AI Engineer (Evaluation Systems)

New

Skills

Anthropic AWS Docker OpenAI Postgresql Python

Design a structured, configurable evaluation engine combining deterministic checks with LLM-as-judge verdicts. Build calibration workflows using expert-labeled examples, measure precision and recall accurately, handle delayed outcomes and low-confidence review flows, and store structured verdicts to power dashboards and analytics.

Key Responsibilities

Design a configurable evaluation engine
Combine deterministic checks with LLM-as-judge verdicts
Build calibration workflows using expert-labeled examples
Measure precision and recall accurately
Handle delayed outcomes and low-confidence review flows
Store structured verdicts to power dashboards and analytics

Required Skills & Qualifications

4+ years backend / ML engineering experience
2+ years building production AI/LLM systems
Python, Docker, and PostgreSQL experience
AWS, OpenAI, Anthropic, and other LLM APIs knowledge
Proven experience building LLM-based production systems
Experience developing evaluation/QA/score pipelines
Remote work with LATAM focus
Independent contractor via payroll platform
Remote work allocated at client
Human-in-the-loop workflow design (Plus)
OpenTelemetry familiarity (Plus)

Job Type: Remote

Salary: Not Disclosed

Experience: Entry

Duration: 12 Months

Similar Jobs

Risk Engineering Software Engineer

Posted 18 days ago

Automate risk workflows using Go and AI tools.

Prototype and turn experiments into production.

Anthropic CoPilot Go Kubernetes

View Job

AI Product Engineer Role

Posted 292 days ago

Rapid prototyping and deployment of AI features

Building scalable applications using modern cloud stacks

Anthropic Next.js OpenAI Python

View Job