AI Engineer (Evaluation Systems)

New

Skills

Anthropic AWS Docker OpenAI Postgresql Python

Design a structured, configurable evaluation engine combining deterministic checks with LLM-as-judge verdicts. Build calibration workflows using expert-labeled examples, measure precision and recall accurately, handle delayed outcomes and low-confidence review flows, and store structured verdicts to power dashboards and analytics.

Key Responsibilities
  • Design a configurable evaluation engine
  • Combine deterministic checks with LLM-as-judge verdicts
  • Build calibration workflows using expert-labeled examples
  • Measure precision and recall accurately
  • Handle delayed outcomes and low-confidence review flows
  • Store structured verdicts to power dashboards and analytics
Required Skills & Qualifications
  • 4+ years backend / ML engineering experience
  • 2+ years building production AI/LLM systems
  • Python, Docker, and PostgreSQL experience
  • AWS, OpenAI, Anthropic, and other LLM APIs knowledge
  • Proven experience building LLM-based production systems
  • Experience developing evaluation/QA/score pipelines
  • Remote work with LATAM focus
  • Independent contractor via payroll platform
  • Remote work allocated at client
  • Human-in-the-loop workflow design (Plus)
  • OpenTelemetry familiarity (Plus)

Job Type: Remote

Salary: Not Disclosed

Experience: Entry

Duration: 12 Months

Share this job:

Similar Jobs

Risk Engineering Software Engineer

Posted 18 days ago

Automate risk workflows using Go and AI tools.

Prototype and turn experiments into production.

Anthropic CoPilot Go Kubernetes

AI Product Engineer Role

Posted 292 days ago

Rapid prototyping and deployment of AI features

Building scalable applications using modern cloud stacks

Anthropic Next.js OpenAI Python
overtime