Sonnet Code
← Back to all case studies
AI / ML2025·6 months

Human-in-the-loop evaluation harness

Overview

Built the platform the team uses to grade model outputs before and after training runs: rubric authoring, blind pairwise comparisons, rater calibration, and exportable scorecards that feed directly into the RLHF preference dataset. Replaced a patchwork of spreadsheets and ad-hoc scripts.

What we delivered

  • Rubric authoring UI
  • Blind pairwise comparison flow
  • Rater onboarding + calibration
  • Scorecard export (CSV + JSONL)
  • Admin metrics (IRR, drift, throughput)
  • Model-provider adapter for reference outputs

Stack

Python · FastAPI · PostgreSQL · Next.js · LangGraph · OpenAI API


Duration: 6 months
Year: 2025
Industry: AI / ML

Want us to build yours? Schedule a 15-minute call.