What actually happened to the data-labeling and RLHF market
The structural change in the human-feedback industry through 2025 and into 2026 is now legible enough that the data points cohere into a single picture, even where the individual reports come from different research firms with different methodologies.
The operationally important data points, synthesized from the consolidated coverage:
- Surge AI — founded by Edwin Chen in 2020, operating profitably from year one, and bootstrapped without external capital for five years — hit $1.2B in annualized revenue with the frontier-lab cohort (OpenAI, Google, Anthropic, Microsoft, Meta) as the dominant customer base. The company maintains roughly 50,000 expert contractors and ~130 full-time employees, and initiated its first-ever capital raise in mid-2025 at a reported valuation between $15B and $25B.
- Scale AI sits at a $14B valuation on top of an integrated platform that spans RLHF workflows, RL environments, and secure-enclave deployment for sovereign and regulated buyers.
- The RLHF platform market is forecast to grow from $2.8B in 2025 to $18.6B by 2034 on the consolidated industry analyst read — a ~6.6x compound over the decade.
- The AI data-labeling market is sized at $2.3B in 2026, growing to $6.5B by 2031 at 22.95% CAGR through the forecast period.
- The iMerit State of AI in the Enterprise study reports that 96% of companies say human-in-the-loop (HITL) is essential or nice-to-have for AI/ML projects, and 86% say it is strictly essential. That number is the operating signal that matters for the buyer-side conversation.
- The standard 2026 training pipeline at every major lab is Pre-training → SFT → Preference Optimization (RLHF/DPO) → RLVR (for reasoning). DPO has emerged as the de-facto default for alignment fine-tuning in most production lines, with RLHF retained for the cases where pairwise preference data is the right primitive.
Worth reading carefully: the demand shift over the last 18 months has been not for more annotation volume but for more skilled-reviewer judgment. Surge's customer base is the frontier-lab cohort because the bottleneck on the frontier-lab training pipelines is domain-deep human judgment on the hardest tail of the distribution, not crowdsourced annotation volume. The same pattern is reproducing in the enterprise tier underneath: the buyers who are getting value from their alignment spend are the ones who treat the reviewer pool as a managed senior-judgment surface with calibration, rubric authoring, and gold-set discipline; the buyers who treat it as crowdsourced annotation volume are not.
Why the discipline shift matters more than the market-size shift
The temptation reading $1.2B-bootstrapped, $18.6B forecast, 86% strictly essential is to read the news as data labeling is back as an investable category. That read is correct as a finance-side observation and structurally incomplete as an engineering-side observation. The structurally important shift is in the discipline of human-feedback work, not the volume of it.
Three honest reads on the discipline shift.
The bottleneck moved from annotation volume to senior-judgment depth. Through 2023 and most of 2024, the operating constraint on a frontier-lab training run was how many labelers can you put on this dataset, how cheaply, with what minimum quality threshold. The cost surface that the data-labeling industry was optimized for was cents per annotation, scaled to billions of annotations. The constraint moved through 2025 as the frontier models passed the easy parts of the distribution and the gains started coming from the hardest tail — the cases where the judgment is which of these two model outputs is better, given that both are plausible and the answer depends on domain knowledge the average labeler doesn't have. The cost surface that the next generation of human-feedback work is optimized for is senior expert hours, with calibration, with rubric discipline, with multi-judge agreement protocols. The labor mix is different; the unit economics are different; the buyer relationships are different. Surge AI's positioning around expert reviewers and high-context judgment is the bootstrapped-to-$1.2B story; the model that mirrors it is the model that compounds through the rest of the cycle.
DPO becoming the alignment default does not reduce the demand for expert human feedback — it changes the shape. Direct Preference Optimization is structurally cheaper to run than full RLHF because it doesn't require training a separate reward model, and the production lines that adopted it through 2025 saw the per-training-run cost decline. The data that DPO consumes is preference pairs — the same primitive the buyer has been generating for years. What changed is that the quality of the preference pairs now matters more, because there is no reward model in between to absorb the noise. The downstream effect: the labs that produce good preference pairs from senior reviewers produce models that align well; the labs that produce noisy preference pairs from average reviewers produce models that don't. The shift toward DPO concentrates value on the preference-pair quality surface, which is the surface where senior-judgment discipline lives.
RLVR for reasoning is a new column of demand the industry is just learning to staff. Reinforcement Learning with Verifiable Rewards — the post-training step that the reasoning-tier models (Opus, MAI-Thinking-1, GPT-5.5, the Gemini line) all use — requires gold sets of problems where the correct answer is verifiable. Math problems, programming problems, scientific reasoning problems, multi-step proof problems. The work of building those gold sets — selecting problems at the right difficulty distribution, verifying solutions, writing the verifier code, designing the curriculum — is a new category of human-feedback work that didn't exist in industrial volume two years ago. The teams that are building RLVR datasets at frontier-lab scale today are inventing the discipline as they go. The teams that are buying that discipline downstream — for their own reasoning-tier fine-tuning — are paying premium rates because the supply is thin.
What changes about the buyer-side AI-training conversation
Four shifts that follow from the discipline rerating for any enterprise that has alignment fine-tuning, RLHF, or RLVR on the FY27 roadmap.
The make-vs-buy decision on the human-feedback layer can no longer default to make. The conventional 2024 answer was spin up an internal labeling team, train them on the rubric, run the workflow on a homegrown annotation tool. That answer worked when the volume was the constraint and the quality threshold was modest. It does not work for senior-judgment workflows where the calibration cadence runs weekly, the rubric authoring needs domain experts on staff, and the multi-judge agreement protocols need a managed reviewer pool with attrition planning. The honest 2026 answer is which fraction of the workflow we own, which fraction we buy, and how the boundary is structured so the customer's signal (the rubrics, the gold sets, the workload-specific examples) stays portable. The buyer that defaults to make without staffing the discipline ends up with a labeling team that produces volume the model can't learn from; the buyer that defaults to buy without owning the signal ends up locked into a service vendor whose roadmap is not the buyer's roadmap.
The eval-and-governance discipline moves to the center of the procurement conversation. A buyer signing for $5M of RLHF data needs to know — at procurement time — what the gold sets look like, what the multi-judge agreement protocol is, what the calibration cadence is, how the senior-reviewer pool is managed for attrition and drift, and what signal the buyer owns at the end of the engagement. The vendor conversation that doesn't anchor on those questions is a vendor conversation that will discover the discipline gap in production, not in procurement. The buyer that sets those questions as table stakes at RFP time gets a vendor relationship that produces durable model quality; the buyer that anchors on the per-annotation rate gets the conventional outcome — the model improves marginally, the dashboard moves slowly, and the budget gets rolled into next year on the assumption that more volume will fix it.
The senior-reviewer pool becomes a strategic asset, not a cost line. A managed pool of senior reviewers — engineers, domain experts, bilingual practitioners — calibrated to the buyer's specific workload, with continuous rubric refinement and multi-judge agreement protocols, is not the same procurement category as a labeling team. It is closer in shape to a research-engineering team: high-context, slow to onboard, expensive to lose, valuable in proportion to the depth of the workload-specific knowledge that has been built up over time. The buyer who treats it as a cost line and rotates the pool quarterly will rebuild the discipline every quarter and never see the compounding return. The buyer who treats it as a strategic asset and invests in the pool's continuity will see the model-quality compounding curve that the frontier labs have figured out for themselves.
The data-portability question becomes the make-vs-buy hinge. The customer's signal — the gold sets, the rubrics, the preference-pair history, the multi-judge agreement records — is the durable asset of the alignment program. Whether the customer owns that signal in a portable representation, or whether it lives platform-locked inside a vendor's annotation tool, is the question that determines whether the alignment investment compounds into the buyer's roadmap or evaporates with the next vendor switch. The contract terms that determine this are unglamorous and structurally decisive: export formats, schema ownership, the right to re-use rubrics across vendor relationships. The buyer who negotiates these at procurement time keeps the asset; the buyer who doesn't gives it away.
What this does not change
Three honest caveats.
It does not eliminate the volume tier of labeling work. Most enterprise AI projects still need a tier of human-feedback work that is closer to traditional annotation — labeling support tickets, tagging documents, basic preference comparisons on lower-stakes outputs. The volume tier doesn't go away; it becomes the smaller fraction of the engagement, with the senior-judgment tier becoming the larger fraction. The buyer's procurement needs to decompose the spend into the two tiers and structure each appropriately, not bundle them into a single per-annotation rate.
It does not eliminate the auto-grader and synthetic-data approaches. Through 2025 and into 2026, synthetic preference data and LLM-as-judge approaches have grown into a credible complement to human feedback on many workload classes — especially for the easier parts of the distribution where the cost of human review exceeds the marginal value. The honest production pipeline routes the easier cases to auto-graders and the harder cases to senior human reviewers; the all-human path is increasingly the wrong answer on cost grounds, and the all-synthetic path is increasingly the wrong answer on quality grounds. The discipline of which class of work goes where, calibrated against the buyer's eval matrix is what makes the hybrid work.
It does not eliminate the model-lead question. Better human-feedback discipline on a smaller model does not, in general, beat a worse-disciplined run on a stronger model — at least not on most production workloads where the model gap is wide. The buyer who invests heavily in alignment discipline while running on a two-generations-old base model will see the compounding return capped by the base model's ceiling. The honest play is to pair the discipline investment with a routing portfolio that keeps the buyer on the current frontier across multiple vendors, so the alignment work compounds on top of an underlying capability surface that is moving forward, not stuck.
Where Sonnet Code fits
A bootstrapped $1.2B vendor at the top of the market, an $18.6B forecast on the platform tier, and 86% of enterprises saying human-in-the-loop is strictly essential is the easy half of the AI-training conversation. The hard half is the engineering and human-judgment work that turns we hired a vendor for RLHF into the senior-reviewer pool is calibrated to our workload, the rubrics are authored against gold sets we own, the multi-judge agreement protocol is producing preference data the model can actually learn from, the auto-grader and human routing is honestly tuned, and the signal we are building stays portable as the platform landscape shifts. AI training at Sonnet Code is the human-judgment half: senior engineers, domain experts, and bilingual reviewers who design the gold sets that grade alignment work honestly on the buyer's actual workload, calibrate the senior-review queue for the failure modes a production-tier alignment run produces, author the rubrics that the eval harness runs against, and serve as the senior-judge pool whose calibrated decisions compound model quality across training cycles. AI development is the engineering half: building the workflow infrastructure that owns the buyer's signal in a portable representation, instrumenting the multi-judge agreement protocols and the calibration cadence as first-class observability surfaces, routing the auto-grader and human-feedback work to the right tier based on the eval matrix, and integrating the alignment pipeline into the buyer's existing MLOps surface so the signal compounds rather than evaporating at the next platform switch.
The conversation about whether expert humans belong in the training loop just ended for anyone still having it. The teams that walk into FY27 planning with the senior-reviewer pool calibrated, the gold sets authored, the rubric discipline mature, the auto-grader and human routing honestly tuned, and the signal owned in a portable representation are the teams that will compound model quality through the back half of 2026 and the FY27 budget cycle. The teams that anchor on the per-annotation rate and skip the discipline investment will keep watching their alignment dashboards move slowly while the buyer down the road — who treated human-feedback discipline as a strategic investment — defines the model-quality curve that the rest of the cohort will inherit a year later.

