The number that reframes the market
There's a single statistic that does more to explain the state of AI in 2026 than any model release. Surge AI — founded in 2020, no venture capital, run from a small team — crossed $1.2 billion in annualized revenue in 2024 while remaining profitable. By mid-2025 it was valued at $25+ billion, and in July 2025 it opened its first external fundraise targeting up to $1 billion at a $15B+ valuation. Scale AI, its better-known and venture-backed competitor, posted $870 million over the same window — until its partial acquisition by Meta triggered a client exodus to Surge, with Google, OpenAI, Microsoft, and xAI all moving meaningful volume.
The customer list is roughly twelve organizations. The work is human-generated training data for frontier models. The market it sits inside is projected to grow from $4.87 billion in 2025 to $29.11 billion by 2032 at a 29% CAGR. Each major frontier lab is spending roughly $1 billion a year on human data. That's not a vendor business. That's a capital-expenditure line item on the same order as compute.
The surface story is "AI is big, so the data labelers are big." The structural story is what kind of data is now scarce — and that's the part that changes the calculus for every company trying to train a model on its own domain.
Why the cheap-volume era ended
For the first wave of large language models, the bottleneck was volume. Get enough labeled examples in front of the model and the loss curve goes down. The vendors that won that phase were the ones who could scale annotation throughput cheaply — pools of crowdworkers, fast pipelines, low per-task cost.
That phase ended for a specific technical reason: frontier labs discovered that training models on their own outputs creates feedback loops. The model starts producing synthetic data, fine-tunes on it, amplifies its own mistakes, and degrades. The only break in the loop is fresh, expert-verified human evaluation. And "expert-verified" is the operative word. A crowdworker can label whether an image contains a stop sign. A crowdworker cannot tell you whether a legal-research agent's interpretation of a securities ruling will hold up in litigation. The work that matters now is the work the crowdworker can't do.
So the spend pattern inverted. Frontier labs cut their cheap-annotation budgets and raised their expert-evaluation budgets. The going rate for the work reflects it: data annotators at $15–25/hr, AI tutors at $20–55/hr, RLHF specialists at $50–65/hr, prompt engineers at $40–65/hr, red teamers at $100–200/hr, and domain experts from $130 all the way up to $1,000/hr — a radiologist ranking model outputs on chest CTs, a securities attorney red-teaming a financial agent. The rate card has six rungs and the value lives at the top.
This is why a bootstrapped company without a sales team can be worth $25B. It's not selling annotation hours. It's selling expert judgment at industrial scale, which is a very different — and much more defensible — business.
What this means if you're not a frontier lab
Most companies reading this are not going to train a foundation model. They're going to take a frontier model and adapt it to their domain — fine-tune, RLHF, or just build a careful eval harness and a strong prompt scaffold. The same economic pattern applies to them, in miniature.
Volume annotation is now a commodity. If your training-data plan is "hire a team of generalists to label thousands of examples," the model is going to do that work nearly as well as the team. The fine-tune that gets you marginal lift on a generic task is not a defensible advantage in 2026.
Expert-verified evaluation is the scarce input. What separates a model that performs in your domain from one that doesn't is whether you have evaluations graded by people who would know if the answer were wrong. A finance model evaluated by people who don't read financial filings is a model that ships confident-sounding hallucinations. A code agent evaluated by junior reviewers is a code agent that produces convincing-looking regressions. The eval is only as good as the judgment behind it.
The economics favor depth over breadth. A small set of expert-graded examples — well-designed, covering the failure modes you actually care about — is now worth more than a large set of crowd-graded examples. This is true at frontier-lab scale ($1B a year on a few thousand experts) and it's true at company scale (a few dozen high-quality evaluations from senior staff, not a thousand mediocre ones from contractors).
What "depth over breadth" looks like in practice
The shape of a credible AI-training program in 2026 isn't a vendor purchase order. It's an operating model that looks more like staffing a clinical-trials team than running a labeling pipeline. Three pieces matter.
A rubric written by someone who knows the domain. Not "is the answer correct" — that's the prompt. The rubric is the failure-mode taxonomy: what specifically goes wrong in your domain, in what ways, with what severity, and how a reviewer should weigh them. A securities-law rubric and a clinical-decision-support rubric look nothing alike. Generic rubrics produce generic models.
Reviewers who are paid like the expertise costs. The single biggest mistake in custom-model training is assigning evaluation to whoever is cheapest. The rates above ($130–$1,000/hr for domain experts) reflect what the market clears at for a reason: expert reviewers catch the failure modes that matter, and non-expert reviewers don't. Underpay this line item and you're paying for the appearance of evaluation, not the substance.
Adversarial evaluation, not just acceptance testing. The frontier-lab pattern that pays off is red-teaming: expert reviewers actively trying to break the model on the cases that matter, not just confirming it works on the easy ones. The 4× reduction in unflagged code flaws in Claude Opus 4.8 didn't come from more examples. It came from reviewers who were paid to find the unflagged ones.
This is the inverted economics in concrete form: small team, high judgment, adversarial design, expensive per hour, cheap per unit of actual model improvement. The teams that internalize this will train models that perform in their domain. The teams that don't will buy a lot of labeled data and ship a model that demos well in the meeting and fails in production.
Where Sonnet Code fits
A $1B-a-year line item on human data at the frontier labs is the easy story. The harder, more useful story is that the shape of the work changed — and the same shape applies at any scale below frontier. AI training at Sonnet Code is exactly the operating model the new economics demand: senior engineers and domain experts who design the failure-mode rubrics that make evaluation meaningful, conduct adversarial red-team review on the cases your model has to get right, and stand up the human-in-the-loop discipline that breaks the model-trains-on-its-own-output loop in your specific domain. AI development is the engineering half: the eval harnesses that turn that expert judgment into a measured number, the fine-tune and RLHF pipelines that actually consume the evaluations, and the production guardrails that route around the failure modes the experts surfaced.
The cheap-volume era is over. The teams that win the next phase are the ones whose training program looks more like a clinical-trials operation than a labeling vendor — and that's the program worth standing up before the next model lands.

