The numbers, in one paragraph
The AI data-labeling category, which two years ago was a single-vendor market with Scale AI at the center, has reshaped twice in twelve months. Surge AI — bootstrapped in 2020, profitable, and quiet — crossed $1 billion in annual revenue and raised external capital for the first time at a $25 billion-plus valuation. Mercor raised at a $2B valuation in early 2025, was reported in talks at $10B by year-end, and now employs 30,000+ pre-vetted expert contractors at an average rate of $95/hour, disbursing more than $1.5 million per day to evaluators. Scale AI's partial acquisition by Meta triggered a client exodus — Google, OpenAI, Microsoft, xAI all sought alternatives — that turned what had been Scale's near-monopoly into a three-way competitive market in eighteen months. The category itself is on track from $2.32B in 2026 to $6.53B by 2031, with the fastest growth in the senior-expertise tier, not the volume-labeling tier.
The headline framing in most coverage is "data labeling firms are growing fast." The substance is one tier deeper, and it's the part every team building a serious AI program should be reading carefully: the senior-domain-expert layer of the AI stack — the people who write rubrics, grade trajectories, author golden examples, and red-team agents — has become the binding constraint on model and product quality. The vendor that owns that bench owns the bottleneck. The customer that can't access that bench can't ship the program.
Why "data labeling" is the wrong frame for what these firms now do
Two years ago, the work being sold by Scale and its peers was, in fact, labeling: bounding boxes around objects, sentiment tags on text, transcriptions of audio. The labor pool was a global crowd, the price per task was measured in cents, and the value-add was scale and quality control on a fundamentally low-skill task.
That market still exists, but it's no longer the interesting market. The interesting market — the one Surge and Mercor are racing each other in, and the one Scale is repositioning to defend — is senior-domain-expert evaluation. The buyers are AI labs and enterprises building production agents. The work is:
- Rubric authoring: a senior practitioner in medicine, finance, law, software engineering, or research writes down what a good answer looks like for a workflow, broken into criteria that can be graded individually.
- Trajectory grading: an expert watches an agent execute a multi-step task — call a tool, read the output, decide what to do next — and grades whether each step was correct, separately from whether the final output happened to be right.
- Golden-example creation: an expert writes the ideal response or trajectory for a representative scenario, which becomes both a training signal (via SFT or rejection sampling) and an eval baseline.
- Red-teaming: an expert deliberately probes the agent for failure modes, writes the adversarial prompts that surface them, and helps the team patch the rubric so future versions are graded on resilience.
- Calibration grading: an expert grades not just the model's outputs but the model's confidence — flagging cases where the agent was right but uncertain, or wrong but confident, both of which point to deployment risk.
This is not labeling. This is the same activity a senior partner does when training a junior associate, a senior physician does when supervising a resident, or a principal engineer does when running a design review. It just happens to be the input that determines whether a frontier model can be deployed in a regulated workflow.
The Surge / Mercor / Scale repositioning, in one diagram
The three big firms have ended up in different parts of the same market.
Surge AI — bootstrapped to scale, profitable to scale, and quiet about it. Surge's pitch from day one was higher quality at higher cost, with a smaller and more curated labor pool than Scale's. That positioning has aged well: when Meta acquired part of Scale and the labs needed an alternative they could fully trust, Surge was already running a senior-expert pipeline at billion-dollar scale, with medical doctors charged out at $200–$500/hour, PhD researchers at $150–$350/hour, and software engineers at $100–$300/hour. Surge sells expertise as the product; the labeling tooling is the delivery mechanism.
Mercor — built on the explicit thesis that the rate-card is the product. 30,000+ pre-vetted experts, average $95/hour, $1.5M disbursed per day to the contractor base. Mercor's bet is that what AI labs and enterprises want is access to senior practitioners on demand, with the vetting, contracting, payments, and scheduling absorbed by the platform. The labels are an output of that access; the access itself is the SKU.
Scale AI — repositioning. The Meta transaction was bad for Scale's lab business in the short term, but Scale retains the largest labeling infrastructure in the industry, a deep relationship with the defense and public-sector buyers (Scale's government work is a separate, large business), and significant capability in agentic-AI evaluation tooling. The pitch this year is less "we label your data" and more "we operate the eval and training data lifecycle end-to-end, including the senior-expert layer." Whether that lands depends on how successfully Scale rebuilds the trust it lost with the frontier labs.
The long tail — Labelbox, Sama, Toloka, iMerit, Centaur Labs, a hundred smaller specialist providers, plus the in-house teams every major lab built when it stopped trusting any single vendor. That tail is large, growing, and structurally fragmented; the buyer's experience of it is closer to "manage a vendor portfolio" than "pick a vendor."
Why the rubric author is the bottleneck — and why everyone underestimates it
For two years, the dominant narrative inside every AI lab and every enterprise AI program has been some variant of the model is the bottleneck. Bigger model, better answers. Newer model, better answers. The benchmark scores keep climbing; the launch posts keep promising new capability tiers; the procurement conversations keep returning to which frontier model to standardize on.
The ground truth, observable in the practice of every team actually shipping production AI, is different. The model has not been the bottleneck for at least eighteen months. The bottleneck has migrated to the rubric — the explicit, written-down standard against which the model's outputs are graded — and the rubric author — the senior practitioner who can write a rubric that's actually predictive of whether the model is doing the right thing in the workflow.
Three facts that make this concrete.
A frontier model with a bad rubric grades as a mediocre model. If the rubric is wrong — if it gives credit for plausible-sounding answers that happen to be incorrect, or if it penalizes correct answers for surface-level format issues — then no amount of model capability shows up in the eval. Teams that complain their model "hasn't improved" often have a rubric problem, not a model problem.
A mediocre model with a great rubric ships. The agent-deployment teams that have made it to production with regulated workloads almost universally got there by writing a rubric that was precise enough to catch the failure modes their senior practitioners cared about, then iterating the model and the prompts against that rubric. The rubric is what makes the engineering tractable; the model is the substrate.
Rubric authoring doesn't scale linearly with engineering headcount. Hiring two more ML engineers doesn't get you twice as much rubric. Rubric authoring requires the senior practitioner's judgment about what "correct" means in the workflow, expressed in a form that can be applied consistently across thousands of cases. That work scales with senior-practitioner availability, which is what Surge and Mercor are selling and what every internal AI team is short of.
What it changes for buyers — model labs, enterprises, services firms
Four structural shifts to plan against this quarter.
Model labs are bidding the senior-expert market up. The labs need rubrics across an expanding surface of capabilities — coding, math, scientific reasoning, multi-step agency, vision, audio, multilingual, domain-specific. Surge and Mercor and the long tail are competing for the same population of senior practitioners willing to do this work, and the rate-card has been moving up for two years. Enterprises that wait to staff their own rubric programs are going to find the price-per-expert-hour higher next quarter than this one.
Enterprise AI programs are starting to look like senior-expert engagements with engineering attached, not engineering engagements with experts attached. The shape of a working production AI program — pick a workflow, get a senior practitioner in the room, write the rubric, build the eval, then iterate the model and the prompts — puts the practitioner first and the engineering second. Programs structured the other way around (engineering builds the agent, then asks the practitioner to grade it after the fact) tend to ship later and reach lower quality.
Services firms with senior-expert benches have a structural advantage that compounds. A firm that can put a clinician, a controller, an underwriter, a procurement specialist, a principal engineer into a customer engagement isn't selling labor — it's selling the binding constraint. The same firm that does the AI engineering for the customer can author the rubric, run the calibration, and grade the agent in production, which is a tighter loop than coordinating across two or three vendors. The customer's procurement team will eventually notice this and consolidate.
In-house labeling and eval teams are not a substitute, for most companies. A FAANG lab can hire a hundred senior practitioners and run rubric authoring in-house; a $500M regional bank cannot. The price floor on senior-practitioner-hours is set by the labs, and a regional bank's internal hiring will not clear it. For most enterprises, the senior-expert layer is structurally bought, not built, and the choice is which vendor to buy from.
What it doesn't change
Three things worth saying out loud.
The senior expert is not a substitute for the engineering. A perfect rubric authored by a perfect practitioner does not, by itself, produce a deployed agent. The rubric needs to be wired into a CI pipeline, the trajectories need to be captured by an observability layer, the failure cases need to be queued for review, and the rollout needs to be gated on the eval metrics. That's all engineering work, and it's substantial, and the rubric without the engineering is a Google Doc that sits in a folder.
Quantity of expert hours doesn't substitute for quality of expert. A team that buys 1,000 hours of generalist labeler time and a team that buys 100 hours of senior-practitioner time get very different rubrics. The labor market knows this; the rate-cards reflect it; the procurement conversation should too.
The expert population is finite. Surge's medical doctors at $200–$500/hour are paid that rate for a reason — there are only so many physicians willing and able to do this work, and the labs and enterprises competing for them are not going to discover a cheaper supply. The rate is going to go up. Programs that assume it goes down are going to be surprised.
Where we'd push back on the category narrative
"30,000 pre-vetted experts" is a real number and a coarse one. Across 30,000 contractors there is a meaningful range of seniority, domain depth, and grading consistency. The customer's experience of "a Mercor expert" or "a Surge expert" depends entirely on which expert the platform staffed onto the engagement. Smart buyers ask about the staffing model, the named-expert option, and the platform's process for swapping experts who under-perform. The number is the headline; the staffing is the substance.
"Senior-domain-expert" can mean a lot of things. A board-certified specialist with 15 years of practice, a recent residency graduate, a PhD candidate, a retired senior practitioner doing this part-time — all might show up under the same label. Customers should ask for the resume, the practice profile, and the assignment history, the same way they would if they were hiring a senior contractor for any other engagement.
The platforms are also evaluating the customer, not just delivering experts to it. Surge and Mercor both ration access to their best contractors based on the customer's program quality. Engagements that pay well, scope cleanly, and give the experts good feedback get the best staffing; engagements that are chaotic and poorly scoped get the bench. Customers that assume the platform is a neutral marketplace are missing the dynamic. The platform has favorites, and you want to become one.
What we'd build differently this week
- Inventory your current rubric coverage. For each workflow your team is automating with AI, write down: does a rubric exist, who authored it, is it under version control, when was it last updated, what failure modes is it known to miss. Most teams discover the rubric is either missing or six months out of date, and that the workflow has changed since.
- Identify the senior practitioner you would put in the room. Inside your org or outside. For each automated workflow, name the person whose judgment defines whether the agent is doing the work correctly. If the answer is "we don't have one," that's the hiring or contracting decision you make before you scale the agent.
- Decide whether you're buying senior expertise or growing it. A small AI program in a regulated industry should probably buy from Surge / Mercor / a senior boutique; a large program with proprietary workflows should plan an internal expert-recruitment motion alongside the vendor relationship. Most successful programs do both.
- Wire rubric authorship into the development cycle. Not as a one-time setup — as a recurring practice. Every time the workflow changes, the model is upgraded, or a regression surfaces, the rubric gets touched. Treat it like a test suite, not like documentation.
- Audit your vendor relationship for the staffing model. Who specifically grades your trajectories? Is it the same expert each week or a rotating pool? What's the swap procedure if quality drops? Get the answers in writing before the rubric becomes load-bearing for a regulator-facing program.
Sonnet Code's take
The Surge / Mercor / Scale reshape is the moment "data labeling" stopped being a back-office cost and became the senior-practitioner-access market that decides whether a customer's AI program ships at production quality or doesn't. The right read isn't whose valuation is biggest or whose contractor base is largest. It's that the binding constraint on most enterprise AI programs is not the model, not the framework, not the cloud capacity — it's the senior practitioner who writes the rubric, grades the trajectories, and signs off on the eval. The teams that win this year are the ones who treat that practitioner as the load-bearing part of the program, not the optional one.
We staff that work directly. AI training at Sonnet Code is the engagement where senior practitioners — clinicians, controllers, underwriters, principal engineers, security architects, domain specialists from whichever vertical the customer operates in — author the rubrics, write the golden examples, grade trajectories, run calibration, and red-team the agents the customer is shipping into regulated workflows. We pair it with AI development — the engineering that wires the rubric into CI, captures the trajectories in observability, queues the failures for expert review, and gates the rollout on the eval metrics — so the rubric is enforced where it matters and the expert's time is spent on the cases that move the program forward.
If your team is reading the Surge / Mercor coverage this week and wondering whether the rubric layer in your AI program is doing what it needs to do, the next conversation isn't about which labeling vendor to RFP. It's about which workflow you'd put a senior practitioner on first, what their rubric would say, and the engineering work that turns that rubric into a deployment gate.

