The number
Adobe's enterprise AI report this quarter found that only 31% of organizations have implemented a measurement framework for agentic AI. Forty-seven percent either have no framework or aren't sure whether one exists. Twenty-two percent are "in progress." LangChain's 2026 State of AI Agents report puts another stat next to it: 57% of organizations have agents in production, and 32% cite quality as the top barrier to deployment.
Stack the three numbers together and the picture is uncomfortable. More than half the market has agents in production. A third of those teams say quality is the binding constraint. And only one in three has the measurement framework that would tell them whether quality is improving, regressing, or steady. That's not a market that's about to ship its way out of the problem with more agents. It's a market that's about to discover, in Q3, that it can't tell which of its agents are quietly getting worse.
What an "evaluation framework" actually means in 2026
The phrase has gotten loose. Some teams hear it and think we have a Slack channel where the QA team reports issues. That isn't a framework. The shape of an actually-functional agent eval system has stabilized around five components, and most of the 31% that Adobe counted have all five:
1. Simulation environments. A controlled sandbox where the agent runs against realistic but synthetic scenarios — recreations of historical incidents, edge cases pulled from production traces, adversarial prompts the red-team has authored. The sim runs deterministically, and the team can replay any failure case to debug.
2. Trajectory evaluation, not just output evaluation. The old eval question was did the agent produce the right answer? The current one is did the agent get there for the right reasons? For an agent that calls five tools and routes through three reasoning steps, the final output can be correct because the agent guessed and got lucky. Trajectory eval grades the reasoning path itself: was the right tool called at the right step, was each intermediate decision justified, did the agent escalate when it should have.
3. Real-time observability. Production traces piped into a system that can show, for any session, every tool call, every model response, every retry, every escalation. Without this, the team can't even reproduce the failure cases they're trying to fix.
4. Drift alerts. Agents that perform fine on Tuesday can degrade by Friday — model providers update, upstream APIs change, the input distribution shifts. The framework needs threshold-based alerts on the metrics that matter (task success rate, tool correctness, escalation rate, cost per successful outcome) so regression is caught in hours, not weeks.
5. Human review loops. A sampled portion of production sessions — randomly selected, plus all flagged failures — gets reviewed by a domain expert who can grade what the agent did, propose corrections, and feed those corrections back into the training and eval data. The expert is the part that's missing in most "we have monitoring" claims.
If the team can show all five with running dashboards, they're in the 31%. If they can show two or three, they're in the 22% in-progress. If they have a Slack channel and gut feel, they're in the 47%.
Why most teams don't have one
Three reasons keep showing up:
- Eval is unsexy work. Agent demos get screen time at the all-hands. Eval dashboards do not. The team building the agent gets the budget and the headcount; the team that would have built the eval often doesn't exist.
- Generic benchmarks don't grade what matters. SWE-Bench Verified, Terminal-Bench, GDPVal — these tell you the model is good at some workload. They do not tell you the model is good at the workload your underwriting team or your support team or your procurement team actually runs. The eval that matters is the one written against the customer's own workflow, and that eval has to be written by someone who understands the workflow.
- The domain expertise isn't on staff. A good underwriting agent eval needs an underwriter to write the cases. A good clinical-decision-support eval needs a clinician. A good procurement-contract-review eval needs a procurement specialist. These people are not on most engineering teams, and the engineering team writing the agent often doesn't have the budget to pay them at the rate it would take to get a senior version of them in the room.
What this is actually about
The gap is not a tools gap. The platforms exist — Maxim, Galileo, LangSmith, Arize, half a dozen others all sell credible eval suites. The gap is a domain-expertise-in-the-loop gap. The 31% who built a working framework didn't do it by buying a better tool. They did it by getting the right humans into the eval-authoring and review loops, often as senior contractors brought in for the engagement.
This is what the industry has started calling human-in-the-loop AI training — and it's the layer of the agentic stack that's structurally undersupplied. You can buy frontier-model API access by the millions of tokens. You can buy eval platforms by the seat. You cannot, at most companies, buy a senior underwriter who's also fluent enough in agent failure modes to write demonstrations and red-team prompts that actually catch the agent's mistakes. That talent is rare, expensive, and the bottleneck on most agent-program quality plans.
What buyers should do this quarter
Three concrete moves if you're sitting in the 47% or the 22%:
- Audit which of the five components you have. Be honest. Most teams have one or two; pretending you have four is how you end up in the 47% explaining quality regressions to a customer in Q4.
- Pick one workflow, build the framework around it, then port. Trying to build a universal eval system before any single workflow has one is how the project never ships. Pick the agent that matters most, build the full five-component framework for it, and only then generalize the parts that can be generalized.
- Budget for the humans, not just the tools. The expensive line item on a working eval framework is the senior domain expert reviewing trajectories, writing red-team cases, and grading edge cases. If your budget for the eval program is "buy a platform," it isn't going to work. The platform without the humans is a dashboard with no signal in it.
What it doesn't change
A working eval framework is necessary, not sufficient. Three things it won't do for you:
- It won't make a bad agent good. It will tell you the agent is bad. Closing the gap is still engineering work — better prompts, better tool definitions, better retrieval, sometimes a different model. Eval shows you the problem; it doesn't fix it.
- It won't survive a model swap without re-baselining. When you switch models — and you will — the eval needs to be re-run from scratch on the new model and the baselines updated. Teams that treat eval as a one-time setup get burned the first time the model changes underneath them.
- It won't pay for itself in the first quarter. The ROI on eval shows up the first time a regression is caught before it ships, and the second time a customer complaint is closed in a day instead of a week. Both happen, but neither happens in the first sprint.
Sonnet Code's take
The 31% / 47% gap is the bottleneck on most agent rollouts this year, and it's the gap our AI training practice is built around closing. We staff senior domain experts — underwriters, clinicians, procurement specialists, software engineers — into client eval programs to write the trajectories, the demonstrations, and the red-team coverage that takes a generic agent and makes it deployable in a regulated workflow. We pair that with the AI development side of our work — the engineering that wires the eval framework into CI, into observability, into the rollout gate — so the eval doesn't sit in a notebook and gets enforced where it matters.
If you're in the 47% and the next quarterly review is going to ask you to defend a quality number you can't actually measure, the right next step is to pick one workflow, scope the framework, and staff the eval engagement deliberately. That's the conversation we're best at this year. Bring the workflow; we'll bring the experts and the engineering.

