Sonnet Code
← Back to all articles
AI TrainingMay 22, 2026·8 min read

ICLR 2026's "Reasoning Trap" Paper: Training Models to Reason Harder Made Tool Hallucination Worse — The Eval Rubric Is the Fix, Not the Next Model

The release, in one paragraph

At ICLR 2026 in Rio de Janeiro this week, a paper titled "The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination" documented something every team running production agents has been quietly observing: post-training a model to reason longer, more rigorously, and more multi-step increases the rate at which the model fabricates tool calls — invented function names, hallucinated parameter schemas, references to capabilities that don't exist in the connected toolset. Tested across Claude Opus 4.7, GPT-5.5, Gemini 3.5, and Llama 4.5, with and without extended thinking enabled, the pattern is consistent. The companion benchmark data from the Suprmind May 2026 hallucination report puts the numbers in production terms: citation accuracy is the worst-performing task family across the frontier, averaging 12.4% hallucination rate even with extended thinking enabled; across frontier models the range is 4–19% on accuracy-critical workloads. A separate 2026 State of AI Development survey reports that 96% of enterprises now run AI agents in production, while 63% cannot enforce purpose limitations on those agents and 60% cannot quickly terminate a misbehaving one.

The surprising line isn't "models hallucinate." Every team running production AI has known that since 2023. The surprising line is the inversion of an assumption the entire frontier has been built on: that thinking harder produces better outputs. The ICLR paper shows that when the reasoning chain is allowed to generate intermediate plans involving tools, the model becomes more confident about more options, including options that don't exist. The hallucinated tool call is then executed (or attempted) in production, and the failure mode lands as a real error in a real system that paid real money to deploy the agent. The right read isn't "reasoning is broken." It's that the eval rubric for tool-using agents — the rubric that grades whether the called tool actually exists, whether the parameters are valid, whether the trajectory matches a verified-correct path — is the lever that brings the failure rate down. Choosing a different model doesn't help; choosing a better eval does.

Why the reasoning-tradeoff finding moves the work onto the eval team

For three years, the production-AI playbook has assumed that capability upgrades arrive through model upgrades. New model ships, hallucination drops, deploy and move on. The ICLR finding breaks that loop for tool-using agents, because the post-training step that improves general reasoning quality is the same step that introduces the failure mode. You can't fix it by waiting for the next model; the next model has the same tradeoff. You fix it by building the eval rubric that catches the hallucinated tool call before the trajectory commits.

Tool hallucination is a different failure class than text hallucination. A model that hallucinates a citation in a generated essay produces a bad essay. A model that hallucinates a tool call inside an autonomous agent triggers a real attempt to invoke a function — and depending on the harness, either fails noisily (best case) or silently produces a malformed result the downstream consumer treats as authoritative (worst case). The blast radius is structurally larger. The eval discipline has to be structurally tighter.

Extended-thinking modes are where the tradeoff bites hardest. The benchmark data shows the gap most clearly: extended thinking improves general reasoning quality on most tasks, but on tool-using accuracy tasks, the longer chain of intermediate plans is where the hallucinated tool name most often appears. Teams using extended thinking as a default for agentic workloads are paying the latency cost twice — once for the slower inference, once for the higher rate of bad tool calls the supervision layer has to catch and re-route. Either pay the eval-design cost up front, or pay the incident-response cost downstream.

The supervision gap shows up in the survey numbers. 96% of enterprises run agents; 63% can't enforce purpose limits; 60% can't kill a misbehaving agent quickly. Those numbers describe an industry that adopted the productivity story without adopting the supervision architecture that makes it safe. The Reasoning Trap paper lands inside that gap — telling teams that the failure they've been watching isn't a bug the next model release fixes, it's a structural feature of the post-training method, and only a deliberate eval-and-supervision investment closes it.

What the right eval rubric for tool-using agents looks like

Tool existence is the first check. Before scoring whether the agent's use of a tool was correct, the rubric has to score whether the tool the agent called exists in the connected toolset. A trajectory that invokes internal_payments.charge_customer_v2 against a system where only internal_payments.charge_customer exists is a failure regardless of how good the rest of the chain looked. The check is mechanical, cheap, and catches a meaningful fraction of the trapped-reasoning failures the paper documents.

Parameter schemas have to be enforced, not just hoped for. The next failure mode is correctly-named tool called with hallucinated parameter structure — extra fields, wrong types, references to identifiers that don't resolve. The eval rubric scores each parameter against the tool's declared schema, fails the trajectory at the first schema-invalid call, and retains the failure case as a regression scenario. This is the same discipline that strict-typed APIs already enforce at runtime; the eval layer just brings it into the rubric so it counts against the model's score, not just against the production system's error rate.

Trajectory-level grading beats step-level grading. A trajectory of ten tool calls where each individual call passed the existence-and-schema checks can still fail at the trajectory level — the agent called the right tools in the wrong order, or skipped a verification step a senior reviewer would have included, or didn't roll back when a previous step's output was anomalous. The eval rubric has to grade the shape of the trajectory, not just the legality of each step. That requires a senior-practitioner author who can specify what a good trajectory looks like for the workload — a credentialed clinician for medical-decision-support agents, a senior corporate attorney for contract-review agents, a principal staff engineer for coding agents.

Confidence calibration has to be measured separately. A model that hallucinates a tool call and reports low confidence about the call is a problem the supervision layer can catch. A model that hallucinates a tool call and reports high confidence is a problem that lands in production. The rubric should measure both — the rate of bad calls, and the calibration of the model's confidence about those calls — because the supervision strategy depends on which mode the model is operating in.

What the finding doesn't change

Reasoning models are still net-positive on most workloads. The Reasoning Trap is a real tradeoff on a real workload class — tool-using agents in production — but it doesn't invalidate extended thinking for the workloads where reasoning quality is the dominant axis (complex retrieval-grounded answers, multi-step analysis, code generation without execution). The right read is "reasoning has a workload-specific cost," not "reasoning is broken."

The frontier still moves. Twelve-month-over-twelve-month, hallucination rates across frontier models have dropped 3–8× versus 2024 baselines. The Reasoning Trap is a specific failure mode that the new generation of models inherits; it doesn't reverse the overall trend. The teams that need accuracy-critical workloads brought from 19% to under 1% can still hit that target — with extended thinking, retrieval grounding, and the eval-rubric-plus-human-in-the-loop architecture the paper effectively prescribes.

The supervision layer is the durable fix, not a tactical patch. Once you've built the eval rubric, the trajectory grader, the schema enforcer, the calibration tracker, and the human-review surface for the trajectories that fall below the confidence threshold — you've built the infrastructure that survives the next three model upgrades. The teams that treat each model release as a one-shot eval pass are buying themselves a regression every quarter. The teams that build the supervision layer once amortize the cost across every model that ships next.

Where we'd push back on the framing

"Hallucination rate" is a single number on a benchmark that's a poor proxy for production behavior. A 12.4% citation hallucination rate on a public benchmark doesn't tell you what your domain-specific tool-using agent will do in production, against your specific toolset, on your specific workload. Use the published benchmarks to compare frontier models against each other; build your own eval rubric to score what's actually going to ship.

The paper's framing is partly an artifact of its benchmark choice. The Reasoning Trap result is robust on the tool-call accuracy benchmarks the authors chose; the size of the tradeoff varies on different benchmarks. The honest read is "reasoning amplifies tool hallucination on this benchmark family by a measurable amount, and the pattern is consistent enough across models to take seriously." The dishonest read is "reasoning is broken." The first is actionable; the second is internet outrage.

A 12.4% hallucination rate is unacceptable for the workloads where it's unacceptable, and acceptable for the workloads where it isn't. A 12% bad-tool-call rate on a customer-facing payment agent is a Sev-1 incident waiting to happen. A 12% bad-tool-call rate on an internal idea-generation agent is operationally fine. Match the eval discipline to the consequence of failure. Build the rigor where the blast radius is large; relax it where the blast radius is small. The eval portfolio is a triage exercise, not a uniform standard.

"Just add a verifier" is not the eval rubric. A common response to tool hallucination is to add a second model call that checks whether the proposed tool exists before the call is executed. That helps. It doesn't replace the rubric, because the verifier model is itself susceptible to the same Reasoning Trap on its own decision step. The verifier is a useful component of a supervision stack; it's not the supervision stack.

What we'd build differently this week

  • Audit your agentic eval suites for tool-existence checks. For every production agent, answer: does the eval rubric explicitly score whether the called tool exists in the connected toolset? If the answer is "the test suite would catch it eventually," the rubric has a gap.
  • Score parameter-schema validity as a separate axis. Existence and schema-correctness are two failure modes; the rubric should grade them independently and report them separately. A model that's strong on existence but weak on schema is a different supervision problem than the inverse.
  • Hire (or contract) the senior practitioner who specifies the trajectory rubric. A trajectory grader is only as good as the spec of "what a correct trajectory looks like." That spec needs an owner — a named practitioner with the seniority to defend it at quarterly review, the domain knowledge to author it credibly, and the ongoing availability to update it as workloads evolve.
  • Instrument confidence calibration on every tool-using agent. Log the model's confidence about each tool call alongside the call itself; correlate it with the post-hoc judgment of whether the call was correct. A persistent gap between confidence and correctness is the signal that the model is operating in the "confident-and-wrong" mode the supervision layer has to catch.
  • Build the trajectory-review surface, not just the diff-review surface. The output of a tool-using agent isn't a diff. It's a trajectory of decisions, calls, and side effects. The reviewer needs a UI that surfaces the trajectory, not just the final state. The teams that ship trajectory review well will look like the teams that already ship deployment dashboards well — operational visibility as a first-class product surface.

Sonnet Code's take

The Reasoning Trap paper is the moment the production-AI community has to stop assuming the next model release closes the reliability gap. The gap isn't a bug the frontier labs will patch in the next checkpoint; it's a structural property of the post-training method, and the only durable fix is an eval-and-supervision layer built around the specific workload. The teams that invest in that layer get a flywheel — every model release is graded against the same rubric, every regression is caught the same way, every supervisor sees the same trajectory surface. The teams that don't invest will spend the next year debugging the same failure mode with a different model name each quarter.

That's where our work lives. AI training at Sonnet Code is the senior-practitioner side of the eval engagement — staff engineers, security architects, principal reviewers, and (through partner networks in regulated domains) credentialed clinicians and attorneys — who author the tool-existence checks, the schema enforcers, the trajectory rubrics, and the failure-mode catalogs the eval layer scores against. AI development is the engineering plumbing — the eval harness, the verifier stack, the trajectory-review UI, the calibration tracker, the human-in-the-loop surface — that turns the rubric into a release-blocking artifact your supervisor team can actually operate. If your team is reading the Reasoning Trap result this week and quietly recognizing failure modes you've already seen in production, the next conversation isn't about which model to swap to. It's about who authors your tool-using eval rubric, who owns the trajectory review, and the supervision contract that catches the hallucinated call before it lands in the audit log.