Sonnet Code
← Volver a todos los artículos
Developer Tools28 de mayo de 2026·8 min read

Claude Opus 4.7 Hit 87.6% on SWE-bench Verified. The Score Is Real. The Gap Between a Benchmark and Shipped Software Is What's Worth Studying Now.

A real number, not a marketing one

On April 16, 2026 Anthropic shipped Claude Opus 4.7. The headline result is the one that's been driving the developer-tools conversation for the last six weeks: SWE-bench Verified jumped from 80.8% to 87.6%, a nearly seven-point gain in a single version bump. SWE-bench Pro climbed even more — 53.4% to 64.3%, a 10.9-point gain. CursorBench moved from 58% to 70%. By mid-May, follow-on benchmarks from third parties show the same shape: the gap between Opus 4.7 and the rest of the frontier on agentic coding tasks is the largest single-model gap of the year so far.

Pricing held flat — $5 per million input, $25 per million output. Same API. Same context. Same SDK. Drop in the new model name and the bench numbers go up. That's the part that's true. The part that's quietly false is the inference everyone is rushing to: that shipping software with AI got 7 points better last month.

What the benchmark is measuring, and what it isn't

SWE-bench Verified is a curated set of real GitHub issues from real Python projects, each with a hidden test suite. The model gets the issue text and the repo, has to produce a patch, and the patch either passes the tests or it doesn't. It is a genuinely good benchmark — far better than the toy code-completion ones it replaced. It rewards a model that can read a non-trivial codebase, locate the right file, and write a change that actually works. A 7-point jump on it is a real engineering achievement.

But notice what the benchmark assumes the human has already done before the model starts: the issue is well-written. The reproduction is clear. The right tests exist. The bug has been triaged. The architectural constraints are implicit in the test suite. Every one of those is, in your codebase, a thing your team had to do. Strip those scaffolds away and you don't have an 87.6% bug-fixer; you have a very capable junior engineer who needs the work framed before they can do it.

The gap is the scaffolding, and it didn't move

Here is the part that gets lost in the leaderboard chase. The journey from "the model passes the test" to "the change is in production at your company" goes through a series of human and engineering checkpoints that haven't gotten meaningfully cheaper just because the model got better:

  • The ticket has to be specified well enough for the model to act on. A vague Jira card is still vague to an agent, and "fix the dashboard" is not a SWE-bench prompt.
  • The repo has to be intelligible. A model loose in a half-undocumented monorepo with five overlapping conventions does not score 87.6%. It scores whatever the worst surface area lets it score.
  • The tests have to exist, and they have to be the right tests. Models close the loop on what you measure; if your tests don't cover the regression the change can cause, the agent's "passing" patch is a future incident.
  • The change has to be reviewed by someone who understands the system. A reviewer who rubber-stamps a green PR has not added review; they have added latency.
  • The deployment has to be safe to roll back. Agentic edits that ship through a deploy pipeline without good observability turn a clever model into an efficient incident generator.

None of that is what SWE-bench is testing. Which is fine — no benchmark is. The problem is that the delta between "Opus 4.7 vs Opus 4.6 on Verified" is being read as the delta in shipped-software productivity, and they are not the same number. The model got 7 points better. The scaffolding around it got 0 points better unless your team did the work.

What actually changes when a frontier coding model gets sharper

This isn't an argument for ignoring the new model. Opus 4.7 is genuinely the strongest agentic coding model right now, and the right read is to put it to work. The change is in what the bottleneck is once you do.

Pre-4.7, the limit on autonomous coding tasks was the model's ability to reason across a real codebase. That limit moved. The new limit is your team's ability to frame work well, evaluate output honestly, and integrate agent-produced changes into your existing review and deploy discipline. Those skills are not in the model's weights. They are in your engineering process — and they're what determines whether a 7-point benchmark gain translates into a single percentage point of shipped-feature velocity or a flood of subtle, hard-to-debug regressions.

The cheapest way to waste a frontier model is to point it at a workflow that wasn't ready for it. The most expensive way is to do it in production.

Where Sonnet Code fits

A model that jumps 7 points on Verified is the easy half of the equation. The hard half is the layer around it — and that's where our work lives. AI development at Sonnet Code is the engineering that turns a frontier coding model into a production capability: framing tickets to be agent-actionable, wiring the model into your review and CI so changes are governed instead of injected, and building the observability so a regression is a finding, not a Tuesday. AI training is the human-judgment half: senior engineers and domain experts who define what "correct" looks like for your codebase, build the evaluation harnesses that turn "the model feels better" into a measured number, and stand up the review discipline that lets your team trust agentic edits without rubber-stamping them.

87.6% on SWE-bench Verified is real. Whether it shows up in your deployment frequency depends on the scaffolding around it. That scaffolding is the work — and it's the work that compounds when the next model lands.