Opus 4.7, /ultrareview, and the New Floor for AI-Integrated Products

The release in one paragraph

On April 16, Anthropic released Claude Opus 4.7. The headline numbers: 87.6% on SWE-bench Verified (up from 80.8% on 4.6), 64.3% on SWE-bench Pro (up from 53.4%), and 70% on CursorBench (up from 58%). Price is unchanged at $5 / $25 per million tokens. Vision input jumped to 3.75 megapixels. A new xhigh effort level sits between high and max, a task budgets beta caps runaway agent spend, and Claude Code picked up a new command called /ultrareview.

The benchmark bump is the headline. The command is the story.

What /ultrareview actually changes

/ultrareview runs Opus 4.7 at its highest effort level against a diff, a PR, or a staged change, and produces a review that — in our early testing — surfaces roughly the same set of issues a careful senior engineer would on a second pass. It is not the first AI code review tool. It is the first one with a hit rate high enough that skipping it feels like skipping the build step.

The practical shift is that the first human reviewer on a PR no longer starts cold. They start with a triaged set of findings, ordered by severity, with the obvious ones — style drift, missing null checks, dead branches, subtle concurrency bugs — already flagged. Their job becomes adjudication and architectural judgment, not line-by-line pattern matching.

If you run a team of ten engineers, this is not a 10% efficiency win. It is a reshaping of what the PR queue looks like. The review bottleneck that every mid-sized engineering org deals with now has a real pressure release valve.

The agentic scratchpad upgrade matters more than the benchmarks

Buried in the release notes: agents that write to and read from scratchpads or notes files across long sessions get noticeably more reliable behavior. Multi-session work that previously lost context now holds it.

This is the change that unlocks real production agent workflows. The failure mode that has killed most agent-in-the-loop features in the past 18 months was not the model being wrong on a given step. It was the model losing its own context across steps and confidently acting on a stale understanding of the task. A scratchpad that actually persists intent across calls is the primitive that turns impressive demo into reliable daily driver.

Teams with an agent feature stalled in staging should retest on 4.7 before writing the capability off. We have seen workflows that failed consistently on 4.6 start passing on 4.7 without any prompt changes — just the architecture change around how the model handles its own working memory.

The gaps worth naming

Opus 4.7 loses on Terminal-Bench and BrowseComp relative to the best frontier alternatives. If your product is a browser agent, or if your workflow is dominated by shell-driven automation, this is not the model. The same team that wins on structured coding tasks gives ground on open-ended web navigation. That is a useful reminder that there is no single best model anymore. There is a best model per workload, and the workload-to-model map now matters more than the headline benchmark.

What we would build against this today

For product teams with AI features in or near production:

Wire /ultrareview into CI as a non-blocking signal. Let it comment on every PR for two sprints before deciding how much weight to give it. The correct deployment of AI review is advisory that sometimes escalates, not gate that sometimes blocks.
Rebuild your agent scratchpad layer. If your agent framework was designed around 4.6's context limitations, the extra reliability on 4.7 only helps if you let the model use it. Revisit how your agent persists intent between calls.
Reprice your coding-assist features. At the same per-token price but with 87.6% SWE-bench performance, the cost per successful task dropped roughly 8% — not huge, but enough to widen the margin on automation products that were previously marginal.
Do not ship multi-model routing this quarter without measuring. The xhigh effort level is cheap enough that for many workloads, routing everything to 4.7 at xhigh beats a cleverly-routed mix of smaller models. Measure before you architect.

The broader read

April 2026 is now on track to be the densest release month of the cycle: Opus 4.7 from Anthropic, a Codex overhaul from OpenAI, Gemini 3.1 Ultra from Google, and whispered frontier releases from at least two other labs. The pattern we see in the teams shipping well through this: they stopped trying to pick the model and started building evaluation harnesses that make swapping models a 30-minute exercise rather than a quarter-long migration. The model picks itself when the eval is in place. Teams without an eval pipeline are still debating whose vibes-test matters most.

The floor for what AI-integrated product means moved again this month. Teams with the right substrate will feel the lift. Teams without it will keep chasing headlines.