Computer-Use Agents Hit GA in May — But the 78% OSWorld Ceiling Is Where the Real Engineering Starts

The capability is real now

Computer-use agents — AI that operates ordinary software the way a person does, by reading the screen, clicking, and typing — spent most of 2025 as a demo genre. As of this month they're a product you can buy with an SLA. Microsoft shipped Copilot Studio computer-use agents to general availability on May 13, 2026, across all commercial Power Platform geographies. As of late May, Anthropic's computer use is still in paid-plan beta and Google's Gemini Computer Use is in public preview, which makes Copilot Studio, by Microsoft's framing, the only computer-use platform an enterprise can deploy today against a production SLA with audit-compliant logging, RBAC credential isolation, and broad geographic availability.

That governance triad — auditable logs, role-based credential isolation, geo-broad availability — is what moves a capability from "interesting" to "deployable." It's also a tell: when a vendor leads with governance rather than raw capability, it's because capability alone was never the thing blocking production.

The ceiling nobody puts on the slide

Here is the number that matters more than the GA date. On OSWorld-Verified, the standard benchmark for agents operating a real desktop, the independent aggregator Vellum reports Claude Sonnet 4.6 at 72.5% and Claude Opus 4.6 at 72.7% as of February 2026, with a later measurement putting Claude Opus 4.7 at 78.0% — the current high-water mark. These are the best scores available, on the best models, and they top out in the high 70s.

Sit with what that means operationally. Even at 78%, roughly one step in five is wrong on a hard task. And computer-use work is almost never one step — it's a chain: open the app, find the record, edit three fields, submit, confirm. Per-step error compounds. A five-step task where each step succeeds 90% of the time finishes correctly only about 59% of the time. The benchmark score is a per-step number; the business cares about the end-to-end number, and the end-to-end number on an unscaffolded multi-step task is worse than the headline.

This is not a knock on the models — they are remarkable, and the trend line is up. It's a statement about where the engineering is. The gap between 78% per-step and a process you'd let run unattended over real systems is not closed by waiting for a better model. It's closed by everything you build around the model.

What actually closes the gap

The teams getting reliable value from computer-use agents aren't the ones with a magic prompt. They're the ones who treat the agent as an unreliable-but-cheap worker and engineer accordingly. Notably, Microsoft shipped agent evaluations alongside the computer-use GA — and that pairing is the whole point. The work that closes the gap looks like this:

Evals first, in your own environment. As Anthropic's own guidance on the topic puts it, evals turn vague expectations into measurable checks that catch regressions early. You validate against your scenarios, your policies, and your production data — measuring quality across a full test set, not the three cases that happened to work in the demo. Without this you have no idea what your real end-to-end success rate is, which means you can't tell when an update made it worse.
Scope the task to the agent's actual reliability. A 78% per-step agent is excellent at narrow, well-bounded, reversible tasks and dangerous on long, branching, irreversible ones. The engineering decision is which steps to hand it — and that decision should be driven by your eval numbers, not by ambition.
A human-review surface for low-confidence and high-stakes actions. The agent does the work; a person confirms anything that's expensive to get wrong. Done well, this is not a bottleneck — it's the routing that lets you ship at all, sending the routine 80% to the agent and the consequential tail to someone qualified to catch it.
Governance that survives an audit. Credential isolation so the agent acts with a scoped identity, not a shared admin login. Logs detailed enough to reconstruct what it did and why. A kill switch. These are the GA features for a reason; they're also the features teams disable for convenience and regret later.

The mistake to avoid

The predictable failure mode this year will be teams that watch the GA demo, see the agent fly through a happy-path task, and roll it out across a process with the per-step reliability assumptions of a deterministic script. Then the 22% bites — in production, on real data, on the step that was expensive to get wrong — and the conclusion drawn is "the technology isn't ready." The technology is ready, for the right scope, with the right scaffolding. What wasn't ready was the engineering around it.

Where Sonnet Code fits

This is AI development work in the most literal sense: the model is the easy part, and the value is in what you build around it. Concretely, that's standing up the eval harness that tells you your real end-to-end success rate on your tasks before anything ships, scoping each step to what the agent is actually reliable enough to do, building the human-review surface for the low-confidence and high-stakes cases, and wiring in the credential isolation and audit logging that let it pass a security review. The eval and human-judgment layer also leans on our AI training practice — the senior people who define what "correct" means for your task and author the test cases that measure it.

Computer-use agents going GA is real and worth acting on. Just don't confuse the GA announcement with the work. The announcement is the capability. The work is making it trustworthy on your systems — and that's the part that's still done by hand, well, by people who've shipped it before.