Sonnet Code
← Back to all articles
AI & Machine LearningMay 8, 2026·7 min read

Dreaming, Outcomes, Multiagent: Anthropic Just Productionized the Three Things Custom Agents Were Failing At

The release, in one paragraph

On May 6, 2026 — opening day of Code w/ Claude 2026 — Anthropic shipped three additions to Claude Managed Agents: Dreaming (a research preview where a scheduled process reviews past sessions and memory stores to surface recurring mistakes, converged workflows, and team-shared preferences); Outcomes (the agent runs a loop where a separate grader, in its own context window, scores output against a rubric the team authored, and the lead agent re-attempts when the grader flags gaps); and Multiagent Orchestration (a lead agent decomposes the job and dispatches specialist subagents — each with its own model, prompt, and tools — operating in parallel on a shared filesystem and contributing back to the coordinator's context). Outcomes, multiagent, and managed memory are in public beta; Dreaming is research preview.

The headline framing is "three new features." The substance is that each one is the productionized answer to a specific failure mode that every team running real agents has been hitting since the platform shipped: memory drift, reward gaming, and serial bottlenecking. Read the release as three problems Anthropic finally took ownership of solving instead of leaving to every customer to re-invent.

Why Outcomes is the most consequential of the three

Dreaming will get the magazine coverage. Multiagent will get the architecture-diagram tweets. The feature that actually changes how teams ship agents is Outcomes, and it's worth saying plainly: Outcomes is a rubric-as-reward training loop, but applied at inference time inside the agent runtime.

The pattern most senior teams have been hand-rolling for a year — agent generates a draft, a separate grader applies a rubric, agent revises against the grader's feedback — is now first-class infrastructure. The grader runs in its own context window, so it doesn't inherit the agent's reasoning chain. The rubric is something the customer writes and versions, not something Anthropic ships. The loop is bounded; the lead agent gets the grader's structured feedback and either accepts the result or takes another pass.

Three things follow from that, and most of them haven't sunk in yet:

The rubric is now the load-bearing artifact. Before Outcomes, most teams' "agent quality" came from prompt-engineering the lead agent harder. After Outcomes, agent quality comes from the rubric the grader applies. That changes who needs to write the rubric — it's a senior practitioner's job, not a prompt engineer's — and it changes how the rubric is reviewed (versioned, code-reviewed, regression-tested), and it changes the procurement question ("who authors our rubrics" becomes the new "which model do we use").

Eval and runtime collapse into one surface. The grader that runs inside the Outcomes loop is the same artifact you'd use to run an offline eval against a candidate model. Teams that had separate "rubric for evals" and "rubric for production" stop having that split — same rubric, used twice, drift between them goes to zero. That's a win even before counting the production-quality gain.

Sycophancy and reward gaming get harder. A grader sitting in a separate context, scoring against explicit criteria written by a senior, is meaningfully harder for an agent to talk its way past than a single in-loop reflection step. It is not impossible — graders can still be gamed — but the gameable surface area shrinks, and that matters.

What Dreaming actually solves

Dreaming is positioned as memory, but the more accurate framing is session-level pattern extraction. Without it, every agent session starts cold against the same long-term store: the agent re-discovers the same anti-patterns, makes the same mistakes that someone made last week, fails to notice that the team has standardized on a workflow it keeps rebuilding from scratch.

A scheduled process that reviews completed sessions and memory stores, pulls out the recurring failure shapes and the converged workflows, and curates that back into the agent's accessible memory is the kind of work that — until last week — every serious team was building by hand: log scraping, embedding clustering, manual rubric review of failed runs, somebody quietly editing the system prompt every two weeks to inject the lessons. Anthropic just took ownership of that loop.

The caveat that goes with research-preview status: anything that automatically promotes patterns from past sessions into the agent's working memory is a system that can amplify a bad pattern as readily as a good one. If the agent learned a workaround in session 47 that gets the right answer for the wrong reason, Dreaming may consolidate that workaround into a memory and propagate it. The honest posture for the next two quarters is: let it run, audit what gets consolidated, treat the dream output the same way you'd treat a junior engineer's PR — useful, frequently right, never auto-merged.

Multiagent orchestration and the new shape of the harness

Multiagent Orchestration is the least surprising of the three — every senior team has been writing this loop manually for a year. The interesting thing isn't that it exists; it's that it now ships with a few specific operational properties that hand-rolled versions usually skipped:

Specialists run in parallel, not in series. A lead agent that dispatches three specialists and waits for all three is fundamentally different from a coordinator that calls one specialist, gets a response, decides what to do next, calls another. Parallel changes the latency profile. It also changes the cost profile — three subagent calls in parallel cost the same as three in serial, but the wall-clock budget is one-third — which makes "thorough" an economically rational default instead of a luxury.

Specialists share a filesystem, not just a context. This matters more than it sounds. A specialist that writes its findings to a file the lead agent can read is operationally cleaner than one that has to stuff its full output back into the coordinator's context window. Token budgets stop being the bottleneck for fan-out width; the bottleneck becomes the file structure the team designs for inter-agent communication.

Each specialist gets its own model. A research subagent on a smaller, cheaper model. A code-writing subagent on Opus. A red-team subagent on a different vendor entirely. The orchestration layer is now where multi-model routing lives at the agent level, not just at the request level. That's the same structural shift we wrote about with the DeepSeek V4 release — now it's inside every Managed Agent by default.

Where we'd push back on the launch narrative

Two gaps worth flagging.

Outcomes only works as well as the rubric. Anthropic is shipping infrastructure; the rubric is still on the customer. A team that flips Outcomes on with a rubric that was written for a different workload, or that captures the wrong success criteria, will get faster-converging agents that reliably produce the wrong answer. The cost of a mediocre rubric just went up — it gets applied on every run, not just once during eval. Treat rubric authorship as a senior-practitioner role, treat the rubric as a versioned production artifact, and run the same regression tests against the rubric that you'd run against the agent.

Multiagent fan-out increases blast radius, not just throughput. Three subagents in parallel mean three sets of tool calls, three sets of side effects, three sets of audit-trail entries. Without explicit capability-scoping per subagent (this one can read but not write, this one can call MCP server X but not Y, this one writes only to its own scratch directory), the convenience of orchestration becomes a security and observability problem. Build the capability-scoping model before you build the third specialist subagent, not after.

What we'd build differently this week

  • Stand up Outcomes on one workflow with a senior-authored rubric. Pick a workflow where you already know what "good" looks like (a code-review agent, a customer-email drafter, a deal-memo summarizer). Get the senior practitioner who would grade junior work to write the rubric. Versioned, code-reviewed, in your repo. The rubric — not the agent — is the artifact you're investing in.
  • Pilot Dreaming on one team with explicit audit gates. Let it run. Read every consolidated memory. Reject the ones that promote a workaround over a fix. Track the rate of useful consolidations vs noise. Decide rollout scope based on that ratio, not on the launch demo.
  • Refactor one custom agent to use Multiagent Orchestration with capability scoping. Pick the agent currently doing five things in serial. Decompose to three specialists, each with the narrowest tool surface that gets the job done. Measure latency, cost, and audit-trail quality before and after. The shape of the answer informs the next ten agents you build.
  • Author one shared rubric library in your repo. Same way you'd treat a shared prompt library, but versioned at a finer grain. Each rubric is a YAML or markdown artifact with a CHANGELOG, an owner, and an eval suite that catches when a rubric change breaks downstream behavior.

Sonnet Code's take

Claude Managed Agents stopped being a chat surface with tools and started being a stack with three legs: a runtime that fans out, a memory that consolidates, and a rubric-graded loop that closes on "good" instead of on "the model said it was done." The teams that win this cycle aren't the ones who flipped the features on first; they're the ones with the senior practitioners ready to author the rubrics, the platform engineering ready to scope the subagent capabilities, and the audit discipline to keep Dreaming from amplifying yesterday's workaround. We staff that work directly: AI training at Sonnet Code is senior domain reviewers — engineers, clinicians, financial analysts, lawyers, depending on the workload — authoring the Outcomes rubrics, the golden examples for Dreaming to consolidate against, and the red-team prompts that stress-test the multiagent orchestration. We pair it with AI development engagements that wire the rubrics into your Managed Agents config, scope the subagent capabilities, and stand up the audit surface that tells you whether the dreams that got consolidated last week made the agent better or only louder. If your team flipped Outcomes on yesterday and is now wondering who should write the rubric, the next conversation isn't about the feature flag. It's about the practitioner whose grading you'd actually defend in front of an auditor.