What Microsoft actually shipped
At Microsoft Build 2026 on June 2, Microsoft introduced Frontier Tuning — a managed post-training and continuous-improvement system positioned as the enterprise-grade analog of the reinforcement-learning loops that produce the capability gains inside the frontier labs. The framing in Microsoft's announcement is deliberate: Frontier Tuning is not pitched as another supervised fine-tuning service; it is pitched as the reinforcement-learning environment that has lived only inside the frontier labs, packaged for enterprise use, operating inside the customer's compliance boundary.
The operational shape, summarized from Microsoft AI's own announcement and the practitioner write-ups in the 24 hours after the keynote:
- Three parts that work together: a managed Reinforcement Learning Environment (RLE) where learning happens; the unique inputs the customer provides from its own workflows, processes, and conventions; and the tuned output models, skills, and harness that the system produces.
- Training signal is the trace of work, not a labeled dataset. The system learns from the sequence of tool calls the agent made, the decisions a human reviewer applied, the corrections that landed, and the eventual outcomes — both successful and not.
- The compliance boundary is the customer's. Training data, eval signals, model weights at intermediate post-training checkpoints, and the final tuned artifact all stay inside the customer-controlled environment that the RLE runs inside.
- The RLE is the same environment used at inference. The model running in production is the model being continuously refined; the post-training is not a quarterly batch process running on a copy.
- A reference internal-Microsoft HR deployment showed successful task completion increase from 13% to 87% over the post-training run — a step-change that is hard to achieve with supervised fine-tuning on static labeled data and is closer to the kind of capability gain frontier labs report from RL post-training.
- Private preview availability through Forward Deployed Engineers, with broader availability planned in Microsoft Copilot Studio and Microsoft Foundry later in the quarter.
The headline framing — teaching AI to work the way your business does — is correct as a sales line and slightly misleading as a description of what the system is doing. It is not loading the model with knowledge about the business; it is using the workflow itself as the training environment, with the human-judgment signal embedded in the corrections and outcomes that flow through the workflow.
Why "RL on the trace of real work" is structurally different from supervised fine-tuning on a labeled dataset
Supervised fine-tuning is what every cloud vendor has been selling under the enterprise customization banner since 2023. The pattern is well-understood: the customer curates a dataset of examples that demonstrate the desired model behavior, the vendor runs a fine-tuning job, the resulting model is deployed, and the customer hopes that the production workload distribution matches the curated dataset closely enough that the fine-tuning lift survives contact with reality.
The enterprises that have actually deployed supervised fine-tuning at scale know the failure mode. The curated dataset captures a snapshot of how the workflow worked at the moment the dataset was curated. The workflow drifts — new tools come online, new processes get standardized, new exception cases enter the queue, the eval signal that determines what good means shifts subtly — and the fine-tuned model's performance decays over weeks rather than holding for quarters. The remediation is curate a new dataset and rerun the job, which is a multi-week cycle that costs senior ML engineering time, costs domain-expert review time, and doesn't fully close the gap because the model that ships always lags the workflow it is supposed to support.
Reinforcement learning on the trace of real work is structurally different in three ways that matter for production deployment.
The training signal is the workflow, not a snapshot of the workflow. The model is learning from the actual sequence of tool calls, decisions, and outcomes that flow through production. The drift problem doesn't compound the same way, because the model is continuously absorbing the drift as it happens, rather than ratcheting against a stale dataset.
The reward function is the eval signal, not the loss against a label. The signal that drives the post-training is the same signal that the customer uses to grade success in production — task completion, user acceptance, error rates against the production guardrails, escalation patterns from the senior-review queue. The model is being optimized for the metric the business already measures, not for a proxy metric that approximates it.
The capability gains compound rather than plateau. Supervised fine-tuning on a fixed dataset has a capability ceiling — once the model is performing well against the dataset, additional training adds little. Reinforcement learning on a continuously updating workflow has a much higher ceiling, because every new failure mode the workflow encounters is a new training opportunity, every corrected output is a new positive example, every escalation to the senior-review queue is a new high-value signal. The 13%-to-87% Microsoft internal-HR result is, in shape, exactly the kind of step-change that RL produces and supervised fine-tuning struggles to replicate.
What this means for the human-in-the-loop training data conversation
Four shifts that follow when the post-training loop is RL-on-workflow rather than SFT-on-dataset.
The role of the human shifts from labeler to judge. Supervised fine-tuning needs a lot of labelers writing a lot of examples. Reinforcement learning on the workflow needs a smaller number of senior domain experts whose job is to judge, correct, and escalate — to be the source of the high-quality reward signal that the RL loop can amplify into broader model capability. The throughput per senior expert goes up; the number of junior labelers needed goes down; the dollars-per-unit-of-capability-gained shifts in favor of fewer-higher-paid-senior-judges over more-lower-paid-junior-labelers. That is a different headcount mix than the one most enterprise AI training programs are currently staffed for.
The eval rubric becomes the most valuable asset in the program. The reward function is the eval signal. The eval signal is defined by the rubric. The rubric is the artifact that decides which trajectories get reinforced and which get penalized. If the rubric is poorly authored, the model learns to game it; if the rubric is well authored, the model learns to satisfy the underlying business intent. The rubric-authoring discipline — what is good, what is bad, what is a borderline case, how do we calibrate multiple expert judges to agree — moves from a side activity that most enterprise AI training programs deprioritize to the central engineering deliverable of the program.
The senior-review queue becomes a high-leverage data source. Every case that the senior-review queue resolves is a piece of training data the RL loop can use — the model proposed an action, the senior reviewer approved or modified or rejected it, the modified or rejected version became the high-value reward signal. The senior-review queue is no longer just a safety mechanism; it is a continuous post-training data source. The teams that staff the queue with senior judges whose corrections are also good training data (because they explain the reasoning, because the rubric is shared, because the calibration discipline is in place) compound a meaningful capability advantage. The teams that staff the queue with reviewers whose only job is to approve or reject get the safety benefit but not the post-training benefit.
The continuous-improvement story changes the procurement framing. A supervised-fine-tuning offering is sold once, deployed once, and decays. A reinforcement-learning-on-workflow offering is sold continuously, deployed continuously, and improves. That has implications for the contract shape (per-token inference plus per-RL-step training, not flat-fee fine-tune-and-deploy), for the operational model (continuous platform engagement, not project-style hand-off), and for the budget-cycle conversation (annual operating expense for the platform plus the senior-judge pool, not capital outlay for a fine-tuning job). The teams that get the procurement framing right early avoid the awkward conversation where finance discovers that the supposed fine-tuning project is in fact a permanent operational dependency.
What this does not change
Three honest caveats.
It does not eliminate the cold-start problem. Frontier Tuning improves the model by learning from the workflow trace. If the workflow trace doesn't yet exist — because the agent is being deployed into a new domain, or because the existing workflow doesn't generate the right signal — the RL loop has nothing to amplify. The first 90 days of any Frontier Tuning deployment look very similar to a supervised-fine-tuning deployment: a small bootstrap dataset, a thin set of human-judged examples, and a model that performs near baseline. The capability gains compound only after the workflow trace volume crosses a meaningful threshold. The procurement team that signs expecting 87% completion in week two will be disappointed; the team that signs expecting 87% completion in quarter two will be on track.
It does not collapse the rubric-authoring labor. The training signal is the rubric. A bad rubric produces a model that is confidently wrong in ways the auto-grader rewards. A good rubric requires senior domain experts whose time is expensive, whose calibration discipline is mature, and whose authorial output is reviewed against multiple-judge agreement before it becomes the reward function. The platform vendor cannot supply this; the customer has to. The teams that treat rubric authoring as the highest-leverage activity in the program ship the capability gain that justifies the platform; the teams that treat it as paperwork ship a model that has been trained to satisfy a rubric that didn't capture the business intent.
It does not eliminate the multi-vendor portability question. Frontier Tuning runs inside Microsoft Foundry, on top of MAI-family or peer-frontier models. The post-trained artifact lives inside the Microsoft platform. The training signal — the workflow trace, the rubric, the senior-judge labels — is the customer's; the post-trained model is platform-coupled. The portability story has to be designed in: which signals are owned by the customer in a portable representation, which artifacts can be re-derived on a different platform if the platform contract changes, which platform-specific dependencies are acceptable and which are not. The customer that signs without thinking about this gets a meaningful capability gain that is also a meaningful platform lock-in.
Where Sonnet Code fits
A managed reinforcement-learning environment inside the customer's compliance boundary is the easy half of the story. The hard half is the engineering and human-judgment work that turns the platform is available into the capability gain compounded and the senior-judge pool is calibrated to keep compounding it: the rubric authoring with multi-judge calibration, the senior-review queue restructured to produce high-quality post-training signal rather than just approve-or-reject decisions, the bootstrap-dataset curation that gets the cold start to the RL-can-amplify threshold faster, the workflow-trace instrumentation that captures the right signal at the right granularity, the eval-harness extension that grades the post-trained artifact honestly against held-out gold sets. AI training at Sonnet Code is precisely that work: senior engineers and domain experts who design the rubrics, calibrate the judge pool, author the gold sets, and run the senior-review queue as a continuous post-training data source — staffed with the kind of senior judgment that compounds with the RL loop instead of just feeding it noise. AI development is the platform-engineering half: instrumenting the workflow trace, wiring the RLE into the production observability surface, building the post-training-attribution dashboards that surface which classes of work compound capability fastest, and structuring the portability layer so the customer-owned signal is not platform-locked even when the model artifact is.
The enterprise fine-tuning conversation just stopped being about static datasets and started being about workflows. The teams that walk into Q3 with the rubric discipline mature, the senior-judge pool calibrated, the bootstrap-dataset curation underway, and the workflow trace instrumented at the right granularity are the teams that turn Frontier Tuning into a real, compounding, defensible capability advantage. The teams that defer the rubric work will discover that the platform is doing exactly what it was designed to do — amplifying whatever signal it gets — and that the signal they fed it does not produce the model they wanted.

