GitHub Copilot Just Replaced Premium Requests With Per-Token AI Credits on June 1 — Developers Watched 10x to 100x Cost Swings Land on Their Dashboards on Day One, 82% of a Monthly Allowance Burned in a Single Editor Session, and Every Coding-Agent Buyer's Procurement Conversation Just Acquired a Per-Token Accountability Layer — The FinOps Discipline That Has Been a 'We'll Look at It Next Quarter' Item Is Now a Live Monitoring Surface, and the Engineering Team's Senior Judgment on Routing Just Got the Receipts to Match.

What GitHub flipped on June 1 and the cost surface that lands with it

On June 1, 2026, GitHub flipped every Copilot plan from the prior premium-request allowance to usage-based billing through AI Credits — 1 credit = $0.01, with input, output, and cached tokens all metered against the published per-model API rates. Annual-plan holders stay on the legacy request-based allowance until renewal. Monthly-plan holders migrated automatically, and the day-one developer reports were sharp.

The operationally important pieces:

Pricing skeleton stays familiar, the meter underneath changes. Copilot Pro is still $10/month but now ships with $15 in credits; Pro+ is $39 with $70; Business stays at $19/$19; Enterprise stays at $39/$39; the new Max tier lands at $100 with $200 in credits. The seat-license number on the procurement spreadsheet is unchanged. The cost the seat actually generates is not.
The meter respects the model choice. A turn routed against GPT-5.5 burns at GPT-5.5's per-token rate; a turn against Claude Opus 4.8 burns at Claude's rate; a turn against the smaller-and-cheaper model class burns at the cheaper rate. The IDE vendor stopped subsidizing the model selection. The cost of which model the developer reached for is now visible to the buyer for the first time.
The day-one swings were not subtle. Multiple practitioner reports documented 10x to 100x cost swings on heavy workloads against the prior request-based allowance. The widely-shared screenshot of an engineer watching 82% of a monthly allowance evaporate in a single editor session was not the outlier the vendor's pricing FAQ implied — it was the long tail of every team that had been quietly running the most expensive flagship behind every tab-complete keystroke and had been shielded from the math by the request abstraction.
The annual-plan carve-out is the buyer's last quiet window. A team on annual Pro or Pro+ stays on the request allowance until plan expiration, which means the procurement team has somewhere between 30 days and 11 months to land the cost-discipline work before the meter starts everywhere. The team that uses the window is the team that walks into the next renewal with the data; the team that defers it discovers the meter the same way the monthly cohort did.

The structural read isn't the IDE vendor raised prices. It's that the cost surface of coding-agent usage — which has been a flat, predictable, $19/seat line item on the procurement spreadsheet — just became a per-token accountability layer the engineering team has to monitor like any other production cost. The IDE vendor moved the cost from something the vendor absorbs to something the buyer measures, which is the same structural move the cloud vendors made on egress, on storage tiers, and on inference itself over the last decade. The procurement team that treats it as a one-time price change misses what it actually is: the day the coding-agent stack acquired a real FinOps surface.

What the meter restructures about coding-agent procurement

Four concrete shifts that follow when the per-token cost surface becomes visible.

The model selection becomes a routing decision the team owns. Twelve months ago, the developer reached for the flagship by default because the vendor had no incentive to surface the cost. Now the flagship-versus-smaller-class decision shows up on the FinOps dashboard within minutes, decomposed per developer, per repository, per workload class. The team that authors a routing matrix from the cost data — inline completions against the smaller dense model, multi-file refactor against the flagship, architectural mapping against the long-context flagship, terminal automation against the open-weight class running on owned inference — gets to ship the same workload at a fraction of the prior cost. The team that lets every developer keep reaching for the flagship gets the 10x-100x swing the day-one screenshots warned about.

The procurement spreadsheet acquires a usage-per-developer axis. A seat license is a deterministic line item; a per-token meter is a usage distribution. The team's top-quartile developer can burn 5x to 20x what the median developer burns, depending on workload mix and routing habits. The procurement team that reports we spent $X on Copilot last month against the seat count is reporting the wrong number; the procurement team that reports here is the per-developer cost distribution, here is the per-workload-class cost distribution, here is the routing decision that drives the 80/20 of the spend is reporting the number the CFO will eventually ask for. The dashboard discipline has to be built; the question is whether the engineering team builds it now or builds it under pressure two quarters from now.

The cost-per-successful-task replaces the cost-per-seat as the procurement metric. A coding agent that burns 3x the per-token cost but lands the merged PR in half the time at a higher first-pass acceptance rate is cheaper than the coding agent that burns less per token and produces lower-quality output the developer re-runs three times. The buyer who optimizes for the per-token number alone optimizes the wrong objective. The buyer who instruments the success-rate dashboard — did the PR merge on first review, did the test suite pass, did the senior reviewer accept the change — alongside the cost meter gets the cost-per-successful-task signal the routing decisions should grade against.

The negotiating position against every model vendor in the stack improves. A team whose routing data shows we spent $X on Claude Opus 4.8, $Y on GPT-5.5, $Z on Gemini 3.5 Flash, with the workload-class breakdown attached walks into the next renewal cycle with a different negotiating position than the team whose only data point is we spent $W on Copilot seats. The model vendors know the routing data is the buyer's leverage; the contracts the buyer signs in the next twelve months will reflect that. The team that walks in with the per-workload-class success-rate-and-cost matrix in hand will get a meaningfully better contract than the team that walks in with the prior allowance-era posture.

Where the meter is signal and where it is noise

Four honest reads on what the per-token number actually tells the buyer.

Signal: the cost of reflexive flagship reach. The team that defaulted to the most expensive flagship on every turn has a cost surface that is now visible and addressable. The routing-matrix work — which model class lands on which workload — turns a meaningful share of the FinOps bill into recoverable savings without changing what the team ships. The savings are real.

Signal: the workload tail that has been hiding behind the request abstraction. A long-running terminal automation, a multi-file refactor that the agent re-runs eight times before landing the diff, an inline completion that the developer accepts and then immediately rewrites — these are workload patterns whose cost was masked by the request allowance. The meter exposes them, and the engineering team that grades the failure-mode tail honestly finds the work that needs to move to a different model class, a different prompt, or a different orchestration topology.

Noise: the meter does not measure code quality. A turn that costs $0.40 against the flagship and ships a clean, mergeable diff is cheaper than four turns that cost $0.10 each against the cheaper model and produce a diff the developer has to rewrite by hand. The per-token number alone is not the procurement signal; the per-successful-task number is. The team that reads the meter as the objective to minimize will under-spend on the workloads where the flagship is the right answer and over-spend on the senior-review queue that catches the cheaper model's failure-mode tail.

Noise: the meter does not measure the senior-judgment workload behind the diff. The engineering hours the team spends on the senior-review queue, the rubric authoring, the eval-gold-set refresh, the routing-matrix calibration — none of that lands on the meter, and none of it is the cost the developer's keystroke generated. The buyer who reads the meter as the total cost of the coding-agent stack misses the human-judgment workload that turns the agent's output into something the team ships. The meter is the easy half of the cost picture.

What the team should do inside the first 90 days

Four concrete actions that close the gap between the meter and the discipline the meter requires.

Stand up the per-developer, per-workload-class cost dashboard. The data is in the GitHub billing API and the per-request telemetry. The dashboard cost is a few engineer-days. The procurement value compounds for as long as the meter exists. The team that ships the dashboard in the first 30 days walks into the rest of the cost-discipline work with the data; the team that defers it operates on anecdotes and reactive panic.

Author the routing matrix from the team's own success-rate data, not from vendor positioning. The vendors' marketing surface tells a story about which model is best for which workload; the team's own success-rate-and-cost data tells the honest story for the team's workload distribution. The work is the gold-set authoring per workload class, the per-model grading against those gold sets, and the routing decision encoded from the result. The team that owns the routing matrix from the team's data is the team whose routing decisions hold up under audit and survive the next model release.

Wire the success-rate-and-cost metric into the engineering review cadence. The same review cadence the team uses for incidents, on-call load, and shipping velocity should now include a coding-agent slot: which workload classes are over the cost-quality threshold, which routing decisions are not landing, where is the failure-mode tail growing, what model release in the last 30 days might change the matrix. The cadence keeps the discipline alive after the initial dashboard wears off the new-thing-shine.

Calibrate the senior-review queue for the cheaper-model-class failure modes. A team that shifts routing toward cheaper model classes to manage the meter is a team whose senior-review queue is now catching a different failure-mode shape than before. The rubrics that grade the queue have to be refreshed; the senior reviewers have to be calibrated against the new distribution; the gold sets the queue grades against have to reflect the routing change. The team that shifts routing without refreshing the queue gets the cost savings on the meter and the regression risk in production.

What this does not change

Three honest caveats.

The meter does not solve the procurement decision. GitHub Copilot is one node in a multi-vendor coding-agent stack. The cost-discipline work the meter forces on the Copilot surface is the same cost-discipline work the team owes the open-source coding-agent surface, the IDE-bundled surface, and the orchestration layer that routes across them. The meter is one input; the procurement discipline is the work on top.

The meter does not eliminate the model-vendor contract surface. Every model the routing matrix targets still has its own per-token rate, its own data-handling terms, its own SLA, and its own residency commitments. The meter exposes the cost; it does not eliminate the contract.

The meter does not eliminate the senior-judgment workload. The agent's failure-mode tail — the cases where the cheaper model class ships a confident wrong answer the meter does not flag — is the cost the buyer pays for the routing decision. The senior-review queue calibrated to catch that failure-mode tail is the engineering-and-human-judgment workload the cost discipline depends on.

Where Sonnet Code fits

A per-token meter on every Copilot turn is the right pressure to land the routing-and-FinOps discipline every coding-agent buyer has been deferring for two years. The meter is the easy half. The engineering and human-judgment work that turns the meter into the routing matrix, the per-workload-class success-rate dashboard, the senior-review-queue calibration, and the contract leverage at the next renewal is the work the cost discipline requires.

AI development at Sonnet Code is the engineering half: standing up the per-developer-and-per-workload-class cost dashboard against the team's existing Copilot, OpenCode, IDE-bundled, and orchestration surfaces; encoding the routing matrix from the team's own success-rate-and-cost data per workload class; wiring the success-rate-and-cost metric into the engineering review cadence so the discipline survives the new-thing-shine; and delivering the per-vendor cost attribution that the next contract renewal grades against.

AI training at Sonnet Code is the human-judgment half: senior engineers and domain experts who author the gold sets that grade each candidate model honestly against the team's specific workload classes; design the senior-judgment rubrics that calibrate the senior-review queue against the cheaper-model-class failure-mode tail the routing shift exposes; refresh the gold sets quarterly so the routing decisions do not silently drift as the model surface evolves; and serve as the senior-judge pool whose calibrated decisions feed the routing-matrix updates the next release cycle's eval surface reflects.

The coding-agent stack just acquired a real FinOps surface. The teams that walk into Q3 with the per-developer dashboard wired, the routing matrix encoded against the team's workload, the senior-review queue recalibrated for the cheaper-class failure-mode tail, and the success-rate-and-cost metric in the engineering review cadence are the teams that turn the meter into a compounding cost-and-quality advantage. The teams that read the meter as a price-hike line item and stop there will discover the routing-honesty gap, the senior-review-queue calibration gap, and the contract leverage they walked away from — six months after the buyer down the road figured out how to grade them honestly.