What DeepSeek and Moonshot shipped and the permanent floor that just settled
The DeepSeek V4 Pro permanent rate card that took effect on June 1, 2026 — after the 75% promotional discount expired on May 31 — is the point where the open-weight frontier coding tier stopped being a promotional bet that might revert and started being a permanent floor on the model-spend line that the multi-vendor routing portfolio either captures or doesn't. The permanent rate is $0.435 / $0.87 per 1M input/output tokens (cache-hit at $0.003625/M), roughly one order of magnitude below the closed Western flagships at the equivalent capability tier — Claude Opus 4.8 at $5/$25, GPT-5.5 in a broadly comparable band, Gemini 3.1 Pro in the same neighborhood. Kimi K2.6, released April 20 by Moonshot, sits at $0.95/$4.00 for the 1T-parameter MoE, 32B-active, 256K-context model under a Modified MIT license. DeepSeek V4 weights are MIT-licensed, with V4-Pro at 1.6T total / ~49B active and V4-Flash at 284B total / ~13B active.
The operationally important specifications, summarized from the consolidated April–May rollouts and the June 1 rate-card settlement:
- DeepSeek V4 Pro: 1.6T total / ~49B active MoE, MIT-licensed weights, permanent rate $0.435/$0.87 per 1M (input/output), cache-hit $0.003625/M.
- DeepSeek V4 Flash: 284B total / ~13B active MoE, MIT-licensed weights, the lighter-cost tier for routine work.
- Kimi K2.6: 1T total / 32B active MoE, 256K context, Modified MIT license, $0.95/$4.00 per 1M.
- License posture: both families ship under licenses that permit commercial use, modification, and on-prem deployment without a per-deployment royalty — the procurement-side floor for the in-perimeter inference path.
- Benchmark posture: both families are in the frontier-tier band on the published agentic-coding evals (SWE-Bench Pro, Terminal-Bench 2.0-Terminus, the broader coding leaderboards) — close enough to the closed Western flagships that the workload-specific divergence is the eval question rather than a generic capability gap.
- Pricing structure: API-served pricing is published and stable; self-hosted pricing is the marginal cost of the inference substrate the customer operates, which falls outside the published rate card entirely.
Worth flagging clearly: the permanent rate card is not, by itself, a routing-strategy revolution. The published rates have been visible since April; the multi-vendor routing cohort has been integrating the open-weight tier for two months. What's new in the June 1 settlement is the permanence — the rate is no longer a promotional posture that might revert against the FY27 budget assumptions; it is the standing floor against which the routing portfolio's cost-per-successful-task math resolves. The buyer whose Q2 routing decisions ran on the promotional rate is the floor, but we should plan against a higher number now has the actual floor in front of them, and the budget conversation moves to the durable shape rather than the contingency shape.
Why an order-of-magnitude pricing delta on an open-weight frontier model resets the routing portfolio
For the last eighteen months the multi-vendor routing conversation has anchored on the closed-flagship cohort — Claude, GPT, Gemini — with the open-weight tier treated as the lower-capability lane for workloads where the cost mattered more than the marginal capability. The framing was approximately correct through Q1 2025 when the open-weight tier was a capability generation behind. It is no longer correct in 2026 when the open-weight tier (Qwen 3.7 Plus, DeepSeek V4 Pro, Kimi K2.6, GLM-5.1, the broader cohort) is in the frontier band on the agentic-coding eval surface, and the cost delta is an order of magnitude rather than a 30% discount.
Three honest reads on why the routing portfolio reshapes around the permanent open-weight floor.
The cost-per-successful-task math inverts on the routine workload distribution. A token-throughput routing strategy that sends the routine work to a $0.435/$0.87 model and reserves the closed-flagship tier for the workload tail where the capability difference actually matters is a strategy whose blended cost-per-successful-task is meaningfully lower than the strategy that sends everything to the closed-flagship cohort. The buyer running the closed-flagship-only routing through Q2 will see the cost-per-successful-task line carry the omission against the buyer who built the multi-vendor routing through the same window. The delta is not a marginal optimization; it is the kind of FY27 budget mismatch that surfaces in the CFO conversation.
The in-perimeter deployment path is a permission, not a constraint. MIT-licensed weights mean the customer can deploy the model inside its own VPC, on its own GPU substrate, against its own observability surface, without a per-deployment royalty conversation. For the regulated-industry buyer where the in-perimeter path is the only path, the open-weight tier at frontier capability is the deployment object the procurement conversation has been waiting for. For the non-regulated buyer, the in-perimeter path is an optionality the routing portfolio carries — the buyer can route certain workloads to the API-served endpoint, certain workloads to the in-perimeter endpoint, and certain workloads to the closed-flagship cohort, with the routing decision made per workload rather than per platform.
The permanence of the rate card resolves the budget contingency. A promotional discount that expires forces the FY27 budget conversation to plan against the post-promotional rate. A permanent rate card lets the routing portfolio compound the cost-per-successful-task win across the budget cycle. The customer whose Q2 routing decisions were made against the promotional rate has the budget math to redo against the permanent rate; the customer whose Q2 decisions were made against the post-promotional rate has the math already done. The settlement of the rate card moves the budget conversation from the contingency shape to the durable shape, and the routing portfolio's compounding window opens against the durable shape.
What changes about the multi-vendor routing matrix
Four shifts that follow when the open-weight frontier tier has a permanent floor and the in-perimeter deployment path is on the table.
The routing matrix becomes a per-workload selection across three model classes, not a per-vendor commitment. The Q1 routing strategy had a two-class shape: closed-flagship cohort for the serious work; open-weight cohort for the cost-sensitive tail. The Q3 strategy has a three-class shape: closed-flagship cohort for the capability tail; API-served open-weight tier for the volume work where the in-perimeter path isn't justified; in-perimeter self-hosted open-weight tier for the workload classes where the egress posture, the data sensitivity, or the substrate economics make the self-hosted path the cheaper or more compliant answer. The buyer whose routing logic encodes only the two-class shape is running on the obsolete matrix.
The cost-per-successful-task attribution becomes the routing primitive, not the model-quality leaderboard. A routing decision made on the published benchmark — this model is N points ahead on SWE-Bench Pro, so we route serious work to it — is a routing decision made on the wrong axis. The right axis is the cost-per-successful-task on the customer's specific workload class: which model produces the lowest cost-per-successful-task on routine refactoring; which produces the lowest on long-horizon planning; which produces the lowest on the test-generation surface; which produces the lowest on the doc-extraction work. The cost-per-successful-task math factors both the token cost (where the open-weight tier wins by an order of magnitude) and the success rate (where the closed-flagship cohort still wins on the harder tail). The routing decision falls out of the math, and the math has to be decomposed per workload, per model, per tier.
The eval discipline extends to grade the open-weight tier on the customer's specific codebase, not on the leaderboard's benchmark. The published benchmark numbers are reproducible from the eval methodology; the workload-specific performance on the customer's actual codebase is not predicted by the benchmark and has to be measured. The teams whose eval harness was already standing for the closed-flagship cohort have to extend the harness to the open-weight tier, with a column added for the cost-per-successful-task on each workload class. The teams whose eval harness was on the we'll build it next quarter roadmap will produce the comparison too late to inform the FY27 budget cycle.
The gold-set authoring discipline extends to grade non-Western-trained models honestly. DeepSeek and Kimi are trained on substantially different distributions from the Western flagships, with deeper Chinese-language exposure and different fine-tuning posture. The places where the workload-specific performance diverges from the benchmark performance are different in shape from the Western models' divergences. The gold sets that grade the open-weight tier honestly need a row of cases that exercise the failure modes specific to a model trained on a different distribution — the Chinese-language false-cognates the agent will misread, the regional dialect of the customer's codebase the model will normalize against, the library-version misidentifications that come from a different training-data cutoff. That row of the eval matrix is new engineering work that the senior judges have to author, not a copy-paste of the existing harness.
What this does not change
Three honest caveats.
It does not eliminate the closed-flagship cohort from the routing portfolio. The open-weight tier is competitive on the volume of the workload distribution; it is not the frontier on the hardest tail. The portfolio that routes 100% of work to the open-weight tier on the day after the rate-card settlement will spend the rest of the quarter rolling back routing decisions on the workloads where the closed-flagship capability difference actually mattered. The portfolio is broader, not cheaper-everywhere.
It does not eliminate the geopolitical and supply-chain considerations. Some buyers — public sector, defense, certain financial-services compliance regimes — have explicit constraints on the use of Chinese-built models, including the open-weight tier, regardless of the technical posture. Those constraints are unaffected by the rate card or the license. The honest play is to know which constraints the organization actually has, in writing, with the right legal review, before the procurement conversation, rather than discovering them after the platform team has deployed the in-perimeter path.
It does not collapse the senior-review queue. A routing portfolio extended to the open-weight tier is a portfolio whose hardest failure modes still have to be caught by humans whose judgment is calibrated to the specific model's behavior on the customer's workload. The senior-review queue's calibration has to extend to the new entries; the queue's volume has to scale with the volume the new entries pick up. The buyer who treats the rate card as the open-weight tier is safe to run unsupervised will get the audit log of incidents the queue should have caught.
Where Sonnet Code fits
A permanent floor on the open-weight frontier coding tier inside a multi-vendor routing matrix is the easy half of the procurement conversation. The hard half is the engineering and human-judgment work that turns the rate card is on the table into the routing matrix is decomposed per workload across the closed-flagship cohort, the API-served open-weight tier, and the in-perimeter self-hosted open-weight tier; the cost-per-successful-task attribution is a first-class dashboard per model per workload; the eval discipline grades the open-weight tier honestly on the customer's specific codebase; and the gold sets exercise the failure modes specific to a non-Western-trained distribution. AI development at Sonnet Code is the engineering half: extending the MCP-native routing layer to treat DeepSeek V4 Pro, V4 Flash, and Kimi K2.6 as first-class peers of the closed-flagship cohort; standing up the in-perimeter self-hosted deployment on the GPU substrate the platform team already operates; wiring the cost-per-successful-task attribution per model and per workload as a first-class dashboard rather than a quarterly reconstruction; and instrumenting the routing decisions so the per-workload selection falls out of the data rather than out of the engineer's preference.
AI training is the human-judgment half: senior engineers, domain experts, and bilingual reviewers who design the gold sets that grade a model trained on a different distribution honestly against the customer's workload, calibrate the senior-review queue for the failure modes that differ from the Western flagships, build the rubrics that decide which class of work auto-routes to the in-perimeter endpoint, which auto-routes to the API-served open-weight tier, and which always escalates to the closed-flagship cohort.
The open-weight frontier coding tier just settled at a permanent price floor one order of magnitude below the closed Western flagships. The teams that walk into Q3 with the routing matrix decomposed per workload across the three model classes, the cost-per-successful-task attribution decomposed per model per workload, the eval discipline graded honestly on the customer's codebase, and the gold sets exercising the non-Western-trained distribution's failure modes are the teams that turn the rate-card settlement into a compounding cost-and-capability win through the rest of 2026. The teams that read the settlement as the open-weight tier got cheaper and renew the closed-flagship-only routing into FY27 will discover, two renewal cycles later, that the buyer down the road who built the three-class routing matrix is paying meaningfully less for the same engineering output.

