The release
On April 24, DeepSeek dropped a preview of V4 — two open-weight models, V4 Pro (1.6 trillion total parameters, 49 billion active) and V4 Flash (284 billion total, 13 billion active), both mixture-of-experts, both with 1 million token context windows, both released under a standard MIT license, with API access live the same day. DeepSeek's own technical report claims V4 Pro matches GPT-5.5 and Claude Opus 4.7 on most agentic benchmarks at 10–13x lower output cost per token.
The pricing tells the rest of the story. V4 Pro charges $0.145 per million input tokens and $3.48 per million output tokens. V4 Flash charges $0.14 / $0.28. For comparison, Claude Opus 4.7 lists at $5 input and $25 output. GPT-5.5 prices are in the same neighborhood as Opus. The cost gap is not a rounding error.
Why this lands harder than V3 did
DeepSeek V3 was the first open-weight model that competed credibly on price and quality. V4 is the first that competes on price and on the agentic-reasoning benchmarks that were closed-frontier territory until this month. The difference matters because agentic workloads are where the cost gap compounds: one ChatGPT-style turn might cost a few cents either way, but one long-running agent session reading a codebase, calling tools, and iterating on output can produce 50,000–200,000 output tokens. At 10x the price, that's the line between "we'll route the hard tasks to Opus" and "we'll route everything to Opus and absorb the bill." V4 puts the price-conscious option back inside the same capability envelope.
The other shift is licensing. Mistral Medium 3.5 launched last week under a modified MIT license — open-weight, but with a non-trivial commercial-use rider. DeepSeek V4 ships under standard MIT. That's the friendliest possible license for self-host, fine-tune, and commercial redistribution. For teams whose continuity plan was hedged on "we could self-host this if we had to," the option just got cleaner.
What changes in the routing playbook
The Q1 2026 routing playbook for AI-integrated products had three tiers:
- Closed frontier (Opus 4.7, GPT-5.5) for the hardest reasoning and most agentic work
- Mid-tier (Sonnet 4.6, GPT-5.5-mini, Composer 2) for the bulk
- Cheap-and-fast (Haiku, mini, Flash-Lite) for trivial calls
V4 forces a fourth row, slotted above the mid-tier and below the absolute frontier:
- Open-weight frontier-adjacent (DeepSeek V4 Pro, Mistral Medium 3.5) — agentic-class capability at a price point that lets you run it as a default, not as a fallback.
For most workloads that previously went to closed-frontier-by-default, the right architectural question is no longer "can we afford to keep using Opus?" — it's "is the marginal capability of Opus worth 10x the cost on this specific task?" The honest answer is yes on a smaller share of workloads than most teams currently route that way.
The honest caveats
A few things to keep in the calculation:
- Provenance. DeepSeek is a Chinese AI lab. For regulated US workloads — defense-adjacent, financial services with national-security sensitivity, healthcare with FedRAMP requirements — calling the DeepSeek API or running their hosted endpoints is a non-starter regardless of price or capability. Self-hosting the open weights inside a customer-controlled environment is the only practical path for that buyer class. Make sure your serving stack, fine-tuning pipeline, and audit posture are ready before promising the cost reduction in a procurement conversation.
- MoE serving complexity. A 1.6T-parameter MoE with 49B active is not a model you spin up on a single H100. The infrastructure to serve V4 Pro at scale is non-trivial — multi-GPU, careful expert routing, decent observability. If your team is currently calling APIs and has never operated a self-hosted model, the cost arithmetic that looks great on paper can erode fast against operational overhead. Run the TCO honestly.
- "Matches on most agentic benchmarks" ≠ "matches on yours." Generic benchmark numbers tell you the model is in the same league. They don't tell you it's the right pick for your specific workflow, your specific data, your specific eval rubric. The teams that move fastest are the ones who can run their own internal eval against any new model in a day, not the ones who wait six weeks for a public benchmark to confirm what their evals would have shown earlier.
What to do this month
If you operate an AI-integrated product, three concrete moves:
- Add V4 Pro to your routing benchmark. If you have a routing layer, drop the model in behind a feature flag and measure. Teams whose architecture committed to a single closed-frontier vendor are the ones who'll spend a sprint catching up.
- Pull the self-host plan off the shelf. "We could self-host if we had to" is now a credible plan, not a hedge. Pick a small workload, stand up V4 Flash on owned infrastructure, and prove the deployment path works. The exercise has value separate from whether you migrate any production traffic to it.
- Re-run your unit economics. Most AI-integrated product P&Ls were modeled six months ago against closed-frontier pricing. The cost compression of the last quarter — V4, Medium 3.5, Composer 2, Gemini Flash-Lite — has changed the right answer to what features are profitable, what pricing tier should bundle AI, and which workloads should we stop charging for. Bring the spreadsheet up to today's prices before the next planning cycle.
What it doesn't change
A few things V4 does not solve:
- Frontier ceiling. The hardest reasoning, the longest agent loops, the most ambiguous tool selection — these are still cleaner with Opus 4.7 or GPT-5.5. V4 narrows the gap; it doesn't close it. Routing still matters.
- The eval problem. A cheaper model only helps if you can measure that it's good enough on your workload. Teams without an internal eval suite will not capture the cost savings — they'll see latency or quality regressions in production and roll back, then conclude V4 "wasn't ready," when the actual problem is they had no way to verify before shipping.
- Compliance and red-team coverage. A new model means a new threat surface. Prompt injection, jailbreaks, data exfiltration paths, and tool-misuse patterns all need re-testing. Don't ship a routing change without re-running your red-team suite.
Sonnet Code's take
The frontier is splitting into two rails: a closed, premium-priced top end where capability advantages are real but margins are getting tested, and an open-weight, fast-improving tier where the gap to frontier shrinks every quarter. V4 is the clearest signal yet that the open-weight tier is going to keep tightening, and that the routing-and-substitution architecture we've been recommending all year is the right default — not a hedge.
The work that compounds isn't picking the right model; it's the eval suite, the routing logic, and the human-in-the-loop training data that lets you swap models confidently. That's split between AI development — building the routing, the tool definitions, the observability — and AI training, where domain experts produce the evals, demonstrations, and red-team coverage that make a model swap a measured decision instead of a leap of faith. If your team is staring at the V4 release wondering whether the cost story is too good to be true, the answer is: probably not, but only if you can measure it on your own workload. That measurement layer is what we build.

