The month made the decision for you
In the last 30 days, Anthropic shipped Opus 4.7 (87.6% SWE-bench Verified), OpenAI overhauled Codex and teased GPT-6 with a reported sub-0.1% hallucination rate from its two-tier inference framework, Google shipped Gemini 3.1 Ultra with native multimodal reasoning and a million-token context window, and xAI released Grok 4.20 with real-time web access. Open source did not sleep either — four Gemma 4 variants under Apache 2.0, more Qwen releases, and an upgraded Mistral frontier checkpoint.
The teams trying to pick the model for their product lost time every week of that month. The teams running a router shipped.
What routing actually means
The pattern that consistently produces good outcomes in Q2 2026 is not pick the best model and stick with it. It is a three-tier routing setup with an honest evaluation harness underneath:
- A cheap default for the 70%+ of requests that are routine — short completions, classification, extraction, routing itself. The economics of this tier depend on running a small open-source model or a flagship vendor's smallest offering. Price per call should be measured in cents per hundred calls.
- A strong mid-tier for the 25% of work that is the actual product — the generation the user sees, the reasoning that matters, the outputs that determine whether your app is good. This is where Sonnet-class, GPT-5-class, and Gemini Pro-class models live. Price per call is the lever your margin turns on.
- A premium option for the 5% of requests where the difference between a 90% answer and a 98% answer justifies the cost. Opus 4.7, GPT-6, Gemini 3.1 Ultra. Price per call is two orders of magnitude above the default tier. Used deliberately, it shows up in the output. Used by default, it shows up in the P&L.
The numbers vary by product. The shape does not.
The routing logic is a model
The most common mistake we see in production routers is treating the routing decision as a static config — a map from task type to model, hard-coded and rarely updated. This works for about three months, after which the routing table is a museum of assumptions from the last release cycle.
The version that holds up: the router is itself a small model or classifier, trained on a labeled dataset of (request, which model handled it well) pairs, and updated continuously from production traffic. The task becomes train the router rather than configure the router. Every month the router learns what the underlying vendors improved and shifts traffic accordingly, without anyone rewriting a config file.
This is where a surprisingly large number of production AI features die. Teams underestimate the amount of infrastructure required to treat routing as a learned component, try to maintain a static table, and lose the margin war to competitors who automated the same decision.
The eval harness is the real competitive moat
The single biggest predictor we have seen for whether a product team ships durable AI features in 2026 is not which model they use. It is whether they have an evaluation harness that runs in under an hour on a representative slice of their production traffic.
Teams with that harness swap models at the cadence of vendor releases — which is roughly monthly now — and pick up the performance and cost improvements each release brings. Teams without it are still running whatever they deployed last quarter, paying a premium they did not know they stopped needing, and losing ground to competitors who kept up.
Building that harness is boring work. Curate 200–500 real production requests with expected outputs or rubric-based graders. Wire them to a test runner that can hit any of your candidate models with the same inputs. Produce a single dashboard with cost-per-request, latency, and quality score per model. That is the entire stack. It does not require buying tooling; it requires treating the ability to swap models as a first-class capability of your product.
What the routing tier looks like in practice
A rough template, based on what we see working in Q2 2026:
- Cheap tier: Gemma 4 9B on your own infrastructure, or Haiku 4.5, or GPT-5-nano. ~$0.0005 per request on average.
- Mid tier: Sonnet 4.6, GPT-5.4, or Gemini 3.1 Pro. ~$0.01 per request.
- Premium tier: Opus 4.7 at xhigh, GPT-6, Gemini 3.1 Ultra. ~$0.10 per request.
The premium tier fires on maybe 5% of traffic. The math is brutal if it fires on 30%: your cost per request triples without a corresponding user-visible quality jump, and your margin vanishes.
The one-paragraph working rule
Have a default. Have an escalation. Have a second escalation for the small tail of requests where the quality gap is large enough to notice. Train the router on real traffic instead of guessing. Run an eval harness monthly, and swap models without fear when the numbers move. The teams operating this way are the ones still shipping through the April 2026 release cycle without losing a week. The teams still debating which is the best model are the ones losing that week every month.

