Sonnet Code
← Back to all articles
AI DevelopmentJune 1, 2026·10 min read

Subquadratic Emerged From Stealth With $29M, a 12M-Token Context Window, and 300× Cost-Reduction Claims on RULER 128K. The Quadratic-Attention Tax Era of LLM Economics May Be Ending — and Architectural Lock-In Is Suddenly a Real Category of Risk Again.

What Subquadratic shipped, in the numbers

On May 5, 2026, Subquadratic emerged from stealth with $29 million in seed funding led by Javier Villamizar (formerly of SoftBank Vision Fund) and Justin Mateen (Tinder co-founder, founder of JAM), with participation from early backers of Anthropic, OpenAI, Stripe, and Brex. The company is led by CEO Justin Dangel and CTO Alexander Whedon.

The launch shipped three products at once: SubQ API for direct developer and enterprise access to the 12M-token model; SubQ Code, a CLI coding agent built around the long context for whole-codebase reasoning in a single session; and a consumer-facing search product offered free as a land-and-expand surface. None of the products are open-weight; Subquadratic ships SubQ as a closed model with customization available for specific use cases.

The published numbers, all of which require the not yet independently verified asterisk, are the headlines that broke the model out of the architecture-paper category and into the production-platform conversation:

  • 95% accuracy on RULER 128K at $8 per full evaluation, against Claude Opus at 94% and ~$2,600 for the same eval — a claimed 300× cost reduction at parity quality.
  • 52× faster than FlashAttention at 1M tokens in prefill on B200 GPUs, with intermediate speedups of 7.2× at 128K, 13.2× at 256K, and 23× at 512K.
  • ~1,000× less compute at 12M tokens versus a dense quadratic-attention Transformer at the same context length.
  • A 12-million-token context window in production, with the SubQ Code agent able to take an entire mid-sized codebase as a single session input rather than relying on retrieval.

Subquadratic's framing of the architecture is Subquadratic Sparse Attention (SSA) — a sparse-attention mechanism where the model learns which token-to-token relationships matter during pretraining (rather than computing all of them densely and discarding most), with the entire stack optimized for linear-instead-of-quadratic scaling.

What "subquadratic" actually changes about the cost curve

To understand why this is potentially structural rather than incremental, it helps to be precise about the thing it would replace.

For every frontier LLM since the original Transformer paper, the cost per token at long context grows roughly as N² where N is the context length. The reason is straightforward: dense attention computes a relationship between every token in the context and every other token. Double the context, quadruple the attention compute. Engineering work like FlashAttention has aggressively optimized the constant factor on that quadratic curve, and architectural work like sliding-window attention and ring attention has carved out useful exceptions, but the underlying curve has been N² since 2017. That curve is the thing that makes a 1M-token context window expensive, a 10M-token context window approximately impossible at frontier quality, and a 100M-token context window science fiction.

If SubQ's architectural claim holds — and the if is doing real work in that sentence — then the curve isn't N². It's roughly N, or N·log(N), or N times some small constant that doesn't grow with context length. That changes the economics of three workload classes immediately.

Whole-codebase reasoning. SubQ Code's framing of the whole codebase in one session stops being a marketing claim and becomes the operating mode. A 1M-LOC repo at frontier quality, with the model able to reason about a change in module A against the type system in module B and the migration in module C, is a different shape of agent than anything the 200K–400K-context generation can run.

Document review and contract intelligence. Workflows that require the model to hold a 500-page contract, a regulatory framework, and twenty supporting documents in a single context are currently solved with retrieval-augmented patterns that lose information at every chunking step. A 12M-token context that costs $8 instead of $2,600 turns those workflows from RAG-with-engineering-overhead into load everything and ask.

Long-horizon agentic work. The constraint on multi-hour agentic sessions today is in part the cost of feeding the model the accumulated context of the session itself. At 1,000× less compute at 12M tokens, that constraint loosens by enough that agent runs for an entire workday, accumulates context the whole time, makes decisions against the full session history becomes practical at production cost.

None of these are speculative if the architectural claim holds. All of them depend on the claim holding for workloads outside the narrow benchmark surface Subquadratic published on.

The verification window

The most important thing to be clear about: the headline numbers are not yet independently verified. The model is not open-weight. The training methodology is private. The benchmark surface is narrow — RULER 128K is a long-context retrieval benchmark, not a general capability eval. The prefill speedup numbers are reported on a single GPU class. And the architectural family — sparse attention, long-context-first — has a long history of looking great in papers and disappointing in production once the workload broadens.

What to watch for over the next 90 days:

Third-party reproductions on workloads the company didn't publish. Coding-agent benchmarks (SWE-Bench Verified, Terminal-Bench). Multi-document QA outside RULER. Long-horizon agentic evals that grade trajectory, not just final answer. If SubQ holds at 80%+ of frontier-model quality on any one of those at the published cost, the architectural premise is real. If it falls off a cliff outside RULER and a narrow band of long-context tasks, the conclusion is that SSA works for a specific shape of problem and is not a general replacement.

Behavior on the failure modes long-context architectures historically share. Lost in the middle — degradation in retrieval accuracy for tokens in the middle of a very long context — is the classic failure mode. The published RULER 128K result is suggestive, but RULER at 128K is a fraction of the 12M-token capability. A high-quality independent eval at 4M, 8M, and 12M tokens, on tasks that require the model to actually use the long tail of the context rather than synthesize a summary, is the test that matters.

Pricing stability. $8 per RULER 128K evaluation is a launch number. The unit economics of running SSA inference at scale, with B200 GPUs at current cloud spot prices, on the actual usage patterns of paying customers, are what determine whether that price holds or compresses toward the rest of the market. The first sign of pressure would be tier restructuring within the next two quarters.

The honest summary: the announcement is interesting; the architecture is plausible; the claims are large enough and verifiable enough that the next 90 days will resolve them in one direction or the other. Treating SubQ as either inevitable or vaporware in June 2026 is overconfident. Treating it as a real possibility that warrants engineering preparation is correct.

What changes for production teams if the claims hold

Concrete decisions any team running long-context workloads or planning to should make, contingent on the verification work going the right way.

Your retrieval architecture is up for reconsideration. RAG was built around the assumption that you can't afford to put everything in context, so you have to retrieve. If everything fits in 12M tokens at 1/300 the previous cost, the RAG stack — chunking, embedding, vector store, retriever, re-ranker, all the engineering glue around it — is competing against load the whole document set and ask. That doesn't mean RAG goes away. It does mean RAG becomes the answer for the subset of workloads where the document corpus is genuinely larger than 12M tokens, rather than the default for everything. The team that's built a large RAG investment should be ready to defend why it's still the right shape for each specific workload.

Your codebase agents should be designed assuming whole-repo context is becoming cheap. Agents that today perform careful pre-retrieval to figure out which files are relevant to a task should be redesigned with an alternative path where the agent loads the entire repo into a single session. If SubQ holds, the alternative path becomes the default for repos under 1M LOC, which covers a substantial fraction of the production codebases in the world. The portable design is one where retrieval is optional, not assumed.

Your eval matrix needs another axis. SubQ joins Composer 2.5, the SWE-1.5 family, and the frontier-lab models as a real option for a category of workload. The eval discipline that grades cost-per-successful-task across this entire matrix — not just within the frontier-lab subset — is what surfaces which workload belongs where. The teams that built that discipline for the three-vendor world in March will extend it for the five-or-six-option world in June with minor effort; the teams that didn't will be making model selections by vibes.

Architectural lock-in is back as a category of risk

The larger structural point, the one most analyses are skipping: architectural lock-in just stopped being a solved problem.

For the last two years, the operating assumption in nearly every enterprise AI conversation was that the model architecture — dense Transformer with quadratic attention, scaled — was a permanent fact of the landscape. The variables in the planning model were vendor (Claude / GPT / Gemini), pricing, and the relative-capability ranking on this week's evals. Stack portability meant can I swap the model from one vendor to another without rewriting my code? — and the answer, with MCP and templated prompts, was yes, with discipline.

If subquadratic architectures pan out — and SubQ is the first credible production-scale data point in that direction, not the last — then the assumption underneath that portability conversation breaks. Code written against a 200K-context model has a different shape than code written against a 12M-context model. Retrieval architectures, prompt structures, eval design, tool-call patterns, error handling on truncation — all of these are sensitive to assumptions about how long the context can be and how much it costs to use. Portability across architectures is a harder problem than portability across vendors of the same architecture.

The defense is the same kind of discipline the multi-vendor world has needed for a year, raised one level: design your stack so the architectural assumption — how big can the context be, what does it cost, what does the model do well or poorly inside that context — is an explicit configuration rather than a hidden assumption baked into a hundred files. Teams that have built explicit context-size policies, retrieval-as-optional designs, and eval harnesses that grade the same task across architectures will absorb the next architectural shift as a configuration change. Teams that haven't will spend a quarter discovering which of their assumptions were architecture-shaped and which weren't.

Where Sonnet Code fits

A subquadratic-attention frontier model with 12M context and a 300× cost-reduction claim is the easy half of the story. The hard half is the engineering above the model that turns architectural option in the matrix into portable production capability you can adopt safely if and when the claims hold up. AI development at Sonnet Code is that engineering: building the architecture-aware portability layer that lets your stack adopt a SubQ-shaped model the same week independent reproductions confirm it works, designing the retrieval-as-optional patterns that don't break when the cheap-long-context option becomes real, and standing up the cross-architecture eval discipline that grades cost-per-successful-task across vendors and architecture families. AI training is the human-judgment half: senior engineers and domain experts who design the long-context evaluation rubrics that grade what the model actually does in the middle of a 4M-token document (not just at the ends), run the adversarial review on the failure modes that long-context architectures historically share, and stand up the senior-reviewer queue for the workloads where the cheaper option is most likely to silently underperform.

The next 90 days will resolve whether SubQ's claims hold. The decision your team needs to make this month is not do I bet on SubQ? — it's is my stack ready to evaluate an architectural shift the week it lands? The teams that built that readiness in the multi-vendor era will absorb the next shift cheaply. The teams that didn't are the ones for whom a confirmed verification result becomes a quarter of engineering work to take advantage of.