Google I/O 2026 Shipped Gemini Spark on MCP and Gemini 3.5 Flash at Half the Frontier Price — The Agentic Tier Is No Longer Optional

The release, in one paragraph

At Google I/O 2026 on May 19, Google announced Gemini Spark, a cloud-based personal AI agent powered by Gemini 3.5 that runs continuously, manages tasks across connected apps, surfaces work that needs human attention, and finishes the rest in the background. Spark integrates Gmail, Docs, and Slides out of the box and adds Model Context Protocol connections to Canva, OpenTable, and Instacart at launch, with more partners on the way. The agent runs on Antigravity, the same harness Google says powers its internal tooling. Alongside Spark, Google shipped Gemini 3.5 Flash — a lighter-weight family member priced at roughly half to one-third the per-token cost of comparable frontier models — and bundled the agentic features into a Google AI Ultra subscription that dropped from $250 to $200 per month. Spark is in beta this week for trusted testers and rolls out next week to US Ultra subscribers.

The surprising line in the keynote isn't "Google has an agent." Every frontier lab has an agent. The surprising line is that Spark ships natively on MCP — the protocol Anthropic authored and OpenAI publicly aligned on earlier this year — and that Google is positioning Spark as the user-facing default of the Gemini app, not a power-user feature behind a toggle. The agentic tier just stopped being a thing customers opt into and started being the surface they get out of the box.

Why the MCP commitment is the architecture decision, not the model number

For two years, the agentic-AI conversation has been a question of harness lock-in. Each vendor shipped a harness — OpenAI's Agents SDK, Anthropic's Claude Agent SDK, Microsoft's AutoGen, Google's now-public Antigravity — with its own tool-call format, its own state management, its own ecosystem of integrations. The Spark announcement reframes that conversation because the integration layer is now an open protocol — MCP — that all three frontier labs ship clients and servers for, and that the largest productivity surfaces (Google Workspace, Microsoft 365, the long tail of B2B SaaS) are publicly committing to.

MCP is the integration standard, the way HTTP is the integration standard. Twelve months ago, the question "how does this agent reach into Salesforce / Slack / Notion / Figma?" had a different answer per vendor and a different SDK per integration. Today the answer is "the MCP server." That's a profound simplification. A team that writes an MCP server for its internal CRM exposes the same surface to Claude, ChatGPT, Gemini, and any future agent runtime that speaks the protocol. The integration cost stops being O(agents × integrations) and starts being O(integrations).

The agentic harness is becoming commoditized; the integration server is becoming differentiated. Antigravity is Google's harness; the Claude Agent SDK is Anthropic's; the Agents SDK is OpenAI's. They are converging on similar primitives — tool calls, state persistence, parallel subagents, eval hooks. The interesting engineering work is no longer the harness — it's the MCP servers that expose the customer's systems with the right granularity, the right permissions, the right audit trail, and the right error semantics. That's where production AI in 2027 will earn or lose its reliability budget.

The price compression on the agentic-tier model is real and structural. Gemini 3.5 Flash sits in the same band as Anthropic's Haiku and OpenAI's Mini tier — and is priced explicitly to be the default behind agentic workloads where the agent makes many small model calls per user interaction. The economics of an agent that issues ten or twenty tool calls per task only work when the per-call price is in the fractions-of-a-cent range. The frontier model still runs the planning turn; the efficiency tier runs the execution turns. Spark is built on that split, and so is everyone else's agent now.

What the Antigravity disclosure actually changes

The second-most-skimmed line in the announcement is the harness. Google is shipping Spark on Antigravity, an agent runtime they describe as the same harness behind their own internal tools, and exposing developer-facing surfaces of it through AI Studio and Vertex.

This matters more than the keynote suggested.

Three production-grade harnesses, one protocol. Antigravity joins the Claude Agent SDK and OpenAI's Agents SDK as a credible, vendor-supported runtime for building agentic systems. All three speak MCP. That is now a real "pick the harness that fits your stack" decision — and a real conversation about which harness is best suited to which workload class. Google's harness has the advantage of integrating cleanly with the Workspace and Cloud surfaces; Anthropic's is the most mature on long-horizon agentic coding; OpenAI's has the broadest model-ecosystem support. The right answer is per-workload, not per-vendor.

The harness is a deployment choice, not a coding choice. Because MCP is the integration interface, the same MCP server that powers a Claude-based agent powers a Gemini-based agent powers a GPT-based agent. The team that designs its agentic system as MCP servers plus a swappable runtime gets to pick the harness on the day they deploy, not the day they write the code. The team that builds against a specific harness's idioms (decorators, lifecycle hooks, native state) pays a migration tax every time the model market moves.

The harness market will consolidate, the protocol won't. Two years from now there will be fewer credible harnesses than today — Antigravity, the Claude Agent SDK, the Agents SDK, and probably one open-source survivor. The protocol that lets them all reach into customer systems will be MCP, the same way HTTP/2 is the protocol no matter which web framework you use. Build for the protocol; pick the harness on the day; refactor the harness in a week, not a quarter, when the market moves.

What it doesn't change

The agent's reliability still depends on the eval suite, not the harness. Spark, like every agentic product before it, only delivers as much value as the evaluation rubric behind it lets you measure. A team that ships an internal Spark-class agent without an eval harness scoring it nightly against a representative workload set is going to discover regressions the way Amazon did in March: in production, at scale, attributed to a deploy nobody flagged. The harness is necessary; the eval is what makes the harness safe.

The frontier model still owns the planning turn. Spark routes the high-stakes reasoning turn to Gemini 3.5 Pro (or higher) and the execution turns to Flash. That split is doing real economic work, and copying it for your own agents is a free win — but only if you have an eval that catches the cases where the cheap tier is not good enough for the execution turn it just got handed. The escape rate on the cheap tier is the metric that determines whether the split saves money or quietly costs you quality.

MCP is a protocol, not a product. Shipping an MCP server is the technical part; designing the right granularity of capability it exposes is the hard part. An MCP server that exposes the entire customer database with a single query tool is a security incident waiting to happen; an MCP server that exposes thirty narrow, well-named, well-permissioned tools — get_customer_by_id, update_customer_status, list_open_tickets_for_account — is an agentic surface that scales safely. The protocol doesn't write the design for you. A senior engineer does.

Where we'd push back on the launch narrative

"24/7 personal AI agent" is the right framing for the consumer-facing demo and the wrong framing for the production conversation. The Spark demo is genuinely impressive — an agent that processes inbox triage, drafts replies, books reservations, and surfaces only what needs human attention. The hard production translation is not "build the same agent for our enterprise." It's: which workloads in our org map to "surface for attention," which map to "complete in the background," and which should stay synchronous because the cost of an unsupervised wrong answer is too high? That triage is the actual agentic-engineering work. The demo skipped it.

"At half to one-third the price" is true on Google's pricing sheet and partial on the cross-vendor comparison. Gemini 3.5 Flash at its announced pricing is competitive with Haiku 4.5 and GPT-5.5 Mini — not dramatically cheaper, and on some workloads slightly more expensive on output tokens. Read the price as Google making sure they're at the floor, not Google undercutting the floor. The differentiated value of Flash is the first-party integration into Spark + Workspace + Vertex, not the per-token price. Procure accordingly.

"Ultra dropped from $250 to $200" is a margin signal worth reading carefully. The frontier productivity-AI subscription tier — $20-$30/mo on every vendor a year ago, $200-$250/mo six months ago, sliding back toward $200 this quarter — is repricing fast as supply catches up to demand. Teams budgeting agentic-AI seats for 2027 should plan for both directions: deeper discounts on consumer-facing tiers, harder negotiation on enterprise tiers as the vendors recoup margin from the workloads that can't easily switch.

What we'd build differently this week

Take MCP seriously as your integration substrate. If your roadmap has even a single agentic feature in the next six months, the integration layer should be MCP servers wrapping your internal systems, exposed with the right granularity and permissions. Build it once; let every vendor's harness consume it. The team that wires its agent into Notion/Salesforce/internal systems through a vendor-specific tool-call format pays the migration tax the next time the harness market moves.
Pick your harness per workload, not per vendor. Antigravity for Workspace-heavy automation; the Claude Agent SDK for long-horizon coding and engineering agents; OpenAI's Agents SDK for the workloads that benefit from the broadest tool ecosystem. The decision should be reviewable; document the rationale; revisit quarterly.
Stand up an eval harness for every agentic workload before it ships. Score against a rubric a senior practitioner signed off on. Run nightly. Treat regressions as deploy-blocking. The teams that skip this step ship Spark-class features and then quietly roll them back when the escape rate climbs above the threshold a customer is willing to tolerate.
Design MCP servers like you'd design an API for partners, not for prompts. Narrow tool surfaces, clear permissions, audited side effects, predictable error envelopes. The model writing the tool calls is making the same architectural mistakes a junior engineer would; the difference is it's making them at scale, at machine speed, every few seconds. The MCP server design is what bounds the blast radius.
Decide which agentic workloads are "surface for attention" and which are "complete in background." The first category is a UX problem — the agent's job is to triage and present, the human's job is to decide. The second is a reliability problem — the agent's job is to ship work, the system's job is to roll back when it ships the wrong thing. The two categories deserve different evals, different harnesses, and different deploy gates. Document the split.

Sonnet Code's take

The Spark + Antigravity + Gemini 3.5 Flash release is the moment the agentic AI tier stopped being a power-user feature and started being the default user surface for every major productivity stack. The right read isn't "Google caught up." It's that production AI engineering in 2026 is now a protocol-and-eval problem, not a which-model-do-we-call problem. Teams that build for MCP, pick their harness per workload, and instrument every agentic surface against a senior-authored rubric will ship reliable agentic features through this cycle and the next. Teams that bind their integration code to a vendor-specific tool-call format and skip the eval layer will spend 2027 explaining incidents to executives who saw the I/O keynote and assumed agents "just work."

We staff that work directly. AI development at Sonnet Code is the engineering that builds MCP servers with the right granularity, the agentic eval harnesses that catch regressions before deploy, the model-routing layer that splits the planning turn from the execution turns, and the observability plumbing that traces every tool call end-to-end across whichever harness shipped this quarter. We pair it with AI training engagements where senior practitioners — staff engineers, security reviewers, domain experts — author the rubrics and golden examples that grade what your agents actually do on your workloads, separate from the demo Google ran on stage. If your team is reading the Spark announcement this week and wondering whether your roadmap needs an agentic tier of its own, the next conversation isn't about which harness to pick. It's about which workloads are agent-shaped, who owns the MCP surface, and the senior practitioner whose rubric defines whether the agent is ready for production traffic.