One engine for every medium
At Google I/O on May 19, 2026, Google introduced Gemini Omni — and the architectural detail matters more than the demo reel. Omni is natively multimodal: it takes images, audio, video, and text into a single core engine and processes them simultaneously, rather than routing each medium through a separate model and stitching the results together. From that one engine it can generate high-quality video grounded in real-world knowledge, edit through conversation (change backgrounds, remove objects, alter camera angles, swap characters — each edit building on the last while keeping the scene consistent), and it carries an improved intuitive grasp of physics like gravity and fluid dynamics. Every clip it produces ships with a non-optional SynthID watermark — a signal embedded in the pixels at generation time, invisible to the eye but readable by detection tools.
The headline is "AI makes video now." The actual shift is quieter and more consequential: multimodal generation has become a primitive you can build a product on, not a research artifact you demo and shelve.
Why "native" changes what you can build
The old way to put multimodal into a product was a pipeline: a vision model here, a speech model there, a generation model bolted on, glue code translating between them, and quality that degraded at every handoff. A native multimodal engine collapses that stack. One model that genuinely understands images, audio, video, and text together — and can generate across them with scene consistency — is a different building block than a pile of stitched single-purpose models.
For a product team that means features that were previously a research project become an integration: conversational video editing inside your app, multimodal support flows that reason over a screenshot and a voice note at once, generated training or marketing assets that stay on-model across edits. The capability stopped being the obstacle. Which is exactly when the obstacle moves somewhere else.
The generation got easy. Judging it didn't.
Here is the trap with every generative leap: the impressive part — making the thing — gets cheap, and people assume the whole problem got cheap. It didn't. When generation is a single API call, the binding constraint becomes evaluation: is this output actually correct, on-brand, factually grounded, and safe to ship?
That's a human-judgment problem, and a hard one for multimodal specifically:
- "Good" is not automatable. Whether a generated video is on-brand, whether the physics looks right enough, whether the edit preserved the thing that mattered — these are judgments a model can't reliably score itself on. You need people who know the domain rating the output against a real rubric.
- Multimodal failure modes are subtle. A plausible-looking clip with a wrong logo, a hallucinated product detail, or a physically impossible motion ships an error that's harder to catch than a bad sentence, because it looks finished.
- "It looks great" is how bad output ships. Without a held-out evaluation set and a defined standard, you're judging on vibes — and vibes scale badly when you're generating thousands of assets.
The capability democratized. The ability to tell good output from confident-looking garbage did not.
Provenance is now a product requirement
The SynthID watermark is the other half of the story, and it's not a footnote. The moment your product can generate convincing video, proving what was machine-made stops being a nice-to-have and becomes a requirement — for trust, for compliance, for not becoming a vector for misinformation. Provenance, watermarking, and the policy around what your system will and won't generate are now part of the spec, the same way auth and audit logging are. Building multimodal generation into a product without a provenance and safety story is building a liability with a nice UI.
Where Sonnet Code fits
Native multimodal generation is squarely the seam between our two service lines. AI development is the engineering that turns Omni-class capability into a real feature: integrating the model into your product, wiring the conversational-editing and generation flows, and building the provenance and safety layer — watermark verification, content policy, audit — so what you ship is trustworthy by construction. AI training is the human-judgment half: the domain experts who define what "good" means for your output, build the evaluation rubrics and held-out sets that turn "it looks great" into a measured pass/fail, and red-team the generative system to find where it produces convincing-but-wrong results before your users do.
Gemini Omni made generating multimodal content the easy part. Knowing whether what you generated is correct, on-brand, and safe — and being able to prove where it came from — is the part that's now worth real engineering. That's the conversation to have before you ship a generative feature, not after.

