• AiNews.com
  • Posts
  • Google Plans to Merge Gemini and Veo for Multimodal AI Vision

Google Plans to Merge Gemini and Veo for Multimodal AI Vision

A sleek, futuristic digital interface shows two AI systems—one labeled "Gemini" with icons for text, audio, and images, and another labeled "Veo" with video elements like film strips and motion lines. At the center, the two systems visually merge into a glowing neural orb, symbolizing their integration. The background features abstract representations of real-world physics—curved light trails, motion effects, and particle visuals—highlighting the AI's evolving understanding of the physical world. The overall aesthetic is minimal, high-tech, and immersive.

Image Source: ChatGPT-4o

Google Plans to Merge Gemini and Veo for Multimodal AI Vision

Google is planning to integrate its Gemini and Veo AI models in a step toward building a highly capable, multimodal digital assistant, according to DeepMind CEO Demis Hassabis.

Speaking on the Possible podcast with Reid Hoffman, Hassabis said the eventual fusion of the two models will help Gemini better understand the physical world—bringing the company closer to its vision of a “universal digital assistant.”

“We’ve always built Gemini, our foundation model, to be multimodal from the beginning,” Hassabis said. “And the reason we did that [is because] we have a vision for this idea of a universal digital assistant, an assistant that […] actually helps you in the real world.”

The Push Toward Omni Models

The broader AI industry is increasingly focusing on what Hassabis referred to as “omni” models—systems capable of handling and generating a range of modalities, including text, audio, images, and video.

Google’s Gemini already supports audio, images, and text. Meanwhile, Veo, the company’s advanced video-generation model, is expected to boost Gemini’s real-world comprehension when merged. OpenAI’s ChatGPT now includes image generation, and Amazon is preparing its own “any-to-any” AI model for release later this year.

These powerful models rely on large and diverse datasets—pulling from video, audio, images, and text—to function effectively.

YouTube’s Role in Training Veo

Hassabis hinted that YouTube, Google’s massive video platform, is a key source of training data for Veo:

“Basically, by watching YouTube videos — a lot of YouTube videos — [Veo 2] can figure out, you know, the physics of the world,” he said.

Google previously told TechCrunch that its AI models “may be” trained on “some” YouTube content, depending on agreements with creators. The company reportedly broadened its terms of service last year to expand how content can be used in AI training.

What This Means

Google’s plan to merge Gemini with Veo signals a clear direction: AI that isn’t just conversational or visual, but deeply embodied in how it understands and navigates the real world. By leveraging the vast and varied data from platforms like YouTube, Google is positioning itself to build an AI assistant that can synthesize, reason, and act across all forms of media.

It’s not just about smarter chat—it’s about AI that sees, hears, and understands life in motion.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.