DeepMind Sees Video Models as the Next LLM-Scale Breakthrough for Visual Tasks

DeepMind's Veo 3 video model demonstrates zero-shot capabilities for visual tasks, marking a GPT-3 moment for machine vision with physics-based realism and multi-modal output.

September 29, 2025 2 min read Mari del Valle

DeepMind researchers believe video models could become as flexible and general-purpose for visual tasks as large language models are for text. The claim centers on Veo 3, which handles object segmentation, physical property understanding, and maze solving without specific training for those tasks.

The model’s zero-shot capabilities mark what some researchers call a “GPT-3 moment” for machine vision. Where earlier visual AI required separate models for different jobs, Veo 3 approaches multiple tasks through a single architecture. It generates video at 4K resolution with synchronized audio while maintaining what DeepMind describes as physics-based realism.

One Model, Multiple Tasks: The Inflection Point in Machine Vision

The technical achievement lies in how the model reasons about visual information. Video models function as zero-shot learners and reasoners, processing spatial relationships and temporal dynamics without task-specific fine-tuning. When presented with a maze, Veo 3 calculates possible paths. When shown objects, it segments them from backgrounds. The model infers these capabilities from its training on video data.

DeepMind’s position places Veo 3 in direct competition with OpenAI‘s Sora and Runway’s Gen-3 models. Veo 3 emphasizes physical accuracy and multi-modal output, distinguishing itself through audio integration and high-resolution rendering. The three systems now define the commercial video generation market, with applications spanning film production to scientific simulation.

Veo 3 and Sora differ in rendering approaches and prompt interpretation, though both achieve photorealistic output. Runway targets creative workflows with editing tools, while Google and OpenAI focus on generation quality and physical coherence.

The shift from specialized visual models to general-purpose video systems mirrors the trajectory of text AI. Three years ago, language models required separate training for translation, summarization, and question answering. Now a single model handles all three. DeepMind argues video models will follow the same path, with one system eventually managing visual tasks from medical imaging to robotics training.

Industry focus has shifted from technical capability to profitability, testing whether these models can generate revenue at scale. DeepMind has not disclosed Veo 3’s computational costs or commercial availability timeline. OpenAI faces similar questions with Sora, which remains in limited release nine months after its announcement.

The technical demonstration raises questions about how video models learn visual reasoning. Unlike text, where word sequences carry explicit logical structure, video encodes physics and spatial relationships implicitly. Veo 3 appears to extract these patterns from training data, though researchers have not detailed the mechanism. The model’s ability to solve mazes suggests it builds internal representations of space and causality.