What if video felt alive?
Today, every video is a pre-recorded, fixed capture that plays out the same way for everyone. At Odyssey, we asked ourselves the question: what if video could feel alive? What if it could be generated and interacted with instantly? What if video shifted from something passive to something interactive?
With Odyssey-2, we’ve taken a major step forward on that journey. Odyssey-2 is a frontier interactive video model that dreams AI video instantly that you can interact with. You experience it much like a language model: you type, and the video responds in the moment. It feels like magic.




Interactive video models are nascent, and so is Odyssey-2, but we hope that when you experience this model you see the same exciting potential we do. In seconds—and with a few words typed—Odyssey-2 begins streaming minutes of imagined video to any screen or device. This feels like the beginning of a fundamental shift for AI, media, and much more.
Unlocking open-ended interactivity in video models
Foundational to Odyssey-2 is a video model with deep, general knowledge of the world. Like most video models, it begins life relying on both the future and the past when generating. This works well for clips with fixed endings, but prevents the kind of open-ended interactivity needed to make video feel alive. In a truly interactive stream, any action at any moment alters all possible futures.




To make a video model interactive, it must generate the future without knowing it in advance, responding only to what’s happened so far. In other words, it must be causal and autoregressive, generating each frame solely from the context of prior frames and your actions. With Odyssey-2, we’ve made this possible through a novel multi-stage training pipeline that transitions the model to causal behavior—yielding a real-time, action-aware video model that responds continuously to input. As the video plays, you shape it in real time using natural text prompts—much like talking to a language model. The result is a continuous stream of multi-minute video that listens, adapts, and reacts.
Real-time AI video shaped by open-ended text or audio is a topic of research we're excited to explore deeper, while continuing to explore the navigation conditioning and long-term memory we focused on with Odyssey-1—a research preview we released earlier this year.
Odyssey-2 is really, really fast
Video only feels alive when it’s generated instantly. Most bidirectional video models today take 1-2 minutes to generate only 5 seconds of footage. Odyssey-2 on the other hand begins streaming video instantly—producing a new frame of video every 50 milliseconds (meaning imagined video is streamed at 20 frames per second).
The latency reduction is game-changing, and changes what’s possible with AI video. Instead of waiting minutes for a short clip to be generated, you instead interact with a stream of video in the moment. Achieving this while retaining the generality and fidelity of the model required deep optimization at every level. We tuned our model architecture, data pipeline, and inference stack to balance speed, quality, and responsiveness—so Odyssey-2 can dream freely, without delay.




Interactive video models are burgeoning world simulators
Interactive video models predict the next frame using only the past, and then roll forward. Long rollouts punish bad guesses, so the model must internalize the dynamics of how motion, lighting, and contact evolve over time. In doing so, they become implicit world simulators—systems that learn to model and generate the world in real time.
Waves on the ocean are a great example. From prior video frames of a wave, you can infer surface slope, curvature, and a velocity field. With that, you can predict what comes next: the crest advances, troughs fill, foam drifts, highlights slide, and the wave bends around a rock. Although early on this journey, that’s exactly what Odyssey-2 does, all learned from decades of video data. This same concept applies generally to dynamics and behaviors of many types.




Language models showed how far a simple next-step objective can go—predicting the next word unlocked reasoning and creativity. Interactive video models take that idea further. By learning to predict the next frame of video, they begin to approximate the rules that govern our world. As these models learn to act and react to the highest levels of realism, they will transition from enabling emergent media to a general-purpose world simulator.
From fixed media to emergent media
Video is fast shifting from passive to interactive. As that shift takes hold, new consumer and enterprise applications (our API is coming soon!) will emerge across gaming, film, education, social media, advertising, training, companionship, simulation, and much more.




Imagine stepping into an old photo and exploring that memory once more. Taking a guided tour of an ancient civilization as if you were there. Learning piano with instant feedback. Asking Einstein to teach you physics—asking him questions along the way. Learning a new language by speaking with people on the streets of an imagined Paris or Tokyo. Entering a sports game at a pivotal moment and changing the outcome. These applications aren't solved today, but Odyssey-2 is an early, humble step towards this exciting future.

The team that brought this to life
Odyssey-2 was made possible by the incredible Odyssey team: Andy Kolkhorst, Aravind Kaimal, Ben Graham, Boyu Liu, Derek Sarshad, Fabian Güra, James Grieve, Jeff Hawke, Jenny Seidenschwarz, Jessica Inman, Jon Sadeghi, Kristy McDonough, Nima Rezaeian, Oliver Cameron, Philip Petrakian, Richard Shen, Robin Tweedie, Sarah King, Sirish Srinivasan, Vinh-Dieu Lam, and Zygmunt Łenyk. Join us!
