
Introducing Starchild-1: The First Multimodal World Model
Learning to simulate both the visuals and sounds of the world, all in real-time

Oliver Cameron
May 16th, 2026
World models are a new form of generative intelligence. Unlike language models, which learn from text, world models learn directly from the world itself through the pixels, motion, and actions encoded in large-scale video. In the process, these models become capable of understanding and simulating an approximation of the world in real time.
Today, we’re excited to share a research preview of Starchild-1, the world’s first multimodal world model. In Starchild-1, we’ve trained a general world model that autoregressively generates synchronized audio and video in real-time, while continuously responding to streaming user input. Starchild-1 goes beyond traditional world models, which have been limited to learning and generating visuals alone, with no sound.



Beyond Visual Simulation
The world is not silent. It's full of conversation, emotion, crashing waves, and chirping birds. Sound is a rich signal about how the world works, and humans use it constantly to understand and explore the world around them. Machines should too.
In order to revolutionize robotics, education, gaming, healthcare, defense, and many other industries, we believe world models must learn the full, multimodal richness of the world. Starchild-1, and successor models, will enable a new class of interactive multimodal systems that are dramatically more natural, expressive, and intelligent. We believe this is an early step toward general world intelligence.






Towards General World Intelligence
Traditional audio-video models like DeepMind's Veo generate video clips of a specific length offline. While these models have made major progress in visual fidelity and audio synchronization, they do not continuously evolve in response to user interaction. Once generation begins, the future trajectory of the output is fixed. In contrast, Starchild-1 is a causal multimodal world model, and autoregressively predicts the next audio and video state of a world, conditioned on past observations and streaming user input, enabling real-time interaction.
Achieving this required solving a number of new systems and modeling challenges unique to multimodal world simulation. Audio and video evolve at fundamentally different temporal frequencies and information densities, and during long-horizon rollout, errors in one modality can rapidly destabilize the other.
To address these challenges, we developed a new multimodal causal training and inference stack for real-time world simulation. This includes a causal distillation pipeline that adapts a bidirectional audio-video foundation model into a real-time autoregressive world model, while preserving synchronized multimodal generation. We additionally introduced an asynchronous KV-cache architecture and rollout adaptation strategy designed for the very different temporal characteristics of audio and video during long-horizon rollout.
Together, these innovations allow Starchild-1 to maintain synchronized audio-video generation over extended interactions while continuously responding to streaming text, speech, and action inputs in real time.
Interactive Multimodal Generation
A core capability of Starchild-1 is interactive multimodal generation. During autoregressive rollout, users can continuously stream new prompts and inputs into the model, dynamically altering both the visuals and sounds being generated in real time. This allows environments, conversations, ambient sound, and world dynamics to evolve interactively rather than following a fixed trajectory. As described in our technical report, enabling this required new approaches to synchronized audio-video rollout and long-horizon multimodal stability.
More Natural, Capable Forms of Computational Intelligence
“Nothing is in the intellect that was not first in the senses” is a principle associated with Thomas Aquinas and the tradition of empiricism: the idea that knowledge emerges through observation and interaction with the world. This principle ultimately gave rise to the scientific method, where hypotheses are validated through experimentation and grounded in evidence from the natural world. For centuries, this process has driven human scientific and technological progress.
It remains an open question where the next major step-change in computational intelligence will come from. One view is that increasingly capable AI systems will recursively improve themselves and contribute to their own research and development. We agree, and are excited by where this could lead. However, we also believe greater intelligence will come from learning directly from the world itself. This belief motivates our research on world models.



Starchild-1 is an early step beyond world models that learn only from visual observation, toward systems that learn from richer multimodal interaction with the world. We believe multimodal world models will ultimately enable more natural and capable forms of computational intelligence grounded in how the real world actually evolves and behaves.
We’re excited to hear what you think. In an accompanying technical report, we share the architecture, training pipeline, and systems innovations behind Starchild-1, including our work on causal multimodal rollout, synchronized audio-video generation, and long-horizon real-time interaction.
The Team That Brought This to Life
Starchild-1 was made possible by the amazing Odyssey team.
Core Contributors
Ahmad Nazeri, Amogh Adishesha, Jenny Seidenschwarz, Richard Shen, Sarah King, Tobiah Rex, Vighnesh Birodkar.
Full Team
Ahmet Hamdi Guzel, Andy Kolkhorst, Aravind Kaimal, Ben Graham, Derek Sarshad, Fabian Güra, Finley Code, James Grieve, Jesse Allardice, Jessica Inman, Jonathan Sadeghi, Kaiwen Guo, Kristy McDonough, Nicolas Griffiths, Nima Rezaeian, Robin Tweedie, Sirish Srinivasan, Vinh-Dieu Lam, Zygmunt Łenyk.
Leadership
Jeff Hawke, Oliver Cameron.






