Why We Must Build World Models

From naive software to systems capable of interacting intelligently with us and our world, unlocking intelligence and scientific breakthroughs

Oliver Cameron

February 17th, 2026

Humans have been trying to model the world since our beginning. Understanding why we’re here and how everything around us works are questions we will never stop asking.

Early humans carved notches into bone to track the lunar cycle. Polynesian navigators crossed the open ocean using models of swell and stars. Plato argued the physical world was a shadow of deeper structure. Da Vinci sketched anatomy and fluid dynamics because, to him, understanding how the world works was a prerequisite to doing anything useful. Einstein showed the rules of the universe could be compressed into elegant mathematical form. Penrose demonstrated that causal structure is the deep architecture of physical reality, and Pearl argued true understanding requires causality, not just correlation.

A 34,000 year old lunar calendar

Over thousands of years, we’ve tried to better model the world, to make it a better place. We’ve made much progress, but we can go much further. This is why I work on world models.

Are Language Models Enough to Understand the World?

Many argue that language models learn a powerful model of the world. And it’s undoubted that training on all of language has produced an emergent understanding of grammar, logic, reasoning, and common sense. These models are incredible.

But that model—as powerful as it is—remains an incomplete world model, and only as good as the data it primarily learns from. Language is a thin, biased slice of reality. It contains only what people chose to write down, and vast amounts of human experience never make it to the page. The way someone’s body language signals hesitation, the way you position your body to stay afloat while swimming, the weight of a heavy object in your hand, the sound footsteps make on different surfaces—none of this is captured in text with the depth needed to understand how the world truly works or feels. Not to mention the infinite volume of things that just aren’t exciting enough to write down.

Could you learn to swim purely by reading a book describing it?

So if not language, what data source is large and rich enough to teach a comprehensive model of the world?

As humans, we begin life by exploring the world. We stare at our parents’ faces looking for cues, listen for our name being said, throw food to see what happens, hit our heads on tables and cry, with each experience recursively improving our internal model of the world. All of this can be described in words, but to give machines a true sense of human experience and our world, we have to show them the richest, largest-volume representation of the world: video.

Since the smartphone era began, a new tool has become widely available: the ability to record the world in video at massive scale. Each frame of video is light captured by a lens, containing rich observations of physics, dynamics, sound, and human behavior in action. In aggregate, this creates trillions of observations of our world, with a depth and diversity no other data source approaches. What feels ordinary to us—how things move, how lighting shifts, how actions cause reactions, how people interact—is incredibly rich structure for a model to learn from.

Narrow World Models That Learn From Video Exist

If video is the richest data source for learning the world, we should first expect next-state world models to emerge in constrained domains. This has already happened. In a single domain, the challenge of learning superhuman next-state prediction from video was largely solved by self-driving car labs years ago.

Predicting the next-state of the world was key to solving self-driving cars

Given a representation of the road from sensors, companies like Waymo trained models that could predict what the world would look like moments into the future. These models guess—with incredible precision—where pedestrians and drivers will be at future timesteps, learned from vast volumes of driving video and sensor data. These models also learned the effect of the AI driver on the world itself: by staying still or moving, you affect the set of possible futures.

This is what we’d consider a narrow world model, and we see no reason to think this approach—predicting the next state of a world from video observations—is unique to driving. General world models, learned from far broader video observations, appear within reach.

From Video Models to True World Simulators

Given video appears to be so key to learning the world, does this mean video models are general world models? In my opinion, not quite. Bidirectional diffusion models learn a powerful model of world dynamics over short clips. But they are not trained to roll the world forward causally under intervention. In practice, this leads to exposure bias and instability when users take open-ended actions, since the model was never optimized for step-by-step, action-conditioned prediction.

A true world model is causal and action-conditioned. It predicts the next state of the world step-by-step, based on the current state of the world and the actions taken. This open-endedness is essential, because outcomes in our world are not pre-determined.

A multimodal world model, simulating in real-time

The most promising architecture for world models today is autoregressive diffusion transformers (AR DiTs). Unlike video models that generate an entire clip at once, an AR DiT generates frames sequentially, conditioning each new frame on all previous frames and action inputs. Diffusion has proven it can represent the incredible fidelity and realism of the world, while autoregression models how the world evolves over time. To make this world model general, training must expand from narrow domains—like driving observations—to diverse, large-scale general video paired with actions across many environments and interactions, under a single next-state prediction objective. The same next-state objective should extend to multimodal state—video, audio, actions, and language—though this is a nascent research frontier that will mature over time.

Today, we are beginning to see early versions of this generality emerge. Current systems can sustain stable, interactive simulations on the order of a minute, responding to simple control inputs like WASD or open-ended language. This is meaningful progress, but we are still far from full generality, rich interactivity, and long-horizon stability. We need multimodal interactivity, and simulations that remain stable over hours and days rather than minutes. With the level of talent, compute, and capital now flowing into world models, we expect these gaps to close quickly.

Systems That Understand and Interact With the World

Today, software is largely naive to the real-world. It has little sense of how the world works, or how to interact with it—or with humans—in grounded ways. A general world model, in contrast, will learn this capability from trillions of observations of the world encoded in video, and eventually all multimodal data. Suddenly we transition from naive software to systems capable of interacting intelligently with us and our world. This is a new capability for humanity.

J.A.R.V.I.S.

For robotics, this unlocks intelligent robots capable of operating in general environments and serving billions of customers. World models learn how humans manipulate the world and enable robots to rehearse complex tasks before acting—simulating reaching, navigation, and manipulation learned from video. Instead of brittle automation in controlled settings, robots can adapt to all the variability of the real-world, like humans can.

Going further, a world model’s prediction loop can run in real-time, producing continuous, interactive simulations that enable entirely new experiences. Imagine new kinds of devices that naturally interact with you and the world. Being able to step into an old photo and explore that memory again. Taking a guided tour of an ancient civilization as if you were there. Practicing a difficult conversation or public speech with real, grounded feedback. Learning piano with instant, expert help. Asking Einstein to teach you physics and questioning him along the way. Soldiers and disaster-response teams training in ultra-realistic, evolving environments. Learning a new language by speaking with people on the streets of an imagined Paris or Tokyo. Over time, most pixels on all screens will be simulations generated by world models, with simulations shape-shifting into whatever form is most natural for the task and place.

What’s also becoming clear is that these models will become scientific instruments, deepening our understanding of the world and the physics of our universe. By learning from trillions of observations across domains, they may internalize the dynamics of physics, human behavior, and complex systems at a scale no human or single laboratory ever could.

The possible futures of world models are beyond exciting, and if this direction resonates, we’re building it at Odyssey. We’re an AI lab focused on general-purpose world models: causal, multimodal systems that learn to predict and interact with the world over long horizons. If you’re a researcher interested in pushing beyond narrow models toward learned world simulators, we’d love to talk!

API

Build with general-purpose world models

Integrate Odyssey-2 Pro with our API

APP

Simulate your

dreams in real-time

Experience Odyssey-2 Pro for free

API

Build with general-purpose world models

Integrate Odyssey-2 Pro with our API

APP

Simulate your

dreams in real-time

Experience Odyssey-2 Pro for free

API

Build with general-purpose world models

Integrate Odyssey-2 Pro with our API

APP

Simulate your

dreams in real-time

Experience Odyssey-2 Pro for free

API

Build with general-purpose world models

Integrate Odyssey-2 Pro with our API

APP

Simulate your

dreams in real-time

Experience Odyssey-2 Pro for free