On the Origin of Species (of World Models)

World models are entering a renaissance, but the term is being overloaded. Frequently, and badly. We need a better taxonomy, rather than calling everything a world model

Jeff Hawke

February 27th, 2026

The past year has seen enormous interest arise from the research community and different industries. This is visible in machine learning conferences, new startups, and even in Google search trends. These trends show signs of acceleration, not slowing. New papers, demos, and systems are appearing every week. That momentum is meaningful. It means a fast-growing number of very smart people, who could spend their time on almost anything, believe world models are where their time is best spent.

So, what is being modeled?

With this lens, I'd like to propose a taxonomy of four categories of AI model currently being described as a ‘world model.’

World models: “learn how the world evolves”
Spatial intelligence: “learn how the world appears”
Behaviour models: “learn how to act within in a world”
Proxy world models: “learn an abstraction of the world”

The architecture of a world model

World Models: “Learn how the world evolves”

The canonical definition of a world model is one which is trained to predict “how the world evolves”. This was coined by Ha and Schmidhuber (2018) in a reinforcement learning context.

The model predicts potential future states of the world given some change to be effected in that world representation. Importantly this means it is a dynamics model, predicting a change given an action. This definition is common in reinforcement learning and robotics, with a belief of learning ‘end-to-end’. Odyssey and DeepMind’s Genie are good public examples of this category of model. The JEPA proposal from Yann LeCun is a specific interpretation of this definition of a world model.

This is a very general type of model, which can be thought of as learning structure, dynamics, and behaviour end-to-end from data.

World models offer a route to general multimodal AI. They offer an accelerant to general-purpose robotics, outlined well in Jim Fan's blog post, ‘The Second Pre-training Paradigm’. They offer new categories of experiences that go beyond game engines. They also enable an enormous long-tail of new applications that feel like sci-fi.

Spatial Intelligence: “Learn how the world appears”

The second model definition is better thought of as “spatial intelligence” than a world model. This definition has come primarily from the 3D computer vision community, rather than reinforcement learning or robotics.

These models focus on “learning how the world appears.” The argument is that explicit representations are necessary for spatial understanding, though there are widely held alternative perspectives. Vincent Sitzmann's post outlines our perspective well. World Labs is a good example of this category. We have also seen 3D gaussian splat reconstruction referred to in this manner. Importantly, this category is not considered a dynamics model nor is it action conditioned.

These models offer a route to improving productivity with 3D toolchains. For example, simplifying games development built on existing game engines. Importantly, the route for these models to more general AI is not clear as the explicit representations used make end-to-end learning difficult.

Behaviour Models: “Learn how to act within a world”

The third category is agent behaviour, where a model learns what action to take in some state. This is also referred to as 'policy learning.'

There are a range of industry research efforts in this category. Some, like General Intuition, are focused on agent behaviour in a game environment with the belief that this will generalise into broader environments. Others, like Simile, are focused on human behaviour modelling with agentic language models for enterprise use. It’s arguable that World Models do model behaviour implicitly, in that they must model the behaviour of agents in order to predict state dynamics.

This category of models offers behaviour modelling in limited domains. This can be incredibly valuable, even if an AI model cannot extend beyond the target training domain. Self-driving models such as Wayve's driver or robot foundation models such as Physical Intelligence fit this category.

Proxy World Models: “Learn an abstraction of the world”

Finally, we sometimes hear the statement “LLMs have a world model inside.” This is motivated by research such as Emergent Representations of Program Semantics in Language Models Trained on Programs (August 2024). There is evidence for language models forming internal representations of simple world state.

However, this notion of world state is simplistic compared to what is required for a ‘neural simulation model’. Models trained this way might be able to reason about how fluid moves, but would not be able to demonstrate a pattern of vortex shedding. At best the models learn a proxy for underlying phenomena through language.

This category of models is simply Language Models as we know them today, with all the amazing things they can do.

We are excited about the shift of world models from single-domain to general-purpose utility

Autonomous driving was solved on the basis of 1) predicting the world, and 2) learning the right behavioural response to the situation. Solving this was a huge frontier AI research challenge. This investment has been worth it, with the industry now shifting into a scaling phase.

World models formed the foundation on which autonomous driving was built, predicting potential futures. Excitingly, they offer the straightest path to a future of multimodal, autoregressive models which learn causality from observations.

This is the most general form of intelligence, bringing together vision, audio, and language. The model learns from an immense scale of observations of the real world, as well as synthetic observations where we have a desire to deviate from reality or augment model performance.

In this future, rather than getting a text box for interaction with AI, you will be presented with a stream of intelligent pixels.

It’s easy to imagine all the ways these ‘intelligent pixels’ will transform our lives. Robotics and entertainment are obvious examples of industries which will change through these models, however this is only the start. This is really the birth of the Holodeck, and world models are the foundational technology underpinning this future.

Want to go deeper?

The second pre-training paradigm, Jim Fan (February 2026)
The flavor of the bitter lesson for computer vision, Vincent Sitzman (February 2026)
Should we predict pixels for world models in robotics? Anirudha Majumdar (February 2026)

I also recommend reading the origins of world model research. Originally, world models were developed for single domains, such as autonomous driving, RL test environments, or platform games. Most of these models have not been released publicly. This is changing, with Genie 3 and Odyssey-2 both being released and having a broad degree of applicability. There are differences in model capabilities being prioritised between frontier labs, though we expect these to converge.