Blog
Thoughts on Video World Models and Long Video Generation
Some notes I have been carrying around since working on Mixture of Contexts.
Some random thoughts I have been having about video world models and long video generation since working on Mixture of Contexts, whose title could almost have been "Learnable Sparse Attention for Long Video Generation".
Semi-long post alert: this is a loose collection of views rather than one tightly argued thesis.
1. Learnable Sparse Attention Is Still Underrated
I still think learnable sparse attention is underrated for video, 3D/4D, and world models.
- Different from text. Text often hinges on single-token dependencies; video almost never does. The visual signals that matter tend to form patch or tube structures that persist and evolve across frames.
- The wrong mental model. The "needle-in-a-haystack" token recall test for LLMs does not map cleanly onto video. Long video rarely requires recovering one lone token from far in the past. Because viewpoint, lighting, scale, occlusion, articulation, motion blur, and even editing all change so much, there is no invariant single token to recover anyway.
- Visual content is physically structural. Continuity, locality, bounded acceleration, and limited parallax drastically shrink the search space. Targets usually reappear across multiple frames and move in predictable ways.
- Compression versus sparsity. Compression is a blunt tool for space-time recurrence. Learnable sparsity routes computation directly toward recurring, structured signal instead of risking the loss of fine but persistent cues. For visual domains, that may be more suitable than purely compression-centric strategies. That said, the two are not orthogonal: in MoC we still use a naive attention sink, which is a form of compression, and it helps.
2. What Should "Memory" Mean in Video World Models?
When people say memory, context, state, or history for long video generation, I think what we really want is context that supports a self-evolving world state, broadly in the spirit of Yann LeCun's view.
- After scaling up, simply maintaining scene or character consistency starts to feel relatively trivial. Our MoC works, Context-as-Memory works, TTT/LaCT works, and nano-banana also works.
- What we need is a more expressive context ability. A simple behavioral test is this: a car enters from the left, the camera looks away, and when it returns the car should have advanced plausibly. That requires a state that evolves off-screen and enables the model to deduce what is happening.
- This requires something beyond 3D caches. Pure 3D memory, meaning geometry and appearance, does not carry ongoing events through occlusions or field-of-view changes. We need an evolving 4D latent state that tracks identity, pose, momentum, interactions, and constraints. In other words, the model needs to preserve "what's going on" even when it is unseen.
- This also means we need more than a memory bank. Consistency of characters or assets is not enough; we need state transitions that continue even while unobserved.
- This does not necessarily mean using an SSM. It means placing a deductive step somewhere inside the model. Full attention can do this well given enough data, since it effectively behaves like a dynamic graph, but it becomes intractable at long contexts. That is exactly why learnable sparsity matters, and it was a core motivation for us in MoC.
3. Algorithms Are Not the Main Problem; Data Is
I increasingly feel that algorithms are not the main bottleneck for handling memory in video world models. The real bottleneck is data, especially video-action or video-interaction paired data.
- We more or less know how to represent and route long-range visual context.
- What we do not have enough of is data that truly stresses long-horizon prediction: persistent identity, occlusions, off-screen dynamics, multi-agent interactions, and meaningful action-conditioned changes.
- This mirrors the difficult VLA challenge. Scalable, high-quality interaction data is the real rate limiter for grounded state evolution and robust deduction.
- The encouraging part is that under the video world model setting, we may not face as much of a Sim2Real gap as classic robotics pipelines do.
4. What Is the Role of Explicit 3D?
On this point I side with purely implicit, data-driven approaches. I think explicit 3D still matters, but mainly in data and alignment rather than as the foundational representation the whole model should be built on top of.
5. The Future Is a Unified Model
My current bet is that the future is a unified model.
- Put deduction in the right place. A unified model is the most direct way to put the deductive step into the semantic representation space and train it end-to-end.
- Borrow more, borrow better. Shared representations let the model transfer motion priors, physics, and identity persistence across tasks and modalities. It also becomes easier to borrow more, and borrow better, from years of effort in the LLM community.
- Consistent routing and compression. Unified training can yield more stable sparsity policies about what to attend to, when to attend to it, and how to compress it.
- Richer supervision. Multi-task signals should sharpen the evolving latent state and improve long-horizon deduction ability.
There is still a lot to do.