Blog

Thoughts on Video World Models and Long Video Generation

Some notes I have been carrying around since working on Mixture of Contexts.

Some random thoughts I have been having about video world models and long video generation since working on Mixture of Contexts, whose title could almost have been "Learnable Sparse Attention for Long Video Generation".

Semi-long post alert: this is a loose collection of views rather than one tightly argued thesis.

1. Learnable Sparse Attention Is Still Underrated

I still think learnable sparse attention is underrated for video, 3D/4D, and world models.

2. What Should "Memory" Mean in Video World Models?

When people say memory, context, state, or history for long video generation, I think what we really want is context that supports a self-evolving world state, broadly in the spirit of Yann LeCun's view.

3. Algorithms Are Not the Main Problem; Data Is

I increasingly feel that algorithms are not the main bottleneck for handling memory in video world models. The real bottleneck is data, especially video-action or video-interaction paired data.

4. What Is the Role of Explicit 3D?

On this point I side with purely implicit, data-driven approaches. I think explicit 3D still matters, but mainly in data and alignment rather than as the foundational representation the whole model should be built on top of.

5. The Future Is a Unified Model

My current bet is that the future is a unified model.

There is still a lot to do.