Mixture of Contexts for Long Video Generation

“Minute-long context memory with short-video cost”

Shengqu Cai^1,* Ceyuan Yang^2,† Lvmin Zhang¹ Yuwei Guo⁴ Junfei Xiao³ Ziyan Yang² Yinghao Xu¹ Zhenheng Yang⁵ Alan Yuille³ Leonidas Guibas¹ Maneesh Agrawala¹ Lu Jiang² Gordon Wetzstein¹

¹Stanford University ²ByteDance Seed ³Johns Hopkins University ⁴CUHK ⁵ByteDance

^*Work done at ByteDance Seed ^†Corresponding Author

📄 Research Paper

Learnable Sparse Attention Routing

A non-parametric yet learnable router selects informative history chunks to calculate attention. The routing is implicitly differentiable and optimized end-to-end from large-scale long video data.

Multi-shot Long Video Generation

For 8-shot, minute-long videos (~180k tokens), MoC prunes ≈ 85 % of token pairs and cuts FLOPs by 7× while delivering minute-long context coherence.

Some prompts are from Twitter artworks.

Single-shot Short Video Generation

On 8-second, 320×192 clips (~6.5k tokens), MoC routing prunes ≈ 83 % of token pairs, while maintaining or even improving video quality.

BibTeX

@inproceedings{cai2025moc,
        title={Mixture of Contexts for Long Video Generation},
        author={Cai, Shengqu and Yang, Ceyuan and Zhang, Lvmin and Guo, Yuwei and Xiao, Junfei and Yang, Ziyan and Xu, Yinghao and Yang, Zhenheng and Yuille, Alan and Guibas, Leonidas and Agrawala, Maneesh and Jiang, Lu and Wetzstein, Gordon},
        year={2025},
        booktitle={arXiv},
      }