Mixture of Contexts for Long Video Generation

“Minute-long context memory with short-video cost”
1Stanford University 2ByteDance Seed 3Johns Hopkins University 4CUHK 5ByteDance
*Work done at ByteDance Seed   Corresponding Author
Learnable Sparse Attention Routing

A non-parametric yet learnable router selects informative history chunks to calculate attention. The routing is implicitly differentiable and optimized end-to-end from large-scale long video data.

Multi-shot Long Video Generation

For 8-shot, minute-long videos (~180k tokens), MoC prunes ≈ 85 % of token pairs and cuts FLOPs by 7× while delivering seamless cross-cut coherence.

Some prompts are from Twitter artworks.

Single-shot Short Video Generation

On 8-second, 320×192 clips (~6.5k tokens), MoC routing prunes ≈ 83 % of token pairs, while maintaining or even improving video quality.

BibTeX
@inproceedings{cai2025moc,
        title={Mixture of Contexts for Long Video Generation},
        author={Cai, Shengqu and Yang, Ceyuan and Zhang, Lvmin and Guo, Yuwei and Xiao, Junfei and Yang, Ziyan and Xu, Yinghao and Yang, Zhenheng and Yuille, Alan and Guibas, Leonidas and Agrawala, Maneesh and Jiang, Lu and Wetzstein, Gordon},
        year={2025},
        booktitle={arXiv},
      }