A non-parametric yet learnable router selects informative history chunks to calculate attention. The routing is implicitly differentiable and optimized end-to-end from large-scale long video data.
For 8-shot, minute-long videos (~180k tokens), MoC prunes ≈ 85 % of token pairs and cuts FLOPs by 7× while delivering seamless cross-cut coherence.
Some prompts are from Twitter artworks.
On 8-second, 320×192 clips (~6.5k tokens), MoC routing prunes ≈ 83 % of token pairs, while maintaining or even improving video quality.
@inproceedings{cai2025moc, title={Mixture of Contexts for Long Video Generation}, author={Cai, Shengqu and Yang, Ceyuan and Zhang, Lvmin and Guo, Yuwei and Xiao, Junfei and Yang, Ziyan and Xu, Yinghao and Yang, Zhenheng and Yuille, Alan and Guibas, Leonidas and Agrawala, Maneesh and Jiang, Lu and Wetzstein, Gordon}, year={2025}, booktitle={arXiv}, }