Mode Seeking meets Mean Seeking for Fast Long Video Generation

1Stanford University 2NVIDIA Research 3NYU Courant
*Equal contribution.
Claim: We compressed all videos to lighten the loading burden.
Videos load as you scroll
0 / 0 · 0%
Overview
Overview figure
Mode seeking meets mean seeking: a shared long-context condition encoder EĎ• maps a noisy long-video latent to a unified representation ht. Two lightweight decoder heads read out velocities from ht: the long-context Flow Matching head is trained with supervised flow matching on real long videos (mean-seeking), while the segment-wise Distribution Matching head is trained via on-policy sliding-window reverse-KL alignment to an expert short-video teacher (mode-seeking). Both objectives update the shared encoder, but each head receives only its corresponding signal. During inference, only the Distribution Matching head is used, enabling fast long-video generation.
Gallery
Gallery (14B) — loading videos as you scroll…
0 / 0 · 0%
Gallery (1.3B) — loading videos as you scroll…
0 / 0 · 0%
Qualitative Comparisons
Comparisons — loading as you scroll…
0 / 0 · 0%
BibTeX
        @inproceedings{cai2026mmm,
          title={Mode Seeking meets Mean Seeking for Fast Long Video Generation},
          author={Cai, Shengqu and Nie, Weili and Liu, Chao and Berner, Julius and Zhang, Lvmin and Ma, Nanye and Chen, Hansheng and Agrawala, Maneesh and Guibas, Leonidas and Wetzstein, Gordon and Vahdat, Arash},
          year={2026},
          booktitle={arXiv}
        }