Mode Seeking meets Mean Seeking for Fast Long Video Generation

1Stanford University · 2NVIDIA Research · 3NYU Courant
*Equal contribution.
Fast browsing: previews load only when they are near your screen. Click any preview to open the highest available version.
Overview
Overview figure
Mode seeking meets mean seeking: a shared long-context condition encoder Eϕ maps a noisy long-video latent to a unified representation ht. Two lightweight decoder heads read out velocities from ht: the long-context Flow Matching head is trained with supervised flow matching on real long videos (mean-seeking), while the segment-wise Distribution Matching head is trained via on-policy sliding-window reverse-KL alignment to an expert short-video teacher (mode-seeking). Both objectives update the shared encoder, but each head receives only its corresponding signal. During inference, only the Distribution Matching head is used, enabling fast long-video generation.
Gallery previews load as they enter view.
0 / 0 · 0%
One-minute previews load as they enter view.
0 / 0 · 0%
Qualitative Comparisons
Comparison previews load by slide.
0 / 0 · 0%
1min Qualitative Comparisons
1min comparison previews load by slide.
0 / 0 · 0%
BibTeX
@inproceedings{cai2026mmm,
  title     = {Mode Seeking meets Mean Seeking for Fast Long Video Generation},
  author    = {Cai, Shengqu and Nie, Weili and Liu, Chao and Berner, Julius and
               Zhang, Lvmin and Ma, Nanye and Chen, Hansheng and Agrawala, Maneesh and
               Guibas, Leonidas and Wetzstein, Gordon and Vahdat, Arash},
  booktitle = {arXiv},
  year      = {2026},
}