Mode Seeking meets Mean Seeking for Fast Long Video Generation

Shengqu Cai^1,2 Weili Nie^*² Chao Liu^*² Julius Berner² Lvmin Zhang¹ Nanye Ma³ Hansheng Chen¹ Maneesh Agrawala¹ Leonidas Guibas¹ Gordon Wetzstein¹ Arash Vahdat²

¹Stanford University · ²NVIDIA Research · ³NYU Courant

^*Equal contribution.

Paper

Fast browsing: previews load only when they are near your screen. Click any preview to open the highest available version.

Overview

Mode seeking meets mean seeking: a shared long-context condition encoder E_ϕ maps a noisy long-video latent to a unified representation h_t. Two lightweight decoder heads read out velocities from h_t: the long-context Flow Matching head is trained with supervised flow matching on real long videos (mean-seeking), while the segment-wise Distribution Matching head is trained via on-policy sliding-window reverse-KL alignment to an expert short-video teacher (mode-seeking). Both objectives update the shared encoder, but each head receives only its corresponding signal. During inference, only the Distribution Matching head is used, enabling fast long-video generation.

Gallery

Gallery previews load as they enter view.

0 / 0 · 0%

Gallery (1 minute)

One-minute previews load as they enter view.

0 / 0 · 0%

Qualitative Comparisons

Comparison previews load by slide.

0 / 0 · 0%

Prompt A mysterious figure navigating through a dense, misty forest during what appears to be early morning or late evening, given the low light conditions. The character, clad in dark medieval armor, moves stealthily among the trees and tall grass, suggesting a cautious exploration or a search. As the sequence progresses, the environment transitions from densely wooded areas to more open spaces with visible paths and occasional rustic structures, like wooden cabins, hinting at a human presence or past habitation in this secluded forest. The atmosphere is thick with fog and the natural sounds of the forest, enhancing the sense of solitude and mystery.

Long-context SFT

Mixed-length SFT

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt A person's journey through the historic streets of Lisbon, Portugal, showcasing a variety of landmarks and the vibrant atmosphere of the city. The person navigates through cobblestone paths, surrounded by lush greenery, ancient trees, and the well-preserved architecture that tells the story of Lisbon's rich past. The sun casts a bright, clear light, enhancing the textures and colors of the stone walls and the verdant foliage. As the person moves, the perspective shifts, offering glimpses of statues, panoramic city views, and the iconic red-tiled roofs that Lisbon is known for, all under a blue, cloudless sky.

Long-context SFT

Mixed-length SFT

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt The video begins with a first-person perspective walking through an indoor corridor illuminated by warm, ambient lighting. The camera moves steadily forward, capturing the sleek, modern design of the hallway with its polished floors and hanging spherical lamps. As the viewer progresses, they pass by various storefronts and informational displays, including a vibrant red poster featuring festive imagery. The scene transitions smoothly as the person exits the building, stepping into a nighttime urban setting. The camera pans across a well-lit outdoor area adorned with lush greenery and tall palm trees, creating a serene atmosphere despite the cloudy weather. The viewer continues along a paved pathway, observing scattered pedestrians and the soft glow of streetlights reflecting off the wet pavement. The surrounding buildings, with their modern architecture and illuminated facades, add depth to the scene. The camera maintains a steady pace, capturing the tranquil ambiance of Taipei's urban landscape at night, with occasional shifts in angle to highlight the interplay of light and shadow among the trees and structures.

Long-context SFT

Mixed-length SFT

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt The drone begins its aerial journey over Cape Town, South Africa, capturing a sweeping view of the urban coastline. As it moves forward, the camera pans smoothly from left to right, showcasing a series of tall residential buildings that line the beachfront. The structures vary in height and design, with some featuring balconies and others displaying flat rooftops. The sandy beach stretches out beneath the drone, meeting the turquoise waters of the ocean where gentle waves roll onto the shore. In the distance, the iconic Table Mountain stands majestically under a clear blue sky, its flat top contrasting sharply with the surrounding landscape. A large ship can be seen far out at sea, adding a sense of scale to the scene. The camera maintains a steady altitude, offering a consistent bird's-eye view that highlights the harmonious blend of urban development and natural beauty. The sunlight bathes the entire scene in warm hues, enhancing the vibrant colors of the buildings, sand, and water. The absence of crowds on the beach and streets emphasizes the serene and tranquil atmosphere of this picturesque coastal city.

Long-context SFT

Mixed-length SFT

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt A frog knight exploring a rain-soaked garden kingdom in a video game, hopping between wet stones and lily pads, approaching a tiny castle doorway lit by warm candlelight.

Long-context SFT

Mixed-length SFT

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt A sleek black cat exploring a gothic mansion in a video game, moving from candlelit hallways lined with portraits into a grand ballroom filled with drifting dust motes. The cat slips past creaking doors and velvet curtains, then climbs a spiral staircase to a moonlit balcony overlooking a foggy courtyard, where distant thunder and flashing lightning briefly illuminate statues and overgrown vines.

Long-context SFT

Mixed-length SFT

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt The video depicts a tiny astronaut hamster floating through a zero-gravity museum, gently pushing off walls to drift past glowing dioramas and suspended artifacts under cool white lighting. The hamster enters a panoramic observatory window where a planet's curve fills the view, then follows a corridor of blinking panels and humming doors, with floating water droplets and miniature plants, creating a calm, surreal finale.

Long-context SFT

Mixed-length SFT

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt A librarian exploring a floating cathedral in a video game, walking on stone pathways suspended among clouds, in a hall of glowing books where pages flutter without wind.

Long-context SFT

Mixed-length SFT

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt The video depicts a vintage tram rolling through a hillside city at sunrise, passing tiled facades and narrow streets as NPCs gather at stops. The tram climbs toward a viewpoint where rooftops spread toward a river, sunlight brightening the scene.

Long-context SFT

Mixed-length SFT

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

1min Qualitative Comparisons

1min comparison previews load by slide.

0 / 0 · 0%

Prompt The video depicts a medieval-themed video game character navigating through a flooded village. The character, dressed in dark, elaborate armor with a flowing cape, wades through the shallow waters of a river that runs through the center of the village. The village is composed of stone and wooden buildings, some of which are partially submerged. The character moves forward cautiously, suggesting a sense of alertness or anticipation. The environment is detailed with overgrown vegetation, scattered debris, and reflections on the water's surface, creating a realistic and immersive atmosphere. The lighting is natural, with a clear sky and soft shadows, indicating it is daytime.

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt A person exploring a mysterious, icy cave filled with swirling smoke and ethereal blue light. The explorer, dressed in rugged outdoor gear, cautiously navigates through the treacherous terrain, marked by icy surfaces and large boulders. The person is progressing deeper into the cave, where the light is composed with a dim, natural outside light and a surreal blue, enhancing the mystical and perilous atmosphere of the environment. The journey is slow and deliberate, emphasizing the cave's unique features, such as the icy walls and natural obstacles.

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt The video begins with a person wearing a green cloak and dark clothing walking through a dimly lit stone corridor. As they exit the corridor, they emerge into a vast, rocky canyon under a cloudy sky. The person continues walking along a rugged path, navigating through the rocky terrain and stone corridors. Bridges are spanning the canyon, and the person proceeds to cross them. The camera follows closely, capturing the expansive landscape and the person's determined stride.

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt A dynamic and visually rich journey of a car driving through a vibrant cityscape in a video game. The journey happens in a neon-lit urban area, bustling with colorful signage and animated advertisements, reminiscent of a bustling metropolis at night. The sequence captures various times of day, from night to day, highlighting the game's detailed and realistic lighting and weather effects, capturing the essence of a sprawling city.

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt A heavily armored character navigating through a dark, eerie cave. The character, clad in black armor and wielding a large sword, moves cautiously through the cave, which is filled with twisted, gnarled roots and glowing blue crystals. The atmosphere is tense and foreboding, with the character encountering various obstacles and mysterious elements as they progress deeper into the cave. The lighting is dim, with a blue hue casting an otherworldly glow on the surroundings, enhancing the sense of danger and mystery.

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

Prompt A person in a green robe walks down a sunlit, ancient stone hallway. The hallway is lined with tall, arched windows that allow beams of sunlight to cast intricate shadows on the stone floor. The hallway is adorned with creeping vines and patches of grass growing through the stone cracks, adding a touch of nature reclaiming the old architecture. The individual moves at a steady pace through various sections of an ancient, open-air structure with intricate stone arches and gothic architecture. The setting is serene, with lush greenery visible through the windows and open spaces, and other cloaked figures can be seen in the background, suggesting a communal or scholarly environment. The lighting is natural, with sunlight streaming through the arches, casting dynamic shadows on the stone floor.

CausVid

Self-Forcing

InfinityRoPE

LongLive

Ours

BibTeX

@inproceedings{cai2026mmm,
  title     = {Mode Seeking meets Mean Seeking for Fast Long Video Generation},
  author    = {Cai, Shengqu and Nie, Weili and Liu, Chao and Berner, Julius and
               Zhang, Lvmin and Ma, Nanye and Chen, Hansheng and Agrawala, Maneesh and
               Guibas, Leonidas and Wetzstein, Gordon and Vahdat, Arash},
  booktitle = {arXiv},
  year      = {2026},
}