On "Error Accumulation" in Causal AR World Models

Notes on train-test drift, self-forcing, teacher models, and the effect of scale.

I've been meaning to write about this topic for a long time.

Recently, as world models are truly beginning to enter downstream fields like robotics, I happened to have the opportunity to speak with people on the core team behind a highly prestigious world model. Interestingly, they had barely paid attention to this issue at all. Armed with powerful base models and massive compute scale, the problem almost seems not to exist for them. That conversation made me realize that now is the perfect time to summarize and reflect on error accumulation in a more systematic way.

This post is mostly for myself, as a way to clear my thoughts, but I figure it may also be of interest to others, so I am sharing it publicly. Thinking about it, I might be among the first group of researchers to tackle this issue directly within the visual generation community. Funny that, after years, I ended up circling back to almost the same problem again.

What Is Error Accumulation?

First, let us clarify the definition. The core concept is actually quite simple: during autoregressive generation, the training and testing distributions become increasingly inconsistent and drift apart. As the number of rollout steps increases, the model's ability to generalize degrades. In other words, the model is forced to condition on more and more of its own imperfect outputs, moving further away from the distribution it saw during training.

In video generation models, the specific manifestation is highly intuitive: as the frames progress, the picture becomes increasingly messy and begins to structurally collapse before your eyes.

Why Do LLMs Seem Unaffected?

Following this logic, a natural question arises: large language models are also based on autoregressive next-token prediction, so why do they seem almost immune to this problem?

Well, not entirely. Error accumulation was a fatal flaw in early RNN frameworks, which prompted academia to design countless workarounds, many of which strongly resemble today's Self-Forcing strategies [1]. Even modern LLMs still suffer from exposure bias, which is essentially the same phenomenon, just less fatal in practice.

For modern LLMs, the problem is nowhere near as destructive as it is in video generation, mainly for two reasons:

Discrete versus continuous domains. LLMs operate in a discrete space. No matter how much distribution drift occurs, the output still remains a valid token from a predefined vocabulary. The system has an extremely high tolerance for small deviations. In video, by contrast, once pixel-level errors begin to accumulate, they quickly trigger the loss of high-frequency information, leading to blur, oversaturation, broken structure, and eventual collapse.
The overwhelming advantage of the training domain. The training corpus for LLMs is so vast that the models have already been exposed to a huge number of bad or unusual distributions, which grants them very strong generalizability. That is why, even during prefilling, you can feed a model a string of incomprehensible gibberish and it can still continue with something vaguely coherent.

It is also worth noting that this is not exclusive to pure AR frameworks. Diffusion models also experience error accumulation, since the diffusion process can itself be viewed as autoregressive generation over timesteps. It is simply less obvious than across frame rollouts. A prime example of explicitly addressing this in diffusion is Backward Simulation in DMD 2 [2], which targets a closely related pain point around distribution drift.

A Personal Perspective

I first started systematically focusing on this issue in the context of Infinite Nature's [3] scene extension framework. For those unfamiliar, the underlying mechanism of Infinite Nature is essentially Self-Forcing. That demo left a massive impact on my younger self, so I followed that line of thought and later developed DiffDreamer [4] with Eric Chan and Songyou Peng, which ultimately became my second paper and my master's thesis. I also got to spend a good week in Paris for ICCV 2023.

In DiffDreamer [4], we explored perpetual view generation: given a natural image, the camera continuously "flies" deeper into the scene. This kind of long-range extrapolation has highly typical autoregressive characteristics. As the camera moves forward, newly generated frames depend entirely on the model's previous outputs. As the model gradually strays from real ground truth and gets trapped inside its own hallucinated distribution, high-frequency details begin to disappear, warping artifacts compound, and eventually the image homogenizes and collapses.

Our solution was to frame the problem as predicting a future anchor frame first, and then interpolating the intermediate frames. That strategy turned out to be highly effective. More broadly, using future prediction to address error accumulation also has deep parallels in the SLAM domain.

This line of thinking later translated naturally into FramePack [5], under a more formal autoregressive video generation framework based on next-frame prediction. In FramePack, the first contribution used context packing to handle long-context memory. The second contribution, designed explicitly to combat error accumulation, introduced an early-established endpoints mechanism, essentially using a relatively clean future prediction to guide the AR process. You can check out the FramePack-P1 [6] demo for some of the best visual results. The Seoul World Model [16], by jyseo_cv, follows a similar idea and achieves very impressive outcomes.

On Self-Forcing

When discussing error accumulation, it is impossible not to talk about Self-Forcing (SF) [1]. The authors deserve enormous credit for engineering the system to work and for open-sourcing it to the community.

However, from my perspective, the community currently harbors a major misunderstanding about SF. Whenever it is mentioned, people subconsciously assume it is a framework designed to solve long video generation and then reason from that assumption.

But is that really the case? If you look at follow-up works such as LongLive [7] and Infinity-RoPE [8], you will notice that in most scenarios SF fundamentally collapses after rollouts of only a dozen seconds or so. If you carefully read the original paper and project page, the authors are actually very honest about this: long video generation remains a limitation. At its core, SF is a system for accelerating video generation via causal AR distillation to enable interactivity. It is not yet a true foundational solution for long videos.

So where does the core bottleneck lie? In my view, it comes down to rollout steps and the context window and overall capability of the teacher model.

On Rollout Steps

During training, SF [1] only rolls out to about 5 seconds. That inherently means that during inference, any rollout beyond 5 seconds is pure extrapolation. But the nature of accumulated error is complex and diverse; we absolutely cannot assume that the error distribution accumulated over a 5-second rollout mirrors the error distribution accumulated over a 30+ second rollout.

Viewed through the lens of extrapolation failure, works like LongLive [7], Self-Forcing++ [9], and APT 2 [10] may actually be somewhat closer to fundamentally tackling long video generation.

Without relying heavily on parallel infrastructure like Context Parallelism (CP) to brute-force the issue [14], the rollout process already pushes the limits of per-iteration computational cost. In practice, it is difficult for SF to even run native diffusion or flow matching; it is forced to resort to few-step distillation like DMD [2]. That is not inherently a bad thing, since reverse KL has its own benefits, but it does make the design space narrower.

On the Teacher Model

World models, in any context, must possess causal-level memory and long-context capabilities. This becomes especially critical as we push them toward downstream applications like robotics. The memory and long context we need in those domains are extremely abstracted, enabling long-horizon tasks, as in LingBot-VA [15], and even in-context learning, as in DVA [13]. To me, these capabilities strongly resemble the memory and context mechanisms we discuss in LLMs.

But SF is ultimately a distillation method, which means its ceiling is dictated by the teacher model. In the open-source world, what the Wan 1.3B model can achieve is limited by its own architectural ceiling. The SF codebase actually used a 14B model as the teacher to push results higher, but even that teacher only provided about a 5-second context window.

Without a long-context base model, any discussion or proposed solution around error accumulation risks simply overfitting to this 5-second regime. Larger-scale efforts such as LingBot-World [14] choose to directly train a long-context base model, on the order of 1 minute, before distillation. I also worked on MMM [11] to try to make genuine headway on true long video generation by introducing real ground-truth long data directly into the distillation stage through the underlying representation.

This is also a core reason why I remain highly skeptical of treating static 3D representation or consistency as a primary direction for world models. Under most causal world-model settings, that framing does not make much sense to me, except perhaps marginally when the world model is treated more like a simulator.

On the Effect of Scale

When we look at truly top-tier commercial models, backed by massive compute, parameter counts, and data, their video extension capabilities are actually already quite robust. The visual collapse caused by error accumulation has been pushed much farther out.

To me, that is a strong indicator that the base model is the key. It is also exactly why several of my more recent projects, including FramePack [5, 6], PFP [17], MMM [11], and MoC [12], focus heavily on endowing base models with genuine long-context capability from the ground up.

A sad reality is that, at small scale and with limited data, we are often entering a dead end. The only viable path may be to overfit to narrow domains like Minecraft or Doom. That can still prove the concept, but it can be hard to tell whether the method would truly scale.

While developing FramePack [5], we also noticed a telling detail: if the prompt matched a highly common data point in the training set, like a city walk, error accumulation was noticeably mitigated.

Of course, on-policy distillation remains highly valuable and absolutely deserves continued research. But my personal prediction is that as we move toward larger scales, especially in a near future dominated by large-scale unified models that directly absorb vast amounts of information into their understanding modules, our entire perspective on this problem may be completely upended.