End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
TL;DR Summary
This paper presents a teacher-free self-resampling method for training autoregressive video diffusion models, addressing exposure bias and enabling efficient long-horizon generation with competitive performance and improved temporal consistency.
Abstract
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
1.2. Authors
-
Yuwei Guo (The Chinese University of Hong Kong, ByteDance Seed)
-
Ceyuan Yang (ByteDance Seed)
-
Hao He (The Chinese University of Hong Kong)
-
Yang Zhao (ByteDance Seed)
-
Meng Wei (ByteDance)
-
Zhenheng Yang (ByteDance)
-
Weilin Huang (ByteDance Seed)
-
Dahua Lin (The Chinese University of Hong Kong)
The author list indicates a strong collaboration between academia (The Chinese University of Hong Kong) and a major industry research lab (ByteDance). This combination often results in research that is both theoretically grounded and practically focused on scalability and state-of-the-art performance.
1.3. Journal/Conference
The paper is available on arXiv, a preprint server. The publication date in the metadata is listed as December 17, 2025, which is a future placeholder date, common for submissions to conferences with future deadlines. This indicates the work is recent but not yet peer-reviewed or officially published at a conference. arXiv is a standard and respected platform in the machine learning community for rapid dissemination of cutting-edge research.
1.4. Publication Year
2025 (as a placeholder on arXiv). The content suggests it was developed in late 2024 or early 2025.
1.5. Abstract
The paper addresses a critical challenge in autoregressive video diffusion models: exposure bias. This bias arises from a mismatch where models are trained on perfect ground-truth data but must rely on their own imperfect, self-generated outputs during inference, leading to error accumulation. While previous methods tackle this with post-training steps involving external "teacher" models, this paper introduces Resampling Forcing, a novel, teacher-free framework that enables end-to-end training of these models from scratch. The core of the method is a self-resampling scheme, where the model simulates its own future inference-time errors on past frames during the training process. By training on these "degraded" histories, the model learns to be robust to its own imperfections. The framework uses a sparse causal mask for parallel training and introduces history routing, an efficient, parameter-free mechanism to handle long videos by dynamically selecting the most relevant past frames. Experiments show that this approach achieves performance comparable to distillation-based methods while demonstrating superior temporal consistency in longer videos, thanks to its ability to train on full-length sequences natively.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2512.15702
- PDF Link: https://arxiv.org/pdf/2512.15702v1.pdf
- Publication Status: This is a preprint available on arXiv and has not yet undergone formal peer review for publication in a journal or conference.
2. Executive Summary
2.1. Background & Motivation
-
Core Problem: The central problem is
exposure biasin autoregressive video generation models. Autoregressive models generate content sequentially (e.g., frame by frame), where each new frame depends on the previously generated ones. During training, these models are typically fed perfect, ground-truth previous frames (a technique calledteacher forcing). However, during inference (actual generation), they must use their own, inevitably imperfect, generated frames as input for the next step. This mismatch between training and inference conditions causes small errors to propagate and amplify over time, leading to a significant drop in quality or even "catastrophic collapse" in long videos. -
Importance and Gaps: Solving this problem is crucial for creating reliable "world simulators" or generating long, coherent videos, which have applications in gaming, content creation, and simulation. Existing solutions, such as
Self Forcing, attempt to fix this by "post-training" the autoregressive model. They use a powerful, pre-trained bidirectional (non-causal) "teacher" model to guide or "distill" knowledge into the autoregressive model. This approach has two major drawbacks:- Dependency: It requires a separate, often much larger, teacher model, which makes the training process complex, computationally expensive, and not "end-to-end".
- Causality Violation: The teacher model can see the future, potentially leaking non-causal information that compromises the strict temporal causality desired in a true autoregressive simulator.
-
Innovative Idea: The paper's key idea is to eliminate the need for an external teacher by making the model its own teacher. It proposes to explicitly simulate the errors that occur during inference within the training loop itself. By forcing the model to generate a clean target frame from a "degraded" history that contains simulated errors, the model learns to become robust and correct its own mistakes, preventing error accumulation. This enables a fully end-to-end, teacher-free training paradigm.
2.2. Main Contributions / Findings
The paper presents the following primary contributions:
-
Resampling ForcingFramework: A novel, end-to-end, and teacher-free training framework for autoregressive video diffusion models. It directly addresses theexposure biasproblem without relying on external teacher models or discriminators, making it more scalable and self-contained. -
Self-ResamplingMechanism: This is the core technical innovation. During training, the method takes ground-truth history frames, partially corrupts them with noise, and then uses the model's current weights to regenerate them. This process creates "degraded" history frames that mimic the model's own inference-time errors. The model is then trained to predict the next clean frame conditioned on this imperfect history. -
History Routingfor Efficiency: To address the computational challenge of generating long videos (where the history grows continuously), the paper introduces a dynamic and parameter-free attention mechanism. Instead of attending to all past frames,history routingintelligently selects the top-k most relevant frames for each new frame being generated, keeping attention complexity nearly constant. -
State-of-the-Art Performance with Better Causality: The experimental results show that this method achieves video quality comparable to existing distillation-based methods. More importantly, because it is trained end-to-end on long videos without a non-causal teacher, it exhibits superior temporal consistency and stricter adherence to physical causality in long-horizon generation tasks.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Diffusion Models: These are a class of generative models that learn to create data by reversing a gradual noising process. The process involves two steps:
- Forward Process: A clean data sample (e.g., an image) is gradually corrupted by adding a small amount of Gaussian noise over many steps, eventually turning it into pure noise.
- Reverse Process: A neural network is trained to reverse this process. It takes a noisy sample and a timestep as input and predicts the noise that was added. By repeatedly subtracting the predicted noise, the model can generate a clean data sample starting from pure random noise. The paper uses a variant called
Flow Matching, which simplifies the training objective.
-
Autoregressive (AR) Models: These models generate complex data, like text or video, in a sequential manner. The generation of each element (a word, a pixel, or a video frame) is conditioned on the elements generated before it. The joint probability of a sequence is factorized as . LLMs like GPT are classic examples of AR models. In video, an AR model generates frame based on frames .
-
Teacher Forcing: This is a standard training technique for AR models. During training, to predict the next element , the model is given the ground-truth previous element as input, not its own prediction from the previous step. This stabilizes training and allows for parallel processing of the entire sequence. However, it creates a discrepancy with inference time.
-
Exposure Bias: This is the direct consequence of
teacher forcing. The model is never "exposed" to its own errors during training. At inference, it must use its own (potentially flawed) outputs as inputs for future steps. This can cause errors to compound, leading to a rapid decline in generation quality. -
Diffusion Transformer (DiT): This is a neural network architecture that adapts the Transformer (originally designed for text) to diffusion models. It treats an image or video as a sequence of "patches" (tokens) and uses self-attention to model relationships between them. This architecture has proven to be highly scalable and effective for high-resolution image and video generation.
3.2. Previous Works
-
Bidirectional Video Generation (e.g., Sora, Veo): These models generate all video frames jointly. Each frame's generation is influenced by both past and future frames through attention mechanisms. This allows for very high global consistency and quality but makes them non-causal. They cannot be used for real-time, interactive, or predictive tasks where the future is unknown.
-
Autoregressive Video Generation:
Self Forcing[32] andCausVid[77]: These are key baselines that also aim to solveexposure bias. Their approach is a form of distillation. They first train a large bidirectional "teacher" model. Then, they train a smaller autoregressive "student" model. During the student's training, it generates a video autoregressively, and the output is compared to the output of the teacher model. The student is trained to match the teacher's output distribution. The main drawback is the reliance on this powerful (and non-causal) teacher model.- Noise Injection [62, 67]: A simpler approach where small amounts of noise are added to the ground-truth history frames during training. This crudely approximates inference-time errors, but the paper argues that simple noise doesn't accurately reflect the complex, structured errors a model actually makes.
-
Efficient Attention for Video Generation:
- Standard Self-Attention: The computational core of Transformers. For a set of queries , keys , and values , the output is calculated as: Here, is the dimension of the keys. The complexity of this operation is quadratic with respect to the sequence length, making it a bottleneck for long videos.
- Sliding-Window Attention: A simple solution where each frame only attends to a fixed number of recent history frames. This is computationally cheap but sacrifices long-term dependency, leading to content drift over time.
Mixture of Block Attention (MoBA)[46]: An advanced sparse attention mechanism for LLMs. It dynamically selects a subset of "blocks" (groups of tokens) from the history for each query block to attend to, using a routing mechanism. The paper'shistory routingis directly inspired by this idea, adapting it for autoregressive video generation.
3.3. Technological Evolution
The field of video generation has evolved from early GAN-based methods to diffusion models. Initially, these were often bidirectional, prioritizing quality over causality (e.g., Sora). Recognizing the need for simulation and interactivity, the focus shifted towards autoregressive models. The first AR models suffered from exposure bias. The next generation of methods, like Self Forcing, addressed this using post-training distillation from teacher models. This paper represents the next step in this evolution: an end-to-end, teacher-free approach that integrates the correction for exposure bias directly into the training process itself, making the framework more elegant, scalable, and causally pure.
3.4. Differentiation Analysis
The core differentiation of this paper's method, Resampling Forcing, can be summarized in the following table:
| Method Type | Core Idea | Teacher Model? | End-to-End? | Causal Purity |
|---|---|---|---|---|
| Teacher Forcing | Train on ground-truth history. | No | Yes | High |
| Noise Augmentation | Add random noise to history. | No | Yes | High |
| Self Forcing / Distillation | Match output of a bidirectional teacher. | Yes (Required) | No (Post-training) | Low (Risk of leakage) |
| Resampling Forcing (This paper) | Simulate and train on own errors. | No (Teacher-free) | Yes | High |
The key innovation is replacing the external teacher with a self-correction loop. Instead of learning from a perfect, non-causal oracle, the model learns from its own simulated imperfections in a causally-strict manner.
4. Methodology
4.1. Principles
The fundamental principle of Resampling Forcing is to bridge the train-test gap (exposure bias) by making the model robust to its own errors. The intuition is that if a model is trained to generate a correct output (clean frame) even when given a slightly wrong input (degraded history), it will learn a "corrective" function rather than just a predictive one. During inference, when it inevitably produces a slightly wrong frame, this learned robustness will prevent the error from being amplified in subsequent steps. Instead, the model will stabilize, keeping the error level roughly constant.
4.2. Core Methodology In-depth
The methodology can be understood as a carefully designed training procedure that modifies the standard teacher forcing paradigm.
4.2.1. Background: Autoregressive Video Diffusion with Teacher Forcing
First, let's establish the baseline. An autoregressive video diffusion model generates a video of frames by factorizing the probability as a product of conditional probabilities, one for each frame:
-
is the probability of the -th frame given all previous frames and an external condition (e.g., a text prompt).
Each frame is generated using a diffusion process. Starting from pure noise , a neural network predicts the "velocity" to move towards the clean frame. The final frame is obtained by integrating this velocity from time to :
-
is the noisy version of frame at time .
-
is the neural network (a DiT in this case) that predicts the denoising direction.
-
are the history frames that the model is conditioned on.
The model is trained using a
Flow Matchingobjective. At a random time , a noisy frame is created by linearly interpolating the clean frame and a noise sample : The network is trained to predict the target velocity, which is simply . The loss function is the mean squared error between the predicted and target velocities: In standardteacher forcing, the history condition is the clean, ground-truth history. This is the source of exposure bias, as illustrated in the top part of Figure 2.
该图像是示意图,展示了单步生成与自回归生成的差异。上部分展示了教师强制方法中的历史帧与预测的关系,以及随之产生的累积错误;下部分展示了本研究提出的错误模拟方法,显示了有界错误的生成过程。图中标识了推断步骤和模型误差。
4.2.2. Core Innovation: Autoregressive Self-Resampling
Instead of using the clean history , Resampling Forcing trains the model on a degraded history that contains simulated model errors. This process is called self-resampling and is performed autoregressively for each video in a training batch.
The process to generate the degraded history is shown in Figure 3(a). For each frame from 1 to :
- Partial Noising: Take the clean ground-truth frame and noise it up to a specific timestep using Equation (3). This gives you .
- Self-Regeneration: Use the model's current weights to denoise back to a clean frame. Crucially, this denoising is conditioned on the previously generated degraded frames . This step simulates the full autoregressive generation process. The resulting degraded frame is given by:
-
is the history of already degraded frames, which simulates error propagation.
-
is the online model, meaning the model learns from its own current state.
-
Important: The gradients are detached from this entire self-resampling process. This prevents the model from learning a trivial solution (e.g., making the resampling process fail to produce errors). The resampling is only used to generate the input conditions.
After generating the full degraded history , the model is trained using the same loss function as before (Equation 4), but with the degraded history as input: Note that the prediction target remains the clean frame , while the condition is the imperfect history .
-
该图像是图示,展示了自回归重采样、并行训练和因果掩码的过程。图(a)说明了通过因果 DiT 进行自回归重采样的步骤,图(b)展示了并行训练中的扩散损失,而图(c)展示了因果掩码的结构。这些方法旨在改善视频扩散模型的训练和生成效果。
4.2.3. Controlling the Simulation Strength
The simulation timestep is a crucial hyperparameter.
-
If is very small (close to 0), the resampling is weak, is very similar to , and the training is close to teacher forcing. This risks error accumulation.
-
If is very large (close to 1), the resampling is strong, allowing the model to deviate significantly from the history. This risks
content drifting.To balance this, the paper samples from a
LogitNormal(0, 1)distribution. The logit of follows a standard normal distribution: This distribution naturally concentrates probability mass in the intermediate range, avoiding extreme values. To further control the distribution, a shifting parameter is applied: -
pushes the distribution of towards lower values (weaker resampling).
-
pushes it towards higher values (stronger resampling). The authors use , biasing the simulation towards weaker resampling to maintain faithfulness to the history.
4.2.4. Algorithm Summary and Warmup
The complete training process is detailed in Algorithm 1. In the initial stages of training, the model is too random to generate meaningful content, so self-resampling would just introduce noise. Therefore, the authors first warm up the model with standard teacher forcing until it can generate basic autoregressive sequences, and only then switch to Resampling Forcing.
| Algorithm 1 Resampling Forcing | |
| Require: Video Dataset D | |
| Require: Shift Parameter s | |
| Require: Autoregressive Video Diffusion Model vθ(·) | |
| 1: while not converged do | |
| 2: | ▷ sample simulation timestep |
| 3: | ▷ shift timestep (equation (7)) |
| 4: Sample video and condition | |
| 5: with gradient disabled do | |
| 6: for to do | ▷ autoregressive resampling |
| 7: | ▷ using numerical solver and KV cache (equation (5)) |
| 8: end for | |
| 9: end with | |
| 10: Sample training timestep | |
| 11: Sample | |
| 12: | |
| 13: `\mathcal{L} \leftarrow \frac{1}{N} \sum_{i=1}^N \|(\epsilon^i - x^i) - v_\theta(x_t^i, \tilde{x}^{ | ▷ parallel training with causal mask (equation (4)) |
| 14: Update with gradient descent | |
| 15: end while | |
| 16: return |
4.2.5. Efficient Long-Horizon Attention: History Routing
For long videos, the history grows, making the attention computation in increasingly slow. To solve this, the paper introduces history routing, a sparse attention mechanism. For a query token in the current frame, instead of attending to all tokens in the history keys and values , it only attends to a small subset.
该图像是一个示意图,展示了历史路由机制。图中,Top-K Router 动态选择了与当前帧相关的前 个重要帧,通过查询令牌 进行注意力机制处理,以实现高效的视频生成。
-
Selection: For a query , it finds the top- most relevant history frames. The relevance is measured by the dot product between the query and a descriptor for each history frame , . The paper uses a simple mean pooling for , which is parameter-free. The set of indices of the selected frames is: where .
-
Attention: The attention is then computed only over the keys and values corresponding to the selected frames: This reduces the attention complexity from being linear in the history length to being constant . Since different tokens and attention heads can select different sets of frames, the model can still capture complex, long-range dependencies.
5. Experimental Setup
5.1. Datasets
The paper does not specify the exact training dataset used. It mentions training on a large-scale dataset of 5s and 15s videos. Given the baseline model (WAN2.1) and prompts used in the examples (e.g., "A red balloon floating in an abandoned street"), it is likely a large, diverse internal dataset of text-video pairs curated by ByteDance, similar in nature to those used for training other large-scale generative models. The choice to train on both 5s and 15s videos is to validate the method's ability to handle and benefit from native long-video training.
5.2. Evaluation Metrics
The paper uses VBench [34], a comprehensive benchmark suite for evaluating video generation models. VBench assesses models across multiple dimensions. The key metrics reported are:
-
Temporal Quality:
- Conceptual Definition: This metric evaluates the smoothness, consistency, and realism of motion and temporal changes in the video. It penalizes flickering, object disappearance/reappearance, and unnatural motion.
- Calculation:
VBenchuses multiple sub-metrics for this, includingTemporal Flickering(measured by computing the temporal variation of frame-level features) andMotion Smoothness(often evaluated using optical flow consistency or specialized models trained to detect jerky motion). A higher score indicates better temporal quality.
-
Visual Quality:
- Conceptual Definition: This measures the aesthetic quality of individual frames in the video, such as sharpness, clarity, color fidelity, and absence of artifacts.
- Calculation:
VBenchtypically uses no-reference image quality assessment (NR-IQA) models, such asDOVER, which are trained on human preference scores to predict the perceptual quality of an image. The scores are averaged across frames. A higher score means better visual quality.
-
Text-Video Alignment (Text):
- Conceptual Definition: This measures how well the generated video corresponds to the meaning and details of the input text prompt.
- Calculation: This is typically measured using a video-text retrieval model like
CLIPor a dedicated Video-Language Model. The cosine similarity between the embeddings of the generated video and the text prompt is calculated. A higher similarity score indicates better alignment. The formula for cosine similarity between two vectors and is:- and are the feature vectors (embeddings) of the video and text, respectively.
- is the dimension of the feature vectors.
5.3. Baselines
The paper compares its method against a range of recent autoregressive video generation models:
-
SkyReelsV2[11]: A clip-level autoregressive model that generates 5s segments sequentially. -
MAGI-1[60]: A model that relaxes strict causality by starting the generation of the next chunk before the current one is fully finished. -
NOVA[17]: An autoregressive model that operates without vector quantization. -
Pyramid Flow[35]: An efficient video generative model based on flow matching. -
CausVid[77]: A distillation-based model that learns from a bidirectional teacher. -
Self Forcing[32]: A key baseline representing the state-of-the-art in post-training distillation for autoregressive models. -
LongLive[73]: A concurrent work similar toSelf Forcingbut designed for longer videos, also using distillation.These baselines are representative because they cover different strategies for autoregressive video generation: simple autoregression, relaxed causality, and distillation-based alignment. The comparison with
CausVid,Self Forcing, andLongLiveis particularly important, as these are the main competitors that rely on teacher models, which the proposed method aims to replace.
6. Results & Analysis
6.1. Core Results Analysis
The main quantitative results are presented in Table 1, where the performance on 15-second videos is broken down into three 5-second segments to analyze quality over time.
The following are the results from Table 1 of the original paper:
| Method | #Param | Teacher Model | Video Length = 0-5 s | Video Length = 5-10 s | Video Length = 10-15 s | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Temporal | Visual | Text | Temporal | Visual | Text | Temporal | Visual | Text | |||
| SkyReels-V2 [11] | 1.3B | - | 81.93 | 60.25 | 21.92 | 84.63 | 59.71 | 21.55 | 87.50 | 58.52 | 21.30 |
| MAGI-1 [60] | 4.5B | - | 87.09 | 59.79 | 26.18 | 89.10 | 59.33 | 25.40 | 86.66 | 59.03 | 25.11 |
| NOVA [17] | 0.6B | - | 87.58 | 44.42 | 25.47 | 88.40 | 35.65 | 20.15 | 84.94 | 30.23 | 18.22 |
| Pyramid Flow [35] | 2.0B | - | 81.90 | 62.99 | 27.16 | 84.45 | 61.27 | 25.65 | 84.27 | 57.87 | 25.53 |
| CausVid [77] | 1.3B | WAN2.1-14B(5s) | 89.35 | 65.80 | 23.95 | 89.59 | 65.29 | 22.90 | 87.14 | 64.90 | 22.81 |
| Self Forcing [32] | 1.3B | WAN2.1-14B(5s) | 90.03 | 67.12 | 25.02 | 84.27 | 66.18 | 24.83 | 84.26 | 63.04 | 24.29 |
| LongLive [73] | 1.3B | WAN2.1-14B(5s) | 81.84 | 66.56 | 24.41 | 81.72 | 67.05 | 23.99 | 84.57 | 67.17 | 24.44 |
| Ours (75% sparsity) | 1.3B | - | 90.18 | 63.95 | 24.12 | 89.80 | 61.95 | 24.19 | 87.03 | 61.01 | 23.35 |
| Ours | 1.3B | - | 91.20 | 64.72 | 25.79 | 90.44 | 64.03 | 25.61 | 89.74 | 63.99 | 24.39 |
Analysis:
-
Superior Temporal Quality: The proposed method ("Ours") consistently achieves the highest
Temporalquality score across all time segments (91.20, 90.44, 89.74). This is a direct validation of the core hypothesis:Resampling Forcingeffectively mitigates error accumulation, leading to more stable and coherent videos over time. Notably, the distillation-basedSelf Forcingmodel shows a significant drop in temporal quality from the first to the second segment (90.03 to 84.27), indicating it is more vulnerable to error accumulation in longer rollouts. -
Teacher-Free and Competitive: The most significant finding is that "Ours" achieves this performance without a teacher model. It is competitive with or better than
CausVid,Self Forcing, andLongLive, all of which require a massive 14B-parameter teacher model. This demonstrates a major leap in training efficiency and practicality. -
Effectiveness of History Routing: The "Ours (75% sparsity)" result, which uses
history routingwith , shows only a very minor drop in performance compared to the dense attention version ("Ours"). This confirms thathistory routingis an effective mechanism for reducing computational cost with negligible quality loss, making long-horizon generation feasible. -
Qualitative Causality (Figure 5): The qualitative comparison in Figure 5 provides a powerful, intuitive argument. While
LongLiveproduces a physically impossible video (milk level rising and falling during continuous pouring), the proposed model maintains strict causality. This suggests that distillation from a non-causal teacher can "contaminate" the student model with non-causal behaviors, a problem that the end-to-end, causally pure training ofResampling Forcingavoids.
该图像是展示了不同视频生成模型的定性比较,包括我们的模型与其他几种技术在长视频生成中的稳定性。上半部分展示了各模型的输出效果,下半部分则对比了在倒牛奶过程中液体的变化,红色箭头强调了每帧液体的水平变化。
6.2. Ablation Studies / Parameter Analysis
6.2.1. Error Simulation Strategies
Table 2 compares the proposed autoregressive resampling with two simpler strategies for simulating errors.
The following are the results from Table 2 of the original paper:
| Simulation Strategies | Temporal | Visual | Text |
|---|---|---|---|
| noise augmentation | 87.15 | 61.90 | 21.44 |
| resampling - parallel | 88.01 | 62.51 | 24.51 |
| resampling - autoregressive | 90.46 | 64.25 | 25.26 |
Analysis: Autoregressive resampling significantly outperforms the other two methods. This is strong evidence that both aspects of the simulation are crucial:
- Resampling > Noise: Simulating errors by regenerating frames (
resampling) is better than just adding random noise (noise augmentation), because the model's errors have a specific structure that noise does not capture. - Autoregressive > Parallel: Simulating error accumulation (
autoregressive) is better than simulating per-frame errors independently (parallel). This confirms that training the model to handle propagated errors is key to long-term stability.
6.2.2. Simulation Timestep Shifting
Figure 6 shows the effect of the shifting factor , which controls the strength of the self-resampling.

Analysis:
- Small (Slight Shift): This corresponds to weak resampling, close to teacher forcing. The figure shows that this leads to quality degradation over time, demonstrating error accumulation.
- Large (Large Shift): This corresponds to strong resampling, where the model is allowed to deviate heavily from the history. This leads to
content drifting, where the generated content loses semantic consistency with the initial frames. - Moderate : This strikes the right balance, mitigating error accumulation while preserving historical context, leading to a stable and coherent generation.
6.2.3. Sparse History Strategies
Figure 7 compares history routing with dense causal attention and the common sliding-window attention.

Analysis:
History Routingvs.Dense Attention: Routing to the top-5 frames (75% sparsity) yields quality nearly identical to dense attention, validating its efficiency. Even with extreme sparsity (top-1, 95% sparsity), the degradation is minor.History Routingvs.Sliding Window: This is the most telling comparison. Bothtop-1 routingand asliding window of size 1have the same sparsity (attending to only one past frame). However,history routingmaintains much better consistency (the fish's appearance is stable). This is because the sliding window is restricted to the immediately preceding frame, exacerbating drift. In contrast,history routingcan dynamically choose any frame from the past (e.g., the very first frame) that is most relevant, allowing it to maintain global consistency.
6.2.4. History Routing Frequency
Figure 8 visualizes which history frames are selected by the router.
该图像是图表,展示了在生成第21帧时,前20帧被选择的频率。不同颜色的条形图分别表示 top-1、top-3、top-5 和 top-7 的选择频率,最大值为 0.524,位于第20帧。
Analysis: The visualization reveals a fascinating and intuitive pattern. The router learns to prioritize two types of frames:
- The most recent frames: These are crucial for maintaining short-term continuity and motion.
- The very first frames: These act as an "attention sink," providing a global context anchor that helps prevent semantic drift over long periods. This hybrid "sliding-window" + "attention-sink" pattern has been proposed as a handcrafted heuristic in other works. This paper shows that a dynamic, content-aware routing mechanism can learn this optimal strategy automatically.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces Resampling Forcing, a novel and effective end-to-end framework for training autoregressive video diffusion models. By proposing a self-resampling mechanism, it elegantly solves the long-standing problem of exposure bias without the need for complex post-training or large, external teacher models. This makes training more scalable, efficient, and causally pure. The method's effectiveness is demonstrated through extensive experiments, where it achieves state-of-the-art temporal consistency and competitive overall quality. Furthermore, the introduction of history routing provides a practical solution for efficient long-horizon video generation. This work marks a significant step towards building robust and scalable video world models.
7.2. Limitations & Future Work
The authors acknowledge two main limitations:
- Inference Latency: As a diffusion-based model, inference requires multiple iterative denoising steps, which is computationally slow and not suitable for real-time applications without further optimization. They suggest future work could explore few-step distillation or advanced samplers to accelerate inference.
- Training Cost: The training process involves processing two sequences simultaneously (the noisy diffusion samples and the clean history for the resampling target). While manageable, they suggest architecture optimizations could improve this.
7.3. Personal Insights & Critique
-
Personal Insights: The core concept of "learning from one's own simulated mistakes" is a powerful and generalizable principle. It feels more fundamentally sound than relying on an external, potentially mismatched supervisor. This idea could be applied to other autoregressive domains beyond video, such as long-form text generation, music composition, or any task prone to error accumulation. The
history routinganalysis is also insightful, as it shows that a simple, dynamic mechanism can learn complex and effective attention patterns that were previously handcrafted. -
Critique and Potential Issues:
- Computational Overhead of Resampling: While the paper mentions efficient implementation with KV caching, the
self-resamplingstep still introduces a non-trivial computational cost to the training loop, as it involves a full denoising process for every frame in the history. A more detailed analysis of this overhead would be beneficial. - Heuristics in Timestep Sampling: The choice of the
LogitNormaldistribution and the manual setting of the shifting parameter feel somewhat heuristic. A potential area for future work could be to learn an adaptive schedule for the resampling strength , perhaps dependent on the model's training stage or the content of the video. - Reproducibility: The experiments are built upon
WAN2.1-1.3B, a model whose weights may not be publicly available. This could pose a challenge for the research community to precisely reproduce the results and build upon this work. - Evaluation of Causality: The claim of stricter causality is primarily supported by a single, albeit compelling, qualitative example. A more systematic, quantitative evaluation of physical or logical consistency would further strengthen this important conclusion. For example, testing the model on a dedicated benchmark designed to probe physical laws.
- Computational Overhead of Resampling: While the paper mentions efficient implementation with KV caching, the
Similar papers
Recommended via semantic vector search.