<caption><strong>Chunk-wise AR</strong></caption>
<thead>
<tr>
<th></th>
<th colspan="3">Evaluation scores ↑</th>
</tr>
<tr>
<th></th>
<th>Total Score</th>
<th>Quality Score</th>
<th>Semantic Score</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="4"><em>Many (50x2)-step models</em></td>
</tr>
<tr>
<td>Diffusion Forcing (DF)</td>
<td>82.95</td>
<td>83.66</td>
<td>80.09</td>
</tr>
<tr>
<td>Teacher Forcing (TF)</td>
<td>83.58</td>
<td>84.34</td>
<td>80.52</td>
</tr>
<tr>
<td colspan="4"><em>Few (4)-step models</em></td>
</tr>
<tr>
<td>DF + DMD</td>
<td>82.76</td>
<td>83.49</td>
<td>79.85</td>
</tr>
<tr>
<td>TF + DMD</td>
<td>82.32</td>
<td>82.73</td>
<td>80.67</td>
</tr>
<tr>
<td><strong>Self Forcing (Ours, DMD)</strong></td>
<td><strong>84.31</strong></td>
<td><strong>85.07</strong></td>
<td><strong>81.28</strong></td>
</tr>
<tr>
<td>Self Forcing (Ours, SiD)</td>
<td>84.07</td>
<td>85.52</td>
<td>78.24</td>
</tr>
<tr>
<td>Self Forcing (Ours, GAN)</td>
<td>83.88</td>
<td>85.06</td>
<td>-</td>
</tr>
</tbody>
</table></div>
</td>
<td>
<div class="table-wrapper"><table>
<caption><strong>Frame-wise AR</strong></caption>
<thead>
<tr>
<th></th>
<th colspan="3">Evaluation scores ↑</th>
</tr>
<tr>
<th></th>
<th>Total Score</th>
<th>Quality Score</th>
<th>Semantic Score</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="4"><em>Many (50x2)-step models</em></td>
</tr>
<tr>
<td>Diffusion Forcing (DF)</td>
<td>77.24</td>
<td>79.72</td>
<td>67.33</td>
</tr>
<tr>
<td>Teacher Forcing (TF)</td>
<td>80.34</td>
<td>81.34</td>
<td>76.34</td>
</tr>
<tr>
<td colspan="4"><em>Few (4)-step models</em></td>
</tr>
<tr>
<td>DF + DMD</td>
<td>80.56</td>
<td>81.02</td>
<td>78.71</td>
</tr>
<tr>
<td>TF + DMD</td>
<td>78.12</td>
<td>79.62</td>
<td>72.11</td>
</tr>
<tr>
<td><strong>Self Forcing (Ours, DMD)</strong></td>
<td><strong>84.26</strong></td>
<td><strong>85.25</strong></td>
<td><strong>80.30</strong></td>
</tr>
<tr>
<td>Self Forcing (Ours, SiD)</td>
<td>83.54</td>
<td>-</td>
<td>78.86</td>
</tr>
<tr>
<td>Self Forcing (Ours, GAN)</td>
<td>83.27</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table></div>
</td>
Rolling KV Cache Analysis: Figure 7 qualitatively shows the importance of the proposed training trick for the rolling KV cache. The naive implementation leads to severe blurring and artifacts in extrapolated frames, while the proposed method with a restricted attention window during training maintains high quality.
该图像是图7,展示了滚动KV缓存视频外推的定性比较。它对比了使用朴素基线和我们提出的方法进行视频帧外推的效果。图像中,朴素基线外推的视频帧普遍存在严重的视觉伪影,如模糊和失真,尤其在细节处。相比之下,我们提出的方法能够生成更清晰、更逼真且没有明显伪影的视频帧,从而突显了其在维持视频质量方面的显著优势,特别是在处理视频外推任务时。
Training Efficiency: Figure 6 reveals a surprising and crucial finding. The per-iteration training time of Self Forcing is comparable to, or even faster than, parallel methods like TF and DF. This is because SF uses standard, highly optimized full attention kernels (like FlashAttention), whereas TF/DF require custom causal attention masks that are less optimized. The right panel shows that SF not only trains quickly but also converges to a higher quality level in the same amount of wall-clock time, making it superior in both performance and efficiency.
该图像是图表,图6,展示了不同视频扩散模型训练效率的比较。左侧的柱状图对比了在生成器和判别器更新中,不同chunk-wise、few-step自回归视频扩散算法(包括Diffusion Forcing、Teacher Forcing及不同步数的Self Forcing)的每次迭代训练时间。右侧的折线图则展示了视频质量(VBench分数)随训练墙钟时间的变化。结果表明,Self Forcing通常具有更低的单次迭代训练时间,并在相同训练时间内达到更高的视频质量。
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces Self Forcing, a training paradigm that resolves the long-standing issue of exposure bias in autoregressive video models. By aligning the training and inference processes and optimizing a holistic, video-level loss, the method significantly reduces error accumulation. This allows for the creation of video generation models that are not only high-quality—matching or surpassing slower, non-causal systems—but also highly efficient, enabling real-time generation with low latency.
-
Limitations & Future Work: The authors acknowledge two main limitations:
- Quality can still degrade when generating videos that are substantially longer than the context length seen during training.
- The gradient truncation strategy, while necessary for efficiency, might limit the model's ability to learn very long-range temporal dependencies.
Future research could focus on better extrapolation techniques and exploring alternative architectures like state-space models (
SSMs) that are inherently recurrent and may better balance efficiency and long-context modeling.
-
Personal Insights & Critique:
- This paper presents a very elegant and powerful solution to a fundamental problem. The core idea of "train as you test" is not new, but its application to autoregressive diffusion models, combined with clever efficiency optimizations, is a significant contribution.
- The finding on training efficiency is particularly impactful. It challenges the conventional wisdom that parallelizable training is always superior and suggests that for certain sequence modeling tasks, a well-designed sequential training loop can be more efficient by leveraging optimized kernels.
- The proposed paradigm of "parallel pre-training and sequential post-training" is a compelling direction for the future. A large model can be pre-trained efficiently in a parallel fashion (like a standard DiT), and then fine-tuned with
Self Forcing to adapt it for autoregressive inference. This combines the strengths of both worlds.
- The work beautifully integrates three major families of generative models: the sequential structure of AR models, the high-fidelity generation of Diffusion models, and the powerful distribution-matching principle from GANs. This synthesis demonstrates the complementary nature of these approaches.
- The authors also responsibly include a discussion on the Broader Societal Impact in the appendix, acknowledging the dual-use nature of real-time video generation technology and the potential for misuse in creating deepfakes and spreading disinformation. This is a crucial consideration for such powerful technology.
|