-
Core Results:
The main quantitative results are presented in Table 1. AR-Diffusion consistently outperforms or is highly competitive with prior SOTA methods across all four datasets.
(This is a transcribed version of Table 1 from the paper.)
|
Taichi-HD [28] |
Sky-Timelapse [40] |
FaceForensics [25] |
UCF-101 [31] |
|
FVD16 |
FVD128 |
FVD16 |
FVD128 |
FVD16 |
FVD128 |
FVD16 |
FVD128 |
Generative Adversarial Models |
MoCoGAN [35] |
- |
- |
206.6 |
575.9 |
124.7 |
257.3 |
2886.9 |
3679.0 |
+ StyleGAN2 backbone |
- |
- |
85.9 |
272.8 |
55.6 |
309.3 |
1821.4 |
2311.3 |
MoCoGAN-HD [34] |
- |
- |
164.1 |
878.1 |
111.8 |
653.0 |
1729.6 |
2606.5 |
DIGAN [45] |
128.1 |
748.0 |
83.11 |
196.7 |
62.5 |
1824.7 |
1630.2 |
2293.7 |
StyleGAN-V [30] |
143.5 |
691.1 |
79.5 |
197.0 |
47.4 |
89.3 |
1431.0 |
1773.4 |
MoStGAN-V [27] |
- |
- |
65.3 |
162.4 |
39.7 |
72.6 |
- |
- |
Auto-regressive Generative Models |
VideoGPT [41] |
- |
- |
222.7 |
- |
185.9 |
- |
2880.6 |
- |
TATS [9] |
94.6 |
|
132.6 |
|
- |
|
420 |
|
Synchronous Diffusion Generative Models |
Latte [19] |
159.6 |
|
59.8 |
- |
34.0 |
|
478.0 |
|
VDM [46] |
540.2 |
|
55.4 |
125.2 |
355.9 |
|
343.6 |
648.4 |
LVDM [11] |
99.0 |
|
95.2 |
- |
|
|
372.9 |
1531.9 |
VIDM [20] |
121.9 |
563.6 |
57.4 |
140.9 |
|
|
294.7 |
|
Asynchronous Diffusion Generative Models |
FVDM [17] |
194.6 |
|
106.1 |
|
55.0 |
555.2 |
468.2 |
|
Diffusion Forcing [4] |
202.0 |
738.5 |
251.9 |
895.3 |
175.5 |
99.5 |
274.5 |
836.3 |
AR-Diffusion (ours) |
66.3 |
376.3 |
40.8 |
175.5 |
71.9 |
265.7 |
186.6 |
572.3 |
- Key Findings: AR-Diffusion achieves the best
FVD
scores on Taichi-HD, Sky-Timelapse, and UCF-101. On UCF-101, its FVD_16
of 186.6 is a significant improvement over all baselines, including the prior best asynchronous model, FVDM (468.2), demonstrating its effectiveness on complex action videos.
-
Ablations / Parameter Sensitivity:
Table 2 explores the impact of the inference timestep difference s
.
(This is a transcribed version of Table 2 from the paper.)
s |
FaceForensics [25] |
Sky-Timelapse [40] |
Taichi-HD [28] |
UCF-101 [31] |
Inference Time (s) |
FID-img |
FID-vid |
FVD |
FID-img |
FID-vid |
FVD |
FID-img |
FID-vid |
FVD |
FID-img |
FID-vid |
FVD |
16-frame Video Generation |
0 |
14.0 |
6.9 |
71.9 |
10.0 |
9.2 |
40.8 |
13.8 |
9.2 |
80.9 |
30.3 |
17.6 |
194.4 |
2.4 |
5 |
14.0 |
6.2 |
78.1 |
11.1 |
11.6 |
55.2 |
13.0 |
5.9 |
66.3 |
30.0 |
17.7 |
194.0 |
5.2 |
10 |
13.6 |
6.1 |
84.4 |
10.3 |
11.2 |
57.6 |
12.4 |
5.8 |
70.9 |
30.1 |
18.8 |
212.2 |
7.9 |
15 |
14.3 |
6.6 |
83.5 |
9.4 |
11.4 |
55.6 |
12.2 |
5.8 |
69.4 |
30.0 |
16.3 |
186.6 |
10.8 |
20 |
14.8 |
6.3 |
83.3 |
9.2 |
10.9 |
56.3 |
12.7 |
6.0 |
67.0 |
31.0 |
17.4 |
201.1 |
13.6 |
25 |
14.1 |
6.1 |
79.0 |
9.7 |
10.5 |
48.4 |
12.9 |
6.5 |
75.1 |
29.6 |
17.3 |
191.6 |
16.4 |
50 |
14.2 |
6.1 |
82.8 |
10.1 |
10.7 |
50.6 |
13.1 |
5.9 |
71.7 |
29.5 |
17.1 |
192.6 |
30.5 |
128-frame Video Generation |
5 |
14.7 |
9.5 |
265.7 |
12.2 |
25.2 |
185.1 |
8.9 |
10.8 |
376.3 |
32.5 |
24.4 |
592.7 |
42.1 |
10 |
15.2 |
8.9 |
278.1 |
12.1 |
23.9 |
182.6 |
8.8 |
12.2 |
401.9 |
31.5 |
24.9 |
605.3 |
78.0 |
25 |
15.4 |
9.3 |
348.6 |
12.2 |
22.8 |
175.5 |
8.8 |
12.3 |
402.5 |
31.8 |
23.3 |
572.3 |
184.8 |
-
Analysis: The optimal value ofs
varies by dataset. For 16-frame generation, s=0
(synchronous) is best for FaceForensics and Sky-Timelapse, while a small non-zero s
(e.g., 5-15) is better for the more complex Taichi-HD and UCF-101 datasets. This shows that a degree of asynchronicity is beneficial for capturing complex motion. Inference time increases linearly with s
, highlighting the trade-off between performance and efficiency. For longer 128-frame videos, a small s=5
provides a good balance.
The paper also includes a crucial ablation study in Table 3.
(This is a transcribed version of Table 3 from the paper.)
|
FID |
FVD-img |
FVD |
AR-Diffusion |
12.2 |
13.4 |
62.8 |
- |
FoPP Timestep Scheduler |
11.0 |
16.8 |
101.0 |
- |
Improved VAE |
13.1 |
29.6 |
148.3 |
- |
Temporal Causal Attention |
15.9 |
50.2 |
209.8 |
- |
x0 Prediction Loss |
27.9 |
58.0 |
257.6 |
- |
Non-decreasing Constraint |
32.2 |
87.9 |
272.5 |
- Analysis: Removing each component significantly degrades performance (higher FVD). The most critical components are the
Non-decreasing Constraint
, x0 Prediction Loss
, and Temporal Causal Attention
, the removal of which causes FVD to skyrocket. This confirms that every proposed component is essential to the model's success.
-
Qualitative Comparison:
Figure 4 visually compares generated videos. The results show that AR-Diffusion produces videos with more dynamic and significant motion compared to baselines. For instance, on TaiChi-HD, other methods show minimal movement, while AR-Diffusion generates clear figures performing noticeable actions.
该图像是图4,展示了AR-Diffusion模型与其他现有视频生成方法的定性比较。图像分为三个部分,分别对应UCF-101、Sky-Timelapse和TaiChi-HD三个数据集。每个部分通过多行视频帧序列,直观地对比了不同模型在生成视觉真实性和时间连贯性视频方面的表现,其中AR-Diffusion的生成效果位于每个数据集的底部。
-
Training Stability:
The paper compares the training loss curves of AR-Diffusion and Diffusion Forcing (a fully asynchronous model). AR-Diffusion's loss curve (Figure 6) is much smoother and more stable than that of Diffusion Forcing (Figure 5), which exhibits large spikes. This provides strong evidence that the non-decreasing constraint successfully stabilizes the training of an asynchronous diffusion model.
该图像是图5,展示了Diffusion Forcing [4]在UCF-101数据集上的损失曲线。图中蓝线表示训练损失随步数(Step)的变化,从初始较高的值快速下降,随后趋于稳定,表明模型在训练过程中趋于收敛。围绕损失曲线的阴影区域可能代表了损失的波动范围或标准差,显示了训练的稳定性。
(Note: The paper's text incorrectly labels the above image as Figure 6. It corresponds to AR-Diffusion's loss curve.)