Ablation on Loss Components (Table 2): This table, transcribed below, shows the importance of each loss term.
Loss functions |
PSNR ↑ |
SSIM↑ |
MS-SSIM↑ |
LPIPS ↓ |
FID ↓ |
CMMD ↓ |
Jtotal (1), JD (7) |
25.75 |
0.7630 |
0.9022 |
0.1170 |
5.48 |
0.074 |
- JsSL (1), (5) |
26.31 |
0.7794 |
0.9140 |
0.1096 |
5.79 |
0.122 |
- J'′ of Jrec (1), (2), (3) |
25.75 |
0.7487 |
0.8943 |
0.1756 |
18.64 |
0.594 |
- JL2 of Jrec (1), (2) |
23.87 |
0.7304 |
0.8713 |
0.1307 |
6.10 |
0.107 |
- JG (1), (6) |
27.11 |
0.8041 |
0.9178 |
0.1759 |
17.97 |
0.393 |
No fine-tuning |
25.08 |
0.7690 |
0.9018 |
0.1207 |
5.82 |
0.385 |
Analysis: The "No fine-tuning" model has a high CMMD, indicating a perceptual gap. Removing the GAN loss (−JG) or the perceptual loss (−J′) severely degrades the perceptual metrics (FID, CMMD), even though PSNR/SSIM might look good. This confirms that for generative tasks, perceptual and adversarial losses are critical. The full Jtotal
provides the best balance, especially in perceptual quality (lowest FID and CMMD).
Qualitative Improvement (Figure 5):

*该图像是插图,展示了图5中经过编码、矢量量化和解码 (\mathrm{ENC} + \mathrm{VQ} + \mathrm{DEC} = \mathrm{TOK} + \mathrm{DEC})过程后图像的感兴趣区域(RoI)示例。子图(a)显示原始驾驶场景图像x_t中标记的RoI,(b)是该RoI的放大视图。子图(c)表示未经特定领域微调的预训练VGAN模型对汽车物体(停放的汽车)的重建效果差,细节模糊。相比之下,子图(d)展示了经过论文中提出的微调后,图像质量显著提升,汽车对象的细节得到明显改善。∗Thisfigurevisuallyconfirmsthequantitativeresults.Thenon−fine−tunedmodel(c)strugglestoreconstructtheparkedcars,producingblurry,distortedshapes.Theproposedfine−tunedmodel(d)reconstructsthecarswithmuchgreaterdetailandstructuralintegrity.∗∗∗CoreResults:WorldModelandSystemIntegration∗∗∗∗∗TeacherModelAblation(Table4):∗∗ThisexperimentshowsthatusingDINOv2astheteachermodelfortheJ^SSLlossduringtokenizertrainingyieldsthebestvideogenerationquality,achievingthelowestFVDscore(178.97).Thisdemonstratesthatabettertokenizerleadstoabetteroverallsystem.∗∗∗Top−kSamplingAnalysis(Table5):∗∗Thisisakeyexperimentshowingthetrade−offbetweendeterminismandcreativityintheworldmodel.Thetablebelowistranscribedfromthepaper.<divclass="table−wrapper"><table><tr><tdrowspan=1colspan=1>System</td><tdrowspan=1colspan=1>k</td><tdrowspan=1colspan=1>∣FID14↓</td><tdrowspan=1colspan=1>CMMD14↓</td><tdrowspan=1colspan=1>FVD14↓</td></tr><tr><tdrowspan=1colspan=1>TOK+DEC</td><tdrowspan=1colspan=1>−</td><tdrowspan=1colspan=1>7.19</td><tdrowspan=1colspan=1>0.048</td><tdrowspan=1colspan=1>104.35</td></tr><tr><tdrowspan=6colspan=1>TOK+WM+DEC</td><tdrowspan=1colspan=1>1</td><tdrowspan=1colspan=1>28.04</td><tdrowspan=1colspan=1>0.255</td><tdrowspan=1colspan=1>644.71</td></tr><tr><tdrowspan=1colspan=1>5</td><tdrowspan=1colspan=1>20.26</td><tdrowspan=1colspan=1>0.114</td><tdrowspan=1colspan=1>296.95</td></tr><tr><tdrowspan=1colspan=1>10</td><tdrowspan=1colspan=1>18.23</td><tdrowspan=1colspan=1>0.092</td><tdrowspan="1colspan=1>243.97</td></tr><tr><tdrowspan=1colspan=1>50</td><tdrowspan=1colspan=1>14.72</td><tdrowspan=1colspan=1>0.083</td><tdrowspan="1colspan=1>178.97</td></tr><tr><tdrowspan=1colspan=1>200</td><tdrowspan=1colspan=1>12.22</td><tdrowspan=1colspan=1>0.081</td><tdrowspan=1colspan=1>160.48</td></tr><tr><tdrowspan=1colspan=1>1000</td><tdrowspan=1colspan=1>11.29</td><tdrowspan="1colspan=1>0.090</td><tdrowspan=1colspan=1>163.80</td></tr><tr><tdrowspan=1colspan=1>TOK+VDEC</td><tdrowspan=1colspan=1>−</td><tdrowspan="1colspan=1>8.85</td><tdrowspan=1colspan=1>0.155</td><tdrowspan=1colspan="1>74.29</td></tr><tr><tdrowspan=6colspan=1>OpenViGA(TOK+WM+VDEC)</td><tdrowspan=1colspan=1>1</td><tdrowspan=1colspan=1>28.61</td><tdrowspan=1colspan=1>0.411</td><tdrowspan=1colspan=1>646.03</td></tr><tr><tdrowspan="1colspan=1>5</td><tdrowspan="1colspan=1>22.14</td><tdrowspan=1colspan=1>0.244</td><tdrowspan="1colspan=1>276.93</td></tr><tr><tdrowspan=1colspan=1>10</td><tdrowspan=1colspan=1>19.94</td><tdrowspan=1colspan=1>0.215</td><tdrowspan=1colspan=1>210.71</td></tr><tr><tdrowspan=1colspan=1>50</td><tdrowspan=1colspan=1>16.67</td><tdrowspan=1colspan=1>0.229</td><tdrowspan=1colspan=1>153.19</td></tr><tr><tdrowspan=1colspan=1>200</td><tdrowspan=1colspan=1>14.22</td><tdrowspan=1colspan="1>0.241</td><tdrowspan="1colspan=1>136.41</td></tr><tr><tdrowspan=1colspan=1>1000</td><tdrowspan=1colspan=1>13.29</td><tdrowspan="1colspan=1>0.248</td><tdrowspan="1colspan=1>132.16</td></tr></table></div>∗∗Analysis:∗∗TheTOK+DECandTOK+VDECrowsrepresenttranscodingground−truthframes(noprediction),settingareferenceforquality.Forprediction(TOK+WM+...),a‘top−k‘of1(greedysampling)performsverypoorly,suggestingitgetsstuckinrepetitiveorlow−qualityloops.Askincreases,allowingformorediversetokenchoices,thequalityimprovessignificantlyacrossallmetrics.Forthefinal‘OpenViGA‘system,k=1000 gives the best FVD score, indicating it produces the most realistic video motion.
* **Final System Performance (Table 6):** This table, transcribed below, compares the final system using the image decoder (DEC) versus the video decoder (VDEC), each with its best `top-k` setting.
| System | k | FID14 ↓ | CMMD14 ↓ | FVD14 ↓
| :--- | :-: | :---: | :---: | :---:
| TOK+WM+DEC | 200 | 12.22 | 0.081 | 160.48
| **OpenViGA (TOK+WM+VDEC)** | **1000** | **13.29** | **0.248** | **132.16**
**Analysis:** The video decoder (`VDEC`) achieves a substantially better FVD score (132.16 vs. 160.48) than the simple image decoder (`DEC`). This confirms that incorporating temporal context during the decoding stage is crucial for generating higher-quality, more consistent videos. This justifies the adoption of VDEC in the final OpenViGA system.
* **Qualitative Video Examples (Figures 7 & 8):**

*该图像是一组展示OpenViGA系统视频生成结果的图表。它呈现了多个汽车驾驶场景视频序列,每个序列由一系列关键帧组成,包括帧1、帧2、帧6、帧10和帧14。这些帧描绘了城市环境中的道路和车辆(如白色货车),有效地展示了模型预测逼真驾驶场景视频帧的能力。*

*该图像是展示OpenViGA系统在汽车驾驶场景视频生成方面的成果示意图。图示每行代表一个生成序列,其中每个标注帧(如帧1、帧2、帧6、帧10、帧14)被垂直分为左右两部分。左半部分显示原始或真实帧,右半部分则呈现模型预测或生成的对应帧。这些示例展示了城市街道在湿滑天气下的驾驶场景,包含多辆出租车和一辆红色汽车,体现了系统从输入视频片段生成逼真未来帧的能力。*
These figures show sequences of generated frames. They demonstrate that OpenViGA can produce temporally coherent videos. For example, the scenes show consistent forward motion, with objects like cars and buildings moving realistically through the frame over time. The quality appears stable over the 14 predicted frames, without significant degradation or artifacts.
# 7. Conclusion & Reflections
* **Conclusion Summary:** The authors successfully created OpenViGA, a video generation system for automotive scenes that is fully open and reproducible. They demonstrated that by carefully streamlining and fine-tuning existing open-source models (VQGAN, LWM) on a public dataset (BDD100K), it is possible to achieve realistic video generation on academic-scale hardware. The paper's rigorous component-wise analysis provides valuable insights for future research, and its release of code and models serves as an important benchmark for the community.
* **Limitations & Future Work:**
* The authors acknowledge that their 3D CNN-based video decoder is simpler and likely produces lower-quality frames compared to state-of-the-art diffusion-based decoders.
* The temporal context of the VDEC is limited to three frames. Extending this could improve long-term coherence.
* The system operates at a relatively low resolution (256 \times 256$) and frame rate (4 fps), which may not be sufficient for all practical autonomous driving applications.