Paper status: completed

OpenViGA: Video Generation for Automotive Driving Scenes by Streamlining and Fine-Tuning Open Source Models with Public Data

Published:09/19/2025

Open-Source Model Fine-Tuning (1)Video Generation Models (8)Automotive Driving Scene Generation (2)BDD100K Dataset Application (1)

Original Link PDF

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

OpenViGA presents an open, reproducible system for automotive driving scene video generation. It fine-tunes powerful open-source models (e.g., VQGAN, LWM) on the public BDD100K dataset, systematically evaluating tokenizer, world model, and decoder components. This streamlined sys

Abstract

Recent successful video generation systems that predict and create realistic automotive driving scenes from short video inputs assign tokenization, future state prediction (world model), and video decoding to dedicated models. These approaches often utilize large models that require significant training resources, offer limited insight into design choices, and lack publicly available code and datasets. In this work, we address these deficiencies and present OpenViGA, an open video generation system for automotive driving scenes. Our contributions are: Unlike several earlier works for video generation, such as GAIA-1, we provide a deep analysis of the three components of our system by separate quantitative and qualitative evaluation: Image tokenizer, world model, video decoder. Second, we purely build upon powerful pre-trained open source models from various domains, which we fine-tune by publicly available automotive data (BDD100K) on GPU hardware at academic scale. Third, we build a coherent video generation system by streamlining interfaces of our components. Fourth, due to public availability of the underlying models and data, we allow full reproducibility. Finally, we also publish our code and models on Github. For an image size of 256x256 at 4 fps we are able to predict realistic driving scene videos frame-by-frame with only one frame of algorithmic latency.

Mind Map

In-depth Reading

English Analysis~17 min read · 20,149 chars

1. Bibliographic Information

Title: OpenViGA: Video Generation for Automotive Driving Scenes by Streamlining and Fine-Tuning Open Source Models with Public Data
Authors: Björn Möller, Zhengyang Li, Malte Stelzer, Thomas Graave, Fabian Bettels, Muaaz Ataya, Tim Fingscheidt
Affiliations: Technische Universität Braunschweig, Institute for Communications Technology
Journal/Conference: The paper is available on arXiv, which is a preprint server for academic articles. It has not yet undergone formal peer review for a conference or journal.
Publication Year: The provided arXiv ID 2509.15479v1 appears to be a placeholder for a future publication, as it does not correspond to a currently indexed paper. The content is analyzed as presented.
Abstract: The paper introduces OpenViGA, an open-source video generation system for automotive driving scenes. Unlike closed-source systems like GAIA-1, OpenViGA is built by fine-tuning publicly available pre-trained models (a VQGAN for tokenization/decoding and LWM for future prediction) on a public dataset (BDD100K). The authors provide a detailed analysis of each of the three core components: the image tokenizer, the world model, and the video decoder. By streamlining the interfaces between these components, they create a coherent system that can be trained on academic-scale GPU hardware. The project emphasizes reproducibility by releasing all code and models. The final system can predict realistic $256 \times 256$ pixel driving videos at 4 frames per second (fps) with only a single frame of algorithmic latency.
Original Source Link:
- arXiv Page: https://arxiv.org/abs/2509.15479v1
- PDF Link: https://arxiv.org/pdf/2509.15479v1.pdf

2. Executive Summary

Background & Motivation (Why):
- Core Problem: High-quality video generation for autonomous driving is a critical area of research for simulating diverse and rare scenarios to improve safety. However, state-of-the-art systems like GAIA-1 are often "black boxes"—they use large, proprietary models, are trained on non-public datasets, and do not release their code.
- Research Gap: This lack of transparency and accessibility hinders academic research, making it difficult to reproduce results, understand design choices, or build upon existing work. There is a need for a powerful, open, and reproducible baseline for automotive video generation that can be run with academic-level computational resources.
- Innovation: OpenViGA's key innovation is not a brand-new model architecture but rather a new methodology: it demonstrates how to build a high-performing video generation system by intelligently selecting, streamlining, and fine-tuning existing open-source components. It champions a fully transparent and reproducible approach.
Main Contributions / Findings (What):
- A Fully Open-Source System: The primary contribution is OpenViGA itself, a video generation system for driving scenes constructed entirely from open-source models and public data.
- Component-wise Deep Analysis: The paper provides a rigorous quantitative and qualitative evaluation of each of its three main components (tokenizer, world model, decoder), offering valuable insights into their individual performance and the impact of different training strategies.
- Effective Fine-Tuning Strategy: It shows that powerful, general-purpose pre-trained models can be successfully adapted to the specific domain of automotive scenes using a public dataset (BDD100K) and parameter-efficient fine-tuning techniques (LoRA) on limited hardware.
- Reproducibility: By publishing their code, models, and detailing their methodology, the authors enable full reproducibility, providing a solid foundation for future research in the academic community.

Foundational Concepts:
- Autoencoders (AE): These are neural networks designed to learn compressed representations of data. An AE consists of two parts: an encoder that maps input data (like an image) to a low-dimensional latent space, and a decoder that reconstructs the original data from this latent representation. They are fundamental for data compression and feature extraction.
- Vector Quantized Generative Adversarial Network (VQGAN): VQGAN is a sophisticated image generation model. It uses an autoencoder structure, but with a key difference: the latent space is discrete. The encoder's output is mapped to the closest vector in a learned "codebook." This process, called vector quantization (VQ), turns an image into a sequence of discrete tokens, similar to words in a sentence. It is trained with a GAN (Generative Adversarial Network) loss to produce highly realistic images.
- World Models (WM): A world model is a system that learns an internal representation (or "model") of how an environment works. It can then use this model to simulate the environment and predict future states. In this paper, the WM predicts future video frames by operating in the discrete latent space created by the tokenizer.
- Transformers and Large Language Models (LLMs): Transformers are a type of neural network architecture that excels at processing sequential data, like text. LLMs (e.g., LLaMA-2) are massive transformer models trained on vast amounts of text to predict the next word in a sequence. This next-token prediction capability is repurposed in OpenViGA to predict the next image token in a sequence.
- Low-Rank Adaptation (LoRA): Training a massive model like an LLM from scratch is computationally expensive. LoRA is a parameter-efficient fine-tuning (PEFT) technique that freezes the original model's weights and injects small, trainable "adapter" matrices into its layers. This dramatically reduces the number of trainable parameters, making fine-tuning feasible on limited hardware.
Previous Works:
- GAIA-1 & GAIA-2: These are powerful, world model-based systems for autonomous driving video generation. They demonstrated impressive qualitative results but are closed-source, use proprietary datasets, and lack detailed quantitative analysis, which motivates the creation of OpenViGA.
- DriveGAN, DriveDreamer, Drive-WM: These are other notable models in the field. They often use different architectures like LSTMs or diffusion models. While effective, many share the same reproducibility challenges as GAIA.
- Diffusion Models: Many recent video generation models are based on diffusion, which learns to generate data by reversing a noise-adding process. While they produce high-quality results, they can be computationally intensive to train. OpenViGA opts for a VQGAN-based frame-by-frame approach, which is of lower complexity.
Differentiation: OpenViGA distinguishes itself from prior work not by claiming a superior new architecture but by its principled commitment to openness. It explicitly addresses the reproducibility crisis in this domain by using only public assets (models, data, code) and providing a detailed breakdown of its engineering and training process. This positions it as a foundational, verifiable benchmark for academic research.

4. Methodology (Core Technology & Implementation)

The OpenViGA system is a three-stage pipeline designed for predicting future video frames from an initial sequence of frames and a text prompt.

$Figure 1. Proposed video generation system: It consists of an image tokenizer, encoding $T$ input frames $\\mathbf { x } _ { 1 } ^ { T }$ into a latent representation of discrete tokens ${ \\bf z } _ {…$ 该图像是图4所示的世界模型（World Model）微调的示意图。它展示了模型如何使用单个训练样本进行微调。该样本由文本索引序列 $c_1^M$ 和图像标记索引序列 $k_1^{(T+N) \cdot n'}$ 组成。模型将文本和图像标记嵌入后合并，通过多个解码器块、RMS归一化、全连接层和Softmax层，预测出图像块标记索引概率 $\mathbf{P}_{M+1}^{L+1} = (\mathbf{P}_{\ell})$ 。最终，通过计算预测概率与真实图像标记索引之间的交叉熵损失 $J^{\mathrm{CE}}$ 来优化模型。

    The objective is to minimize the cross-entropy loss between the WM's predicted next-token probabilities and the actual ground-truth next token. The loss is computed as:
     $J ^ { \mathrm { C E } } = - \frac { 1 } { ( T + N ) \cdot n ^ { \prime } } \sum _ { \nu \in \mathcal{T} } \log P ( k _ { \nu } | c _ { 1 } ^ { M } , k _ { 1 } ^ { \nu - 1 } )$ 
    *    $k_{\nu}$ : The ground-truth token index at position  $\nu$ .
    *    $c_{1}^{M}, k_{1}^{\nu-1}$ : The context, including the text prompt and all preceding image tokens.
    *    $P(\cdot | \cdot)$ : The probability of the next token as predicted by the model.

5. Experimental Setup

Datasets:
- BDD100K (Berkeley Deep Drive): A large, publicly available dataset containing 100,000 driving videos (around 40 seconds each) recorded at 30 fps with a $1280 \times 720$ resolution. It captures diverse conditions (weather, time of day).
- The authors created several subsets for training and validation: DBDDvid (full video dataset), Dtrain-70k BDDvid-4fps (70k training videos subsampled to 4 fps for the WM and VDEC), DBDDimg (official image splits), and $D_BDDimg/5s^train$ (a larger custom set of ~538k images for TOK training).
Evaluation Metrics:
- Image Reconstruction Quality Metrics (comparing reconstructed image to original):
  1. PSNR (Peak Signal-to-Noise Ratio):
    - Conceptual Definition: Measures the ratio between the maximum possible power of a signal (the pixel values) and the power of corrupting noise (the error between the original and reconstructed image). Higher PSNR means better reconstruction quality. It is sensitive to pixel-wise errors.
    - Mathematical Formula: $\text{PSNR} = 10 \cdot \log_{10} \left( \frac{\text{MAX}_I^2}{\text{MSE}} \right)$
    - Symbol Explanation: $\text{MAX}_I$ is the maximum possible pixel value (e.g., 255), and $\text{MSE}$ is the Mean Squared Error between the original and reconstructed images.
  2. SSIM (Structural Similarity Index Measure):
    - Conceptual Definition: Measures image quality by comparing three characteristics: luminance, contrast, and structure. It is designed to be more consistent with human perception than PSNR. A value closer to 1 indicates higher similarity.
  3. MS-SSIM (Multi-Scale SSIM):
    - Conceptual Definition: An extension of SSIM that evaluates structural similarity at multiple scales (resolutions), providing a more robust quality assessment.
  4. LPIPS (Learned Perceptual Image Patch Similarity):
    - Conceptual Definition: Measures the perceptual distance between two images using features extracted from a deep neural network (e.g., VGG). It aligns better with human judgments of similarity than pixel-based metrics. Lower LPIPS is better.
- Generative Quality Metrics (comparing distribution of generated images/videos to real ones):
  1. FID (Fréchet Inception Distance):
    - Conceptual Definition: Measures the distance between the feature distributions of real and generated images. Features are extracted using a pre-trained InceptionV3 network. A lower FID score indicates that the generated images are more similar to real images in terms of statistical properties.
  2. CMMD (CLIP Maximum Mean Discrepancy):
    - Conceptual Definition: Similar to FID, but uses features from the CLIP model, which is trained on both images and text. This allows it to capture more semantic and abstract similarities. Lower CMMD is better.
  3. FVD (Fréchet Video Distance):
    - Conceptual Definition: The video-based equivalent of FID. It measures the distance between the feature distributions of real and generated videos. The features are extracted from a network trained to classify videos, so it captures temporal consistency and motion quality. Lower FVD is better.
Baselines: The primary comparisons are internal ablations. The authors compare:
- The fine-tuned models against the original pre-trained models ("No fine-tuning").
- Different combinations of loss functions for tokenizer training.
- Different teacher models for the self-supervised loss.
- Different top-k sampling values for the world model's inference.
- The final system with a video decoder (VDEC) vs. a simple image decoder (DEC).

6. Results & Analysis

Core Results: Image Tokenizer Fine-Tuning

Ablation on Loss Components (Table 2): This table, transcribed below, shows the importance of each loss term.

Loss functions	PSNR ↑	SSIM↑	MS-SSIM↑	LPIPS ↓	FID ↓	CMMD ↓
Jtotal (1), JD (7)	25.75	0.7630	0.9022	0.1170	5.48	0.074
- JsSL (1), (5)	26.31	0.7794	0.9140	0.1096	5.79	0.122
- J'′ of Jrec (1), (2), (3)	25.75	0.7487	0.8943	0.1756	18.64	0.594
- JL2 of Jrec (1), (2)	23.87	0.7304	0.8713	0.1307	6.10	0.107
- JG (1), (6)	27.11	0.8041	0.9178	0.1759	17.97	0.393
No fine-tuning	25.08	0.7690	0.9018	0.1207	5.82	0.385

Analysis: The "No fine-tuning" model has a high CMMD, indicating a perceptual gap. Removing the GAN loss ( $- JG$ ) or the perceptual loss ( $- J'$ ) severely degrades the perceptual metrics (FID, CMMD), even though PSNR/SSIM might look good. This confirms that for generative tasks, perceptual and adversarial losses are critical. The full Jtotal provides the best balance, especially in perceptual quality (lowest FID and CMMD).

Optimization of Loss Weights (Table 3): The authors found that their proposed discriminator and a weight ratio of $\lambda'/\lambda^G = 1$ (with $\lambda' = 1.0, \lambda^G = 1.0$ ) provided the best perceptual results, achieving an FID of 3.97 and CMMD of 0.048.
Qualitative Improvement (Figure 5):

![Figure 5. Example of a region of interest (RoI) for a transcoded $( \\mathrm { E N C + V Q + D E C } = \\mathrm { T O K + D E C } )$ image from $\\mathcal { D } _ { \\mathrm { B D D i m g } } ^ { \\mathrm…](/files/papers/68f0d782255e9e144ff4155f/images/5.jpg) *该图像是插图，展示了图5中经过编码、矢量量化和解码$ (\mathrm{ENC} + \mathrm{VQ} + \mathrm{DEC} = \mathrm{TOK} + \mathrm{DEC}) $过程后图像的感兴趣区域（RoI）示例。子图 (a) 显示原始驾驶场景图像$ x_t $中标记的RoI，(b) 是该RoI的放大视图。子图 (c) 表示未经特定领域微调的预训练VGAN模型对汽车物体（停放的汽车）的重建效果差，细节模糊。相比之下，子图 (d) 展示了经过论文中提出的微调后，图像质量显著提升，汽车对象的细节得到明显改善。* This figure visually confirms the quantitative results. The non-fine-tuned model (c) struggles to reconstruct the parked cars, producing blurry, distorted shapes. The proposed fine-tuned model (d) reconstructs the cars with much greater detail and structural integrity. * **Core Results: World Model and System Integration** * **Teacher Model Ablation (Table 4):** This experiment shows that using DINOv2 as the teacher model for the$ J^SSL $loss during tokenizer training yields the best video generation quality, achieving the lowest FVD score (178.97). This demonstrates that a better tokenizer leads to a better overall system. * **Top-k Sampling Analysis (Table 5):** This is a key experiment showing the trade-off between determinism and creativity in the world model. The table below is transcribed from the paper. <div class="table-wrapper"><table><tr><td rowspan=1 colspan=1>System</td><td rowspan=1 colspan=1>k</td><td rowspan=1 colspan=1>| FID14 ↓</td><td rowspan=1 colspan=1>CMMD14 ↓</td><td rowspan=1 colspan=1>FVD14 ↓</td></tr><tr><td rowspan=1 colspan=1>TOK+DEC</td><td rowspan=1 colspan=1>-</td><td rowspan=1 colspan=1>7.19</td><td rowspan=1 colspan=1>0.048</td><td rowspan=1 colspan=1>104.35</td></tr><tr><td rowspan=6 colspan=1>TOK+WM+DEC</td><td rowspan=1 colspan=1>1</td><td rowspan=1 colspan=1>28.04</td><td rowspan=1 colspan=1>0.255</td><td rowspan=1 colspan=1>644.71</td></tr><tr><td rowspan=1 colspan=1>5</td><td rowspan=1 colspan=1>20.26</td><td rowspan=1 colspan=1>0.114</td><td rowspan=1 colspan=1>296.95</td></tr><tr><td rowspan=1 colspan=1>10</td><td rowspan=1 colspan=1>18.23</td><td rowspan=1 colspan=1>0.092</td><td rowspan="1 colspan=1>243.97</td></tr><tr><td rowspan=1 colspan=1>50</td><td rowspan=1 colspan=1>14.72</td><td rowspan=1 colspan=1>0.083</td><td rowspan="1 colspan=1>178.97</td></tr><tr><td rowspan=1 colspan=1>200</td><td rowspan=1 colspan=1>12.22</td><td rowspan=1 colspan=1>0.081</td><td rowspan=1 colspan=1>160.48</td></tr><tr><td rowspan=1 colspan=1>1000</td><td rowspan=1 colspan=1>11.29</td><td rowspan="1 colspan=1>0.090</td><td rowspan=1 colspan=1>163.80</td></tr><tr><td rowspan=1 colspan=1>TOK+VDEC</td><td rowspan=1 colspan=1>-</td><td rowspan="1 colspan=1>8.85</td><td rowspan=1 colspan=1>0.155</td><td rowspan=1 colspan="1>74.29</td></tr><tr><td rowspan=6 colspan=1>OpenViGA (TOK+WM+VDEC)</td><td rowspan=1 colspan=1>1</td><td rowspan=1 colspan=1>28.61</td><td rowspan=1 colspan=1>0.411</td><td rowspan=1 colspan=1>646.03</td></tr><tr><td rowspan="1 colspan=1>5</td><td rowspan="1 colspan=1>22.14</td><td rowspan=1 colspan=1>0.244</td><td rowspan="1 colspan=1>276.93</td></tr><tr><td rowspan=1 colspan=1>10</td><td rowspan=1 colspan=1>19.94</td><td rowspan=1 colspan=1>0.215</td><td rowspan=1 colspan=1>210.71</td></tr><tr><td rowspan=1 colspan=1>50</td><td rowspan=1 colspan=1>16.67</td><td rowspan=1 colspan=1>0.229</td><td rowspan=1 colspan=1>153.19</td></tr><tr><td rowspan=1 colspan=1>200</td><td rowspan=1 colspan=1>14.22</td><td rowspan=1 colspan="1>0.241</td><td rowspan="1 colspan=1>136.41</td></tr><tr><td rowspan=1 colspan=1>1000</td><td rowspan=1 colspan=1>13.29</td><td rowspan="1 colspan=1>0.248</td><td rowspan="1 colspan=1>132.16</td></tr></table></div> **Analysis:** The$ TOK+DEC $and$ TOK+VDEC $rows represent transcoding ground-truth frames (no prediction), setting a reference for quality. For prediction ($ TOK+WM+... $), a `top-k` of 1 (greedy sampling) performs very poorly, suggesting it gets stuck in repetitive or low-quality loops. As$ k $increases, allowing for more diverse token choices, the quality improves significantly across all metrics. For the final `OpenViGA` system,$ k=1000256 \times 256$) and frame rate (4 fps), which may not be sufficient for all practical autonomous driving applications.

Personal Insights & Critique:
- Strengths: The paper's primary contribution is its philosophy of openness and reproducibility. In a field dominated by closed-off, large-scale industrial labs, this work is a crucial step towards democratizing research in video generation. The detailed ablation studies are exemplary and provide a clear roadmap for others looking to build similar systems.
- Weaknesses: The evaluation lacks a direct comparison to other open-source video generation models, which would have helped to better contextualize OpenViGA's performance. The system's performance is heavily constrained by the architecture of the pre-trained models it is based on (e.g., the 256-token limit from LWM).
- Future Impact: OpenViGA has the potential to become a standard baseline for academic research in automotive video generation. It lowers the barrier to entry and allows researchers to focus on improving specific components (e.g., swapping the VDEC for a diffusion model) within a well-understood and reproducible framework. It highlights a pragmatic and effective path forward: building on the shoulders of giants by adapting powerful, general-purpose open-source models to specialized domains.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.