OpenViGA: Video Generation for Automotive Driving Scenes by Streamlining and Fine-Tuning Open Source Models with Public Data
TL;DR Summary
OpenViGA presents an open, reproducible system for automotive driving scene video generation. It fine-tunes powerful open-source models (e.g., VQGAN, LWM) on the public BDD100K dataset, systematically evaluating tokenizer, world model, and decoder components. This streamlined sys
Abstract
Recent successful video generation systems that predict and create realistic automotive driving scenes from short video inputs assign tokenization, future state prediction (world model), and video decoding to dedicated models. These approaches often utilize large models that require significant training resources, offer limited insight into design choices, and lack publicly available code and datasets. In this work, we address these deficiencies and present OpenViGA, an open video generation system for automotive driving scenes. Our contributions are: Unlike several earlier works for video generation, such as GAIA-1, we provide a deep analysis of the three components of our system by separate quantitative and qualitative evaluation: Image tokenizer, world model, video decoder. Second, we purely build upon powerful pre-trained open source models from various domains, which we fine-tune by publicly available automotive data (BDD100K) on GPU hardware at academic scale. Third, we build a coherent video generation system by streamlining interfaces of our components. Fourth, due to public availability of the underlying models and data, we allow full reproducibility. Finally, we also publish our code and models on Github. For an image size of 256x256 at 4 fps we are able to predict realistic driving scene videos frame-by-frame with only one frame of algorithmic latency.
English Analysis
1. Bibliographic Information
- Title: OpenViGA: Video Generation for Automotive Driving Scenes by Streamlining and Fine-Tuning Open Source Models with Public Data
- Authors: Björn Möller, Zhengyang Li, Malte Stelzer, Thomas Graave, Fabian Bettels, Muaaz Ataya, Tim Fingscheidt
- Affiliations: Technische Universität Braunschweig, Institute for Communications Technology
- Journal/Conference: The paper is available on arXiv, which is a preprint server for academic articles. It has not yet undergone formal peer review for a conference or journal.
- Publication Year: The provided arXiv ID
2509.15479v1appears to be a placeholder for a future publication, as it does not correspond to a currently indexed paper. The content is analyzed as presented. - Abstract: The paper introduces OpenViGA, an open-source video generation system for automotive driving scenes. Unlike closed-source systems like GAIA-1, OpenViGA is built by fine-tuning publicly available pre-trained models (a VQGAN for tokenization/decoding and LWM for future prediction) on a public dataset (BDD100K). The authors provide a detailed analysis of each of the three core components: the image tokenizer, the world model, and the video decoder. By streamlining the interfaces between these components, they create a coherent system that can be trained on academic-scale GPU hardware. The project emphasizes reproducibility by releasing all code and models. The final system can predict realistic pixel driving videos at 4 frames per second (fps) with only a single frame of algorithmic latency.
- Original Source Link:
- arXiv Page:
https://arxiv.org/abs/2509.15479v1 - PDF Link:
https://arxiv.org/pdf/2509.15479v1.pdf
- arXiv Page:
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: High-quality video generation for autonomous driving is a critical area of research for simulating diverse and rare scenarios to improve safety. However, state-of-the-art systems like GAIA-1 are often "black boxes"—they use large, proprietary models, are trained on non-public datasets, and do not release their code.
- Research Gap: This lack of transparency and accessibility hinders academic research, making it difficult to reproduce results, understand design choices, or build upon existing work. There is a need for a powerful, open, and reproducible baseline for automotive video generation that can be run with academic-level computational resources.
- Innovation: OpenViGA's key innovation is not a brand-new model architecture but rather a new methodology: it demonstrates how to build a high-performing video generation system by intelligently selecting, streamlining, and fine-tuning existing open-source components. It champions a fully transparent and reproducible approach.
-
Main Contributions / Findings (What):
- A Fully Open-Source System: The primary contribution is OpenViGA itself, a video generation system for driving scenes constructed entirely from open-source models and public data.
- Component-wise Deep Analysis: The paper provides a rigorous quantitative and qualitative evaluation of each of its three main components (tokenizer, world model, decoder), offering valuable insights into their individual performance and the impact of different training strategies.
- Effective Fine-Tuning Strategy: It shows that powerful, general-purpose pre-trained models can be successfully adapted to the specific domain of automotive scenes using a public dataset (BDD100K) and parameter-efficient fine-tuning techniques (LoRA) on limited hardware.
- Reproducibility: By publishing their code, models, and detailing their methodology, the authors enable full reproducibility, providing a solid foundation for future research in the academic community.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Autoencoders (AE): These are neural networks designed to learn compressed representations of data. An AE consists of two parts: an encoder that maps input data (like an image) to a low-dimensional latent space, and a decoder that reconstructs the original data from this latent representation. They are fundamental for data compression and feature extraction.
- Vector Quantized Generative Adversarial Network (VQGAN): VQGAN is a sophisticated image generation model. It uses an autoencoder structure, but with a key difference: the latent space is discrete. The encoder's output is mapped to the closest vector in a learned "codebook." This process, called vector quantization (VQ), turns an image into a sequence of discrete tokens, similar to words in a sentence. It is trained with a GAN (Generative Adversarial Network) loss to produce highly realistic images.
- World Models (WM): A world model is a system that learns an internal representation (or "model") of how an environment works. It can then use this model to simulate the environment and predict future states. In this paper, the WM predicts future video frames by operating in the discrete latent space created by the tokenizer.
- Transformers and Large Language Models (LLMs): Transformers are a type of neural network architecture that excels at processing sequential data, like text. LLMs (e.g., LLaMA-2) are massive transformer models trained on vast amounts of text to predict the next word in a sequence. This next-token prediction capability is repurposed in OpenViGA to predict the next image token in a sequence.
- Low-Rank Adaptation (LoRA): Training a massive model like an LLM from scratch is computationally expensive. LoRA is a parameter-efficient fine-tuning (PEFT) technique that freezes the original model's weights and injects small, trainable "adapter" matrices into its layers. This dramatically reduces the number of trainable parameters, making fine-tuning feasible on limited hardware.
-
Previous Works:
- GAIA-1 & GAIA-2: These are powerful, world model-based systems for autonomous driving video generation. They demonstrated impressive qualitative results but are closed-source, use proprietary datasets, and lack detailed quantitative analysis, which motivates the creation of OpenViGA.
- DriveGAN, DriveDreamer, Drive-WM: These are other notable models in the field. They often use different architectures like LSTMs or diffusion models. While effective, many share the same reproducibility challenges as GAIA.
- Diffusion Models: Many recent video generation models are based on diffusion, which learns to generate data by reversing a noise-adding process. While they produce high-quality results, they can be computationally intensive to train. OpenViGA opts for a VQGAN-based frame-by-frame approach, which is of lower complexity.
-
Differentiation: OpenViGA distinguishes itself from prior work not by claiming a superior new architecture but by its principled commitment to openness. It explicitly addresses the reproducibility crisis in this domain by using only public assets (models, data, code) and providing a detailed breakdown of its engineering and training process. This positions it as a foundational, verifiable benchmark for academic research.
4. Methodology (Core Technology & Implementation)
The OpenViGA system is a three-stage pipeline designed for predicting future video frames from an initial sequence of frames and a text prompt.
该图像是图4所示的世界模型(World Model)微调的示意图。它展示了模型如何使用单个训练样本进行微调。该样本由文本索引序列 和图像标记索引序列 组成。模型将文本和图像标记嵌入后合并,通过多个解码器块、RMS归一化、全连接层和Softmax层,预测出图像块标记索引概率 。最终,通过计算预测概率与真实图像标记索引之间的交叉熵损失 来优化模型。
The objective is to minimize the cross-entropy loss between the WM's predicted next-token probabilities and the actual ground-truth next token. The loss is computed as:
* : The ground-truth token index at position .
* : The context, including the text prompt and all preceding image tokens.
* : The probability of the next token as predicted by the model.
5. Experimental Setup
-
Datasets:
- BDD100K (Berkeley Deep Drive): A large, publicly available dataset containing 100,000 driving videos (around 40 seconds each) recorded at 30 fps with a resolution. It captures diverse conditions (weather, time of day).
- The authors created several subsets for training and validation:
DBDDvid(full video dataset),Dtrain-70k BDDvid-4fps(70k training videos subsampled to 4 fps for the WM and VDEC),DBDDimg(official image splits), and (a larger custom set of ~538k images for TOK training).
-
Evaluation Metrics:
-
Image Reconstruction Quality Metrics (comparing reconstructed image to original):
- PSNR (Peak Signal-to-Noise Ratio):
- Conceptual Definition: Measures the ratio between the maximum possible power of a signal (the pixel values) and the power of corrupting noise (the error between the original and reconstructed image). Higher PSNR means better reconstruction quality. It is sensitive to pixel-wise errors.
- Mathematical Formula:
- Symbol Explanation: is the maximum possible pixel value (e.g., 255), and is the Mean Squared Error between the original and reconstructed images.
- SSIM (Structural Similarity Index Measure):
- Conceptual Definition: Measures image quality by comparing three characteristics: luminance, contrast, and structure. It is designed to be more consistent with human perception than PSNR. A value closer to 1 indicates higher similarity.
- MS-SSIM (Multi-Scale SSIM):
- Conceptual Definition: An extension of SSIM that evaluates structural similarity at multiple scales (resolutions), providing a more robust quality assessment.
- LPIPS (Learned Perceptual Image Patch Similarity):
- Conceptual Definition: Measures the perceptual distance between two images using features extracted from a deep neural network (e.g., VGG). It aligns better with human judgments of similarity than pixel-based metrics. Lower LPIPS is better.
- PSNR (Peak Signal-to-Noise Ratio):
-
Generative Quality Metrics (comparing distribution of generated images/videos to real ones):
- FID (Fréchet Inception Distance):
- Conceptual Definition: Measures the distance between the feature distributions of real and generated images. Features are extracted using a pre-trained InceptionV3 network. A lower FID score indicates that the generated images are more similar to real images in terms of statistical properties.
- CMMD (CLIP Maximum Mean Discrepancy):
- Conceptual Definition: Similar to FID, but uses features from the CLIP model, which is trained on both images and text. This allows it to capture more semantic and abstract similarities. Lower CMMD is better.
- FVD (Fréchet Video Distance):
- Conceptual Definition: The video-based equivalent of FID. It measures the distance between the feature distributions of real and generated videos. The features are extracted from a network trained to classify videos, so it captures temporal consistency and motion quality. Lower FVD is better.
- FID (Fréchet Inception Distance):
-
-
Baselines: The primary comparisons are internal ablations. The authors compare:
- The fine-tuned models against the original pre-trained models ("No fine-tuning").
- Different combinations of loss functions for tokenizer training.
- Different teacher models for the self-supervised loss.
- Different
top-ksampling values for the world model's inference. - The final system with a video decoder (VDEC) vs. a simple image decoder (DEC).
6. Results & Analysis
-
Core Results: Image Tokenizer Fine-Tuning
-
Ablation on Loss Components (Table 2): This table, transcribed below, shows the importance of each loss term.
Loss functions PSNR ↑ SSIM↑ MS-SSIM↑ LPIPS ↓ FID ↓ CMMD ↓ Jtotal (1), JD (7) 25.75 0.7630 0.9022 0.1170 5.48 0.074 - JsSL (1), (5) 26.31 0.7794 0.9140 0.1096 5.79 0.122 - J'′ of Jrec (1), (2), (3) 25.75 0.7487 0.8943 0.1756 18.64 0.594 - JL2 of Jrec (1), (2) 23.87 0.7304 0.8713 0.1307 6.10 0.107 - JG (1), (6) 27.11 0.8041 0.9178 0.1759 17.97 0.393 No fine-tuning 25.08 0.7690 0.9018 0.1207 5.82 0.385 Analysis: The "No fine-tuning" model has a high CMMD, indicating a perceptual gap. Removing the GAN loss () or the perceptual loss () severely degrades the perceptual metrics (FID, CMMD), even though PSNR/SSIM might look good. This confirms that for generative tasks, perceptual and adversarial losses are critical. The full
Jtotalprovides the best balance, especially in perceptual quality (lowest FID and CMMD). -
Optimization of Loss Weights (Table 3): The authors found that their proposed discriminator and a weight ratio of (with ) provided the best perceptual results, achieving an FID of 3.97 and CMMD of 0.048.
-
Qualitative Improvement (Figure 5):
 *该图像是插图,展示了图5中经过编码、矢量量化和解码 (\mathrm{ENC} + \mathrm{VQ} + \mathrm{DEC} = \mathrm{TOK} + \mathrm{DEC})x_tJ^SSLTOK+DECTOK+VDECTOK+WM+...kk=1000 gives the best FVD score, indicating it produces the most realistic video motion. * **Final System Performance (Table 6):** This table, transcribed below, compares the final system using the image decoder (DEC) versus the video decoder (VDEC), each with its best `top-k` setting. | System | k | FID14 ↓ | CMMD14 ↓ | FVD14 ↓ | :--- | :-: | :---: | :---: | :---: | TOK+WM+DEC | 200 | 12.22 | 0.081 | 160.48 | **OpenViGA (TOK+WM+VDEC)** | **1000** | **13.29** | **0.248** | **132.16** **Analysis:** The video decoder (`VDEC`) achieves a substantially better FVD score (132.16 vs. 160.48) than the simple image decoder (`DEC`). This confirms that incorporating temporal context during the decoding stage is crucial for generating higher-quality, more consistent videos. This justifies the adoption of VDEC in the final OpenViGA system. * **Qualitative Video Examples (Figures 7 & 8):**  *该图像是一组展示OpenViGA系统视频生成结果的图表。它呈现了多个汽车驾驶场景视频序列,每个序列由一系列关键帧组成,包括帧1、帧2、帧6、帧10和帧14。这些帧描绘了城市环境中的道路和车辆(如白色货车),有效地展示了模型预测逼真驾驶场景视频帧的能力。*  *该图像是展示OpenViGA系统在汽车驾驶场景视频生成方面的成果示意图。图示每行代表一个生成序列,其中每个标注帧(如帧1、帧2、帧6、帧10、帧14)被垂直分为左右两部分。左半部分显示原始或真实帧,右半部分则呈现模型预测或生成的对应帧。这些示例展示了城市街道在湿滑天气下的驾驶场景,包含多辆出租车和一辆红色汽车,体现了系统从输入视频片段生成逼真未来帧的能力。* These figures show sequences of generated frames. They demonstrate that OpenViGA can produce temporally coherent videos. For example, the scenes show consistent forward motion, with objects like cars and buildings moving realistically through the frame over time. The quality appears stable over the 14 predicted frames, without significant degradation or artifacts. # 7. Conclusion & Reflections * **Conclusion Summary:** The authors successfully created OpenViGA, a video generation system for automotive scenes that is fully open and reproducible. They demonstrated that by carefully streamlining and fine-tuning existing open-source models (VQGAN, LWM) on a public dataset (BDD100K), it is possible to achieve realistic video generation on academic-scale hardware. The paper's rigorous component-wise analysis provides valuable insights for future research, and its release of code and models serves as an important benchmark for the community. * **Limitations & Future Work:** * The authors acknowledge that their 3D CNN-based video decoder is simpler and likely produces lower-quality frames compared to state-of-the-art diffusion-based decoders. * The temporal context of the VDEC is limited to three frames. Extending this could improve long-term coherence. * The system operates at a relatively low resolution (256 \times 256$) and frame rate (4 fps), which may not be sufficient for all practical autonomous driving applications.
-
-
Personal Insights & Critique:
- Strengths: The paper's primary contribution is its philosophy of openness and reproducibility. In a field dominated by closed-off, large-scale industrial labs, this work is a crucial step towards democratizing research in video generation. The detailed ablation studies are exemplary and provide a clear roadmap for others looking to build similar systems.
- Weaknesses: The evaluation lacks a direct comparison to other open-source video generation models, which would have helped to better contextualize OpenViGA's performance. The system's performance is heavily constrained by the architecture of the pre-trained models it is based on (e.g., the 256-token limit from LWM).
- Future Impact: OpenViGA has the potential to become a standard baseline for academic research in automotive video generation. It lowers the barrier to entry and allows researchers to focus on improving specific components (e.g., swapping the VDEC for a diffusion model) within a well-understood and reproducible framework. It highlights a pragmatic and effective path forward: building on the shoulders of giants by adapting powerful, general-purpose open-source models to specialized domains.
Similar papers
Recommended via semantic vector search.
Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability
Vista presents a generalizable driving world model with high fidelity and versatile control. It uses novel losses and latent replacement for accurate, long-term dynamic scene prediction. Integrating diverse action controls, Vista outperforms SOTA models, generalizes seamlessly, a
DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT
DrivingWorld introduces a video GPT model with spatio-temporal fusion, combining next-state and next-token prediction to enhance autonomous driving video generation, achieving over 40 seconds of high-fidelity, coherent video with novel masking and reweighting to reduce drift.
Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability
Vista introduces a generalizable driving world model, overcoming limitations in fidelity, generalization, and control. It leverages novel loss functions and latent replacement for high-fidelity, long-term predictions, and integrates versatile multi-level controls. Outperforming S
Discussion
Leave a comment
No comments yet. Start the discussion!