- Title: DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT
- Authors: Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, Ping Tan
- Affiliations: The Hong Kong University of Science and Technology, Horizon Robotics
- Journal/Conference: This paper is an arXiv preprint. Preprints are preliminary versions of research papers that have not yet undergone formal peer review. They are common in fast-moving fields like machine learning to disseminate findings quickly.
- Publication Year: 2024 (v2 submitted in December)
- Abstract: The paper presents
DrivingWorld
, a world model for autonomous driving based on the GPT architecture. Standard GPT models struggle with video generation because they are designed for 1D text data and cannot adequately model spatial and temporal dynamics. To solve this, DrivingWorld
introduces several spatial-temporal fusion mechanisms. It uses a "next-state" prediction strategy for temporal coherence and a "next-token" prediction strategy for spatial detail within each frame. Additionally, novel masking and reweighting strategies are proposed to reduce long-term video degradation ("drifting") and allow for precise control. The model is shown to generate high-quality, consistent videos over 40 seconds long, outperforming prior state-of-the-art models in visual quality and controllability.
- Original Source Link: https://arxiv.org/abs/2412.19505v2 (The provided link seems to be a placeholder for a future submission. The content is analyzed based on the provided text.)
2. Executive Summary
4. Methodology (Core Technology & Implementation)
The goal of DrivingWorld
is to predict the next state [θT+1,(xT+1,yT+1),IT+1] (vehicle orientation, location, and front-view image) based on a history of past states {[θt,(xt,yt),It]}t=1T. The process is broken down into tokenization, the world model itself, and decoding.
The overall pipeline is shown in Image 2.
该图像是DrivingWorld的示意图,展示了车辆姿态编码器、时序感知编码器和内部状态自回归模块等组成部分。通过这些模块,实现对车辆方向hetat、位置(xt,yt)及前视图It的理解与未来状态预测,并生成超过40秒的视频。
3.1. Tokenizer
The first step is to convert continuous, multimodal state data into a sequence of discrete tokens.
3.2. World Model
This is the core transformer-based model that predicts future tokens. It avoids the inefficient vanilla GPT approach by using a decoupled architecture. Image 3 contrasts the vanilla approach with the proposed one.
该图像是论文中图1,展示了两种驾驶世界模型结构对比,左侧是传统的Vanilla GPT模型,右侧是作者提出的结合时序感知机制的模型,图中详细描绘了时间层、模态层及内部自回归模块等关键组成。
The proposed architecture consists of two main modules:
-
Temporal-multimodal Fusion Module:
- Goal: To process the history of state tokens and produce a feature representation that encodes the likely next state.
- Structure: It alternates between two types of transformer layers:
- Temporal Layer: Uses causal attention along the time axis. A token at position i in frame t can only attend to tokens at the same position i in all previous frames (
1...t
). This efficiently captures how each part of the scene evolves over time.
- Multimodal Layer: Uses bidirectional attention within a single frame. All tokens within the same frame t can attend to each other. This allows the model to fuse information from different modalities (pose and image) and understand the spatial relationships within the scene.
- This decoupled design is more efficient than a full spatio-temporal attention mechanism, as it reduces the number of tokens involved in each attention operation.
-
Internal-state Autoregressive Module:
- Goal: To generate the high-quality tokens for the next state, [ϕ^T+1,v^T+1,q^T+1], autoregressively.
- Process: It takes the feature output from the fusion module (which represents a prediction for the next state) and uses it to condition a standard GPT-style autoregressive generator. It predicts the tokens for the next frame one by one, from the orientation token to the last image token.
- Formula: The prediction for the i-th token of the next state r^T+1i is conditioned on the fused features from the previous state h˚T and all previously generated tokens of the current state:
r^T+1i=G(…,Emb(r^)T+1i−1+h˚Ti)
-
r^T+1i: The predicted i-th token for the state at time T+1.
-
G: The internal-state autoregressive transformer.
-
h˚Ti: The fused feature from the previous module, providing context for the prediction.
-
Emb(⋅): An embedding layer that converts discrete token indices into continuous vectors.
The model is trained with a standard cross-entropy loss to predict the correct sequence of ground-truth tokens.
3.3. Decoder
Once the world model predicts the next-state tokens (ϕ^T+1,v^T+1,q^T+1), the decoder converts them back into physical data.
- Vehicle Pose Decoder: This performs the inverse operation of the pose tokenizer, converting the discrete token indices back into continuous relative pose values (Δθ^T+1,Δx^T+1,Δy^T+1).
- Temporal-aware Decoder: The predicted image tokens q^T+1 are used to look up the corresponding vectors from the codebook. These vectors are then passed through a temporal self-attention layer (mirroring the encoder) before being fed to the VQGAN's decoder CNN to reconstruct the final image I^T+1.
3.4. Long-term Controllable Generation
Two key strategies are introduced to improve long-duration generation and controllability.
5. Experimental Setup
-
Datasets:
- Training: The model was trained on a massive dataset of over 3456 hours of driving data. This includes 120 hours from the public NuPlan dataset and 3336 hours of private data. The tokenizer was also pre-trained on large image/video datasets like Openimages, COCO, and YoutubeDV.
- Evaluation: The model was evaluated on the test sets of NuPlan and NuScenes. The NuScenes evaluation is "zero-shot," meaning the model was not trained on any NuScenes data.
-
Evaluation Metrics:
-
Fréchet Inception Distance (FID):
- Conceptual Definition: FID measures the quality and diversity of generated images compared to a set of real images. It calculates the distance between the distributions of deep features (from an InceptionV3 network) of the two sets. A lower FID means the generated images are statistically more similar to real images.
- Mathematical Formula:
FID(x,g)=∥μx−μg∥22+Tr(Σx+Σg−2(ΣxΣg)1/2)
- Symbol Explanation: μx and μg are the mean vectors of the Inception features for real and generated images, respectively. Σx and Σg are their covariance matrices. Tr(⋅) is the trace of a matrix.
-
Fréchet Video Distance (FVD):
- Conceptual Definition: FVD extends FID to videos. It measures the quality and temporal consistency of generated videos. It computes the Fréchet distance between features extracted from real and generated videos using a pre-trained video classification network. Lower FVD indicates better video quality.
-
Peak Signal-to-Noise Ratio (PSNR):
- Conceptual Definition: PSNR measures the pixel-level reconstruction quality of an image by comparing it to an original, ground-truth image. It is based on the mean squared error (MSE). A higher PSNR indicates a more accurate reconstruction.
- Mathematical Formula:
PSNR=10⋅log10(MSEMAXI2)
- Symbol Explanation: MAXI is the maximum possible pixel value of the image (e.g., 255 for an 8-bit image). MSE is the mean squared error between the original and reconstructed images.
-
Learned Perceptual Image Patch Similarity (LPIPS):
- Conceptual Definition: LPIPS measures the perceptual similarity between two images. Unlike PSNR, which is pixel-based, LPIPS compares deep features from a pre-trained neural network, which aligns better with human perception of image similarity. A lower LPIPS score means the images are more perceptually similar.
-
Baselines: The paper compares DrivingWorld
against several other driving world models, including GAIA-1
, Vista
, DriveDreamer
, WoVoGen
, GenAD
, and the popular video generation model Stable Video Diffusion (SVD)
.
6. Results & Analysis
Core Results
-
Long-time Video Generation: Image 5 showcases the model's ability to generate a 64-second video (640 frames at 10 Hz) from just 15 initial frames. The generated sequence maintains high visual fidelity and structural consistency over this long duration.
该图像是一个自动驾驶视频生成的长时序帧示意图,展示了不同时间步(编号1至640)的视频帧样例,体现了DrivingWorld模型生成超过40秒高保真连续场景的能力。
-
Quantitative Comparison: In the following manually transcribed table (reconstructed from Table 2 in the text), DrivingWorld
is compared to other methods on the NuScenes dataset. It achieves competitive FID and FVD scores while generating significantly longer videos. (Note: "DrivingWorld (w/o P)" likely refers to the model trained without the large private dataset).
Table 2: Quantitative Comparison on NuScenes
(Manually transcribed from paper text)
Metric |
DriveDreamer [37] |
WoVoGen [27] |
Drive-WM [38] |
Vista [9] |
DriveGAN [30] |
GenAD (OpenDV) [42] |
DrivingWorld (w/o P) |
DrivingWorld |
FID ↓ |
52.6 |
27.6 |
15.8 |
6.9 |
73.4 |
15.4 |
16.4 |
7.4 |
FVD ↓ |
452.0 |
417.7 |
122.7 |
89.4 |
502.3 |
184.0 |
174.4 |
90.9 |
Max Duration / Frames |
4s / 48 |
2.5s / 5 |
8s / 16 |
15s / 150 |
N/A |
4s / 8 |
30s / 300 |
40s / 400 |
Analysis: DrivingWorld
is highly competitive with the state-of-the-art Vista
on FID/FVD metrics, despite the evaluation being zero-shot. Crucially, its maximum generation duration (40s) far exceeds all competitors.
-
Qualitative Comparison: As shown in Image 6, when compared to SVD
on a NuScenes scene, DrivingWorld
produces videos with better temporal consistency, preserving details like street lanes and vehicle identities more effectively over time.
该图像是图像对比图,展示了在zero-shot NuScenes场景下,SVD方法与本文方法生成的26帧视频中红框区域的细节表现。结果显示本文方法在街道车道线和车辆身份保持上表现更优。
-
Image Tokenizer Comparison: The proposed Temporal-aware
tokenizer outperforms other VQVAE methods, including a fine-tuned Llama-Gen
, across all metrics (FVD, FID, PSNR, LPIPS), confirming its effectiveness in producing high-quality and temporally consistent token sequences.
Table 3: Quantitative Comparison of VQVAE Methods
(Manually transcribed from paper)
VQVAE Methods |
FVD12 ↓ |
FID ↓ |
PSNR ↑ |
LPIPS ↓ |
VAR [34] |
164.66 |
11.75 |
22.35 |
0.2018 |
VQGAN [8] |
156.58 |
8.46 |
21.52 |
0.2602 |
Llama-Gen [33] |
57.78 |
5.99 |
22.31 |
0.2054 |
Llama-Gen [33] Finetuned |
20.33 |
5.19 |
22.71 |
0.1909 |
Temporal-aware (Ours) |
14.66 |
4.29 |
23.82 |
0.1828 |
Ablation Study
-
Random Masking Strategy (RMS): Removing the RMS
during training leads to a significant increase in FVD, especially for longer videos (FVD40
). This confirms that the masking strategy is crucial for preventing content drifting and improving the model's robustness during long-term inference.
Table 4: Impact of Random Masking Strategy
(Manually transcribed from paper)
Methods |
FVD10 ↓ |
FVD25 ↓ |
FVD40 ↓ |
w/o Masking |
449.40 |
595.49 |
662.60 |
Ours |
445.22 |
574.57 |
637.60 |
-
Comparison with Vanilla GPT structure: The proposed decoupled architecture is vastly superior to a vanilla GPT-2 approach.
- Performance:
DrivingWorld
achieves dramatically lower (better) FVD scores, indicating much higher video quality (Table 5).
- Efficiency: The vanilla GPT-2's memory usage grows quadratically with sequence length and quickly runs out of memory (
OOM
). DrivingWorld
's memory consumption scales much more gracefully, making long-sequence processing feasible (Table 6).
Table 5: Performance vs. GPT-2
(Manually transcribed from paper)
Methods |
FVD10 ↓ |
FVD25 ↓ |
FVD40 ↓ |
GPT-2 [29] |
2976.97 |
3505.22 |
4017.15 |
Ours |
445.22 |
574.57 |
637.60 |
Table 6: Memory Usage (GB) vs. GPT-2
(Manually transcribed from paper)
Num. of frames |
5 |
6 |
7 |
8 |
9 |
10 |
15 |
GPT-2 [29] |
31.555 |
39.305 |
47.237 |
55.604 |
66.169 |
77.559 |
OOM |
Ours |
21.927 |
24.815 |
26.987 |
29.877 |
31.219 |
34.325 |
45.873 |
7. Conclusion & Reflections
-
Conclusion Summary: DrivingWorld
successfully adapts the GPT framework for autonomous driving by introducing a novel spatio-temporal architecture. By decoupling temporal and spatial modeling and incorporating specialized strategies for long-term consistency (RMS
) and control (Balanced Attention
), the model generates high-fidelity, controllable driving videos over 40 seconds long. This work represents a significant step forward in building powerful, scalable world models for autonomous driving simulation.
-
Limitations & Future Work: The authors state their intention to extend the model by:
- Incorporating more multimodal information (e.g., LiDAR, radar).
- Integrating multiple camera view inputs to build a more comprehensive 3D understanding of the environment.
-
Personal Insights & Critique:
- Strengths: The paper's core strength is its thoughtful architectural design. The decoupling of temporal and spatial modeling is an elegant solution to the scaling and performance issues of vanilla GPTs for video. The ablation studies convincingly demonstrate the value of each proposed component, from the temporal-aware tokenizer to the RMS and balanced attention.
- Potential Weaknesses: The model's impressive performance relies heavily on a very large, private dataset (3336 hours). This makes it difficult for the broader research community to reproduce the results and raises questions about how well the approach would work with only publicly available data. The "DrivingWorld (w/o P)" results in Table 2 suggest there is a noticeable performance drop without this private data.
- Future Impact: This work provides a strong blueprint for future research into autoregressive world models. The principles of decoupling information streams (spatial, temporal, multimodal) and designing specific training strategies (like RMS) to handle autoregressive error accumulation are highly transferable to other domains beyond driving, such as robotics and long-form video synthesis. It reinforces the idea that directly applying models from one domain (NLP) to another (vision) is rarely optimal; thoughtful adaptation is key.