ADriver-I: A General World Model for Autonomous Driving
TL;DR Summary
ADriver-I integrates vision-action pairs using MLLM and diffusion models to autoregressively predict control signals and future scenes, enabling iterative autonomous driving and significantly improving performance.
Abstract
Typically, autonomous driving adopts a modular design, which divides the full stack into perception, prediction, planning and control parts. Though interpretable, such modular design tends to introduce a substantial amount of redundancy. Recently, multimodal large language models (MLLM) and diffusion techniques have demonstrated their superior performance on comprehension and generation ability. In this paper, we first introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals. Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I. It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. With the predicted next frame, ADriver-I performs further control signal prediction. Such a process can be repeated infinite times, ADriver-I achieves autonomous driving in the world created by itself. Extensive experiments are conducted on nuScenes and our large-scale private datasets. ADriver-I shows impressive performance compared to several constructed baselines. We hope our ADriver-I can provide some new insights for future autonomous driving and embodied intelligence.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "ADriver-I: A General World Model for Autonomous Driving." It proposes a novel framework that unifies control signal prediction and future scene generation for autonomous driving.
1.2. Authors
The authors are:
-
Fan Jia* (MEGVII Technology)
-
Weixin Mao* (Waseda University)
-
Yingfei Liu (MEGVII Technology)
-
Yucheng Zhao (MEGVII Technology)
-
Yuqing Wen† (University of Science and Technology of China)
-
Chi Zhang (Mach Drive)
-
Xiangyu Zhang (MEGVII Technology)
-
Tiancai Wang‡ (MEGVII Technology)
(* indicates equal contribution, † indicates corresponding author, ‡ indicates corresponding author)
1.3. Journal/Conference
This paper was published on arXiv, a preprint server. As a preprint, it has not yet undergone formal peer review for a specific journal or conference. arXiv is a widely used platform in the scientific community for rapid dissemination of research findings, allowing researchers to share their work before formal publication.
1.4. Publication Year
2023
1.5. Abstract
The paper introduces ADriver-I, a general world model for autonomous driving that addresses the redundancy inherent in traditional modular autonomous driving systems (perception, prediction, planning, control). Leveraging the strengths of multimodal large language models (MLLMs) and diffusion techniques, ADriver-I proposes the concept of an interleaved vision-action pair to unify visual features and control signals. The model takes these vision-action pairs as input to autoregressively predict the control signal for the current frame. Subsequently, these generated control signals, along with historical vision-action pairs, condition a diffusion model to predict future frames. This process allows ADriver-I to iteratively predict the next frame and then the control signal for that predicted frame, enabling "infinite driving" within its self-created world. Extensive experiments on the nuScenes dataset and a large-scale private dataset demonstrate ADriver-I's impressive performance compared to established baselines, offering new insights for future autonomous driving and embodied intelligence research.
1.6. Original Source Link
https://arxiv.org/abs/2311.13549v1 The publication status is a preprint on arXiv.
1.7. PDF Link
https://arxiv.org/pdf/2311.13549v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the redundancy and inefficiency of traditional modular autonomous driving (AD) systems.
- Traditional Modular Design: Typically, autonomous driving systems are broken down into sequential components:
perception(detecting objects, lanes),prediction(forecasting object trajectories),planning(determining the ego-car's path), andcontrol(generating low-level steering and acceleration/braking commands). While thismodular designis interpretable and easier to debug, the authors argue it introduces a substantial amount of redundancy because information needs to be passed and re-processed between modules. - Importance and Challenges: Autonomous driving is a safety-critical application where efficiency, robustness, and human-like reasoning are paramount. The challenges with modular systems include error propagation (an error in perception can cascade through prediction and planning) and sub-optimal global performance due to local optimizations within each module. Human drivers, in contrast, tend to act more directly based on visual information and implicitly predict the future, suggesting a more
end-to-endorworld-model-like approach might be more natural and efficient. - Paper's Innovative Idea: The paper's entry point is to unify
control signal prediction(theaction) andfuture scene generation(theworld) into a single framework, inspired by the recent advancements inmultimodal large language models (MLLMs)anddiffusion techniques. The core innovation is the concept of aninterleaved vision-action pair, which allows the model to process visual data and control signals in a unified format, enabling a system that can both "understand" its environment, "act" within it, and "imagine" its future state.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Concept of Interleaved Vision-Action Pair: Introduction of a novel data format that unifies visual features and control signals (actions). This allows MLLMs to process both modalities in a coherent, text-like manner, enabling
multi-round conversationsand flexible handling ofunfixed frame lengths. - ADriver-I World Model: Construction of a general world model for autonomous driving, named ADriver-I, based on the integration of an
MLLMand aVideo Diffusion Model (VDM). This model unifies two critical tasks:control signal predictionandfuture scene generation. - Infinite Driving Capability: ADriver-I demonstrates the capability of "infinite driving" in a self-created world. By iteratively predicting the current control signal, generating the next visual frame based on that action, and then using the generated frame as input for the next control prediction, the model can simulate continuous driving without external visual input beyond an initial sequence.
- Strong Performance on Benchmarks: Extensive experiments on the nuScenes dataset and a large-scale private dataset show impressive performance in both control signal prediction (e.g., L1 error of 0.072 m/s for speed and 0.091 rad for steer angle on nuScenes) and future scene generation (e.g., FID of 5.5 and FVD of 97.0 on nuScenes), outperforming several constructed baselines.
- New Insights for Embodied Intelligence: The authors hope that ADriver-I provides new insights for the future of autonomous driving and the broader field of
embodied intelligence, particularly in leveraging generative models and large language models for complex real-world tasks.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand ADriver-I, a reader should be familiar with the following core concepts:
- Autonomous Driving (AD) System Architectures:
- Modular Design: The traditional approach, separating AD into distinct, sequential components:
Perception: Processing raw sensor data (cameras, LiDAR, radar) to detect, classify, and track objects (vehicles, pedestrians, lanes) and understand the environment.Prediction: Forecasting the future behavior and trajectories of dynamic objects in the scene.Planning: Determining the optimal, collision-free path and speed for the ego-vehicle, often represented as waypoints.Control: Translating the planned trajectory into low-level executable commands for the vehicle's actuators (steering angle, throttle, brake).
- End-to-End Driving: An alternative approach where raw sensor inputs are directly mapped to control outputs (or planning commands) without explicit intermediate modules. This mimics human driving more closely.
- Modular Design: The traditional approach, separating AD into distinct, sequential components:
- Multimodal Large Language Models (MLLMs):
- Large Language Models (LLMs): Powerful neural networks trained on vast amounts of text data to understand, generate, and reason with human language. Examples include GPT series, LLaMA, Vicuna. They excel at tasks like text completion, summarization, and question answering.
- Multimodal: The ability to process and integrate information from multiple modalities, such as text, images, and video. MLLMs extend LLMs by incorporating visual encoders, allowing them to comprehend and generate responses based on both text and visual inputs. They align features from different modalities into a common embedding space.
- Diffusion Models:
- A class of generative models that learn to reverse a diffusion process. In the forward
diffusion process, noise is gradually added to data (e.g., an image) until it becomes pure noise. Thereverse diffusion processlearns to denoise the data, step-by-step, to generate new, realistic samples from noise. They have shown remarkable success in generating high-quality images and videos. Latent Diffusion Models (LDMs): A specific type of diffusion model that performs the diffusion process in a compressedlatent spacerather than the pixel space. This makes them computationally more efficient while maintaining high generation quality.Stable Diffusionis a prominent example.
- A class of generative models that learn to reverse a diffusion process. In the forward
- World Models:
- In the context of
reinforcement learningandrobotics, a world model is a learned representation of the environment that allows an agent to predict future states or outcomes given its current state and a proposed action. It enables agents to plan and learn within a simulated environment, often referred to asmental simulation. The paper extends this concept to autonomous driving, where the "world" includes both future visual scenes and corresponding actions.
- In the context of
- Vision-Action Pairs: A fundamental concept in embodied AI, where a visual observation (image/video) is explicitly linked with an action taken by an agent. This pairing is crucial for learning policies that map observations to actions. The paper introduces an
interleavedvariant of this concept.
3.2. Previous Works
The paper contextualizes ADriver-I by discussing advancements in MLLMs, end-to-end autonomous driving, generative models for AD, and existing world models.
3.2.1. Multimodal Large Language Models (MLLMs) and Embodied Intelligence
- General MLLMs: Models like
LLaVA[31, 32],miniGPT4[67], andBLIP-2[29] integrate image and text to achieve cross-modal understanding, enabling dialogue with images at different fine-grain levels. They typically use avision encoder(e.g.,CLIP-ViT) to extract visual features and avisual adapter(e.g., MLP) to align these features with theLLM'slanguage embeddings. - Embodied Intelligence with MLLMs: MLLMs are increasingly applied to
embodied intelligence, where agents learn to act in physical or simulated environments.VIMA[24]: UtilizesMask R-CNN[16] to extract object regions, which are then combined with text descriptions and fed into a transformer to predict motor actions.VoxPoser[22]: Combines pretrained MLLMs and LLMs to generate a value map for motion planning, eliminating the need for additional training.PaLM-E[10]: A large embodied multimodal language model (VLA model) that can perform various tasks like motion planning, tabletop manipulation, and image description by combiningPaLM[7] andViT-22B[8].RT-2[3]: Proposesvision-language-action (VLA)models for robot control, directly outputting low-level control signals by leveraging web knowledge.
- Concurrent Work - DriveGPT4 [63]: This MLLM-based approach, concurrent with ADriver-I, takes video and text as input to output control signals and provides reasons for interpretability. ADriver-I's distinction lies in its unified control and future generation, and its 'infinite driving' capability.
3.2.2. End-to-End Autonomous Driving
- Encoder-Decoder Paradigms: Many state-of-the-art end-to-end methods use an
encoder-decoderstructure to extract information from raw sensor data and predict planning results.Transfuser[40] andTCP[61]: Directly predict waypoints (planning results) without explicit scene representations.ST-P3[20]: Builds a densecost mapfrom semantic maps and uses hand-crafted rules for optimal trajectory planning.UniAD[21]: Integrates diverse scene representations (segmentation maps, motion flow maps,BEV occupancy maps) hierarchically.VAD[23]: Employs a fullyvectorized approachusing vectorized agent motion and map representations, reducing computational intensity. ADriver-I differs by predicting low-level control signals directly and generating future scenes, rather than just planning waypoints.
3.2.3. Generative Models for Autonomous Driving
Scene generation is crucial for data augmentation, simulation, and understanding.
- Generative Adversarial Networks (GANs) [12] and Variational Auto-Encoders (VAEs) [27, 55]: Earlier generative models.
DriveGAN[26]: Predicts future driving videos by correlating actions with pixel changes.BEVGen[51]: Built onVQ-VAE[55], it generates multi-view images fromBird's Eye View (BEV)layouts.BEVControl[64]: Generates foreground and background in street view images, allowing sketch-style inputs.
- Diffusion Models: More recent and powerful generative models.
Video Latent Diffusion Model[2]: Builds uponStable Diffusion[46] to synthesize high-resolution video, demonstrating impressive generation quality. This is a direct inspiration for ADriver-I's VDM component.Panacea[1]: Alayout-conditioned video generation systemfor diversifying training data for perception models. ADriver-I leverages diffusion models for future scene generation, but integrates it with control signal prediction, making the generationaction-conditioned.
3.2.4. World Models for Autonomous Driving
World models are central to ADriver-I. The paper distinguishes two definitions:
- Pure Future Prediction: Models that only predict future states.
- Unifying Action Prediction and Future Generation: Models that predict both actions and their consequences in the environment. ADriver-I falls into this category.
- Reinforcement Learning/Robotics World Models: In these fields [13, 14, 25, 49, 60, 62], world models predict environmental responses to agent actions using various data types (RGB, depth, point clouds).
- AD-specific World Models:
GAIA-1[19]: A generative world model that takes video, text, and action as inputs to generate realistic driving scenarios. Differentiation: ADriver-I notes that GAIA-1 primarily focuses on scenario generation, "ignoring the control signal prediction."DriveDreamer[58]: Another world model that generates future driving scenarios and predicts control signals. Differentiation: DriveDreamer "relies heavily on rich prior information, like high-definition (HD) maps and 3D bounding boxes, for future generation." ADriver-I aims to eliminate the need for such extensive prior information for future generation.
3.3. Differentiation Analysis
ADriver-I distinguishes itself from previous work through several key innovations:
- Unified Framework: Unlike
DriveGPT4which focuses on control prediction and interpretability, orGAIA-1which is primarily a scenario generator, ADriver-I explicitly unifiescontrol signal predictionandfuture scene generationwithin a single, coherent framework. - Minimal Prior Information for Generation: Unlike
DriveDreamerwhich requires extensiveHD mapsand3D bounding boxesfor future generation, ADriver-I aims to achieve robust future scene generation with minimal prior information, primarily relying on historical frames and predicted control signals. - Interleaved Vision-Action Pair: This novel concept is central to ADriver-I, providing a unified format for handling both visual input and control signals within the MLLM, facilitating flexible multi-round reasoning.
- Infinite Driving Concept: ADriver-I is presented as the first to introduce and demonstrate the concept of
infinite drivingwhere the model recursively generates its own future world and drives within it. This self-contained simulation loop is a significant conceptual advance for embodied AI in AD. - MLLM + VDM Integration: The specific integration of a general-purpose MLLM (for reasoning and action prediction) with a specialized VDM (for high-fidelity, action-conditioned future scene generation) provides a powerful combination.
4. Methodology
4.1. Principles
The core idea behind ADriver-I is to create a general world model for autonomous driving that mimics a human driver's ability to both take actions based on visual input and anticipate the future consequences of those actions. This is achieved by tightly integrating a Multimodal Large Language Model (MLLM) with a Video Diffusion Model (VDM). The interleaved vision-action pair concept is the glue that unifies visual information and control signals, allowing the MLLM to reason over both, and enabling the VDM to generate future scenes conditioned on predicted actions. The system is designed to be autoregressive, meaning it uses its own predictions (actions and future frames) to drive subsequent predictions, leading to the "infinite driving" capability.
4.2. Core Methodology In-depth (Layer by Layer)
The ADriver-I framework is composed of two main components: a Multi-modal Large Language Model (MLLM) for control signal prediction and a Video Diffusion Model (VDM) for future scene generation. These components work in a synergistic loop.
4.2.1. Overall Architecture and Pipeline
The overall framework of ADriver-I is illustrated in Figure 1 of the original paper.
The general workflow is as follows:
-
Input: The system receives a sequence of
historical interleaved vision-action pairsand thecurrent visual frame. Aninterleaved vision-action pairconsists of a visual frame and its corresponding low-level control signal . -
Control Signal Prediction (MLLM): The MLLM processes these inputs to
autoregressivelypredict thelow-level control signalfor the current frame. -
Future Frame Generation (VDM): The predicted control signal , along with the historical vision-action pairs, are then fed into the VDM. The VDM generates a sequence of
future frames. -
Infinite Driving Loop: The crucial aspect for "infinite driving" is that the
generated next framebecomes the "current frame" for the next timestep (). This generated frame is paired with the action (which caused this future frame) to form a new historical vision-action pair. This new pair, along with the newly generated current frame , is fed back into the MLLM to predict the next control signal . This process can be repeated indefinitely.The following figure (Figure 1 from the original paper) shows the system architecture:
该图像是论文中ADriver-I框架的示意图,展示了历史交错的视觉-动作对 \{I, A\} 和当前视觉帧作为输入,利用多模态大语言模型推理当前帧的控制信号 ,并通过扩散模型预测下一帧图像 I_{t+1}^{},该过程可迭代无限次。
4.2.2. Multi-modal Large Language Model (MLLM)
The MLLM is responsible for interpreting visual inputs and predicting control signals. It consists of three main modules:
- Pre-trained Large Language Model (LLM):
Vicuna-7B-1.5[5] is used as the base LLM. Vicuna is itself fine-tuned onLLaMA2[53], providing strong language comprehension and generation capabilities. - Visual Encoder:
CLIP-ViT-Large[44] is employed to extract features from the visual input (image frames). This encoder is pre-trained on a massive dataset of image-text pairs, enabling it to produce rich visual representations. - Visual Adapter: Two
Multi-Layer Perceptron (MLP)layers serve as the visual adapter. These layers are pre-trained byLLaVA-7B-1.5[31] to align the visual features extracted by the encoder with thelanguage feature spaceof the LLM. This alignment is critical for the LLM to understand and reason about visual inputs in a multimodal context.
Prompt Construction for MLLM: The paper introduces a specific method to construct prompts for the MLLM, making visual and control information consumable by the LLM.
- Low-level Control Signal Conversion: The raw numerical low-level control signals (e.g.,
speedin m/s,steer_anglein radians) are converted intotext-like expressions. For example, a speed of 3.63 m/s might be represented as "The speed is <num_start> 3.63 <num_end>". The tokens<num_start>and<num_end>are special tokens to delimit numerical values, facilitating their processing by the LLM. - Visual Tokens: Each video frame is first processed by the
CLIP-ViT-Largevisual encoder and then by thevisual adapterto producevisual tokens<TOKEN_img>. - Action Tokens: The text-converted control signals are processed by a
language tokenizerto obtainaction tokens<TOKEN_act>. - Interleaved Vision-Action Pair: A visual token
<TOKEN_img>for a frame is paired with its corresponding action token<TOKEN_act>for action . This forms aninterleaved vision-action pair.- Advantages of Interleaved Vision-Action Pair:
- Flexibility for Multi-Round Conversation: It enables flexible
multi-round conversationswith unfixed frame lengths, which is crucial for the autoregressive nature of the world model. - Unified Embedding Space: It unifies future generation and action prediction within the
word embedding spaceof the LLM, allowing a cohesive reasoning process.
- Flexibility for Multi-Round Conversation: It enables flexible
- Advantages of Interleaved Vision-Action Pair:
- System Prompt: A
system prompt() is included to set the context and guide the MLLM's reasoning mode. This prompt helps the MLLM understand its role as an autonomous driving agent. - Conversation Structure: The overall conversation structure for the MLLM can be summarized as:
Human: <SYS> <TOKEN_img> <TOKEN_act> <STOP> Agent <TOKEN_act> <STOP>Here, is a special token indicating the end of a turn or generation. The MLLM, acting as theAgent, is prompted to output the action tokens.
4.2.3. Video Diffusion Model (VDM)
The VDM is responsible for generating realistic future frames based on historical context and predicted actions.
- Base Model: The VDM is built upon a
latent diffusion model[47] designed for video generation. Specifically, it inherits the architecture ofStable Diffusion 2.1[46] and incorporatestemporal-awareness modulessimilar to those in thevideo latent diffusion model[2]. Temporal-awareness modules are crucial for generating coherent sequences of frames, ensuring continuity and realistic motion. - History-Conditioned Functionality: The VDM is enriched with the ability to condition its generation on historical frames. This is achieved by integrating a
reference-video controlmechanism, which concatenates given historical frames with thediffusion input. This allows the VDM to maintain context and consistency with the past. - Action-Guided Future Generation:
Text condition modulesfrom the base diffusion model are retained. These modules enable the VDM to receive and incorporate the predicted control signals (converted into natural language prompts) as conditions for future frame generation. This means the MLLM's predicted actions directly influence how the future scene unfolds visually.
Prompt Construction for VDM:
Since the VDM's text encoder is not as sophisticated in reasoning as the MLLM, it struggles with direct interpretation of numerical control signals (e.g., understanding that a negative steer angle means turning right).
-
GPT3.5 for Motion Description: To address this, the authors utilize
GPT3.5[36] (a powerful LLM) to convert the low-level numerical control signals (speed, steer angle) into a descriptivemotion prompt. This makes the condition more semantically understandable for the VDM's text encoder. -
Example Motion Prompt: As shown in Figure 3, control signals from consecutive frames are given to GPT3.5. GPT3.5 is instructed to output common driving states like "steady speed," "accelerating," "decelerating," and "turning." For example, given specific speed and steer angle values, GPT3.5 might output: "The vehicle is moving at a moderate speed, with a slight right turn in progress. The steering angle remains consistent, indicating a gradual turn. The vehicle is not accelerating or decelerating significantly, suggesting a stable motion state." This descriptive text then conditions the VDM.
The following figure (Figure 3 from the original paper) illustrates an example of conversation used for guiding GPT3.5 to generate the corresponding motion prompt.
该图像是论文中展示的三种模型架构对比示意图,分别为(a)仅MLP结构,(b)基于CNN的结构,以及(c)基于ViT的结构,展示了输入A和图像特征I经过不同模块处理的流程。
4.2.4. Model Training
The MLLM and VDM components are trained separately and then merged for inference.
- Training for MLLM:
- Datasets: Pretrained on a large-scale private dataset (approx. 1.4 million
vision-action pairs) focusing on highway scenarios. Forsupervised finetuning (SFT), it is trained on both the nuScenes dataset and the private dataset. - Pretraining Phase: During pretraining, the
LLM model(Vicuna-7B-1.5) is frozen. Only the parameters of thevision encoder(CLIP-ViT-Large) and thevisual adapterlayers are updated. This aligns visual features with the frozen LLM's language space. - Supervised Finetuning (SFT) Phase: During SFT, the
vision encoderis frozen, and the rest of the model (LLM and visual adapter) is trained. This allows the LLM to learn to predict control signals based on the aligned multimodal inputs. - Loss Function: The MLLM outputs text (action tokens), so
cross-entropy lossis used for supervision, similar to how LLaVA [32] is trained. - Hyperparameters: Trained for 2 epochs for both pretraining and SFT, with a batch size of 16. Input image size is . Control signals are multiplied by 1000 to convert to integers, simplifying convergence for the LLM. AdamW optimizer is used with a learning rate of .
- Datasets: Pretrained on a large-scale private dataset (approx. 1.4 million
- Training for VDM:
- Datasets: Pretrained on the 1.4 million private dataset. Finetuned on the nuScenes dataset (approx. 23,000 video samples).
- Pretraining Phase: The VDM inherits weights from
Stable Diffusion[45] and is then pretrained on the private dataset for 40,000 steps with a batch size of 128. - Finetuning Phase: The VDM is finetuned on the nuScenes dataset for an additional 40,000 steps with a reduced batch size of 32.
- Hyperparameters: Spatial resolution is , and video length is 8 frames for both pretraining and finetuning. Learning rates are for pretraining and for finetuning. During inference, the
DDIM[50] sampler is used with 50 sampling steps.
5. Experimental Setup
5.1. Datasets
The experiments for ADriver-I are conducted on two main datasets:
- nuScenes Dataset:
- Source: A publicly available, large-scale dataset for autonomous driving research.
- Characteristics: It provides a comprehensive set of sensor data, including 6 cameras, 5 radars, and 1 LiDAR, covering a 360-degree field of view. It contains 1000 scenes, each approximately 20 seconds long, with extensive annotations for 23 object classes and map layers.
- Domain: Diverse urban driving scenarios in Boston and Singapore.
- Usage in ADriver-I: Used for supervised finetuning of the MLLM and finetuning of the VDM (approx. 23K video samples).
- Why chosen: It is a standard benchmark in autonomous driving, allowing for comparison with other methods.
- Large-scale Private Dataset:
- Source: An internal dataset collected by the authors' organization (MEGVII Technology).
- Scale: Contains nearly 1.4 million
vision-action pairs. - Characteristics: Primarily focuses on
highway scenarios. This implies less variance in steering angles and speeds compared to complex urban environments. - Usage in ADriver-I: Used for pretraining both the MLLM and the VDM. Also used for supervised finetuning of the MLLM.
- Why chosen: Its large scale is crucial for pretraining large models, and its focus on highway scenarios helps the model learn stable driving behaviors.
5.2. Evaluation Metrics
The paper employs different evaluation metrics for control signal prediction and future scene generation.
5.2.1. Control Signal Prediction Metrics
-
L1 Error: This metric calculates the mean absolute error between the predicted control signals and the ground truth. It measures the average magnitude of errors.
- Conceptual Definition: The L1 error, or Mean Absolute Error (MAE), quantifies the average absolute difference between the predicted values and the true values. It is a robust measure that is less sensitive to outliers compared to L2 error (RMSE). For control signals, a lower L1 error indicates higher prediction accuracy.
- Mathematical Formula: Not explicitly provided in the text, but the common definition of L1 error (MAE) is: $ \mathrm{L1} = \frac{1}{N} \sum_{i=1}^{N} | \hat{x}_i - x_i | $
- Symbol Explanation:
- : The total number of samples (predictions).
- : The predicted value for the -th sample (e.g., predicted speed or steer angle).
- : The ground truth value for the -th sample (e.g., actual speed or steer angle).
- : The absolute value function.
-
Accuracy with different thresholds (): This metric measures the proportion of predictions that fall within a specified absolute error threshold of the ground truth.
- Conceptual Definition: assesses how often the model's prediction is "close enough" to the ground truth, where "close enough" is defined by the threshold . A higher indicates better performance for a given tolerance level.
- Mathematical Formula: The paper provides the definition for and . $ F ( { \hat { x _ { i } } } , x _ { i } , \theta ) = { \left{ \begin{array} { l l } { 1 , } & { | { \hat { x _ { i } } } - x _ { i } | < = \theta } \ { 0 , } & { | { \hat { x _ { i } } } - x _ { i } | > \theta } \end{array} \right. } $ $ A _ { \theta } = \frac { 1 } { N } \sum _ { i = 0 } ^ { N } F ( \hat { x _ { i } } , x _ { i } , \theta ) $
- Symbol Explanation:
- : An indicator function that returns 1 if the absolute difference between prediction and ground truth for sample is less than or equal to the threshold , and 0 otherwise.
- : The predicted value for the -th sample.
- : The ground truth value for the -th sample.
- : The predefined threshold for accuracy. The paper uses .
- : The total number of validation samples.
5.2.2. Future Scene Generation Metrics
- Frechet Inception Distance (FID) [17]:
- Conceptual Definition: FID measures the similarity between the distribution of generated images and the distribution of real images. It computes the
Fréchet distancebetween twoGaussian distributionsfitted to feature representations of real and generated images (extracted from anInception-v3network). Lower FID scores indicate higher quality and realism of generated images. - Mathematical Formula: The formula for FID is: $ \mathrm{FID} = ||\mu_1 - \mu_2||^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1\Sigma_2)^{1/2}) $
- Symbol Explanation:
- : The mean feature vector of real images.
- : The mean feature vector of generated images.
- : The covariance matrix of features for real images.
- : The covariance matrix of features for generated images.
- : The squared L2 norm (Euclidean distance).
- : The trace of a matrix.
- : The matrix square root of the product of covariance matrices.
- Conceptual Definition: FID measures the similarity between the distribution of generated images and the distribution of real images. It computes the
- Frechet Video Distance (FVD) [54]:
- Conceptual Definition: FVD is an extension of FID to video sequences. It assesses the visual quality and temporal consistency of generated videos by comparing their feature distributions (extracted from a pre-trained video classification network, often based on
Inception-v3or similar architectures capable of processing video) against real videos. Lower FVD scores indicate more realistic and temporally coherent generated videos. - Mathematical Formula: FVD uses the same mathematical form as FID but applies it to video features. If represents real video features and represents generated video features, then: $ \mathrm{FVD} = ||\mu_r - \mu_g||^2 + \mathrm{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2}) $
- Symbol Explanation:
- : The mean feature vector of real video sequences.
- : The mean feature vector of generated video sequences.
- : The covariance matrix of features for real video sequences.
- : The covariance matrix of features for generated video sequences.
- Conceptual Definition: FVD is an extension of FID to video sequences. It assesses the visual quality and temporal consistency of generated videos by comparing their feature distributions (extracted from a pre-trained video classification network, often based on
5.3. Baselines
For comparing the performance of control signal prediction, the authors constructed three competitive baselines:
- MLP-Only Baseline (a):
- Architecture: This is the simplest baseline. It takes
historical action sequences(control signals) from three preceding frames as input. These action sequences are then fed intothree fully-connected layers(MLP) to predict the action of the current frame. - Representativeness: Represents a purely regression-based approach that relies only on temporal patterns of past actions, without any visual context.
- Architecture: This is the simplest baseline. It takes
- CNN-Based Baseline (b):
- Architecture: Builds upon the MLP-Only baseline by incorporating visual information. It uses a
Convolutional Neural Network (CNN)(e.g.,ResNet[15]) to encode features from the input image. The image features are then processed byGlobal Average Pooling (GAP). These global image features areconcatenatedwith the historical action features. The combined features are then used to predict the current action, likely through additional fully-connected layers. - Representativeness: Represents a common approach to combine visual features with other data streams for prediction tasks.
- Architecture: Builds upon the MLP-Only baseline by incorporating visual information. It uses a
- ViT-Based Baseline (c):
-
Architecture: Similar to the CNN-Based baseline, but it replaces the traditional CNN with a
Vision Transformer (ViT)(e.g.,ViT-B[9]) for visual feature encoding. The rest of the architecture (concatenation with action features, prediction layers) remains the same as the CNN-Based baseline. -
Representativeness: Represents a more modern approach to visual feature extraction, leveraging the power of transformers that have shown strong performance in vision tasks.
The following figure (Figure 3 from the original paper) illustrates the three baseline architectures for context (though the text in the paper refers to it as CNN and ViT backbone).
该图像是论文中展示的三种模型架构对比示意图,分别为(a)仅MLP结构,(b)基于CNN的结构,以及(c)基于ViT的结构,展示了输入A和图像特征I经过不同模块处理的流程。
-
6. Results & Analysis
6.1. Core Results Analysis
ADriver-I demonstrates impressive performance in both control signal prediction and future scene generation, outperforming constructed baselines and achieving competitive results against other generative AD models.
6.1.1. Control Signal Prediction
The quantitative results for control signal prediction are presented in Table 2, comparing ADriver-I with three baselines (MLP-Only, CNN-Based, ViT-Based) on both nuScenes and the private datasets.
The following are the results from Table 2 of the original paper:
| Method | Speed (m/s) | Steer angle (rad) | ||||||||
| L1↓ | A0.01↑ | A0.03 ↑ | A0.05 ↑ | A0.07 ↑ | L1↓ | A0.01↑ | A0.03 ↑ | A0.05 ↑ | A0.07 ↑ | |
| MLP-Only | 0.122 | 0.189 | 0.275 | 0.361 | 0.440 | 0.101 | 0.183 | 0.482 | 0.641 | 0.715 |
| CNN-Based | 0.106 | 0.191 | 0.301 | 0.407 | 0.474 | 0.095 | 0.277 | 0.527 | 0.648 | 0.721 |
| ViT-Based | 0.103 | 0.200 | 0.326 | 0.438 | 0.489 | 0.092 | 0.299 | 0.546 | 0.656 | 0.724 |
| ADriver-I | 0.072 | 0.237 | 0.398 | 0.535 | 0.640 | 0.091 | 0.411 | 0.575 | 0.664 | 0.731 |
| ADriver-I† | 0.035 | 0.295 | 0.519 | 0.790 | 0.862 | 0.015 | 0.643 | 0.840 | 0.925 | 0.964 |
Analysis:
- Superiority of ADriver-I: On the nuScenes dataset, ADriver-I consistently outperforms all three baselines across almost all metrics for both speed and steer angle prediction.
- For speed, ADriver-I achieves an L1 error of , significantly lower than the best baseline (ViT-Based at ). Its accuracy metrics ( to ) are also notably higher, especially at tighter thresholds (e.g., of
0.237vs. ViT-Based0.200). - For steer angle, ADriver-I achieves an L1 error of
0.091 rad, slightly better than ViT-Based (0.092 rad). However, its accuracy at very tight thresholds (e.g., of0.411vs. ViT-Based0.299) is substantially higher. This suggests that while the mean error might be similar to the best baseline in some cases, ADriver-I makes significantly more precise predictions. The authors attribute the similar L1 error for steer angle to "some special cases with large steer angle variance," implying ADriver-I handles the majority of cases better.
- For speed, ADriver-I achieves an L1 error of , significantly lower than the best baseline (ViT-Based at ). Its accuracy metrics ( to ) are also notably higher, especially at tighter thresholds (e.g., of
- Impact of Visual Information: Comparing MLP-Only to CNN-Based and ViT-Based, there's a clear improvement when visual features are incorporated, highlighting the importance of understanding the scene for accurate control prediction.
- Impact of Transformer-based Vision: ViT-Based generally outperforms CNN-Based, indicating the benefit of
Vision Transformersfor extracting rich visual cues in autonomous driving tasks. - Cross-Dataset Performance (ADriver-I†): The results for ADriver-I† (on the private dataset) show a dramatic improvement over nuScenes performance: L1 errors of for speed and
0.015 radfor steer angle.- Reasons for Improvement: The authors attribute this to two factors:
-
The private dataset mainly focuses on
highway scenarios, which typically involve smaller variance in speed and steer angles, making prediction easier. -
The private dataset used for supervised finetuning is much larger (
1.4Mvs.23Ksamples on nuScenes), providing more extensive training data.Figure 4 provides qualitative visualizations of control signal prediction. The left side shows input frames from timestamp
t-3to . The right side's bar chart displays the actions fort-3tot-1, with prediction (pred) and ground-truth (GT) at time . This helps to visually confirm the model's accuracy.
-
- Reasons for Improvement: The authors attribute this to two factors:
The following figure (Figure 4 from the original paper) shows qualitative visualization of control signal prediction:
该图像是一个对比示意图,展示了ADriver-I模型在不同时间帧下对自动驾驶任务中速度与转向角预测的效果,包括预测值与真实值的对比及对应场景图像,体现了模型的预测准确性和时序连贯性。
6.1.2. Ablation Studies for Control Signal Prediction
The paper conducts ablation studies to analyze the impact of key design choices on control signal prediction.
Ablation on Encoding Methods for Control Signal (Table 3): The following are the results from Table 3 of the original paper:
| Number Embedding | Speed (m/s) | Steer angle (rad) |
| Num2English | 2.094 | 0.536 |
| Special Token | 0.094 | 0.106 |
| Relative Diff | 0.081 | 0.096 |
| Absolute Number | 0.072 | 0.091 |
Analysis:
- Direct Absolute Number is Best: Directly predicting the
absolute numberof speed or steer angle yields the best performance (L1 errors of and0.091 rad). - Ineffectiveness of
Num2English: Translating numbers into English descriptions (Num2English) performs poorly, indicating that the LLM struggles to perform precise numerical regression when values are represented as natural language. - Benefits of
Special TokenandRelative Diff:Special Token(converting to classification bins) andRelative Diff(predicting differences from previous frames) perform better thanNum2Englishbut are still inferior toAbsolute Number. This confirms that for precise numerical tasks, direct numerical prediction within the LLM's framework, facilitated by special tokens as done in ADriver-I's prompt construction, is most effective.
Ablation on Number of Decimal Places (Table 4): The following are the results from Table 4 of the original paper:
| Decimal places | Speed (m/s) | Steer angle (rad) |
| 0 | 0.212 | 0.099 |
| 1 | 0.094 | 0.093 |
| 2 | 0.073 | 0.091 |
| 3 | 0.072 | 0.091 |
Analysis:
- Precision Matters: Using
0or1decimal place significantly degrades performance, especially for speed prediction. This indicates that autonomous driving control signals require a certain level of precision. - Optimal Decimal Places:
Two decimal placesachieve almost identical performance tothree decimal places(e.g., speed L1 error of0.073vs.0.072, steer angle L1 error of0.091for both). This suggests thattwo decimal placesoffer a good balance between precision and potentially reducing the complexity of the output space for the LLM.
Ablation on Effectiveness of Multi-Round Conversations (Table 5): The following are the results from Table 5 of the original paper:
| Conversation | Speed (m/s) | Steer angle (rad) |
| Temporal Fusion | 0.078 | 0.092 |
| Single Round | 0.078 | 0.094 |
| Multi Round | 0.072 | 0.091 |
Analysis:
- Multi-Round Conversation is Key: The
Multi Roundconversation strategy significantly outperformsTemporal FusionandSingle Roundapproaches, particularly for speed prediction (L1 error0.072vs.0.078). - Benefit of Intermediate Supervision:
Multi Roundapplies supervision to intermediate action predictions during training. This distributed supervision helps reduce thecumulative errorthat might build up when only the final prediction () is supervised, leading to more robust action prediction during inference. - Comparison to Baselines:
Temporal FusionandSingle Roundare likely more akin to the baselines in how they process temporal information, and their performance is worse, further validating themulti-roundapproach of ADriver-I.
6.1.3. Future Scene Generation
-
Qualitative Results (Figure 6): The paper presents qualitative results for future frame prediction, showcasing the VDM's ability to generate realistic scenes for different driving maneuvers (e.g., turning left, turning right). The VDM successfully produces future scenes conditioned on historical frames and MLLM-predicted control signals, without needing high-level knowledge like 3D bounding boxes or HD maps. This demonstrates its effectiveness in creating plausible visual futures.
The following figure (Figure 6 from the original paper) shows qualitative results for future prediction generated by the video diffusion model:
该图像是自动驾驶场景中的视觉帧序列示意图,展示了基于时间轴(从-1.5秒到+3.5秒)的车辆速度和转向角变化。每张图片配有对应时间点的速度和转向角柱状图,体现了控制信号与视觉输入的关联。 -
Quantitative Results (Table 6): The quality of generated future frames is quantitatively evaluated using
FIDandFVDmetrics.The following are the results from Table 6 of the original paper:
Method Input→Output FID↓ FVD↓ DriveGAN 1F→12F 73.4 502.3 DriveDreamer 1F+12B+12M→12F 52.6 452.0 ADriver-I 4F → 4F 5.5 97.0
Analysis:
- Superiority of ADriver-I: ADriver-I achieves significantly lower (better) FID (
5.5) and FVD (97.0) scores compared toDriveGAN(FID 73.4,FVD 502.3) andDriveDreamer(FID 52.6,FVD 452.0). - Efficiency in Input Priors: Notably, ADriver-I achieves this impressive performance by conditioning on
4 historical frames(4F → 4F) without relying on external3D bounding boxes (B)orHD maps (M), whichDriveDreamerheavily utilizes. This validates ADriver-I's claim of generating future scenes with minimal extensive prior information. The superior performance, despite fewer explicit priors, highlights the power of its integrated MLLM+VDM approach and the effectiveness of diffusion models for high-quality video generation.
6.2. Joint Control & Generation
The paper verifies the "infinite driving" capability of ADriver-I, where the model drives in its self-created world.
-
Recurrent Process: Given only three initial
historical interleaved vision-action pairs, ADriver-I autonomously performscontrol signal predictionvia the MLLM andfuture scene generationvia the VDM in a recurrent loop. -
Interdependence: As shown in Figure 7, all video frames are generated by the VDM, and the corresponding speed and steer angle predictions are produced by the MLLM. This visualization demonstrates a critical feedback loop: the predicted control signals directly influence the generated future scenes, and in turn, these generated future scenes provide the visual input that prompts ADriver-I to take subsequent actions. This continuous cycle effectively simulates autonomous driving in a dynamic, self-generated environment.
The following figure (Figure 7 from the original paper) shows the joint control and generation in ADriver-I, illustrating the infinite driving process:
该图像是自动驾驶场景中的视觉帧序列示意图,展示了基于时间轴(从-1.5秒到+3.5秒)的车辆速度和转向角变化。每张图片配有对应时间点的速度和转向角柱状图,体现了控制信号与视觉输入的关联。
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces ADriver-I, a novel world model for autonomous driving that unifies low-level control signal prediction and future scene generation. By leveraging a multimodal large language model (MLLM) and a video diffusion model (VDM) in an integrated framework, ADriver-I moves beyond traditional modular autonomous driving pipelines. The core innovation of interleaved vision-action pairs enables the MLLM to process multimodal information effectively. Experiments on nuScenes and a large private dataset demonstrate ADriver-I's impressive performance, outperforming strong baselines in control prediction and state-of-the-art methods in future scene generation, particularly without relying on extensive prior information like HD maps or 3D bounding boxes. Crucially, ADriver-I showcases the concept of "infinite driving" within a self-generated world, marking a significant step towards more holistic and human-like embodied intelligence in autonomous systems.
7.2. Limitations & Future Work
The authors candidly acknowledge several limitations and outline potential future research directions:
- Low-Quality Video Frames: The generation module (VDM), acting as a
closed-loop simulator, might producelow-quality video frames, especially during rapid control signal changes. These low-quality frames coulddisturbthe MLLM's control signal prediction in subsequent timesteps, leading to instability or errors in the infinite driving loop. - Performance Gap for Deployment: Despite impressive results, the current performance of ADriver-I is
still far from satisfactory for real-world deployment. The authors plan to update ADriver-I by training it with a much larger number ofvision-control pairsfrom their continuously expanding private dataset, hoping to leveragescaling laws. - Separate Training: The MLLM and VDM are
trained separately, which prevents the system from benefiting fromend-to-end optimization. A future goal is to develop aunified comprehension & generation frameworkthat allows for joint, end-to-end training. - Lack of Routing Information: For
long-distance autonomous driving, the current modellacks routing information(e.g., navigation maps). Integrating anavigation mapmodule is proposed to enable extended autonomous driving capabilities.
7.3. Personal Insights & Critique
This paper presents a truly fascinating and forward-looking approach to autonomous driving, moving beyond the traditional pipeline. The concept of "infinite driving" in a self-created world is a powerful demonstration of the potential of generative AI and MLLMs for embodied intelligence.
Inspirations and Applications:
- Holistic Learning: ADriver-I inspires a shift from fragmented, task-specific models to more holistic systems that can both perceive, act, and imagine, much like a human. This could lead to more robust and generalized autonomous systems.
- Synthetic Data Generation: The VDM's ability to generate high-quality, action-conditioned future frames is invaluable for creating synthetic training data, especially for rare or dangerous scenarios, which could significantly accelerate AD development and testing.
- Explainable AI: While not a primary focus, the MLLM's inherent language capabilities could potentially be extended to provide
interpretabilityfor its actions, similar toDriveGPT4. Understanding why an autonomous vehicle chose a particular action is crucial for trust and deployment. - Embodied AI Beyond Driving: The
interleaved vision-action pairand the MLLM-VDM coupling could be a generalizable blueprint for other embodied AI tasks, such as robotics, virtual agents, or even human-computer interaction, where agents need to reason over multimodal inputs and predict outcomes of their actions.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Reality Gap: The "infinite driving" in a self-created world, while innovative, faces the inherent
reality gapproblem. The generated world might not perfectly reflect the complexities and unpredictable nuances of the real world. A critical challenge will be ensuring that policies learned in the self-generated world transfer effectively to reality. The authors acknowledge the VDM's potential for low-quality frames, which is a symptom of this. -
Computational Cost: MLLMs and Diffusion Models are computationally expensive to train and run. The feasibility of deploying such a complex system in real-time on resource-constrained vehicle hardware remains a significant hurdle.
-
Safety and Robustness under Novelty: The current system's ability to handle novel, out-of-distribution scenarios (e.g., unexpected obstacles, adverse weather conditions not seen during training) is not explicitly evaluated. While MLLMs offer generalization, the reliance on a generated future raises questions about how it would react to truly unforeseen events.
-
Long-term Planning and High-Level Reasoning: While the system predicts "near future frames," long-term planning and high-level strategic reasoning (e.g., choosing a route, navigating complex intersections, interacting with human intent) are still implicitly handled or require external modules. The mention of
routing informationin limitations hints at this. Integrating more sophisticated planning directly into the MLLM could be a future direction. -
Training Loop Stability: The separate training of MLLM and VDM, while practical, means the VDM might generate frames that are "difficult" for the MLLM to act upon, and vice-versa. An end-to-end trainable system could potentially learn to generate frames that are "easier" to act upon, or actions that lead to "easier-to-generate" futures, potentially creating a self-reinforcing but not necessarily optimal loop. A true
joint optimizationcould lead to emergent behaviors but also carries risks of instability.Overall, ADriver-I represents a compelling step towards more unified and intelligent autonomous systems, pushing the boundaries of what is possible with generative AI and large foundation models in embodied intelligence. The identified limitations highlight key challenges that the field needs to address for practical deployment.
Similar papers
Recommended via semantic vector search.