Paper status: completed

ADriver-I: A General World Model for Autonomous Driving

Published:11/23/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ADriver-I integrates vision-action pairs using MLLM and diffusion models to autoregressively predict control signals and future scenes, enabling iterative autonomous driving and significantly improving performance.

Abstract

Typically, autonomous driving adopts a modular design, which divides the full stack into perception, prediction, planning and control parts. Though interpretable, such modular design tends to introduce a substantial amount of redundancy. Recently, multimodal large language models (MLLM) and diffusion techniques have demonstrated their superior performance on comprehension and generation ability. In this paper, we first introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals. Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I. It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. With the predicted next frame, ADriver-I performs further control signal prediction. Such a process can be repeated infinite times, ADriver-I achieves autonomous driving in the world created by itself. Extensive experiments are conducted on nuScenes and our large-scale private datasets. ADriver-I shows impressive performance compared to several constructed baselines. We hope our ADriver-I can provide some new insights for future autonomous driving and embodied intelligence.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "ADriver-I: A General World Model for Autonomous Driving." It proposes a novel framework that unifies control signal prediction and future scene generation for autonomous driving.

1.2. Authors

The authors are:

  • Fan Jia* (MEGVII Technology)

  • Weixin Mao* (Waseda University)

  • Yingfei Liu (MEGVII Technology)

  • Yucheng Zhao (MEGVII Technology)

  • Yuqing Wen† (University of Science and Technology of China)

  • Chi Zhang (Mach Drive)

  • Xiangyu Zhang (MEGVII Technology)

  • Tiancai Wang‡ (MEGVII Technology)

    (* indicates equal contribution, † indicates corresponding author, ‡ indicates corresponding author)

1.3. Journal/Conference

This paper was published on arXiv, a preprint server. As a preprint, it has not yet undergone formal peer review for a specific journal or conference. arXiv is a widely used platform in the scientific community for rapid dissemination of research findings, allowing researchers to share their work before formal publication.

1.4. Publication Year

2023

1.5. Abstract

The paper introduces ADriver-I, a general world model for autonomous driving that addresses the redundancy inherent in traditional modular autonomous driving systems (perception, prediction, planning, control). Leveraging the strengths of multimodal large language models (MLLMs) and diffusion techniques, ADriver-I proposes the concept of an interleaved vision-action pair to unify visual features and control signals. The model takes these vision-action pairs as input to autoregressively predict the control signal for the current frame. Subsequently, these generated control signals, along with historical vision-action pairs, condition a diffusion model to predict future frames. This process allows ADriver-I to iteratively predict the next frame and then the control signal for that predicted frame, enabling "infinite driving" within its self-created world. Extensive experiments on the nuScenes dataset and a large-scale private dataset demonstrate ADriver-I's impressive performance compared to established baselines, offering new insights for future autonomous driving and embodied intelligence research.

https://arxiv.org/abs/2311.13549v1 The publication status is a preprint on arXiv.

https://arxiv.org/pdf/2311.13549v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the redundancy and inefficiency of traditional modular autonomous driving (AD) systems.

  • Traditional Modular Design: Typically, autonomous driving systems are broken down into sequential components: perception (detecting objects, lanes), prediction (forecasting object trajectories), planning (determining the ego-car's path), and control (generating low-level steering and acceleration/braking commands). While this modular design is interpretable and easier to debug, the authors argue it introduces a substantial amount of redundancy because information needs to be passed and re-processed between modules.
  • Importance and Challenges: Autonomous driving is a safety-critical application where efficiency, robustness, and human-like reasoning are paramount. The challenges with modular systems include error propagation (an error in perception can cascade through prediction and planning) and sub-optimal global performance due to local optimizations within each module. Human drivers, in contrast, tend to act more directly based on visual information and implicitly predict the future, suggesting a more end-to-end or world-model-like approach might be more natural and efficient.
  • Paper's Innovative Idea: The paper's entry point is to unify control signal prediction (the action) and future scene generation (the world) into a single framework, inspired by the recent advancements in multimodal large language models (MLLMs) and diffusion techniques. The core innovation is the concept of an interleaved vision-action pair, which allows the model to process visual data and control signals in a unified format, enabling a system that can both "understand" its environment, "act" within it, and "imagine" its future state.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Concept of Interleaved Vision-Action Pair: Introduction of a novel data format that unifies visual features and control signals (actions). This allows MLLMs to process both modalities in a coherent, text-like manner, enabling multi-round conversations and flexible handling of unfixed frame lengths.
  • ADriver-I World Model: Construction of a general world model for autonomous driving, named ADriver-I, based on the integration of an MLLM and a Video Diffusion Model (VDM). This model unifies two critical tasks: control signal prediction and future scene generation.
  • Infinite Driving Capability: ADriver-I demonstrates the capability of "infinite driving" in a self-created world. By iteratively predicting the current control signal, generating the next visual frame based on that action, and then using the generated frame as input for the next control prediction, the model can simulate continuous driving without external visual input beyond an initial sequence.
  • Strong Performance on Benchmarks: Extensive experiments on the nuScenes dataset and a large-scale private dataset show impressive performance in both control signal prediction (e.g., L1 error of 0.072 m/s for speed and 0.091 rad for steer angle on nuScenes) and future scene generation (e.g., FID of 5.5 and FVD of 97.0 on nuScenes), outperforming several constructed baselines.
  • New Insights for Embodied Intelligence: The authors hope that ADriver-I provides new insights for the future of autonomous driving and the broader field of embodied intelligence, particularly in leveraging generative models and large language models for complex real-world tasks.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand ADriver-I, a reader should be familiar with the following core concepts:

  • Autonomous Driving (AD) System Architectures:
    • Modular Design: The traditional approach, separating AD into distinct, sequential components:
      • Perception: Processing raw sensor data (cameras, LiDAR, radar) to detect, classify, and track objects (vehicles, pedestrians, lanes) and understand the environment.
      • Prediction: Forecasting the future behavior and trajectories of dynamic objects in the scene.
      • Planning: Determining the optimal, collision-free path and speed for the ego-vehicle, often represented as waypoints.
      • Control: Translating the planned trajectory into low-level executable commands for the vehicle's actuators (steering angle, throttle, brake).
    • End-to-End Driving: An alternative approach where raw sensor inputs are directly mapped to control outputs (or planning commands) without explicit intermediate modules. This mimics human driving more closely.
  • Multimodal Large Language Models (MLLMs):
    • Large Language Models (LLMs): Powerful neural networks trained on vast amounts of text data to understand, generate, and reason with human language. Examples include GPT series, LLaMA, Vicuna. They excel at tasks like text completion, summarization, and question answering.
    • Multimodal: The ability to process and integrate information from multiple modalities, such as text, images, and video. MLLMs extend LLMs by incorporating visual encoders, allowing them to comprehend and generate responses based on both text and visual inputs. They align features from different modalities into a common embedding space.
  • Diffusion Models:
    • A class of generative models that learn to reverse a diffusion process. In the forward diffusion process, noise is gradually added to data (e.g., an image) until it becomes pure noise. The reverse diffusion process learns to denoise the data, step-by-step, to generate new, realistic samples from noise. They have shown remarkable success in generating high-quality images and videos.
    • Latent Diffusion Models (LDMs): A specific type of diffusion model that performs the diffusion process in a compressed latent space rather than the pixel space. This makes them computationally more efficient while maintaining high generation quality. Stable Diffusion is a prominent example.
  • World Models:
    • In the context of reinforcement learning and robotics, a world model is a learned representation of the environment that allows an agent to predict future states or outcomes given its current state and a proposed action. It enables agents to plan and learn within a simulated environment, often referred to as mental simulation. The paper extends this concept to autonomous driving, where the "world" includes both future visual scenes and corresponding actions.
  • Vision-Action Pairs: A fundamental concept in embodied AI, where a visual observation (image/video) is explicitly linked with an action taken by an agent. This pairing is crucial for learning policies that map observations to actions. The paper introduces an interleaved variant of this concept.

3.2. Previous Works

The paper contextualizes ADriver-I by discussing advancements in MLLMs, end-to-end autonomous driving, generative models for AD, and existing world models.

3.2.1. Multimodal Large Language Models (MLLMs) and Embodied Intelligence

  • General MLLMs: Models like LLaVA [31, 32], miniGPT4 [67], and BLIP-2 [29] integrate image and text to achieve cross-modal understanding, enabling dialogue with images at different fine-grain levels. They typically use a vision encoder (e.g., CLIP-ViT) to extract visual features and a visual adapter (e.g., MLP) to align these features with the LLM's language embeddings.
  • Embodied Intelligence with MLLMs: MLLMs are increasingly applied to embodied intelligence, where agents learn to act in physical or simulated environments.
    • VIMA [24]: Utilizes Mask R-CNN [16] to extract object regions, which are then combined with text descriptions and fed into a transformer to predict motor actions.
    • VoxPoser [22]: Combines pretrained MLLMs and LLMs to generate a value map for motion planning, eliminating the need for additional training.
    • PaLM-E [10]: A large embodied multimodal language model (VLA model) that can perform various tasks like motion planning, tabletop manipulation, and image description by combining PaLM [7] and ViT-22B [8].
    • RT-2 [3]: Proposes vision-language-action (VLA) models for robot control, directly outputting low-level control signals by leveraging web knowledge.
  • Concurrent Work - DriveGPT4 [63]: This MLLM-based approach, concurrent with ADriver-I, takes video and text as input to output control signals and provides reasons for interpretability. ADriver-I's distinction lies in its unified control and future generation, and its 'infinite driving' capability.

3.2.2. End-to-End Autonomous Driving

  • Encoder-Decoder Paradigms: Many state-of-the-art end-to-end methods use an encoder-decoder structure to extract information from raw sensor data and predict planning results.
    • Transfuser [40] and TCP [61]: Directly predict waypoints (planning results) without explicit scene representations.
    • ST-P3 [20]: Builds a dense cost map from semantic maps and uses hand-crafted rules for optimal trajectory planning.
    • UniAD [21]: Integrates diverse scene representations (segmentation maps, motion flow maps, BEV occupancy maps) hierarchically.
    • VAD [23]: Employs a fully vectorized approach using vectorized agent motion and map representations, reducing computational intensity. ADriver-I differs by predicting low-level control signals directly and generating future scenes, rather than just planning waypoints.

3.2.3. Generative Models for Autonomous Driving

Scene generation is crucial for data augmentation, simulation, and understanding.

  • Generative Adversarial Networks (GANs) [12] and Variational Auto-Encoders (VAEs) [27, 55]: Earlier generative models.
    • DriveGAN [26]: Predicts future driving videos by correlating actions with pixel changes.
    • BEVGen [51]: Built on VQ-VAE [55], it generates multi-view images from Bird's Eye View (BEV) layouts.
    • BEVControl [64]: Generates foreground and background in street view images, allowing sketch-style inputs.
  • Diffusion Models: More recent and powerful generative models.
    • Video Latent Diffusion Model [2]: Builds upon Stable Diffusion [46] to synthesize high-resolution video, demonstrating impressive generation quality. This is a direct inspiration for ADriver-I's VDM component.
    • Panacea [1]: A layout-conditioned video generation system for diversifying training data for perception models. ADriver-I leverages diffusion models for future scene generation, but integrates it with control signal prediction, making the generation action-conditioned.

3.2.4. World Models for Autonomous Driving

World models are central to ADriver-I. The paper distinguishes two definitions:

  1. Pure Future Prediction: Models that only predict future states.
  2. Unifying Action Prediction and Future Generation: Models that predict both actions and their consequences in the environment. ADriver-I falls into this category.
  • Reinforcement Learning/Robotics World Models: In these fields [13, 14, 25, 49, 60, 62], world models predict environmental responses to agent actions using various data types (RGB, depth, point clouds).
  • AD-specific World Models:
    • GAIA-1 [19]: A generative world model that takes video, text, and action as inputs to generate realistic driving scenarios. Differentiation: ADriver-I notes that GAIA-1 primarily focuses on scenario generation, "ignoring the control signal prediction."
    • DriveDreamer [58]: Another world model that generates future driving scenarios and predicts control signals. Differentiation: DriveDreamer "relies heavily on rich prior information, like high-definition (HD) maps and 3D bounding boxes, for future generation." ADriver-I aims to eliminate the need for such extensive prior information for future generation.

3.3. Differentiation Analysis

ADriver-I distinguishes itself from previous work through several key innovations:

  • Unified Framework: Unlike DriveGPT4 which focuses on control prediction and interpretability, or GAIA-1 which is primarily a scenario generator, ADriver-I explicitly unifies control signal prediction and future scene generation within a single, coherent framework.
  • Minimal Prior Information for Generation: Unlike DriveDreamer which requires extensive HD maps and 3D bounding boxes for future generation, ADriver-I aims to achieve robust future scene generation with minimal prior information, primarily relying on historical frames and predicted control signals.
  • Interleaved Vision-Action Pair: This novel concept is central to ADriver-I, providing a unified format for handling both visual input and control signals within the MLLM, facilitating flexible multi-round reasoning.
  • Infinite Driving Concept: ADriver-I is presented as the first to introduce and demonstrate the concept of infinite driving where the model recursively generates its own future world and drives within it. This self-contained simulation loop is a significant conceptual advance for embodied AI in AD.
  • MLLM + VDM Integration: The specific integration of a general-purpose MLLM (for reasoning and action prediction) with a specialized VDM (for high-fidelity, action-conditioned future scene generation) provides a powerful combination.

4. Methodology

4.1. Principles

The core idea behind ADriver-I is to create a general world model for autonomous driving that mimics a human driver's ability to both take actions based on visual input and anticipate the future consequences of those actions. This is achieved by tightly integrating a Multimodal Large Language Model (MLLM) with a Video Diffusion Model (VDM). The interleaved vision-action pair concept is the glue that unifies visual information and control signals, allowing the MLLM to reason over both, and enabling the VDM to generate future scenes conditioned on predicted actions. The system is designed to be autoregressive, meaning it uses its own predictions (actions and future frames) to drive subsequent predictions, leading to the "infinite driving" capability.

4.2. Core Methodology In-depth (Layer by Layer)

The ADriver-I framework is composed of two main components: a Multi-modal Large Language Model (MLLM) for control signal prediction and a Video Diffusion Model (VDM) for future scene generation. These components work in a synergistic loop.

4.2.1. Overall Architecture and Pipeline

The overall framework of ADriver-I is illustrated in Figure 1 of the original paper.

The general workflow is as follows:

  1. Input: The system receives a sequence of historical interleaved vision-action pairs {I,A}\{ I, A \} and the current visual frame ItI_t. An interleaved vision-action pair consists of a visual frame IkI_k and its corresponding low-level control signal AkA_k.

  2. Control Signal Prediction (MLLM): The MLLM processes these inputs to autoregressively predict the low-level control signal AtA_t for the current frame.

  3. Future Frame Generation (VDM): The predicted control signal AtA_t, along with the historical vision-action pairs, are then fed into the VDM. The VDM generates a sequence of future frames {It+1,,It+4}\{ I_{t+1}', \dots, I_{t+4}' \}.

  4. Infinite Driving Loop: The crucial aspect for "infinite driving" is that the generated next frame It+1I_{t+1}' becomes the "current frame" for the next timestep (t+1t+1). This generated frame is paired with the action AtA_t (which caused this future frame) to form a new historical vision-action pair. This new pair, along with the newly generated current frame It+1I_{t+1}', is fed back into the MLLM to predict the next control signal At+1A_{t+1}'. This process can be repeated indefinitely.

    The following figure (Figure 1 from the original paper) shows the system architecture:

    Figure 1. Overview of of our ADriver-I framework. It takes the historical interleaved vision-action pairs \(\\{ I , A \\}\) and current visual token as inputs. The multi-modal large language model (MLLM)… 该图像是论文中ADriver-I框架的示意图,展示了历史交错的视觉-动作对 \{I, A\} 和当前视觉帧作为输入,利用多模态大语言模型推理当前帧的控制信号 AtA_t,并通过扩散模型预测下一帧图像 I_{t+1}^{},该过程可迭代无限次。

4.2.2. Multi-modal Large Language Model (MLLM)

The MLLM is responsible for interpreting visual inputs and predicting control signals. It consists of three main modules:

  • Pre-trained Large Language Model (LLM): Vicuna-7B-1.5 [5] is used as the base LLM. Vicuna is itself fine-tuned on LLaMA2 [53], providing strong language comprehension and generation capabilities.
  • Visual Encoder: CLIP-ViT-Large [44] is employed to extract features from the visual input (image frames). This encoder is pre-trained on a massive dataset of image-text pairs, enabling it to produce rich visual representations.
  • Visual Adapter: Two Multi-Layer Perceptron (MLP) layers serve as the visual adapter. These layers are pre-trained by LLaVA-7B-1.5 [31] to align the visual features extracted by the encoder with the language feature space of the LLM. This alignment is critical for the LLM to understand and reason about visual inputs in a multimodal context.

Prompt Construction for MLLM: The paper introduces a specific method to construct prompts for the MLLM, making visual and control information consumable by the LLM.

  • Low-level Control Signal Conversion: The raw numerical low-level control signals (e.g., speed in m/s, steer_angle in radians) are converted into text-like expressions. For example, a speed of 3.63 m/s might be represented as "The speed is <num_start> 3.63 <num_end>". The tokens <num_start> and <num_end> are special tokens to delimit numerical values, facilitating their processing by the LLM.
  • Visual Tokens: Each video frame IkI_k is first processed by the CLIP-ViT-Large visual encoder and then by the visual adapter to produce visual tokens <TOKEN_img>.
  • Action Tokens: The text-converted control signals are processed by a language tokenizer to obtain action tokens <TOKEN_act>.
  • Interleaved Vision-Action Pair: A visual token <TOKEN_img> for a frame IkI_k is paired with its corresponding action token <TOKEN_act> for action AkA_k. This forms an interleaved vision-action pair.
    • Advantages of Interleaved Vision-Action Pair:
      1. Flexibility for Multi-Round Conversation: It enables flexible multi-round conversations with unfixed frame lengths, which is crucial for the autoregressive nature of the world model.
      2. Unified Embedding Space: It unifies future generation and action prediction within the word embedding space of the LLM, allowing a cohesive reasoning process.
  • System Prompt: A system prompt (<SYS><SYS>) is included to set the context and guide the MLLM's reasoning mode. This prompt helps the MLLM understand its role as an autonomous driving agent.
  • Conversation Structure: The overall conversation structure for the MLLM can be summarized as: Human: <SYS> <TOKEN_img> <TOKEN_act> <STOP> Agent <TOKEN_act> <STOP> Here, <STOP><STOP> is a special token indicating the end of a turn or generation. The MLLM, acting as the Agent, is prompted to output the action tokens.

4.2.3. Video Diffusion Model (VDM)

The VDM is responsible for generating realistic future frames based on historical context and predicted actions.

  • Base Model: The VDM is built upon a latent diffusion model [47] designed for video generation. Specifically, it inherits the architecture of Stable Diffusion 2.1 [46] and incorporates temporal-awareness modules similar to those in the video latent diffusion model [2]. Temporal-awareness modules are crucial for generating coherent sequences of frames, ensuring continuity and realistic motion.
  • History-Conditioned Functionality: The VDM is enriched with the ability to condition its generation on historical frames. This is achieved by integrating a reference-video control mechanism, which concatenates given historical frames with the diffusion input. This allows the VDM to maintain context and consistency with the past.
  • Action-Guided Future Generation: Text condition modules from the base diffusion model are retained. These modules enable the VDM to receive and incorporate the predicted control signals (converted into natural language prompts) as conditions for future frame generation. This means the MLLM's predicted actions directly influence how the future scene unfolds visually.

Prompt Construction for VDM: Since the VDM's text encoder is not as sophisticated in reasoning as the MLLM, it struggles with direct interpretation of numerical control signals (e.g., understanding that a negative steer angle means turning right).

  • GPT3.5 for Motion Description: To address this, the authors utilize GPT3.5 [36] (a powerful LLM) to convert the low-level numerical control signals (speed, steer angle) into a descriptive motion prompt. This makes the condition more semantically understandable for the VDM's text encoder.

  • Example Motion Prompt: As shown in Figure 3, control signals from consecutive frames are given to GPT3.5. GPT3.5 is instructed to output common driving states like "steady speed," "accelerating," "decelerating," and "turning." For example, given specific speed and steer angle values, GPT3.5 might output: "The vehicle is moving at a moderate speed, with a slight right turn in progress. The steering angle remains consistent, indicating a gradual turn. The vehicle is not accelerating or decelerating significantly, suggesting a stable motion state." This descriptive text then conditions the VDM.

    The following figure (Figure 3 from the original paper) illustrates an example of conversation used for guiding GPT3.5 to generate the corresponding motion prompt.

    该图像是论文中展示的三种模型架构对比示意图,分别为(a)仅MLP结构,(b)基于CNN的结构,以及(c)基于ViT的结构,展示了输入A和图像特征I经过不同模块处理的流程。 该图像是论文中展示的三种模型架构对比示意图,分别为(a)仅MLP结构,(b)基于CNN的结构,以及(c)基于ViT的结构,展示了输入A和图像特征I经过不同模块处理的流程。

4.2.4. Model Training

The MLLM and VDM components are trained separately and then merged for inference.

  • Training for MLLM:
    • Datasets: Pretrained on a large-scale private dataset (approx. 1.4 million vision-action pairs) focusing on highway scenarios. For supervised finetuning (SFT), it is trained on both the nuScenes dataset and the private dataset.
    • Pretraining Phase: During pretraining, the LLM model (Vicuna-7B-1.5) is frozen. Only the parameters of the vision encoder (CLIP-ViT-Large) and the visual adapter layers are updated. This aligns visual features with the frozen LLM's language space.
    • Supervised Finetuning (SFT) Phase: During SFT, the vision encoder is frozen, and the rest of the model (LLM and visual adapter) is trained. This allows the LLM to learn to predict control signals based on the aligned multimodal inputs.
    • Loss Function: The MLLM outputs text (action tokens), so cross-entropy loss is used for supervision, similar to how LLaVA [32] is trained.
    • Hyperparameters: Trained for 2 epochs for both pretraining and SFT, with a batch size of 16. Input image size is 336×336336 \times 336. Control signals are multiplied by 1000 to convert to integers, simplifying convergence for the LLM. AdamW optimizer is used with a learning rate of 2×1052 \times 10^{-5}.
  • Training for VDM:
    • Datasets: Pretrained on the 1.4 million private dataset. Finetuned on the nuScenes dataset (approx. 23,000 video samples).
    • Pretraining Phase: The VDM inherits weights from Stable Diffusion [45] and is then pretrained on the private dataset for 40,000 steps with a batch size of 128.
    • Finetuning Phase: The VDM is finetuned on the nuScenes dataset for an additional 40,000 steps with a reduced batch size of 32.
    • Hyperparameters: Spatial resolution is 256×512256 \times 512, and video length is 8 frames for both pretraining and finetuning. Learning rates are 4×1044 \times 10^{-4} for pretraining and 3.2×1043.2 \times 10^{-4} for finetuning. During inference, the DDIM [50] sampler is used with 50 sampling steps.

5. Experimental Setup

5.1. Datasets

The experiments for ADriver-I are conducted on two main datasets:

  • nuScenes Dataset:
    • Source: A publicly available, large-scale dataset for autonomous driving research.
    • Characteristics: It provides a comprehensive set of sensor data, including 6 cameras, 5 radars, and 1 LiDAR, covering a 360-degree field of view. It contains 1000 scenes, each approximately 20 seconds long, with extensive annotations for 23 object classes and map layers.
    • Domain: Diverse urban driving scenarios in Boston and Singapore.
    • Usage in ADriver-I: Used for supervised finetuning of the MLLM and finetuning of the VDM (approx. 23K video samples).
    • Why chosen: It is a standard benchmark in autonomous driving, allowing for comparison with other methods.
  • Large-scale Private Dataset:
    • Source: An internal dataset collected by the authors' organization (MEGVII Technology).
    • Scale: Contains nearly 1.4 million vision-action pairs.
    • Characteristics: Primarily focuses on highway scenarios. This implies less variance in steering angles and speeds compared to complex urban environments.
    • Usage in ADriver-I: Used for pretraining both the MLLM and the VDM. Also used for supervised finetuning of the MLLM.
    • Why chosen: Its large scale is crucial for pretraining large models, and its focus on highway scenarios helps the model learn stable driving behaviors.

5.2. Evaluation Metrics

The paper employs different evaluation metrics for control signal prediction and future scene generation.

5.2.1. Control Signal Prediction Metrics

  • L1 Error: This metric calculates the mean absolute error between the predicted control signals and the ground truth. It measures the average magnitude of errors.

    • Conceptual Definition: The L1 error, or Mean Absolute Error (MAE), quantifies the average absolute difference between the predicted values and the true values. It is a robust measure that is less sensitive to outliers compared to L2 error (RMSE). For control signals, a lower L1 error indicates higher prediction accuracy.
    • Mathematical Formula: Not explicitly provided in the text, but the common definition of L1 error (MAE) is: $ \mathrm{L1} = \frac{1}{N} \sum_{i=1}^{N} | \hat{x}_i - x_i | $
    • Symbol Explanation:
      • NN: The total number of samples (predictions).
      • x^i\hat{x}_i: The predicted value for the ii-th sample (e.g., predicted speed or steer angle).
      • xix_i: The ground truth value for the ii-th sample (e.g., actual speed or steer angle).
      • | \cdot |: The absolute value function.
  • Accuracy with different thresholds (AθA_\theta): This metric measures the proportion of predictions that fall within a specified absolute error threshold θ\theta of the ground truth.

    • Conceptual Definition: AθA_\theta assesses how often the model's prediction is "close enough" to the ground truth, where "close enough" is defined by the threshold θ\theta. A higher AθA_\theta indicates better performance for a given tolerance level.
    • Mathematical Formula: The paper provides the definition for F(x^i,xi,θ)F(\hat{x}_i, x_i, \theta) and AθA_\theta. $ F ( { \hat { x _ { i } } } , x _ { i } , \theta ) = { \left{ \begin{array} { l l } { 1 , } & { | { \hat { x _ { i } } } - x _ { i } | < = \theta } \ { 0 , } & { | { \hat { x _ { i } } } - x _ { i } | > \theta } \end{array} \right. } $ $ A _ { \theta } = \frac { 1 } { N } \sum _ { i = 0 } ^ { N } F ( \hat { x _ { i } } , x _ { i } , \theta ) $
    • Symbol Explanation:
      • F(x^i,xi,θ)F(\hat{x}_i, x_i, \theta): An indicator function that returns 1 if the absolute difference between prediction and ground truth for sample ii is less than or equal to the threshold θ\theta, and 0 otherwise.
      • x^i\hat{x}_i: The predicted value for the ii-th sample.
      • xix_i: The ground truth value for the ii-th sample.
      • θ\theta: The predefined threshold for accuracy. The paper uses θ{0.01,0.03,0.05,0.07}\theta \in \{ 0.01, 0.03, 0.05, 0.07 \}.
      • NN: The total number of validation samples.

5.2.2. Future Scene Generation Metrics

  • Frechet Inception Distance (FID) [17]:
    • Conceptual Definition: FID measures the similarity between the distribution of generated images and the distribution of real images. It computes the Fréchet distance between two Gaussian distributions fitted to feature representations of real and generated images (extracted from an Inception-v3 network). Lower FID scores indicate higher quality and realism of generated images.
    • Mathematical Formula: The formula for FID is: $ \mathrm{FID} = ||\mu_1 - \mu_2||^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1\Sigma_2)^{1/2}) $
    • Symbol Explanation:
      • μ1\mu_1: The mean feature vector of real images.
      • μ2\mu_2: The mean feature vector of generated images.
      • Σ1\Sigma_1: The covariance matrix of features for real images.
      • Σ2\Sigma_2: The covariance matrix of features for generated images.
      • 2||\cdot||^2: The squared L2 norm (Euclidean distance).
      • Tr()\mathrm{Tr}(\cdot): The trace of a matrix.
      • (Σ1Σ2)1/2(\Sigma_1\Sigma_2)^{1/2}: The matrix square root of the product of covariance matrices.
  • Frechet Video Distance (FVD) [54]:
    • Conceptual Definition: FVD is an extension of FID to video sequences. It assesses the visual quality and temporal consistency of generated videos by comparing their feature distributions (extracted from a pre-trained video classification network, often based on Inception-v3 or similar architectures capable of processing video) against real videos. Lower FVD scores indicate more realistic and temporally coherent generated videos.
    • Mathematical Formula: FVD uses the same mathematical form as FID but applies it to video features. If XrX_r represents real video features and XgX_g represents generated video features, then: $ \mathrm{FVD} = ||\mu_r - \mu_g||^2 + \mathrm{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2}) $
    • Symbol Explanation:
      • μr\mu_r: The mean feature vector of real video sequences.
      • μg\mu_g: The mean feature vector of generated video sequences.
      • Σr\Sigma_r: The covariance matrix of features for real video sequences.
      • Σg\Sigma_g: The covariance matrix of features for generated video sequences.

5.3. Baselines

For comparing the performance of control signal prediction, the authors constructed three competitive baselines:

  1. MLP-Only Baseline (a):
    • Architecture: This is the simplest baseline. It takes historical action sequences (control signals) from three preceding frames as input. These action sequences are then fed into three fully-connected layers (MLP) to predict the action of the current frame.
    • Representativeness: Represents a purely regression-based approach that relies only on temporal patterns of past actions, without any visual context.
  2. CNN-Based Baseline (b):
    • Architecture: Builds upon the MLP-Only baseline by incorporating visual information. It uses a Convolutional Neural Network (CNN) (e.g., ResNet [15]) to encode features from the input image. The image features are then processed by Global Average Pooling (GAP). These global image features are concatenated with the historical action features. The combined features are then used to predict the current action, likely through additional fully-connected layers.
    • Representativeness: Represents a common approach to combine visual features with other data streams for prediction tasks.
  3. ViT-Based Baseline (c):
    • Architecture: Similar to the CNN-Based baseline, but it replaces the traditional CNN with a Vision Transformer (ViT) (e.g., ViT-B [9]) for visual feature encoding. The rest of the architecture (concatenation with action features, prediction layers) remains the same as the CNN-Based baseline.

    • Representativeness: Represents a more modern approach to visual feature extraction, leveraging the power of transformers that have shown strong performance in vision tasks.

      The following figure (Figure 3 from the original paper) illustrates the three baseline architectures for context (though the text in the paper refers to it as CNN and ViT backbone).

      该图像是论文中展示的三种模型架构对比示意图,分别为(a)仅MLP结构,(b)基于CNN的结构,以及(c)基于ViT的结构,展示了输入A和图像特征I经过不同模块处理的流程。 该图像是论文中展示的三种模型架构对比示意图,分别为(a)仅MLP结构,(b)基于CNN的结构,以及(c)基于ViT的结构,展示了输入A和图像特征I经过不同模块处理的流程。

6. Results & Analysis

6.1. Core Results Analysis

ADriver-I demonstrates impressive performance in both control signal prediction and future scene generation, outperforming constructed baselines and achieving competitive results against other generative AD models.

6.1.1. Control Signal Prediction

The quantitative results for control signal prediction are presented in Table 2, comparing ADriver-I with three baselines (MLP-Only, CNN-Based, ViT-Based) on both nuScenes and the private datasets.

The following are the results from Table 2 of the original paper:

Method Speed (m/s) Steer angle (rad)
L1↓ A0.01↑ A0.03 ↑ A0.05 ↑ A0.07 ↑ L1↓ A0.01↑ A0.03 ↑ A0.05 ↑ A0.07 ↑
MLP-Only 0.122 0.189 0.275 0.361 0.440 0.101 0.183 0.482 0.641 0.715
CNN-Based 0.106 0.191 0.301 0.407 0.474 0.095 0.277 0.527 0.648 0.721
ViT-Based 0.103 0.200 0.326 0.438 0.489 0.092 0.299 0.546 0.656 0.724
ADriver-I 0.072 0.237 0.398 0.535 0.640 0.091 0.411 0.575 0.664 0.731
ADriver-I† 0.035 0.295 0.519 0.790 0.862 0.015 0.643 0.840 0.925 0.964

Analysis:

  • Superiority of ADriver-I: On the nuScenes dataset, ADriver-I consistently outperforms all three baselines across almost all metrics for both speed and steer angle prediction.
    • For speed, ADriver-I achieves an L1 error of 0.072m/s0.072 m/s, significantly lower than the best baseline (ViT-Based at 0.103m/s0.103 m/s). Its accuracy metrics (A0.01A_{0.01} to A0.07A_{0.07}) are also notably higher, especially at tighter thresholds (e.g., A0.01A_{0.01} of 0.237 vs. ViT-Based 0.200).
    • For steer angle, ADriver-I achieves an L1 error of 0.091 rad, slightly better than ViT-Based (0.092 rad). However, its accuracy at very tight thresholds (e.g., A0.01A_{0.01} of 0.411 vs. ViT-Based 0.299) is substantially higher. This suggests that while the mean error might be similar to the best baseline in some cases, ADriver-I makes significantly more precise predictions. The authors attribute the similar L1 error for steer angle to "some special cases with large steer angle variance," implying ADriver-I handles the majority of cases better.
  • Impact of Visual Information: Comparing MLP-Only to CNN-Based and ViT-Based, there's a clear improvement when visual features are incorporated, highlighting the importance of understanding the scene for accurate control prediction.
  • Impact of Transformer-based Vision: ViT-Based generally outperforms CNN-Based, indicating the benefit of Vision Transformers for extracting rich visual cues in autonomous driving tasks.
  • Cross-Dataset Performance (ADriver-I†): The results for ADriver-I† (on the private dataset) show a dramatic improvement over nuScenes performance: L1 errors of 0.035m/s0.035 m/s for speed and 0.015 rad for steer angle.
    • Reasons for Improvement: The authors attribute this to two factors:
      1. The private dataset mainly focuses on highway scenarios, which typically involve smaller variance in speed and steer angles, making prediction easier.

      2. The private dataset used for supervised finetuning is much larger (1.4M vs. 23K samples on nuScenes), providing more extensive training data.

        Figure 4 provides qualitative visualizations of control signal prediction. The left side shows input frames from timestamp t-3 to tt. The right side's bar chart displays the actions for t-3 to t-1, with prediction (pred) and ground-truth (GT) at time tt. This helps to visually confirm the model's accuracy.

The following figure (Figure 4 from the original paper) shows qualitative visualization of control signal prediction:

该图像是一个对比示意图,展示了ADriver-I模型在不同时间帧下对自动驾驶任务中速度与转向角预测的效果,包括预测值与真实值的对比及对应场景图像,体现了模型的预测准确性和时序连贯性。 该图像是一个对比示意图,展示了ADriver-I模型在不同时间帧下对自动驾驶任务中速度与转向角预测的效果,包括预测值与真实值的对比及对应场景图像,体现了模型的预测准确性和时序连贯性。

6.1.2. Ablation Studies for Control Signal Prediction

The paper conducts ablation studies to analyze the impact of key design choices on control signal prediction.

Ablation on Encoding Methods for Control Signal (Table 3): The following are the results from Table 3 of the original paper:

Number Embedding Speed (m/s) Steer angle (rad)
Num2English 2.094 0.536
Special Token 0.094 0.106
Relative Diff 0.081 0.096
Absolute Number 0.072 0.091

Analysis:

  • Direct Absolute Number is Best: Directly predicting the absolute number of speed or steer angle yields the best performance (L1 errors of 0.072m/s0.072 m/s and 0.091 rad).
  • Ineffectiveness of Num2English: Translating numbers into English descriptions (Num2English) performs poorly, indicating that the LLM struggles to perform precise numerical regression when values are represented as natural language.
  • Benefits of Special Token and Relative Diff: Special Token (converting to classification bins) and Relative Diff (predicting differences from previous frames) perform better than Num2English but are still inferior to Absolute Number. This confirms that for precise numerical tasks, direct numerical prediction within the LLM's framework, facilitated by special tokens as done in ADriver-I's prompt construction, is most effective.

Ablation on Number of Decimal Places (Table 4): The following are the results from Table 4 of the original paper:

Decimal places Speed (m/s) Steer angle (rad)
0 0.212 0.099
1 0.094 0.093
2 0.073 0.091
3 0.072 0.091

Analysis:

  • Precision Matters: Using 0 or 1 decimal place significantly degrades performance, especially for speed prediction. This indicates that autonomous driving control signals require a certain level of precision.
  • Optimal Decimal Places: Two decimal places achieve almost identical performance to three decimal places (e.g., speed L1 error of 0.073 vs. 0.072, steer angle L1 error of 0.091 for both). This suggests that two decimal places offer a good balance between precision and potentially reducing the complexity of the output space for the LLM.

Ablation on Effectiveness of Multi-Round Conversations (Table 5): The following are the results from Table 5 of the original paper:

Conversation Speed (m/s) Steer angle (rad)
Temporal Fusion 0.078 0.092
Single Round 0.078 0.094
Multi Round 0.072 0.091

Analysis:

  • Multi-Round Conversation is Key: The Multi Round conversation strategy significantly outperforms Temporal Fusion and Single Round approaches, particularly for speed prediction (L1 error 0.072 vs. 0.078).
  • Benefit of Intermediate Supervision: Multi Round applies supervision to intermediate action predictions (At2,At1)(A_{t-2}', A_{t-1}') during training. This distributed supervision helps reduce the cumulative error that might build up when only the final prediction (AtA_t') is supervised, leading to more robust action prediction during inference.
  • Comparison to Baselines: Temporal Fusion and Single Round are likely more akin to the baselines in how they process temporal information, and their performance is worse, further validating the multi-round approach of ADriver-I.

6.1.3. Future Scene Generation

  • Qualitative Results (Figure 6): The paper presents qualitative results for future frame prediction, showcasing the VDM's ability to generate realistic scenes for different driving maneuvers (e.g., turning left, turning right). The VDM successfully produces future scenes conditioned on historical frames and MLLM-predicted control signals, without needing high-level knowledge like 3D bounding boxes or HD maps. This demonstrates its effectiveness in creating plausible visual futures.

    The following figure (Figure 6 from the original paper) shows qualitative results for future prediction generated by the video diffusion model:

    该图像是自动驾驶场景中的视觉帧序列示意图,展示了基于时间轴(从-1.5秒到+3.5秒)的车辆速度和转向角变化。每张图片配有对应时间点的速度和转向角柱状图,体现了控制信号与视觉输入的关联。 该图像是自动驾驶场景中的视觉帧序列示意图,展示了基于时间轴(从-1.5秒到+3.5秒)的车辆速度和转向角变化。每张图片配有对应时间点的速度和转向角柱状图,体现了控制信号与视觉输入的关联。

  • Quantitative Results (Table 6): The quality of generated future frames is quantitatively evaluated using FID and FVD metrics.

    The following are the results from Table 6 of the original paper:

    Method Input→Output FID↓ FVD↓
    DriveGAN 1F→12F 73.4 502.3
    DriveDreamer 1F+12B+12M→12F 52.6 452.0
    ADriver-I 4F → 4F 5.5 97.0

Analysis:

  • Superiority of ADriver-I: ADriver-I achieves significantly lower (better) FID (5.5) and FVD (97.0) scores compared to DriveGAN (FID 73.4, FVD 502.3) and DriveDreamer (FID 52.6, FVD 452.0).
  • Efficiency in Input Priors: Notably, ADriver-I achieves this impressive performance by conditioning on 4 historical frames (4F → 4F) without relying on external 3D bounding boxes (B) or HD maps (M), which DriveDreamer heavily utilizes. This validates ADriver-I's claim of generating future scenes with minimal extensive prior information. The superior performance, despite fewer explicit priors, highlights the power of its integrated MLLM+VDM approach and the effectiveness of diffusion models for high-quality video generation.

6.2. Joint Control & Generation

The paper verifies the "infinite driving" capability of ADriver-I, where the model drives in its self-created world.

  • Recurrent Process: Given only three initial historical interleaved vision-action pairs, ADriver-I autonomously performs control signal prediction via the MLLM and future scene generation via the VDM in a recurrent loop.

  • Interdependence: As shown in Figure 7, all video frames are generated by the VDM, and the corresponding speed and steer angle predictions are produced by the MLLM. This visualization demonstrates a critical feedback loop: the predicted control signals directly influence the generated future scenes, and in turn, these generated future scenes provide the visual input that prompts ADriver-I to take subsequent actions. This continuous cycle effectively simulates autonomous driving in a dynamic, self-generated environment.

    The following figure (Figure 7 from the original paper) shows the joint control and generation in ADriver-I, illustrating the infinite driving process:

    该图像是自动驾驶场景中的视觉帧序列示意图,展示了基于时间轴(从-1.5秒到+3.5秒)的车辆速度和转向角变化。每张图片配有对应时间点的速度和转向角柱状图,体现了控制信号与视觉输入的关联。 该图像是自动驾驶场景中的视觉帧序列示意图,展示了基于时间轴(从-1.5秒到+3.5秒)的车辆速度和转向角变化。每张图片配有对应时间点的速度和转向角柱状图,体现了控制信号与视觉输入的关联。

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces ADriver-I, a novel world model for autonomous driving that unifies low-level control signal prediction and future scene generation. By leveraging a multimodal large language model (MLLM) and a video diffusion model (VDM) in an integrated framework, ADriver-I moves beyond traditional modular autonomous driving pipelines. The core innovation of interleaved vision-action pairs enables the MLLM to process multimodal information effectively. Experiments on nuScenes and a large private dataset demonstrate ADriver-I's impressive performance, outperforming strong baselines in control prediction and state-of-the-art methods in future scene generation, particularly without relying on extensive prior information like HD maps or 3D bounding boxes. Crucially, ADriver-I showcases the concept of "infinite driving" within a self-generated world, marking a significant step towards more holistic and human-like embodied intelligence in autonomous systems.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations and outline potential future research directions:

  1. Low-Quality Video Frames: The generation module (VDM), acting as a closed-loop simulator, might produce low-quality video frames, especially during rapid control signal changes. These low-quality frames could disturb the MLLM's control signal prediction in subsequent timesteps, leading to instability or errors in the infinite driving loop.
  2. Performance Gap for Deployment: Despite impressive results, the current performance of ADriver-I is still far from satisfactory for real-world deployment. The authors plan to update ADriver-I by training it with a much larger number of vision-control pairs from their continuously expanding private dataset, hoping to leverage scaling laws.
  3. Separate Training: The MLLM and VDM are trained separately, which prevents the system from benefiting from end-to-end optimization. A future goal is to develop a unified comprehension & generation framework that allows for joint, end-to-end training.
  4. Lack of Routing Information: For long-distance autonomous driving, the current model lacks routing information (e.g., navigation maps). Integrating a navigation map module is proposed to enable extended autonomous driving capabilities.

7.3. Personal Insights & Critique

This paper presents a truly fascinating and forward-looking approach to autonomous driving, moving beyond the traditional pipeline. The concept of "infinite driving" in a self-created world is a powerful demonstration of the potential of generative AI and MLLMs for embodied intelligence.

Inspirations and Applications:

  • Holistic Learning: ADriver-I inspires a shift from fragmented, task-specific models to more holistic systems that can both perceive, act, and imagine, much like a human. This could lead to more robust and generalized autonomous systems.
  • Synthetic Data Generation: The VDM's ability to generate high-quality, action-conditioned future frames is invaluable for creating synthetic training data, especially for rare or dangerous scenarios, which could significantly accelerate AD development and testing.
  • Explainable AI: While not a primary focus, the MLLM's inherent language capabilities could potentially be extended to provide interpretability for its actions, similar to DriveGPT4. Understanding why an autonomous vehicle chose a particular action is crucial for trust and deployment.
  • Embodied AI Beyond Driving: The interleaved vision-action pair and the MLLM-VDM coupling could be a generalizable blueprint for other embodied AI tasks, such as robotics, virtual agents, or even human-computer interaction, where agents need to reason over multimodal inputs and predict outcomes of their actions.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Reality Gap: The "infinite driving" in a self-created world, while innovative, faces the inherent reality gap problem. The generated world might not perfectly reflect the complexities and unpredictable nuances of the real world. A critical challenge will be ensuring that policies learned in the self-generated world transfer effectively to reality. The authors acknowledge the VDM's potential for low-quality frames, which is a symptom of this.

  • Computational Cost: MLLMs and Diffusion Models are computationally expensive to train and run. The feasibility of deploying such a complex system in real-time on resource-constrained vehicle hardware remains a significant hurdle.

  • Safety and Robustness under Novelty: The current system's ability to handle novel, out-of-distribution scenarios (e.g., unexpected obstacles, adverse weather conditions not seen during training) is not explicitly evaluated. While MLLMs offer generalization, the reliance on a generated future raises questions about how it would react to truly unforeseen events.

  • Long-term Planning and High-Level Reasoning: While the system predicts "near future frames," long-term planning and high-level strategic reasoning (e.g., choosing a route, navigating complex intersections, interacting with human intent) are still implicitly handled or require external modules. The mention of routing information in limitations hints at this. Integrating more sophisticated planning directly into the MLLM could be a future direction.

  • Training Loop Stability: The separate training of MLLM and VDM, while practical, means the VDM might generate frames that are "difficult" for the MLLM to act upon, and vice-versa. An end-to-end trainable system could potentially learn to generate frames that are "easier" to act upon, or actions that lead to "easier-to-generate" futures, potentially creating a self-reinforcing but not necessarily optimal loop. A true joint optimization could lead to emergent behaviors but also carries risks of instability.

    Overall, ADriver-I represents a compelling step towards more unified and intelligent autonomous systems, pushing the boundaries of what is possible with generative AI and large foundation models in embodied intelligence. The identified limitations highlight key challenges that the field needs to address for practical deployment.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.