Paper status: completed

Epona: Autoregressive Diffusion World Model for Autonomous Driving

Published:07/01/2025

Auto-Regressive Diffusion Model (7)Driving World Models (4)Long-Horizon Generation (1)Trajectory Planning Integration (1)Chain-of-Forward Training Strategy (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Epona is an autoregressive diffusion world model that decouples spatiotemporal factors and integrates trajectory planning, enabling long-horizon, high-res video generation with improved performance and reduced error accumulation in autonomous driving tasks.

Abstract

Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks. Code will be publicly available at \href{https://github.com/Kevin-thu/Epona/}{https://github.com/Kevin-thu/Epona/}.

Mind Map

In-depth Reading

English Analysis~16 min read · 19,708 chars

1. Bibliographic Information

Title: Epona: Autoregressive Diffusion World Model for Autonomous Driving
Authors: Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, Xun Cao, Wei Yin.
Affiliations: The authors are affiliated with a mix of industry research labs and universities, including Horizon Robotics, Tsinghua University, Peking University, Nanjing University, The Hong Kong University of Science and Technology, Nanyang Technological University, and Tencent Hunyuan. This collaboration between academia and industry often leads to research that is both theoretically sound and practically relevant.
Journal/Conference: The paper is a preprint available on arXiv. As of its version ( $v1$ ), it has not yet been published in a peer-reviewed conference or journal, but venues like CVPR, ICCV, or NeurIPS would be typical targets for this kind of work.
Publication Year: 2025 (as listed on arXiv, which often uses future dates for submission placeholders). The first version was submitted in June 2024.
Abstract: The abstract introduces Epona, a new world model for autonomous driving that aims to solve the limitations of existing models. Current video diffusion models struggle with generating long, flexible-length videos and integrating trajectory planning. Epona addresses this by using an autoregressive approach combined with diffusion. Its key innovations are a decoupled spatiotemporal factorization (separating time dynamics from video rendering) and modular trajectory and video prediction. A novel chain-of-forward training strategy is also introduced to reduce error accumulation in long-term generation. The model achieves state-of-the-art performance, improving video quality (7.4% FVD improvement) and generating videos for minutes, far longer than prior work. It also functions as a high-performing, real-time motion planner.
Original Source Link:
- arXiv Link: https://arxiv.org/abs/2506.24113v1
- PDF Link: https://arxiv.org/pdf/2506.24113v1.pdf

2. Executive Summary

Background & Motivation (Why): World models, which learn a simulation of the physical world, are a promising paradigm for autonomous driving. They allow an agent to "imagine" future scenarios and plan actions accordingly. Current approaches fall into two camps, each with significant drawbacks:
1. Video Diffusion Models (e.g., Vista): These models generate visually stunning, high-fidelity videos. However, they are designed to generate fixed-length clips and model the entire video at once. This makes them incapable of generating arbitrarily long videos and ill-suited for integrating flexible, real-time trajectory planning.
2. GPT-style Autoregressive Models (e.g., GAIA-1): These models excel at long-term, flexible-length generation by predicting one "token" at a time. However, to work with images, they must first quantize them into discrete tokens. This process severely degrades visual quality and reduces the precision needed for accurate motion planning. The paper identifies a clear gap: a need for a model that combines the visual quality of diffusion models with the temporal flexibility and planning capabilities of autoregressive models.
Main Contributions / Findings (What): Epona is proposed as a unified framework to bridge this gap. Its main contributions are:
1. An Autoregressive Diffusion Architecture: Epona generates future video frames one at a time (autoregressively) but uses a diffusion model to render each frame. This preserves the high visual fidelity of diffusion models while enabling long-horizon, flexible-length video generation.
2. Decoupled Spatiotemporal Factorization: The model separates the task of understanding temporal dynamics from the task of generating high-quality visuals. A GPT-style transformer handles the temporal aspect in a compressed latent space, while dedicated diffusion models handle the spatial rendering (video frame) and trajectory generation.
3. Modular Trajectory and Video Prediction: Epona features two separate output heads that are trained jointly: one for predicting the next video frame and another for predicting a future trajectory. This modularity allows the model to function as a real-time motion planner by simply deactivating the computationally expensive video generation module.
4. Chain-of-Forward Training: To combat the classic problem of error accumulation in autoregressive models, the authors introduce a training strategy where the model is periodically forced to generate future frames based on its own (potentially noisy) previous predictions, rather than always using ground-truth data. This makes the model more robust during inference.
5. State-of-the-Art Performance: Epona achieves top results in both video generation (outperforming previous models in quality and generating videos up to 2 minutes long) and motion planning (outperforming strong baselines on the NAVSIM benchmark).

Foundational Concepts:
- World Model: A neural network trained to learn a model of the world. By observing sensory data (like video), it builds an internal representation of how the world works, allowing it to simulate future events and plan actions.
- Diffusion Models: A class of generative models that learn to create data by reversing a noise-adding process. They start with random noise and iteratively "denoise" it, step-by-step, until a realistic sample (like an image) is formed. They are known for generating exceptionally high-quality images and videos.
- Autoregressive Models: Models that generate sequences one element at a time, where each new element is conditioned on all previously generated elements. Language models like GPT are classic examples. This sequential nature makes them naturally suited for generating variable-length sequences.
- Latent Space: A lower-dimensional, compressed representation of data. Instead of working with high-resolution images (millions of pixels), models often encode them into a compact latent space, which is computationally much more efficient. The results are then decoded back into pixel space.
- Rectified Flow: A specific type of diffusion model formulation that defines a straight path from noise to data. This simplifies the training objective and often leads to more efficient and higher-quality generation compared to traditional diffusion models.
- Diffusion Transformer (DiT): A modern architecture for diffusion models that replaces the commonly used U-Net backbone with a Transformer. Transformers are highly scalable and have proven to be very effective for large-scale generative modeling.
Previous Works: The paper positions Epona against two dominant paradigms:
1. Video Diffusion-based World Models (Vista): These models take a sequence of past frames and generate a fixed-length sequence of future frames all at once. The paper argues this "global joint distribution" approach is flawed because it doesn't respect the step-by-step, causal nature of time, making long-term and flexible prediction impossible.
2. GPT-based World Models (GAIA-1, DrivingWorld): These models use an autoregressive transformer. They first convert images into a sequence of discrete tokens (like words in a sentence) using a tokenizer. Then, they predict the next token in the sequence. While this allows for long-term generation, the tokenization step is lossy and results in blurry, lower-quality videos and imprecise trajectories.
Differentiation: Epona's key innovation is its hybrid approach, which is visualized in Figure 3.

该图像是论文中图3的示意图，展示了三种不同的世界建模方法。上部为传统自回归方法，通过离散化图像并逐步预测下一个token；中部为视频扩散方法，同时生成未来 $n$ 帧；下部为本文方法，连续token化并自回归预测细粒度未来帧。
- Unlike GPT-style models, Epona operates in a continuous latent space, avoiding the quality loss from discrete tokenization.
- Unlike video diffusion models, Epona is autoregressive, predicting only the next frame at each step, which allows for flexible, long-horizon generation.
- Epona decouples planning and vision, allowing it to serve as a real-time planner, a feature lacking in other high-fidelity world models.

4. Methodology (Core Technology & Implementation)

The core of Epona is an autoregressive framework that predicts the future frame-by-frame. At each timestep $T$ , given the history of observations $\{ \mathbf{O}_t \}_{t=1}^T$ and actions $\{ \mathbf{a}_{t-1 \to t} \}_{t=1}^T$ , the model performs two tasks:

Trajectory Planning: Predict a future trajectory over a horizon of $n$ steps: $\pi ( \{ \mathbf{a}_{T \to T+i} \}_{i=1}^n \mid \text{history} )$ .
Next-Frame Prediction: Predict the very next observation $\mathbf{O}_{T+1}$ conditioned on the history and the next action $\mathbf{a}_{T \to T+1}$ : $p ( \mathbf{O}_{T+1} \mid \text{history}, \mathbf{a}_{T \to T+1} )$ .

This process is illustrated in the overall architecture diagram.

该图像是论文中描述的Epona模型的示意图，展示了从条件帧和历史轨迹输入，经由变分自编码器和多模态时空变换器，采用链式前向训练策略进行自回归预测生成下一帧图像和未来轨迹的全过程。

The model consists of three main modules:

3.3. Epona: Autoregressive Diffusion World Model

Multimodal Spatiotemporal Transformer (MST):
- Purpose: To encode the entire history of visual observations and actions into a single, compact latent representation.
- Process:
  - First, past video frames $\{ \mathbf{O}_t \}$ are compressed into latent representations $\{ \mathbf{Z}_t \}$ using a DCAE encoder.
  - These visual latents and the historical trajectory waypoints $\{ \mathbf{a}_t \}$ are projected into an embedding space.
  - The MST uses interleaved causal temporal attention and multimodal spatial attention layers.
    - Causal temporal attention looks back in time to model how the scene evolves.
    - Multimodal spatial attention fuses information from the visual latents and action embeddings within each timestep.
- Output: The final output of the MST is a single latent vector $\mathbf{F}$ , which encapsulates all the relevant historical context needed for future prediction.
Trajectory Planning Diffusion Transformer (TrajDiT):
- Purpose: To generate a future 3-second trajectory based on the historical context $\mathbf{F}$ .
- Architecture: It is a small DiT with a Dual-Single-Stream design. In the dual-stream phase, historical context and trajectory data are processed in parallel, linked by attention. In the single-stream phase, they are concatenated for deeper fusion.
- Training: It is trained using a Rectified Flow objective. The model learns to predict the "velocity" required to transform random noise into a valid trajectory. The loss function is: $\mathcal{L}_{traj} = \mathbb{E}_{\overline{\mathbf{a}}, \epsilon, t} \left[ \| v_{traj}(\overline{\mathbf{a}}_{(t)}, t) - (\overline{\mathbf{a}} - \epsilon) \|^2 \right]$
  - $\overline{\mathbf{a}}$ : The ground-truth future trajectory.
  - $\epsilon$ : A random noise sample from a standard normal distribution.
  - $t$ : A timestep from 0 to 1.
  - $\overline{\mathbf{a}}_{(t)} = (1 - t)\overline{\mathbf{a}} + t\epsilon$ : The noisy trajectory at time $t$ .
  - $v_{traj}$ : The velocity prediction network (the TrajDiT).
  - $(\overline{\mathbf{a}} - \epsilon)$ : The target velocity. The loss minimizes the difference between the predicted and target velocities.
Next-frame Prediction Diffusion Transformer (VisDiT):
- Purpose: To generate the latent representation of the next video frame, $\hat{\mathbf{Z}}_{T+1}$ , conditioned on the history context $\mathbf{F}$ and the next action $\mathbf{a}_{T \to T+1}$ .
- Architecture: Similar to TrajDiT but larger, and it includes an additional modulation branch to incorporate the action control signal.
- Training: It is also trained with a Rectified Flow loss, aiming to denoise a random latent into the correct next-frame latent. $\mathcal{L}_{vis} = \mathbb{E}_{\mathbf{Z}_{T+1}, \epsilon, t} \left[ \| v_{vis}(\mathbf{Z}_{T+1(t)}, t) - (\mathbf{Z}_{T+1} - \epsilon) \|^2 \right]$
- Total Loss: The entire model is trained end-to-end by summing the two losses: $\mathcal{L} = \mathcal{L}_{traj} + \mathcal{L}_{vis}$ .

3.4. Chain-of-Forward Training

Problem: During standard teacher-forcing training, the model always sees ground-truth past frames. During inference, it must rely on its own, imperfectly generated past frames. This mismatch leads to autoregressive drift, where errors accumulate and generation quality quickly degrades.
Solution: The paper proposes chain-of-forward training. Periodically during the training process, the model performs several forward passes in a row, using its own predictions as input for the next step.

$Figure 4. Concept illustration of our training process. Here $x$ can be either image latents or trajectories.$ 该图像是论文中的示意图，展示了Epona模型的训练过程。图中 $x$ 代表图像潜变量或轨迹，描绘了链式前向（Chain-of-Forward）和矫正流损失（Rectified Flow Loss）两个训练步骤。

As shown in Figure 4, instead of just predicting $\hat{x}_{T+1}$ from $x_1, \dots, x_T$ , the model also predicts $\hat{x}_{T+2}$ from $x_1, \dots, x_T, \hat{x}_{T+1}$ . To make this efficient, it doesn't run the full diffusion sampling process. Instead, it estimates the final denoised latent $\hat{\boldsymbol{x}}_{(0)}$ in a single step using the predicted velocity: $\hat{\boldsymbol{x}}_{(0)} = \boldsymbol{x}_{(t)} + t \boldsymbol{v}_{\Theta}(\boldsymbol{x}_{(t)}, t)$ This one-step prediction, while noisy, simulates the kind of errors that occur during inference, teaching the model to be more robust.

3.5. Temporal-aware DCAE Decoder

Problem: The DCAE autoencoder used for compressing images is designed for single images and lacks temporal awareness. When used to decode video frames one by one, this can cause flickering and other temporal inconsistencies.
Solution: The authors modify the DCAE decoder by adding spatiotemporal self-attention layers before the main decoding blocks. This allows the decoder to look at multiple latent frames simultaneously, enforcing temporal consistency and producing smoother videos.

5. Experimental Setup

Datasets:
- Training: A combination of the public NuPlan dataset and 700 scenes from the NuScenes dataset. All images are resized to $512 \times 1024$ .
- Evaluation:
  - Video Generation: NuPlan test set and NuScenes validation set.
  - Trajectory Planning: NuScenes benchmark and the NAVSIM benchmark.
Evaluation Metrics:
1. Fréchet Inception Distance (FID): Measures the quality and realism of single generated images. It calculates the distance between the distribution of features from generated images and real images, as extracted by an InceptionV3 network. A lower FID means the generated images are more similar to real ones. $\mathrm{FID}(x, g) = \left\| \mu_x - \mu_g \right\|_2^2 + \mathrm{Tr}\left( \Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2} \right)$
  - $\mu_x, \mu_g$ : The mean of the Inception features for real and generated images, respectively.
  - $\Sigma_x, \Sigma_g$ : The covariance matrices of the Inception features.
  - $\mathrm{Tr}$ : The trace of a matrix.
2. Fréchet Video Distance (FVD): An extension of FID for videos. It measures both the per-frame visual quality and the temporal consistency (realism of motion). It uses a different feature extractor (an I3D network pre-trained on kinetics) to capture spatiotemporal features. A lower FVD is better. The formula is structurally identical to FID but uses video features.
3. L2 Error: Measures the accuracy of the predicted trajectory. It is the average Euclidean distance between the predicted waypoints and the ground-truth waypoints. For a trajectory with $N$ points: $\text{L2 Error} = \frac{1}{N} \sum_{i=1}^{N} \sqrt{(x_{pred,i} - x_{gt,i})^2 + (y_{pred,i} - y_{gt,i})^2}$
4. Collision Rate: The percentage of predicted trajectories that result in a collision with other objects (e.g., cars, pedestrians) in the scene. A lower rate is safer and better.
5. Predictive Driver Model Score (PDMS): A composite metric from the NAVSIM benchmark that evaluates the quality of a planned trajectory based on five factors:
  - NC (No Collision): Collision avoidance score.
  - DAC (Drivable Area Compliance): Score for staying within drivable lanes.
  - TTC (Time-to-Collision): Score for maintaining safe following distances.
  - Comf. (Comfort): Score for producing smooth, comfortable trajectories (low jerk/acceleration).
  - EP (Ego Progress): Score for making progress towards the goal. Higher scores are better for all components and for the overall PDMS.
Baselines: The paper compares Epona against several strong baselines, including:
- Video Generation: DriveGAN, DriveDreamer, WoVoGen, GenAD, Vista, DrivingWorld.
- Motion Planning: ST-P3, UniAD, OccNet, VAD, GenAD, Doe-1, PARA-Drive, DRAMA. These are state-of-the-art end-to-end planners.

6. Results & Analysis

Core Results

Video Generation (Table 1): Epona is compared with other driving world models on the NuScenes benchmark.

(Manual Transcription of Table 1)

Metric	DriveGAN [30]	DriveDreamer [56]	WoVoGen [36]	Drive-WM [57]	GenAD (OpenDV) [61]	Vista [17]	DrivingWorld [25]	Ours
FID ↓	73.4	52.6	27.6	15.8	15.4	6.9	7.4	7.5
FVD ↓	502.3	452.0	417.7	122.7	184.0	89.4	90.9	82.8
Max Duration / Frames*	N/A	4s / 48	2.5s / 5	8s / 16	4s / 8	15s / 150	40s / 400	120s / 600

Analysis: Epona achieves the best FVD score (82.8), a 7.4% improvement over the previous best, Vista. This indicates superior video quality and temporal smoothness. While its FID is slightly worse than Vista's, the most dramatic result is the Max Duration. Epona can generate coherent videos for 2 minutes (600 frames), an order of magnitude longer than any competing method, demonstrating the success of its autoregressive approach.

Qualitative Comparison (Figure 5):

$Figure 5. Qualitative Comparison between Vista \[17\] and Epona. Zoom in for better views.$ 该图像是论文中图5的对比插图，展示了Vista与Epona方法在自动驾驶世界模型视频生成质量上的差异。各时间点自0秒至2分钟以上，Epona生成的场景更清晰稳定，体现了其较长时序预测能力。

This figure visually confirms the quantitative results. While Vista's generation quickly degrades into a blurry mess after a few seconds, Epona maintains high fidelity, consistency, and structural detail for over 2 minutes.
Trajectory-controlled Generation (Figure 6):

该图像是一个行驶轨迹条件与对应路况视频帧展示的示意图，展示了车辆在16秒内按不同转向指令行驶的轨迹变化及其相应视频场景。顶部分别显示直行、左转、右转轨迹，底部展示不同时刻条件帧的场景。

This shows that Epona can be controlled by providing an external trajectory. Given the same starting frame, it generates different future videos that precisely follow the specified motion (e.g., going straight vs. turning left), a crucial feature for simulation and data augmentation.

Motion Planning (Tables 3 & 4):

(Manual Transcription of Table 3: NuScenes Planning Benchmark)

Method	Input	Auxiliary Supervision	L2 (m) ↓				Collision Rate (%) ↓
Method	Input	Auxiliary Supervision	1s	2s	3s	Avg.	1s	2s	3s	Avg.
UniAD [46]	Camera	Map & Box & Motion & Tracklets & Occ	0.48	0.96	1.65	1.03	0.05	0.17	0.71	0.31
GenAD [65]	Camera	Map & Box & Motion	0.36	0.83	1.55	0.91	0.06	0.23	1.00	0.43
Doe-1 [66]	Camera*	QA	0.50	1.18	2.11	1.26	0.04	0.37	1.19	0.53
Ours	Camera*	None	0.61	1.17	1.98	1.25	0.01	0.22	0.85	0.36

(Note: Table 3 is abbreviated to show key comparisons)

(Manual Transcription of Table 4: NAVSIM Planning Benchmark)

Method	Input	NC ↑	DAC ↑	TTC ↑	Comf. ↑	EP ↑	PDMS ↑
Human	/	100	100	100	99.9	87.5	94.8
UniAD[46]	Camera	97.8	91.9	92.9	100	78.8	83.4
PARA-Drive[59]	Camera	97.9	92.4	93.0	99.8	79.3	84.6
DRAMA[63]	Camera & Lidar	98.0	93.1	94.8	100	80.1	85.5
Ours	Camera	97.9	95.1	93.8	99.9	80.4	86.2

Analysis: On NuScenes (Table 3), Epona achieves competitive L2 error while having the lowest average collision rate. This is remarkable because, unlike other methods that use extensive supervision (3D boxes, maps, etc.), Epona is trained with no auxiliary supervision. On the more challenging NAVSIM benchmark (Table 4), Epona achieves the highest overall PDMS score (86.2), outperforming specialized end-to-end planners, even those that use Lidar. This demonstrates that learning to predict the visual future is a powerful supervisory signal for learning robust driving policies.

Ablations / Parameter Sensitivity

Effect of Joint Training (Table 5): When video prediction is disabled (w/o Joint Training), the planning performance (PDMS) drops from 86.2 to 78.1. This confirms that jointly modeling video and trajectories forces the model to learn a much richer and more accurate representation of the world, which significantly benefits the planning task.
Effect of Chain-of-Forward Training (Figure 8):

该图像是一个图表，展示了在NuPlan测试集上，采用和未采用Chain-of-Forward训练策略的模型在不同自回归帧数下的FID得分对比。蓝线代表使用Chain-of-Forward策略的模型，红线代表未使用的模型，表明该策略显著降低了FID分数。

This graph clearly shows that as the number of generated frames increases, the FID score of the model trained without Chain-of-Forward (red dashed line) degrades much faster than the model trained with it (blue solid line). This validates that the strategy is effective at mitigating autoregressive drift and is crucial for long-horizon generation.
Effect of Temporal-aware DCAE Decoder (Table 6): Adding the temporal module to the decoder improves FVD scores across all prediction lengths (e.g., FVD40 drops from 100.11 to 74.88). This confirms that the modification helps reduce flickering and improves the temporal consistency of the generated videos.

7. Conclusion & Reflections

Conclusion Summary: The paper introduces Epona, a novel autoregressive diffusion world model that successfully combines the visual quality of diffusion models with the temporal flexibility of autoregressive models. Through innovations like decoupled spatiotemporal modeling, modular prediction, and a chain-of-forward training strategy, Epona sets a new state of the art in long-horizon driving video generation and also functions as a highly effective motion planner. The results strongly suggest that learning to predict the future in a self-supervised manner is a powerful path toward building general-purpose autonomous driving systems.
Limitations & Future Work: The paper itself does not explicitly state limitations. However, some can be inferred:
- Computational Cost: Epona is a large model (2.5B parameters) trained for two weeks on 48 A100 GPUs. Inference, especially for video generation, is also computationally intensive (Table 2 shows it takes ~2s per frame with 100 sampling steps). While the planning module is real-time, the full world simulation is not.
- Scope: The model is demonstrated on front-camera video. A full autonomous driving system would need to handle multi-camera inputs and potentially other sensor modalities like Lidar and Radar.
- Closed-loop Evaluation: While NAVSIM is a closed-loop benchmark, more extensive, real-world, closed-loop testing would be needed to fully validate the planner's safety and reliability.
Personal Insights & Critique:
- Elegant Synthesis: The core strength of Epona is its elegant synthesis of two previously competing paradigms. It doesn't invent an entirely new component but rather combines existing, powerful ideas (Transformers, Diffusion, Autoregression) in a novel and highly effective architecture.
- Planning as a Byproduct: The most compelling finding is how well the motion planner performs despite not being the sole focus. It shows that a good understanding of world dynamics (learned through video prediction) is perhaps more important than relying on heavily annotated intermediate representations (like 3D boxes or semantic maps). This supports the argument for end-to-end, self-supervised approaches.
- The Chain-of-Forward Strategy: This is a simple but clever technique that addresses a fundamental problem in autoregressive modeling. It is general and could likely be applied to other sequential generation tasks beyond autonomous driving.
- Future Impact: Epona represents a significant step towards general-purpose, scalable world models for robotics and AI. By showing that high-fidelity simulation and effective planning can be learned jointly from raw sensory data, it paves the way for agents that can learn to act in complex environments with minimal human supervision.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.