Paper status: completed

ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation

Published:12/03/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents ReCamDriving, a vision-based framework for generating videos from novel trajectories using camera control. By leveraging dense 3D Gaussian Splatting as geometric guidance, it employs a two-stage training to enhance controllability and structural consistency, a

Abstract

We propose ReCamDriving, a purely vision-based, camera-controlled novel-trajectory video generation framework. While repair-based methods fail to restore complex artifacts and LiDAR-based approaches rely on sparse and incomplete cues, ReCamDriving leverages dense and scene-complete 3DGS renderings for explicit geometric guidance, achieving precise camera-controllable generation. To mitigate overfitting to restoration behaviors when conditioned on 3DGS renderings, ReCamDriving adopts a two-stage training paradigm: the first stage uses camera poses for coarse control, while the second stage incorporates 3DGS renderings for fine-grained viewpoint and geometric guidance. Furthermore, we present a 3DGS-based cross-trajectory data curation strategy to eliminate the train-test gap in camera transformation patterns, enabling scalable multi-trajectory supervision from monocular videos. Based on this strategy, we construct the ParaDrive dataset, containing over 110K parallel-trajectory video pairs. Extensive experiments demonstrate that ReCamDriving achieves state-of-the-art camera controllability and structural consistency.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation

1.2. Authors

Yaokun Li1^{1}, Shuaixian Wang1,3^{1,3}, Mantang Guo2^{2}, Jiehui Huang4^{4}, Taojun Ding2^{2}, Mu Hu4^{4}, Kaixuan Wang2^{2}, Shaojie Shen4^{4}, Guang Tan1^{1*}

  • 1^{1} Sun Yat-sen University
  • 2^{2} ZYT
  • 3^{3} Shenzhen Polytechnic University
  • 4^{4} The Hong Kong University of Science and Technology

1.3. Journal/Conference

Published at (UTC): 2025-12-03 Venue: arXiv (Preprint). The content formatting suggests a submission to a major computer vision conference (e.g., CVPR/ICCV), but it is currently released as a preprint.

1.4. Publication Year

2025

1.5. Abstract

The paper proposes ReCamDriving, a framework for generating autonomous driving videos from new, unrecorded trajectories (novel viewpoints) using only camera data. Existing "repair-based" methods often struggle with complex artifacts, while "LiDAR-based" methods suffer from sparse and incomplete geometric information. ReCamDriving replaces LiDAR with dense 3D Gaussian Splatting (3DGS) renderings to provide explicit geometric guidance. It employs a two-stage training paradigm: Stage 1 learns coarse camera control using poses, and Stage 2 incorporates 3DGS renderings for fine-grained guidance. Additionally, a cross-trajectory data curation strategy is introduced to train lateral movement using monocular videos, resulting in the ParaDrive dataset (110K+ video pairs). Experiments show state-of-the-art performance in camera controllability and structural consistency.

2. Executive Summary

2.1. Background & Motivation

  • Core Problem: In autonomous driving, generating high-quality videos from trajectories different from the originally recorded path (novel-trajectory generation) is crucial for data augmentation and world model training.
  • Why Important: Collecting real-world multi-view data is expensive and difficult to synchronize. Simulating new views from a single recorded video is a scalable alternative.
  • Existing Gaps:
    1. Repair-based methods: First render a 3D scene (often with artifacts) and then use AI to "fix" it. These often fail when artifacts are severe or complex, leading to hallucinations.
    2. LiDAR-based methods: Use LiDAR point clouds to guide the generation of new views. However, LiDAR is sparse (has gaps) and often lacks data for backgrounds or occluded areas, leading to inconsistencies.
    3. Data Scarcity: Real driving datasets usually only have a single trajectory, making it hard to find "ground truth" target views to train models that simulate lateral (side-to-side) movement.

2.2. Main Contributions / Findings

  1. LiDAR-Free Framework: Proposed ReCamDriving, which uses 3DGS renderings instead of LiDAR. Although 3DGS renderings have artifacts, they provide dense and scene-complete geometric cues compared to sparse LiDAR.

  2. Two-Stage Training Strategy: To prevent the model from just learning to "fix artifacts" (restoration) instead of learning camera control:

    • Stage 1: Trains on camera poses for coarse control.
    • Stage 2: Freezes the base network and adds modules to process 3DGS renderings for fine-grained guidance.
  3. Cross-Trajectory Data Curation: A clever strategy to create training pairs from single-camera videos. It uses 3DGS to render an offset view to use as input, while using the real recorded video as the "ground truth" target. This forces the model to learn how to transform a view to match the real world.

  4. ParaDrive Dataset: Constructed a large-scale dataset with over 110K parallel-trajectory video pairs based on Waymo and NuScenes data.

  5. Performance: Achieves state-of-the-art (SOTA) results in visual quality, camera accuracy, and 3D consistency, especially significantly outperforming baselines in lateral view generation.

    The following figure (Figure 1 from the original paper) compares the proposed approach (c) with repair-based (a) and LiDAR-based (b) methods, illustrating how ReCamDriving leverages 3DGS for better consistency:

    Figure 1. Comparison of novel-trajectory generation. Repair-based methods (e.g., Difix \(^ { 3 \\mathrm { D + } }\) \[53\]) suffer from severe artifacts under novel 该图像是一个示意图,展示了三种不同的视频生成方法的比较。 (a)显示了基于3DGS的修复方法,(b)展示了LiDAR基础的可控生成,(c)则是提出的ReCamDriving方法,强调了相机姿态和3DGS渲染在生成过程中的作用。

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • 3D Gaussian Splatting (3DGS): A modern technique for 3D scene reconstruction. Instead of using a mesh (triangles) or a neural network (NeRF), it represents a scene as millions of 3D "blobs" (Gaussians). Each blob has a position, color, opacity, and size. It renders extremely fast but can produce "blurry" or "needle-like" artifacts when viewed from angles not seen during training.
  • Latent Diffusion Models (LDMs): A type of generative AI (like Stable Diffusion) that creates images/videos by gradually removing noise from a random signal. "Latent" means it operates in a compressed mathematical space (latent space) rather than pixel space, making it more efficient.
  • Flow Matching: A newer alternative to standard diffusion training. Instead of modeling a random walk to remove noise, it models a direct "straight line" (flow) from noise to the target image/video. It is generally more stable and faster to train.
  • DiT (Diffusion Transformer): An architecture that replaces the traditional U-Net in diffusion models with a Transformer (like GPT). It splits images/videos into patches (tokens) and processes them, allowing for better scaling and context understanding.

3.2. Previous Works

  • Repair-based Methods (e.g., Difix3D+, GSFixer): These follow a "Reconstruct-then-Repair" pipeline. They construct a 3D scene, render a new view (which usually looks bad), and use a diffusion model to fix the image.
    • Critique: They learn a local mapping from "bad image" to "good image." If the rendering artifacts look different from what the model saw during training (out-of-distribution), the repair fails.
  • Camera-Controlled Generation (e.g., FreeVS, StreetCrafter): These generate video directly based on camera pose sequences. To get better structure, they often use LiDAR projections.
    • Critique: LiDAR is expensive and sparse. It doesn't tell the model what the sky or distant buildings look like, leading to inconsistencies. Also, they often train on "segment-based" pairs (moving forward in time), which doesn't help much with lateral (sideways) movement.

3.3. Differentiation Analysis

  • Vs. Repair Methods: ReCamDriving does not treat the 3DGS rendering as a "draft to be fixed." Instead, it uses it as a structural condition (a guide map) to generate a completely new video from scratch, guided by camera poses.
  • Vs. LiDAR Methods: It replaces sparse LiDAR points with dense 3DGS renderings. Even if the 3DGS rendering is blurry, it provides information about the entire scene (geometry and texture) everywhere, not just where a laser beam hit.

4. Methodology

4.1. Principles

The core idea is to synthesize a video for a target trajectory (VtV_t) given a source video (VsV_s) and the relative camera movement. The authors identify that simply telling the model "move 2 meters right" (pose condition) is too vague. Using LiDAR provides precise depth but leaves gaps. Therefore, they use 3DGS renderings of the target trajectory as a "dense hint." The challenge is to prevent the model from lazily copying the artifacts from this "hint." They solve this by training in two stages: first teaching the physics of camera movement, then adding the 3DGS hint for refinement.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Mathematical Foundation: Flow Matching

The paper utilizes a Latent Diffusion Model trained with Flow Matching.

  • Forward Process: The process defines a path from pure noise (x0x_0) to the data (x1x_1) over time t[0,1]t \in [0, 1]. The state at time tt is a linear interpolation: xt=tx1+(1t)x0x_t = t x_1 + (1 - t) x_0

    • xtx_t: The noisy latent at time tt.
    • x1x_1: The clean target video latent.
    • x0x_0: The initial random noise.
  • Velocity Field: Differentiating the above with respect to tt gives the velocity (direction of change): vt=dxtdt=x1x0v_t = \frac{dx_t}{dt} = x_1 - x_0

  • Training Objective: The neural network ϵθ\epsilon_\theta learns to predict this velocity field vtv_t given the current noisy state xtx_t, the camera condition ccamc_{cam}, and time tt. The loss function is the mean squared error: LFM=Ex0,x1,ccam,tU(0,1)ϵθ(xt;ccam,t)vt2\mathcal{L}_{\mathrm{FM}} = \mathbb{E}_{x_0, x_1, c_{\mathrm{cam}}, t \sim U(0,1)} \big|\big| \epsilon_\theta(x_t; c_{\mathrm{cam}}, t) - v_t \big|\big|^2

    • ϵθ\epsilon_\theta: The neural network (DiT).
    • 2||\cdot||^2: L2 norm (squared difference).

4.2.2. Network Architecture & Inputs

The model is based on the Wan2.1 foundation model (a DiT-based video generator).

  1. Inputs:

    • Source Video VsV_s and Target/Noisy Video VtV_t are encoded into latents xs,xtRf×c×h×wx_s, x_t \in \mathbb{R}^{f \times c \times h \times w} using a 3D VAE encoder.
    • Patchify: These are flattened into sequences of tokens: xs,xtRf×l×d\boldsymbol{x}_s, \boldsymbol{x}_t \in \mathbb{R}^{f \times l \times d}, where l=h×wl = h \times w (spatial tokens) and ff is frames.
  2. Condition Embeddings:

    • Relative Pose (crc_r): Encodes the transformation ΔT\Delta T from source to target.
    • Identity Pose (cIc_I): Encodes "no motion" (identity matrix I4I_4).
    • Frame Embedding (EfE_f): Learnable vectors indicating which frame is which (temporal alignment).

4.2.3. Stage 1: Relative Pose-Guided Camera Transformation

This stage teaches the model strictly about camera movement physics, without 3DGS guidance yet.

  • Input Construction: The camera and frame embeddings are added to the video latents via broadcasting (copying along the spatial dimension ll): xi=Cat(xt+cr+Ef, xs+cI+Ef)\boldsymbol x_i = \mathrm{Cat}(\boldsymbol x_t + \boldsymbol c_r + \boldsymbol E_f, \ \boldsymbol x_s + \boldsymbol c_I + \boldsymbol E_f)

    • Cat\mathrm{Cat}: Concatenation along the frame dimension ff. This effectively stacks the noisy target frames and the clean source frames into one long sequence for the Transformer to process jointly.
    • Note: The source frames (xsx_s) are conditioned with "zero motion" (cIc_I), while the noisy target frames (xtx_t) are conditioned with the "relative motion" (crc_r).
  • Processing: The concatenated input xi\boldsymbol x_i is fed into NN layers of DiT blocks.

    • Each block has Self-Attention, Feed-Forward Network (FFN), and Norms.

    • Only Self-Attention parameters are unfrozen (trainable) here; others are initialized from Wan2.1.

      The architecture of the DiT blocks and the two-stage pipeline is illustrated below (Figure 3 from the original paper):

      该图像是一个示意图,展示了ReCamDriving框架的工作原理,包括源轨迹视频的处理、相机编码器的使用、3D VAE编码器、以及两阶段生成过程。图中标记了Identity Pose和Relative Pose等关键元素,以及用于生成新轨迹的3DGS渲染。 该图像是一个示意图,展示了ReCamDriving框架的工作原理,包括源轨迹视频的处理、相机编码器的使用、3D VAE编码器、以及两阶段生成过程。图中标记了Identity Pose和Relative Pose等关键元素,以及用于生成新轨迹的3DGS渲染。

4.2.4. Stage 2: Fine-grained Camera Control via 3DGS

Once the model understands coarse movement, 3DGS renderings are added for precision.

  • Input: Novel-trajectory 3DGS rendering VgsV_{gs} is encoded and patchified into xgsx_{gs}.

  • Condition Augmentation: xˉgs=xgs+cr+Ef\bar{x}_{gs} = x_{gs} + c_r + E_f The 3DGS latent is also tagged with the relative pose and frame information.

  • Architecture Modification:

    1. Freeze: The Self-Attention layers trained in Stage 1 are frozen. This forces the model to retain its learned motion physics.
    2. New Modules: Two new attention modules are added to each DiT block:
      • Rendering Attention: A self-attention layer that processes the 3DGS latent xˉgs\bar{x}_{gs} to extract spatio-temporal geometric features.
      • Cross Attention: Fuses the 3DGS features into the main diffusion stream. It takes the output of the frozen Self-Attention (xˉi\bar{x}_i) as query, and the processed 3DGS features as key/value.
    • Logic: This design ensures the 3DGS rendering is used as a reference for geometry, not as a template to simply copy-paste (which would copy artifacts).

4.2.5. Cross-Trajectory Data Curation

To train this system, one needs pairs of (Source View, Target View). In real datasets (e.g., Waymo), we only have one camera path.

  • The Strategy:

    1. Train Time:
      • Target (x1x_1): The real, clean recorded video from the dataset.
      • Source (xsx_s): A synthetic 3DGS rendering of a trajectory shifted laterally (e.g., 3 meters left).
      • Task: The model learns to take a "fake/artifact-prone offset view" and generate the "real/clean current view."
    2. Inference Time:
      • Source (xsx_s): The real, clean recorded video.
      • Target (x1x_1): The model generates a novel view (e.g., 3 meters right).
    • Why this works: Even though the training source is noisy (3DGS) and the test source is clean, the geometric transformation pattern (lateral shift) is consistent. The model learns "how to shift view laterally," and providing a cleaner source at test time actually yields better results (as shown in ablations).

      The data curation strategy is depicted in Figure 2 from the paper:

      Figure 2. (ab) Comparison of training and inference cameratransformation patterns. (c) Our training and inference data strategy. (Trans.: Transformation; Traj.: Trajectory) 该图像是一个示意图,展示了 ReCamDriving 框架在训练和推理阶段的相机变换模式比较,以及训练和推理数据策略。图中 (a) 和 (b) 分别表示训练与推理过程中记录的轨迹和生成的新轨迹;(c) 则说明了在训练和推理阶段的相机条件与损失函数之间的关系。

5. Experimental Setup

5.1. Datasets

  • ParaDrive Dataset: A new dataset constructed by the authors.
    • Source: ~1.6K scenes from Waymo Open Dataset (WOD) and NuScenes.
    • Construction: Used DriveStudio to train 3DGS models for each scene.
    • Scale: Over 110K parallel-trajectory video pairs.
    • Configuration: For training, they generated laterally shifted trajectories (±1m,±2m,±3m,±4m\pm 1m, \pm 2m, \pm 3m, \pm 4m) to serve as source videos.

5.2. Evaluation Metrics

The authors evaluate three aspects using the following metrics:

  1. Visual Quality:

    • Imaging Quality (IQ): A holistic metric for image fidelity (specific formula not detailed in text, likely a standard perceptual metric or user study proxy).
    • CLIP-F (Temporal Consistency): Measures the average cosine similarity of CLIP embeddings between adjacent frames. CLIP-F=1T1t=1T1Sim(CLIP(It),CLIP(It+1)) \text{CLIP-F} = \frac{1}{T-1} \sum_{t=1}^{T-1} \text{Sim}(\text{CLIP}(I_t), \text{CLIP}(I_{t+1}))
      • ItI_t: Frame at time tt.
      • Sim\text{Sim}: Cosine similarity.
  2. Camera Accuracy:

    • Uses MegaSaM (a SOTA pose estimator) to estimate camera parameters from the generated video and compares them to the target trajectory.
    • Rotation Error (RErr): Difference in rotation angles (degrees).
    • Translation Error (TErr): Euclidean distance between estimated and target positions (meters).
  3. View Consistency:

    • Fréchet Video Distance (FVD): Measures the distribution distance between generated videos and real videos in a feature space. Lower is better.
    • Fréchet Image Distance (FID): Similar to FVD but frame-by-frame. Lower is better.
    • CLIP-V: Frame-wise CLIP similarity between the generated video and the source video (or ground truth if available), measuring semantic preservation across views.

5.3. Baselines

  • DriveStudio: A pure 3DGS reconstruction method (renders novel views directly).
  • Difix3D+: A state-of-the-art repair-based method (renders 3DGS then fixes with diffusion).
  • FreeVS: A generative method using LiDAR projections on longitudinal segments.
  • StreetCrafter: Similar to FreeVS, uses LiDAR and Vista foundation model.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that ReCamDriving outperforms existing methods, particularly in maintaining structural consistency during large lateral movements.

Quantitative Comparison (NuScenes)

The following are the results from Table 1 of the original paper (Data for Lateral Offset ±1m and ±2m):

Method Lateral Offset ±1m Lateral Offset ±2m
Visual Quality Cam. Accuracy View Consistency Visual Quality Cam. Accuracy View Consistency
IQ↑ CLIP-F↑ RErr.↓ TErr.↓ FID↓ FVD↓ CLIP-V↑ IQ↑ CLIP-F↑ RErr.↓ TErr.↓ FID↓ FVD↓ CLIP-V↑
DriveStudio 52.13 98.84 - - 83.32 25.37 94.78 47.32 98.49 - - 104.24 39.79 94.23
Difix3D+ 64.24 98.92 1.36 2.42 56.35 27.80 95.32 63.11 98.41 1.64 2.66 57.73 31.88 92.85
FreeVS 62.74 95.74 1.71 2.88 63.06 37.06 88.99 60.16 92.59 2.12 2.93 67.87 43.59 88.41
StreetCrafter 63.57 97.31 1.52 2.53 28.18 20.51 96.01 63.78 97.17 1.79 2.77 46.78 22.81 94.74
Ours **65.18** **99.31** **1.32** **2.37** **13.76** **13.27** **97.96** **65.34** **99.03** **1.45** **2.43** **25.01** **14.08** **97.18**

Analysis:

  • Visual Quality (IQ): ReCamDriving scores highest (65.18 at 1m), indicating superior image fidelity.
  • Consistency (FID/FVD): The gap is massive. For 1m offset, ReCamDriving achieves an FID of 13.76 vs. StreetCrafter's 28.18. This suggests the generated images are much more realistic and closer to the ground truth distribution.
  • Robustness: As the offset increases to 2m, baseline performance drops significantly (e.g., StreetCrafter FID goes 28.18 -> 46.78), while ReCamDriving degrades much more gracefully (13.76 -> 25.01), proving the robustness of 3DGS guidance.

Qualitative Analysis

Figure 4 and Figure 5 in the paper (see Executive Summary for Fig 5 context, Fig 4 below) show:

  • Baselines (FreeVS/StreetCrafter): Suffer from blurred or inconsistent backgrounds (trees, buildings) because LiDAR points are sparse at distance.

  • ReCamDriving: Maintains sharp details in distant regions and complex structures (like lane markings) thanks to the dense 3DGS guidance.

  • Difix3D+: Often hallucinates or creates "wobbly" structures because it tries to repair artifacts without understanding the underlying geometry as well as the 3DGS-guided generative model.

    The following figure (Figure 4 from the paper) provides a qualitative comparison on NuScenes:

    Figure 4. Qualitative comparison results on NuScenes \[7\]. 该图像是示意图,展示了四个方法在NuScenes数据集上的定性比较。第一行是源视图,第二行分别为DriveStudio和Difix3D+的生成视图,第三行为本研究方法生成的视图。各方法在道路标识和环境细节上的表现有所不同。

6.2. Ablation Studies

6.2.1. Camera Conditions (Pose vs. LiDAR vs. 3DGS)

The authors tested different conditions for Stage 2. The results are from Table 3 of the original paper:

Camera Condition IQ↑ FID↓ FVD↓ RErr.↓ TErr.↓
Pose 60.13 34.86 32.31 3.01 4.23
Pose + LiDAR 61.32 31.23 27.78 1.53 2.69
Pose + LiDAR + GS 63.42 24.75 19.27 1.41 2.47
Pose + GS (Ours) **63.63** **24.88** **19.18** **1.49** **2.55**

Finding: Adding LiDAR helps compared to just Pose, but Pose + GS (3DGS) yields significantly better visual metrics (FID 24.88 vs 31.23 for LiDAR). Combining both (LiDAR + GS) gives a negligible improvement, justifying the "LiDAR-Free" design choice.

6.2.2. Training Strategy (One-stage vs. Two-stage)

  • One-stage: Training everything together makes the model behave like a repair model—it focuses on fixing local artifacts in the 3DGS rendering and fails to learn proper global geometry or camera movement.
  • Two-stage (Ours): Freezing the motion module (Stage 1) forces the model to rely on camera physics, using 3DGS only for refinement. This yields much lower FVD (18.32 vs 25.16).

6.2.3. Cross-Trajectory vs. Longitudinal Data

Using longitudinal segments (front/back of the same trip) as training pairs (like prior work) results in poor lateral control (RErr 1.97). The proposed Cross-Trajectory strategy (using lateral 3DGS offsets) significantly improves accuracy (RErr 1.49).

7. Conclusion & Reflections

7.1. Conclusion Summary

ReCamDriving successfully addresses the limitations of sparse LiDAR and artifact-prone rendering in autonomous driving video generation. By using 3DGS renderings as a dense structural condition within a two-stage flow matching framework, it achieves precise camera control and high visual fidelity. The accompanying ParaDrive dataset and cross-trajectory data curation strategy provide a scalable path for training such models using only monocular video data.

7.2. Limitations & Future Work

  • Small Distant Objects: The authors note that the method still struggles with small objects far away (e.g., distant pedestrians). This is because 3DGS itself might render these poorly (as blobs), and the diffusion model struggles to recover semantic detail from such small cues.
  • Future Direction: Integrating stronger structural priors or semantic guidance specifically for small objects could be a solution.

7.3. Personal Insights & Critique

  • Innovation: The shift from "LiDAR as geometry" to "Neural Rendering (3DGS) as geometry" is smart. 3DGS is imperfect but dense; LiDAR is precise but sparse. For a generative model that needs to "dream" pixels, dense (even if blurry) guidance seems more valuable than sparse points.
  • Methodology: The two-stage training is a critical engineering detail. It highlights a common pitfall in conditional generation: if the condition is too similar to the target (e.g., a rendered image vs. real image), the network becomes lazy and just learns a denoising filter. Forcing it to learn motion first (Stage 1) prevents this "shortcut learning."
  • Applicability: This approach could be very useful for training end-to-end driving models (World Models) where simulating rare "off-trajectory" scenarios (e.g., drifting out of a lane) is essential for safety training. The ability to generate consistent lateral views is key here.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.