InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
TL;DR Summary
InfiniCube integrates sparse voxel 3D generation with video models, enabling scalable, controllable, and consistent dynamic 3D driving scenes using multimodal inputs like maps, bounding boxes, and text.
Abstract
We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. Previous methods for scene generation either suffer from limited scales or lack geometric and appearance consistency along generated sequences. In contrast, we leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects. Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
- Authors: Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Sihen Chen, Sanja Fidler, Hao Su.
- Affiliations: The authors are affiliated with several leading institutions in AI and autonomous driving research, including NVIDIA, University of Toronto, Vector Institute, Reconstruct Inc., and University of California, San Diego. This diverse team brings expertise from both industry research labs and academia.
- Journal/Conference: This paper is a preprint available on arXiv. While not yet formally peer-reviewed in a top-tier conference like CVPR or NeurIPS at the time of this analysis, the authors' affiliations and the quality of the work suggest it targets such high-impact venues.
- Publication Year: 2024
- Abstract: The authors present
InfiniCube, a method to generate large-scale, dynamic, and controllable 3D driving scenes. Existing methods are often limited in scale or lack consistency.InfiniCubeovercomes this by combining a 3D voxel generative model with a video generation model. First, it generates an "unbounded" 3D voxel world conditioned on High-Definition (HD) maps and vehicle bounding boxes. This voxel world then provides guidance to a video diffusion model, enabling the synthesis of long, consistent driving videos. Finally, a fast, feed-forward approach lifts these videos and voxels into a dynamic 3D Gaussian Splatting (3DGS) representation, which includes controllable moving objects. The paper claims this method produces realistic, controllable scenes and provides extensive experiments to validate its effectiveness. - Original Source Link: https://arxiv.org/abs/2412.03934v2
- PDF Link: https://arxiv.org/pdf/2412.03934v2.pdf
2. Executive Summary
-
Background & Motivation (Why):
- The core problem is the generation of realistic, large-scale, and controllable 3D environments, particularly for autonomous vehicle (AV) simulation. Such simulations are crucial for training and testing self-driving systems safely and efficiently, especially in rare or dangerous "edge-case" scenarios.
- Previous methods fall short in several key areas.
- Direct 3D generation models can create 3D structures but often lack photorealistic appearance and fine details.
- Video generation models produce high-fidelity visuals but struggle with 3D geometric consistency, are limited to short video clips, and do not output a true 3D representation suitable for physics simulation (e.g., LiDAR, collision detection).
InfiniCubeintroduces a novel, synergistic approach that combines the strengths of both worlds. It uses a 3D generative model for consistent large-scale geometry and a video model for high-fidelity appearance, effectively bridging the gap between them.
-
Main Contributions / Findings (What):
- A New Three-Stage Pipeline: The paper introduces a complete pipeline for generating large-scale, dynamic 3D scenes in the 3DGS format. The generation is controllable via HD maps, vehicle bounding box trajectories, and text prompts.
- Unbounded Voxel World Generation: It proposes a method to generate an infinitely extensible semantic voxel world using a conditional diffusion model and a seamless
outpaintingtechnique. This provides a consistent geometric and semantic backbone for the entire scene. - World-Guided Long Video Generation: A key innovation is the use of guidance buffers (semantic maps and 3D coordinate maps) rendered from the voxel world. These buffers ground the video generation process, drastically improving appearance consistency and allowing a standard short-video model (SVD-XT) to generate long, coherent driving sequences (200 frames from a 25-frame model).
- Fast Dual-Branch Dynamic 3D Reconstruction: The paper presents a novel feed-forward method to lift the generated 2D video into a dynamic 3DGS scene in seconds. It uses a voxel branch for the static background and a pixel branch for dynamic objects and mid-ground details, combining the geometric accuracy of the former with the fine-grained detail capture of the latter.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Latent Diffusion Models (LDM): A type of generative model that has become state-of-the-art for creating high-quality images and videos. Instead of working with high-resolution data directly, an LDM first compresses the data (e.g., an image) into a smaller, lower-dimensional "latent" representation. It then learns to generate new data by starting with random noise in this latent space and progressively "denoising" it, guided by a learned understanding of the data distribution and optional conditions (like text).
- Sparse Voxels: A voxel is the 3D equivalent of a 2D pixel—a small cube in a 3D grid. A "sparse" voxel grid is a memory-efficient representation where only voxels that contain actual scene surfaces are stored, while empty space is ignored. This is essential for representing large-scale 3D scenes without prohibitive memory requirements.
- 3D Gaussian Splatting (3DGS): A modern technique for representing and rendering 3D scenes. Instead of using traditional meshes or voxels, a scene is represented by a collection of millions of tiny, semi-transparent 3D "Gaussians" (ellipsoids). Each Gaussian has properties like position, shape (scale/rotation), color, and opacity. This representation allows for extremely fast, high-quality, photorealistic rendering from any viewpoint.
- High-Definition (HD) Maps: Highly detailed maps used by autonomous vehicles for navigation. Unlike consumer maps, they contain precise 3D geometric information about road elements like lane boundaries, road edges, traffic signs, and curbs, accurate to the centimeter level.
- Outpainting: A generative AI technique used to extend an image beyond its original borders by hallucinating new content that is contextually consistent with the existing image.
InfiniCubeapplies this concept in 3D to extend its voxel world chunk by chunk.
-
Previous Works: The paper positions itself against two main lines of research:
- Direct 3D Scene Generation: Methods like
InfiniCityandXCubelearn to generate 3D scenes directly, often using voxel or neural field representations. While they can produce geometrically valid structures, they typically struggle to generate the rich, photorealistic textures and appearances seen in real-world videos. - Controllable Video Generation: Methods like
VistaandMagicDrive3Dfine-tune powerful video diffusion models on driving data. They can generate realistic-looking driving videos conditioned on camera poses or HD maps. However, they suffer from two major drawbacks: the generated videos lack true 3D consistency (objects might warp or disappear between views), and they are limited to short durations, making them unsuitable for simulating long drives in large-scale scenes.
- Direct 3D Scene Generation: Methods like
-
Technological Evolution:
InfiniCuberepresents a "best of both worlds" approach. It acknowledges that 3D models are best for geometric structure and consistency, while 2D video models excel at creating photorealistic appearance. The paper's novelty lies in creating a robust bridge between these two modalities, using the 3D voxel world to guide and stabilize the 2D video generation process over long sequences. -
Differentiation: The following transcribed table from the paper highlights
InfiniCube's unique capabilities.Table 1. High-level comparison with existing solutions. This is a manual transcription of the table data.
Output Type Detailed Geometry Driving Length† Video 3D Rep. Vista [19] ✓ ✗ ✗ 15 s MagicDrive3D [16] ✓ 3DGS ✗ 6 s InfiniCity [35] ✗ Voxels ✗ N/A WoVoGen [39] ✓ Voxels ✓ 0.4s InfiniCube (Ours) ✓ Voxels + 3DGS ✓ 20 s‡ †: Measuring the video model generation length at 10 Hz. ‡: Our full pipeline outputs both long driving videos and a renderable 3DGS scene.
This table clearly shows that
InfiniCubeis the only method that produces both a long-duration video and a detailed 3D representation (3DGS), satisfying all the key requirements for a high-fidelity driving simulator.
4. Methodology (Core Technology & Implementation)
The InfiniCube pipeline, as illustrated in Image 2, is a three-stage process designed to generate an unbounded, dynamic 3D driving scene.
该图像是一个示意图,展示了论文中无界体素世界生成、基于世界引导的视频生成及动态3D驾驶场景生成的整体流程,包含输入HD地图与边界框、体素扩散与视频扩散,再到双分支3D重建输出大规模3DGS场景。
4.1. Unbounded Voxel World Generation
This stage creates the large-scale, static geometric and semantic foundation of the scene.
-
Input: An HD map (containing 3D road lines and edges) and bounding boxes for vehicles (specifying their location and orientation).
-
Map Conditions: The diffusion model is conditioned on a 3D volume, , which encodes the input information. This volume has three parts:
- : Rasterized HD map polylines (road edges and lane lines) into two channels of a voxel grid.
- : A voxelized representation of the drivable road surface, derived by fitting planes to the HD map. This explicitly tells the model where the ground is, which was found to be crucial for generating correct geometry.
- : Voxelized bounding boxes. To preserve orientation information, which is lost in simple voxelization, the heading angle is encoded as a two-channel vector .
-
Single Chunk Generation: Using a sparse voxel Latent Diffusion Model (LDM) based on
XCube, the system generates a single "chunk" of a semantic voxel grid. The model takes a random noise vector, concatenates it with the condition volume , and iteratively denoises it using a 3D U-Net to produce a latent representation, which is then decoded into a sparse grid where each voxel has a semantic label (e.g., road, building, vegetation, car). -
Unbounded Scene Outpainting: To create a scene larger than a single chunk, the model uses an
outpaintingstrategy. When generating a new chunk adjacent to an existing one, the model uses the latent representation of the overlapping region from the existing chunk to guide the generation of the new chunk. This is achieved at each denoising step with the formula:-
is the updated latent for the full new chunk.
-
is a binary mask indicating the overlapping region.
-
⊙is element-wise multiplication. -
is the newly predicted latent for the non-overlapping part.
-
is a noised version of the known latent from the existing, already generated chunk. This process ensures a seamless and consistent transition between adjacent chunks, as shown in Image 6, which contrasts the "consistent transition" of
InfiniCubewith the artifacts from a naive approach.
该图像是示意图,展示了三种不同的动态三维驾驶场景局部生成效果。左侧为初始片段,中间展示了基线方法Naïve Outpainting存在的不一致过渡区域,右侧展示了本文方法Our Outpainting实现的过渡一致性。
-
4.2. World-Guided Video Generation
This stage synthesizes a photorealistic, long, and consistent video by "painting" an appearance onto the geometric skeleton from the previous stage.
-
Base Model: The method uses a pre-trained video diffusion model, Stable Video Diffusion XT (SVD-XT), which natively generates 25-frame videos. To generate longer videos, it's run auto-regressively (the last generated frame becomes the input for the next segment).
-
The Challenge: Auto-regressive generation quickly leads to accumulated errors, causing a significant drop in quality and consistency over long sequences.
-
Innovation: Guidance Buffers: To combat this,
InfiniCubeintroduces novel guidance buffers rendered from the generated voxel world. These are fed as additional conditions to the video model at each step, keeping the generation grounded in the 3D scene.- Semantic Buffer (): A video sequence of rendered semantic maps from the ego-vehicle's perspective. Each semantic category (building, tree, road) has a distinct color. Crucially, different car instances are given different saturation levels to help the model distinguish them and maintain their individual appearances.
- Coordinate Buffer (): A video sequence where each pixel's color value represents the normalized 3D coordinate of the voxel it corresponds to in the scene. This provides explicit and powerful 3D correspondence information across frames, helping the model understand how the scene geometry projects into the camera view as it moves.
-
Text Control: To allow for weather and style control (e.g., "sunny," "rainy"), the very first frame of the video is generated using a
ControlNet-like model conditioned on the semantic buffer and a user-provided text prompt. This initial frame then conditions the start of the SVD-XT video generation process.
4.3. Dynamic 3DGS Scene Generation
The final stage lifts the 2D video and 3D voxel world into a full dynamic 3D scene represented by 3D Gaussian Splatting (3DGS).
- Motivation: A direct, single-model reconstruction is difficult. Voxel-based methods are good for static geometry but poor at fine details and dynamic objects. Pixel-based methods capture details but can have poor geometric accuracy.
InfiniCubeproposes a dual-branch approach to get the best of both. - Voxel Branch (for Static Background):
- This branch reconstructs the static parts of the scene (buildings, roads, vegetation).
- It takes the static voxel world and the generated video frames (with dynamic objects masked out) as input.
- A 3D sparse U-Net processes features unprojected from the images onto the voxels and predicts the 3DGS attributes (color, scale, etc.) for a set of Gaussians centered at each static voxel.
- Pixel Branch (for Mid-ground & Dynamics):
- This branch reconstructs dynamic objects (vehicles) and "mid-ground" regions—areas not covered by the main voxel geometry, like thin poles or distant objects.
- It uses a 2D U-Net that takes an image and predicts per-pixel 3DGS attributes, essentially becoming a depth-and-color estimation task.
- Key Training Strategy: To improve depth prediction, it is trained in a self-supervised manner using the known voxel geometry. During training, it receives a masked version of the depth map rendered from the voxels and is tasked to predict the full depth map. This forces the model to learn to in-paint depth for regions where it has no explicit voxel information.
- Sky Modeling: The sky is modeled separately using an implicit MLP, which takes a viewing direction and outputs a color. This allows for a consistent and realistic sky that extends to the horizon.
- Inference and Composition: At inference time:
- The voxel branch generates the static 3DGS background once.
- The pixel branch is run for every few frames to generate Gaussians for dynamic objects and mid-ground areas.
- The Gaussians for each dynamic vehicle are isolated using the semantic buffer, transformed according to their known bounding box trajectories, and aggregated.
- All sets of Gaussians (static, dynamic, mid-ground, sky) are composed to form the final, renderable 3D scene.
5. Experimental Setup
-
Datasets:
- The model was trained on the Waymo Open Dataset, a large-scale dataset for autonomous driving research. It provides synchronized LiDAR scans, camera images, and high-quality annotations for HD maps and 3D vehicle bounding boxes.
- Data Processing: The ground-truth 3D geometry was extracted by combining LiDAR points with multi-view stereo reconstruction (
COLMAP). Dynamic car geometry was aggregated over time in the car's local coordinate frame. Textual descriptions (e.g., "Daytime, Sunny") were generated for video clips using a large vision-language model (Llama-3.2-90B-VisionInstruct).
-
Evaluation Metrics: The paper uses several standard metrics to evaluate image and video quality.
-
Fréchet Inception Distance (FID):
- Conceptual Definition: Measures the realism and diversity of generated images by comparing the statistical distribution of their high-level features to that of real images. A lower FID score indicates that the generated images are more similar to real ones. Features are extracted using a pre-trained Inception-v3 network.
- Mathematical Formula:
- Symbol Explanation:
- and represent the sets of real and generated image features, respectively.
- and are the mean vectors of the features.
- and are the covariance matrices of the features.
- is the trace of a matrix.
-
Peak Signal-to-Noise Ratio (PSNR):
- Conceptual Definition: Measures the pixel-wise difference between a ground-truth image and a generated image. It is based on the mean squared error (MSE). A higher PSNR indicates better reconstruction quality. It is sensitive to small pixel errors but may not perfectly align with human perception.
- Mathematical Formula:
- Symbol Explanation:
- is the maximum possible pixel value of the image (e.g., 255 for 8-bit images).
- is the mean squared error between the two images.
-
Structural Similarity Index Measure (SSIM):
- Conceptual Definition: Measures the perceptual similarity between two images, considering changes in luminance, contrast, and structure. It is generally better aligned with human judgment of quality than PSNR. A value closer to 1 indicates higher similarity.
- Mathematical Formula:
- Symbol Explanation:
- are the average pixel values of images x and y.
- are the variances.
- is the covariance.
- are small constants to stabilize the division.
-
Learned Perceptual Image Patch Similarity (LPIPS):
- Conceptual Definition: Measures the perceptual distance between two images by comparing their feature representations extracted from a deep neural network (e.g., VGG or AlexNet). It is designed to correlate well with human perception of image similarity. A lower LPIPS score indicates that two images are more perceptually similar.
-
-
Baselines: The paper compares its method against a range of state-of-the-art models for video generation (
Panacea,Vista) and 3D reconstruction (PixelNeRF,PixelSplat,DUSt3R,MVSplat,SCube). These baselines represent the leading alternative approaches for the sub-tasks thatInfiniCubeperforms.
6. Results & Analysis
-
Core Results:
1. Large-Scale Dynamic Scene Generation: Qualitative results in Image 1 and 5 show that
InfiniCubecan generate vast (e.g., ), detailed, and visually plausible 3D scenes with controllable dynamic vehicles and weather conditions.
该图像是一个示意图,展示了从HD地图和车辆边界框到大规模生成3D高斯动态场景的流程。包含体素扩散生成体素世界、视频扩散结合引导缓冲区合成长时驾驶视频,最后通过快速前馈动态重建得到3D场景。
2. Voxel World Generation: An ablation study demonstrates the importance of the
C_Roadcondition. Without it, the model sometimes fails to generate a proper ground plane, highlighting the effectiveness of the paper's specific conditioning strategy.3. World-Guided Video Generation:
-
Quantitative Comparison: In long video generation,
InfiniCubesignificantly outperforms baselines. Figure 9a shows that its FID score remains low and stable over 200 frames, whereas the FID of baselines likePanaceadegrades sharply after about 100 frames. This confirms that the 3D-grounded guidance buffers are critical for preventing auto-regressive error accumulation. -
Human Evaluation: A user study on HD map alignment (transcribed below) shows that users found
InfiniCube's generated videos to be more consistent with the input map layout, especially in later frames. This is a crucial validation for creating reliable simulation data.Table 2. Human evaluation of HD map alignment. This is a manual transcription of the table data.
InfiniCube (Ours) Panacea [59] Frame Index 40 80 120 40 80 120 Positive Rate ↑ 84.6% 83.9% 84.8% 76.8% 54.0% 53.4% -
Ablation Study: Figure 9b shows that the semantic buffer is the most important component for maintaining video quality, but the coordinate buffer provides an additional boost, helping to resolve ambiguities.
4. 3DGS Scene Generation:
-
Quantitative Comparison: The transcribed Table 3 shows that
InfiniCubeachieves state-of-the-art results in novel view synthesis. It outperforms all baselines, including its direct predecessorSCube, on PSNR, SSIM, and LPIPS. This improvement is attributed to the pixel branch, which effectively handles mid-ground and dynamic regions.Table 3. Quantitative comparisons of novel view rendering. This is a manual transcription of the table data.
Novel View (T + 5) Novel View (T + 10) PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PixelNeRF [71] 15.21 0.52 0.64 14.61 0.49 0.66 PixelSplat [7] 20.11 0.70 0.60 18.77 0.66 0.62 DUSt3R [55] 17.08 0.62 0.56 16.08 0.58 0.60 MVSplat [9] 20.14 0.71 0.48 18.78 0.69 0.52 MVSGaussian [37] 16.49 0.70 0.60 16.42 0.60 0.59 SCube [44] 19.90 0.72 0.47 18.78 0.70 0.49 InfiniCube (Ours) 20.80 0.73 0.42 19.93 0.72 0.45 -
Qualitative Analysis: Image 3 (
Figure 10) visually demonstrates the benefit of the dual-branch approach. The voxel-only branch can miss details or create floaters, while the pixel-only branch can produce distorted geometry. The combined "Dual" output is clean and accurate, leveraging the strengths of both.
该图像是图10,展示了Waymo数据集输入图像和体素生成的不同分支的新视角渲染效果。双分支推理有效消除了单一分支中红框标注的伪影。
-
-
Applications: The coherent pipeline enables powerful editing applications:
-
Vehicle Insertion: As shown in Image 4 (
Figure 11), new vehicles can be inserted into a scene by adding them to the voxel world and re-running the video generation. The model realistically synthesizes their appearance, including plausible shadows.
该图像是论文中展示视频模型插入对象效果的插图,图中上下两行分别为插入前后场景对比。插入后的图片展示了道路上的车辆及其投射的真实阴影,红色箭头指示了阴影位置,强调了生成效果的真实感。 -
Weather Control: By changing the text prompt at the start of the video generation stage, the same scene geometry can be rendered with different weather and lighting, such as "Sunny", "Foggy", or "Snowy", as demonstrated in Image 5 (
Figure 12). -
Sensor Simulation: The 3DGS output is a true 3D representation, allowing for realistic sensor simulation, such as the LiDAR simulation shown in Image 9.
该图像是图S16的示意图,展示了在生成的3D高斯场景中进行的LiDAR模拟。图中显示自车从时间戳向前移动至期间的LiDAR点云变化,红色和橙黄色箭头指示了关键物体位置变化。
-
7. Conclusion & Reflections
-
Conclusion Summary:
InfiniCubepresents a significant step forward in generative 3D scene modeling for autonomous driving. By ingeniously combining a 3D voxel LDM for scalable geometry, a world-guided video model for high-fidelity appearance, and a fast dual-branch reconstruction method, it successfully generates large-scale, dynamic, and highly controllable 3D driving scenes. The method addresses key limitations of prior work related to scale, consistency, and 3D representation, opening the door for creating vast, realistic virtual worlds for AV simulation. -
Limitations & Future Work: The authors acknowledge one primary limitation: while consistency between adjacent chunks is high, subtle drifts in appearance or geometry could accumulate over very long distances. Future work will focus on improving this global consistency and scaling the model with more diverse training data to handle an even wider range of scenes and objects.
-
Personal Insights & Critique:
- Strength - Synergistic Design: The core brilliance of
InfiniCubeis not just in its individual components, but in their seamless integration. The voxel world is not just an output but a crucial input for the subsequent stages, acting as a "scaffold" for both video synthesis and final 3D reconstruction. This synergy is what enables the system's impressive scale and consistency. - Strength - The Guidance Buffer: The idea of using rendered semantic and coordinate buffers to guide a pre-trained video model is exceptionally clever. It's a pragmatic and effective way to inject strong 3D priors into a 2D generative process, solving the long-standing problem of temporal inconsistency in auto-regressive video generation.
- Potential Weakness - Computational Cost: The paper notes substantial training costs (e.g., 192 GPU-days for the video model). While this is common for large-scale generative models, it may limit the ability of smaller research groups to replicate or build upon this work.
- Potential Weakness - Input Dependency: The method's success relies on the availability of high-quality HD maps and accurate vehicle trajectories. In scenarios where such detailed prior information is unavailable or noisy, the generation quality would likely degrade. This makes it a powerful tool for "re-simulating" or augmenting existing data, but perhaps less so for generating entirely novel map layouts from scratch.
- Future Impact:
InfiniCubesets a new standard for world generation. Its approach of using a geometric scaffold to guide an appearance model is highly transferable and could influence future work in generative modeling for robotics, virtual reality, and digital twins beyond just autonomous driving.
- Strength - Synergistic Design: The core brilliance of
Similar papers
Recommended via semantic vector search.