ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Yonghong Tian

Paper status: completed

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Published:09/04/2024

video diffusion models (7)High-Fidelity Novel View Synthesis (1)Point-Based Representation (1)Camera Trajectory Planning (1)3D Reconstruction and Synthesis (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces `ViewCrafter`, a method that synthesizes high-fidelity novel views from single or sparse images using video diffusion models, overcoming the dependence on dense multi-view captures. It incorporates coarse 3D clues and camera pose control while featuring an i

Abstract

Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. In this work, we propose \textbf{ViewCrafter}, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images with the prior of video diffusion model. Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames with precise camera pose control. To further enlarge the generation range of novel views, we tailored an iterative view synthesis strategy together with a camera trajectory planning algorithm to progressively extend the 3D clues and the areas covered by the novel views. With ViewCrafter, we can facilitate various applications, such as immersive experiences with real-time rendering by efficiently optimizing a 3D-GS representation using the reconstructed 3D points and the generated novel views, and scene-level text-to-3D generation for more imaginative content creation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in synthesizing high-fidelity and consistent novel views.

Mind Map

In-depth Reading

English Analysis~29 min read · 42,331 chars

1. Bibliographic Information

1.1. Title

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

1.2. Authors

Wangbo Yu, Jino Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian

1.3. Journal/Conference

The paper is a preprint available on arXiv. The official publication venue is not specified in the provided text but is typically a top-tier computer vision conference (e.g., CVPR, ICCV, ECCV) or a prestigious journal in the field. arXiv is a well-respected platform for disseminating research quickly, allowing for early sharing and feedback before formal peer review and publication.

1.4. Publication Year

2024

1.5. Abstract

The paper proposes ViewCrafter, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images. It addresses the limitation of traditional neural 3D reconstruction methods that require dense multi-view captures. ViewCrafter leverages the powerful generation capabilities of video diffusion models and the coarse 3D clues provided by point-based representations to generate high-quality video frames with precise camera pose control. To expand the range of novel views, the method introduces an iterative view synthesis strategy coupled with a content-adaptive camera trajectory planning algorithm, which progressively extends the 3D clues and covered areas. The paper highlights various applications, including real-time rendering through efficient 3D-GS optimization using reconstructed 3D points and generated novel views, and scene-level text-to-3D generation. Extensive experiments on diverse datasets demonstrate ViewCrafter's strong generalization capability and superior performance in synthesizing high-fidelity and consistent novel views.

1.6. Original Source Link

https://arxiv.org/abs/2409.02048 (Preprint) PDF Link: https://arxiv.org/pdf/2409.02048v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the synthesis of high-fidelity novel views for generic scenes from limited input, specifically single or sparse images. This task is crucial for applications like immersive experiences in games, mixed reality, and visual effects.

The problem is important because, despite significant advancements in neural 3D reconstruction techniques like NeRF (Neural Radiance Fields) and 3D-GS (3D Gaussian Splatting), their practical applicability is severely restricted by their heavy reliance on dense multi-view observations. In many real-world scenarios, obtaining such dense captures is impractical or impossible, leaving a significant gap for methods that can operate with sparse or single inputs.

Specific challenges or gaps in prior research include:

Limited Representation Capabilities: Early regression-based methods were often category-specific (e.g., human faces, indoor scenes) and produced artifacts due to their inability to fully capture complex 3D structures and appearances.
Lack of Precise Camera Control: Recent diffusion-based methods, while powerful for generation, often struggle with precise camera pose control. They might rely on high-level pose prompts (like text embeddings) which do not translate to exact 6-Degrees-of-Freedom (6-DoF) camera movements.
Inconsistency in Occlusion Regions: Methods utilizing depth-based warping and diffusion-based inpainting often produce inconsistent content or artifacts in occluded areas because they treat view synthesis as per-view inpainting rather than a coherent 3D reconstruction.
Scalability for Long View Ranges: Video diffusion models, while capable, face challenges in generating very long videos due to increased memory and computational costs, limiting their ability to synthesize novel views across a large range of perspectives.

The paper's entry point or innovative idea is to integrate the generative power of video diffusion models with explicit 3D information derived from point cloud representations. This combination aims to leverage the world understanding and plausible content generation of diffusion models while providing the precise 3D control and structural guidance lacking in prior diffusion-only approaches.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

ViewCrafter Framework: It proposes ViewCrafter, a novel view synthesis framework that combines video diffusion models with point cloud priors. This framework is designed for synthesizing high-fidelity novel view sequences of generic scenes from single or sparse images, critically maintaining precise control over camera poses.
Iterative View Synthesis with Adaptive Trajectory Planning: It introduces an iterative view synthesis strategy coupled with a content-adaptive camera trajectory planning algorithm. This innovation addresses the challenge of generating long videos and large view ranges by progressively expanding the covered areas of novel views and refining the reconstructed point cloud. This allows for more complete scene generation by adaptively revealing occlusions.
Superior Performance and Generalization: The method achieves superior performance on various challenging datasets (Tanks-and-Temples, RealEstate10K, CO3D) in terms of both the quality of synthesized novel views and the accuracy of camera pose control. This demonstrates its strong generalization capability for diverse scenes.
Facilitates Downstream Applications: ViewCrafter facilitates several applications beyond novel view synthesis. These include:
- Efficient 3D-GS Optimization: It enables the efficient optimization of a 3D-GS representation using its results, allowing for real-time rendering for immersive experiences.
- Scene-Level Text-to-3D Generation: It shows potential for adaptation to scene-level text-to-3D generation, fostering more imaginative content creation by combining with text-to-image diffusion models.
  
  The key conclusions and findings are that by integrating the generative power of video diffusion models with explicit, albeit coarse, 3D priors from point clouds, it is possible to overcome previous limitations in novel view synthesis. The ViewCrafter method effectively solves the problems of high-fidelity generation, precise camera control, and consistency, even from sparse inputs, and scales to larger view ranges through its iterative approach.

3.1. Foundational Concepts

To understand this paper, a foundational grasp of the following concepts is essential:

Diffusion Models: These are generative models that learn to reverse a diffusion process (gradual addition of noise) to generate data. They consist of a forward process (adding noise) and a reverse process (removing noise).
- Forward Process: Starts with clean data $\pmb{x}_0$ and gradually adds Gaussian noise over $T$ time steps, creating a sequence of noisy data points $\pmb{x}_1, \ldots, \pmb{x}_T$ . A key property is that $\pmb{x}_t$ can be directly sampled from $\pmb{x}_0$ using the formula $\pmb{x}_t = \alpha_t \pmb{x}_0 + \sigma_t \epsilon$ , where $\epsilon \sim \mathcal{N}(0, I)$ is Gaussian noise, and $\alpha_t, \sigma_t$ are hyperparameters controlling the noise level at time $t$ .
- Reverse Process: A neural network (often a U-Net) is trained to predict the noise $\epsilon$ added at each step, or directly predict the clean data $\pmb{x}_0$ . By iteratively subtracting this predicted noise, the model can generate new data from random noise.
- Diffusion Loss: The model is optimized to minimize the difference between the predicted noise $\epsilon_\theta(\pmb{x}_t, t)$ and the actual noise $\epsilon$ added, typically using an L2 loss: $\min_\theta = \mathbb{E}[\|\epsilon_\theta(\pmb{x}_t, t) - \epsilon\|_2^2]$ .
Latent Diffusion Models (LDMs): LDMs mitigate the computational cost of diffusion models by performing the diffusion process in a lower-dimensional latent space instead of the high-dimensional pixel space.
- Variational Autoencoder (VAE): LDMs use a VAE (specifically its encoder $\mathcal{E}$ $E$ and decoder $\mathcal{D}$ $D$ ) to compress images into a latent representation and reconstruct them. The diffusion process happens on these latents.
  - Encoder ( $\mathcal{E}$ ): Maps a high-dimensional image $\pmb{x}$ to a lower-dimensional latent representation $\pmb{z}$ .
  - Decoder ( $\mathcal{D}$ ): Maps a latent representation $\pmb{z}$ back to a high-dimensional image $\pmb{x}$ .
- This makes training and inference much faster and more memory-efficient.
U-Net Architecture: A neural network architecture commonly used in diffusion models for its effectiveness in image-to-image tasks. It consists of an encoder path (downsampling layers) to capture context and a decoder path (upsampling layers) to enable precise localization, with skip connections between corresponding encoder and decoder layers to preserve fine-grained details. In video diffusion models, U-Nets are often augmented with temporal attention layers to model relationships across frames.
CLIP (Contrastive Language-Image Pre-training): A neural network trained on a vast dataset of image-text pairs. It learns to associate images with descriptive text. CLIP consists of an image encoder and a text encoder, which project images and text into a shared embedding space. This allows it to understand the content of an image based on textual descriptions, or vice versa, and is often used in diffusion models to provide high-level semantic conditioning.
Point Clouds: A set of data points in a 3D coordinate system. Each point represents a single measurement of a 3D surface, typically storing its 3D coordinates (x, y, z) and sometimes additional attributes like color (RGB). Point clouds are a direct representation of 3D geometry and can be used for explicit camera pose control and rendering. However, they can be sparse, have missing regions (occlusions), and suffer from geometric distortions if derived from limited views.
3D Gaussian Splatting (3D-GS): A recent explicit 3D representation for novel view synthesis that achieves real-time rendering speeds. It represents a scene as a collection of 3D Gaussians, each defined by its 3D position, covariance (shape and orientation), color, and opacity. These Gaussians are projected onto the 2D image plane to render novel views, and their parameters can be efficiently optimized using differentiable rendering.
Dense Stereo: A computer vision technique that estimates dense 3D information (like depth maps or point clouds) from multiple images. Unlike sparse stereo, which only finds correspondences for a few distinct features, dense stereo attempts to estimate a depth value for every pixel. Methods like DUSt3R fall into this category, aiming to reconstruct 3D points and camera parameters from even a single or sparse set of images.
Weiszfeld Algorithm: An iterative optimization algorithm used to find the geometric median of a set of points. In the context of this paper, it's used to optimize for the unknown focal length $f_0^*$ by minimizing a weighted sum of distances between projected 3D points and their 2D pixel locations.
Plücker Coordinates: A representation for lines in 3D space. A line can be uniquely described by a 6-dimensional vector using Plücker coordinates. In some concurrent works, these coordinates are used to condition video diffusion models for camera motion control by encoding the camera's path as a sequence of lines. The paper contrasts its point-cloud-based conditioning with this approach.

3.2. Previous Works

The paper categorizes related work into three main areas:

Regression-based Novel View Synthesis

These methods aim to train a feed-forward neural network to directly generate novel views from sparse or single input images.

Approach: Often use CNN (Convolutional Neural Network) or Transformer architectures to implicitly or explicitly build a 3D representation from input images. Examples include generating tri-plane representations [24, 25] for specific modalities (like human faces) or generic objects (LRM [26]), multi-plane images [7, 8, 9], or NeRF representations (PixelNeRF [10]).
Recent Developments: Inspired by 3D-GS, methods like PixelSplat [27] and MVSplat [28] train models to regress 3D Gaussian representations. Other works combine monocular depth estimation with image inpainting [3, 4, 5, 6].
Limitations: Generally limited to category-specific domains (objects, indoor scenes) and suffer from artifacts due to their constrained representation capabilities and difficulty generalizing to diverse, generic scenes.

Diffusion-based Novel View Synthesis

Leverage the power of diffusion models for novel view generation.

Optimization-based: Some approaches [32, 33, 34] optimize a 3D representation (e.g., NeRF) under the supervision of text-to-image (T2I) diffusion models.
- Limitation: Require scene-specific optimization, hindering generalization.
Generalized Frameworks: GeNVS [35] trains a 3D feature-conditioned diffusion model on large-scale datasets. Zero-1-to-3 [11] and ZeroNVS [12] use camera pose-conditioned diffusion models trained on synthetic or mixed datasets for zero-shot novel view synthesis from a single image.
- Limitations: Often category-specific (GeNVS), restricted to toy-level objects with simple backgrounds (Zero-1-to-3), or struggle with consistency and precise pose control (ZeroNVS treats camera pose as high-level text embeddings).
Other Diffusion-based Approaches: Reconfusion [41] uses PixelNeRF feature-conditioned diffusion for better pose control but lacks consistent novel views and requires multiple input images. Methods like LucidDreamer [14] and [15, 42] use depth-based warping followed by T2I diffusion for inpainting missing regions.
- Limitations: Often produce artifacts and unrealistic content in inpainted regions due to inconsistent warping and per-view inpainting.

Conditional Video Diffusion Models

Focus on guiding video generation with various conditioning signals.

Image-to-Video (I2V): Models like DynamiCrafter [18] (used in ViewCrafter), AnimateDiff [53], and SVD [17] generate videos from a single image or text, but often lack precise 3D camera control.
Explicit Control Signals:
- ControlNet [44], T2I-adapter [45], GLIGEN [46] for T2I generation.
- RGB images [17, 18, 47], depth [48, 49], trajectory [50, 51], semantic maps [52] for video generation.
- Camera Motion Control: AnimateDiff [53] and SVD [17] use class-conditioned video generation. MotionCtrl [13] uses camera extrinsic matrices as conditioning. CamCo [57] and CameraCtrl [58] introduce Plücker coordinates [59].
- Limitations: Many rely on 1D numeric values or high-level embeddings for camera control, leading to imprecise control in complex real-world scenarios. MultiDiff [55] uses depth-based warping and trains on class-specific datasets, limiting generalization. These methods struggle with complicated mappings from numeric parameters to consistent videos or lack generalization to generic scenes.

3.3. Technological Evolution

The field of 3D vision and novel view synthesis has evolved from geometric-based methods to neural representations, and more recently, to leveraging generative models like diffusion models.

Early Geometric Methods: Relied on explicit 3D models (e.g., meshes, point clouds) reconstructed from multiple views, often struggling with incompleteness or difficult scenes.
Neural 3D Reconstruction (e.g., NeRF, MPI): Revolutionized view synthesis by representing scenes implicitly (e.g., continuous volumetric function) or explicitly (e.g., multi-plane images). While achieving high fidelity, these methods typically require many input views and extensive optimization per scene.
Generalizable Neural Methods: Attempts to train models that can generalize to new scenes from sparse inputs, often by learning to predict NeRF or MPI parameters (PixelNeRF). These still face limitations in quality and generalization.
Diffusion Models for Image Synthesis: The emergence of powerful T2I diffusion models opened new avenues for generating photorealistic content.
Diffusion Models for Novel View Synthesis: Initial attempts involved using T2I models for inpainting warped images or fine-tuning diffusion models on 3D datasets. Challenges arose in maintaining 3D consistency and precise camera control.
Conditional Video Diffusion Models: Extending diffusion to video, enabling generation of consistent sequences, but camera control remained a major hurdle.
ViewCrafter's Position: This paper fits within the latest stage, aiming to bridge the gap between the generative power of video diffusion models (for realism and consistency) and explicit 3D information (for precise camera control and geometric accuracy), thereby pushing towards high-fidelity and controllable novel view synthesis from sparse inputs for generic scenes.

3.4. Differentiation Analysis

Compared to the main methods in related work, ViewCrafter offers several core differences and innovations:

Integration of Video Diffusion and Explicit 3D Prior:
- Unlike Regression-based Methods: ViewCrafter moves beyond limited, category-specific regression models by utilizing a powerful video diffusion model, which inherently has a richer understanding of the world due to training on web-scale video datasets. This allows for synthesis of generic scenes, not just specific object categories or indoor environments.
- Unlike Pure Diffusion-based NVS (e.g., ZeroNVS, MotionCtrl): These methods often rely on high-level embeddings (text, 1D numeric values) for camera pose conditioning, leading to imprecise control and inconsistent views. ViewCrafter explicitly incorporates point cloud renders as a conditional signal. This provides a direct, coarse 3D geometric prior, enabling precise 6-DoF camera pose control and improving 3D consistency across generated views.
- Unlike Warping + Inpainting Methods (e.g., LucidDreamer): These methods suffer from artifacts in occlusion regions and inconsistencies due to per-view inpainting. ViewCrafter's video diffusion approach, conditioned on 3D geometry, generates temporally consistent video sequences, implicitly handling occlusions more coherently rather than relying on an independent inpainting step.
Iterative View Synthesis and Content-Adaptive Trajectory Planning:
- Addressing Long Video Generation Limitations: Traditional video diffusion models struggle with memory and computational costs for very long videos. ViewCrafter's iterative strategy breaks down the problem, extending the view range progressively.
- Adaptive NBV Planning: Unlike methods that use predefined camera trajectories (which may overlook scene-specific geometry and occlusions), ViewCrafter introduces a Next-Best-View (NBV) based camera trajectory planning algorithm. This algorithm adaptively selects camera poses that effectively reveal occlusions and complete the point cloud, leading to more comprehensive scene generation tailored to the specific geometry.
Robustness to Imperfect Priors: The ablation studies demonstrate ViewCrafter's robustness to imperfect point cloud conditions (e.g., occlusions, missing regions, geometric distortions). This is a significant advantage over methods that might be more sensitive to the quality of their 3D priors.
Efficient Downstream Applications: The framework naturally extends to optimizing 3D-GS models for real-time rendering and facilitating text-to-3D generation, showcasing broader utility beyond just novel view synthesis.

In essence, ViewCrafter's innovation lies in its principled fusion of explicit 3D geometric guidance with the powerful generative capabilities of modern video diffusion models, addressing key limitations of prior works in fidelity, consistency, and precise control for novel view synthesis from sparse data.

4. Methodology

4.1. Principles

The core idea behind ViewCrafter is to combine the powerful generative capabilities of video diffusion models with explicit, coarse 3D geometric information provided by point cloud representations. The intuition is that while video diffusion models excel at producing plausible and high-quality visual content, they often lack precise control over 3D camera poses and struggle with geometric consistency. Conversely, point clouds offer direct 3D structural cues that can guide camera movements accurately, but they are often sparse, incomplete, and visually unappealing when rendered directly. ViewCrafter leverages the strengths of both: using point cloud renders to condition a video diffusion model, thereby enabling the generation of high-fidelity, geometrically consistent novel views with precise camera control. The method further extends its capabilities by iteratively building up the 3D scene and synthesizing views across a larger range through an adaptive camera trajectory planning strategy.

4.2. Core Methodology In-depth

The ViewCrafter method consists of several interconnected components: point cloud reconstruction, a point-conditioned video diffusion model, and an iterative view synthesis strategy with camera trajectory planning.

4.2.1. Preliminary: Video Diffusion Models

ViewCrafter builds upon the framework of video diffusion models, particularly Latent Diffusion Models (LDMs), to reduce computational cost.

Diffusion Process: A diffusion model operates through a forward process and a reverse process.

Forward Process ( $q$ ): This process gradually adds Gaussian noise to clean data $\pmb{x}_0 \sim q_0(\pmb{x}_0)$ across different time steps $t$ . The noisy data $\pmb{x}_t$ at any time step $t$ can be directly sampled from $\pmb{x}_0$ and Gaussian noise $\epsilon$ : $\pmb{x}_t = \alpha_t \pmb{x}_0 + \sigma_t \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$ is standard Gaussian noise, and $\alpha_t$ and $\sigma_t$ are hyperparameters that satisfy the condition $\alpha_t^2 + \sigma_t^2 = 1$ . These parameters control the schedule of noise addition.
Reverse Process ( $p_\theta$ ): This is the generative process where the model learns to denoise the data. A neural network, typically a U-Net parameterized by $\theta$ , acts as a noise predictor $\epsilon_\theta$ . It is trained to estimate the noise $\epsilon$ that was added to $\pmb{x}_t$ to get $\pmb{x}_0$ . The objective function for training this noise predictor is: $\min_\theta = \mathbb{E}_{t \sim \mathcal{U}(0, 1), \epsilon \sim \mathcal{N}(\mathbf{0}, I)} [ \| \epsilon_\theta ( \pmb{x}_t, t ) - \epsilon \|_2^2 ]$ Here, $\mathcal{U}(0, 1)$ denotes a uniform distribution from which time steps $t$ are sampled, and $\mathbb{E}$ denotes the expectation over the sampled time steps and noise.

Latent Diffusion Models (LDMs): To handle high-resolution video data and reduce computational complexity, LDMs are employed.

Encoding: Video data $\textbf{\textit{x}} \in \mathbb{R}^{L \times 3 \times H \times W}$ (where $L$ is video length, 3 is RGB channels, $H$ is height, $W$ is width) are first encoded into a lower-dimensional latent space using a pre-trained VAE encoder $\mathcal{E}$ frame-by-frame. $\pmb{z} = \mathcal{E}(\pmb{x}), \quad \pmb{z} \in \mathbb{R}^{L \times C \times h \times w}$ Here, $C$ is the number of channels in the latent space, and h, w are the latent height and width, which are smaller than H, W.
Diffusion in Latent Space: Both the forward and reverse processes are then performed in this latent space.
Decoding: After the reverse process generates clean latent representations $\hat{\pmb{z}}$ , they are transformed back into pixel space using the VAE decoder $\mathcal{D}$ to produce the final generated videos. $\hat{\pmb{x}} = \mathcal{D}(\hat{\pmb{z}})$ The paper specifically builds its model based on DynamiCrafter [18], an open-source Image-to-Video (I2V) diffusion model capable of generating dynamic videos from a single input image.

4.2.2. Point Cloud Reconstruction from Single or Sparse Images

To provide accurate pose control, ViewCrafter first establishes a point cloud representation of the scene.

Dense Stereo Model: A dense stereo model, such as DUSt3R [19], is used. It takes a pair of RGB images $\mathbf{I}^0, \mathbf{I}^1 \in \mathbb{R}^{H \times W \times 3}$ as input.
Output: It generates corresponding point maps $\mathbf{O}^{0,0}, \mathbf{O}^{1,0} \in \mathbb{R}^{H \times W \times 3}$ $O^{0, 0}, O^{1, 0} \in R^{H \times W \times 3}$ and confidence maps $\mathbf{D}^{0,0}, \mathbf{D}^{1,0} \in \mathbb{R}^{H \times W}$ $D^{0, 0}, D^{1, 0} \in R^{H \times W}$ .
- The point maps (e.g., $\mathbf{O}^{0,0}$ ) provide the 3D coordinates (expressed in the camera coordinate system of the anchor view $\mathbf{I}^0$ ) for each pixel.
- The confidence maps (e.g., $\mathbf{D}^{0,0}$ ) indicate the reliability of these 3D point estimates.
Camera Intrinsic Parameter Estimation: The model assumes the principal point is centered and pixels are square, meaning only the focal length $f_0^*$ is unknown. This is solved using a few optimization steps with the Weiszfeld algorithm [61]: $f_0^* = \underset{f_0}{\arg\min} \sum_{u=0}^{W} \sum_{v=0}^{H} \mathbf{D}_{u,v}^{0,0} \left\| (u', v') - f_0 \frac{(\mathbf{O}_{u,v,0}^{0,0}, \mathbf{O}_{u,v,1}^{0,0})}{\mathbf{O}_{u,v,2}^{0,0}} \right\|$ where $u' = u - \frac{W}{2}$ and $v' = v - \frac{H}{2}$ are the pixel coordinates relative to the image center. The term $\mathbf{O}_{u,v,0}^{0,0}, \mathbf{O}_{u,v,1}^{0,0}, \mathbf{O}_{u,v,2}^{0,0}$ represent the x, y, z coordinates of the 3D point corresponding to pixel (u,v) in the point map. The focal length $f_0$ scales the projected 3D coordinates to match the image plane. The sum is weighted by the confidence map $\mathbf{D}_{u,v}^{0,0}$ .
Handling Sparse Inputs:
- If only a single input image is available, it is duplicated to create a pair for DUSt3R.
- For more than two images, global point map alignment can be performed.
Colored Point Cloud: The reconstructed point maps are integrated with their corresponding RGB images to obtain a colored point cloud, which serves as the coarse 3D information.
Limitations of Raw Point Cloud: Due to sparse inputs, the reconstructed point cloud can have significant missing regions, occlusions, and artifacts, leading to low-quality direct renders. This necessitates the integration with video diffusion models.

4.2.3. Rendering High-fidelity Novel Views with Video Diffusion Models

The core of ViewCrafter involves a point-conditioned video diffusion model that enhances the imperfect point cloud renders.

Workflow (referencing Fig. 1):
1. Initial 3D Information: Given a single reference image $\mathbf{I}^{\mathrm{ref}}$ (or sparse images), its point cloud, camera intrinsics, and camera pose $\mathbf{C}^{\mathrm{ref}}$ are obtained using the dense stereo model (DUSt3R).
2. Camera Path Generation: A camera pose sequence $\mathbf{C} = \{ \mathbf{C}^0, ..., \mathbf{C}^{L-1} \}$ is planned, ensuring it contains $\mathbf{C}^{\mathrm{ref}}$ .
3. Point Cloud Rendering: The current point cloud is rendered from each pose in $\mathbf{C}$ to obtain a sequence of point cloud renders $\mathbf{P} = \{ \mathbf{P}^0, ..., \mathbf{P}^{L-1} \}$ . These renders capture the correct view relationships but suffer from visual imperfections.
4. High-fidelity Novel View Generation: The goal is to learn a conditional distribution $p(\pmb{x} | \mathbf{I}^{\mathrm{ref}}, \mathbf{P})$ that produces high-quality novel views $\pmb{x} = \{ \pmb{x}^0, ..., \pmb{x}^{L-1} \}$ based on the point cloud renders $\mathbf{P}$ and the reference image(s) $\mathbf{I}^{\mathrm{ref}}$ . This is modeled as the reverse process of a point-conditioned video diffusion model: $\pmb{x} \sim \hat{p}_\theta(\pmb{x} \mid \mathbf{I}^{\mathrm{ref}}, \mathbf{P})$ where $\theta$ represents the model parameters.
Model Architecture (illustrated in Fig. 1): The model inherits the LDM [60] architecture:
- VAE Encoder ( $\mathcal{E}$ ) and Decoder ( $\mathcal{D}$ ): For image compression and decompression, their parameters are frozen.
- Video Denoising U-Net: This is the core component for noise estimation. It contains spatial layers (2D convolutions, spatial attention) and temporal layers (temporal attention) to capture both image content and temporal coherence.
- CLIP Image Encoder: Used to understand the content of the reference image(s) $\mathbf{I}^{\mathrm{ref}}$ and provide high-level semantic conditioning.
Incorporating Conditional Signals:
- Point Cloud Renders ( $\mathbf{P}$ ): These are encoded by the VAE encoder $\mathcal{E}$ into latent representations. These latent representations are then concatenated channel-wise with the noisy latent video $\pmb{z}_t$ that the U-Net is trying to denoise. This provides explicit geometric guidance at each time step.
- Reference Image(s) ( $\mathbf{I}^{\mathrm{ref}}$ ): The CLIP image encoder processes $\mathbf{I}^{\mathrm{ref}}$ to produce embeddings. These embeddings modulate the U-Net features through cross-attention mechanisms, guiding the generation with the appearance and semantics of the input.
Training Process:
- Training Data: Paired training data is created, consisting of point cloud renders $\mathbf{P} = \{ \mathbf{P}^0, ..., \mathbf{P}^{L-1} \}$ and corresponding ground-truth reference images $\mathbf{I} = \{ \mathbf{I}^0, ..., \mathbf{I}^{L-1} \}$ . A key aspect is that the point cloud renders are forced to include at least one ground-truth view at a random location among the $L$ frames. This helps the model learn to transfer fine details from the reference images and handle arbitrary numbers of reference images.
- Latent Space Training: The VAE parameters are frozen. Training occurs in the latent space. Ground-truth images $\mathbf{I}$ are encoded to latents $\pmb{z}$ , and point cloud renders $\mathbf{P}$ (with reference view renders replaced by actual reference images) are encoded to latent condition signals $\hat{\pmb{z}}$ .
- Diffusion Loss: The video denoising U-Net is optimized using the diffusion loss: $\min_\theta = \mathbb{E}_{t \sim \mathcal{U}(0, 1), \epsilon \sim \mathcal{N}(\mathbf{0}, I)} [ \| \epsilon_\theta ( \pmb{z}_t, t, \hat{\pmb{z}}, \mathbf{I}^{\mathrm{ref}} ) - \epsilon \|_2^2 ]$ Here, $\pmb{z}_t = \alpha_t \pmb{z}_0 + \sigma_t \epsilon$ is the noisy latent video. The noise predictor $\epsilon_\theta$ takes the noisy latent $\pmb{z}_t$ , time step $t$ , the conditional latent $\hat{\pmb{z}}$ (from point cloud renders/reference images), and the CLIP embedding of $\mathbf{I}^{\mathrm{ref}}$ as input.
Inference Process:
1. Render Point Cloud Sequence: A sequence of point cloud renders $\mathbf{P} = \{ \mathbf{P}^0, ..., \mathbf{P}^{L-1} \}$ is generated along a desired camera pose sequence. The render result for the reference view(s) is replaced with the actual $\mathbf{I}^{\mathrm{ref}}$ .
2. Encode Conditions: These point cloud renders (with the replaced reference view) are encoded into latent space to get $\hat{\pmb{z}}$ . The CLIP encoder processes $\mathbf{I}^{\mathrm{ref}}$ .
3. Denoising: Starting from sampled noise $\epsilon \sim \mathcal{N}(\mathbf{0}, I)$ , the trained U-Net iteratively denoises the latent representation, conditioned on $\hat{\pmb{z}}$ and $\mathbf{I}^{\mathrm{ref}}$ (via CLIP embedding).
4. Decode Novel Views: The final clean latents are decoded by the VAE decoder $\mathcal{D}$ into high-quality novel views $\pmb{x} = \{ \pmb{x}^0, ..., \pmb{x}^{L-1} \}$ .

4.2.4. Iterative View Synthesis and Camera Trajectory Planning

To overcome the limitation of video diffusion models in generating very long videos and to extend the view range, ViewCrafter employs an iterative strategy combined with adaptive camera trajectory planning.

Iterative Process: The core idea is to progressively extend the reconstructed point cloud and the areas covered by novel views.
1. Initial Point Cloud: Start with an initial point cloud $\mathcal{P}_{\mathrm{ref}}$ reconstructed from the input reference image(s).
2. Camera Navigation: Navigate the camera from one of the reference views ( $\mathcal{C}_{\mathrm{ref}}$ ) to a target camera pose ( $\mathcal{C}_{\mathrm{nbv}}$ ) to reveal occlusions and missing regions in the current point cloud ( $\mathcal{P}_{\mathrm{curr}}$ ).
3. Novel View Generation: Use ViewCrafter (the point-conditioned video diffusion model) to generate a sequence of high-fidelity novel views along the planned camera path.
4. Point Cloud Update: Back-project and align the generated novel views to update and complete the current point cloud $\mathcal{P}_{\mathrm{curr}}$ .
5. Iteration: Repeat steps 2-4 until a predefined limit on predicted poses ( $N$ ) is reached.
Content-Adaptive Camera Trajectory Planning (Algorithm 1): This algorithm aims to adaptively choose the next best camera pose (NBV) to effectively reveal occlusions, rather than using a fixed trajectory.

Algorithm 1: Iterative View Synthesis Input:
- Reference image(s) $\mathcal{T}_{\mathrm{ref}}$
- Dense stereo model $\mathcal{D}(\cdot)$
- Point-conditioned video diffusion model $\mathcal{V}(\cdot)$
- Initial point cloud $\mathcal{P}_{\mathrm{ref}}$
- Searching space $S$
- Initial pose $\mathcal{C}_{\mathrm{ref}}$
- Maximum predicted poses $N$
- Number of candidate poses $K$
- Utility function $\mathcal{F}(\cdot)$
  
  1: Initialize Current point cloud $\mathcal{P}_{\mathrm{curr}} \leftarrow \mathcal{P}_{\mathrm{ref}}$ , current camera pose $\mathcal{C}_{\mathrm{curr}} \leftarrow \mathcal{C}_{\mathrm{ref}}$ , step $\leftarrow 0$ 2: while $step \leq N$ do 3: Uniformly sample $K$ candidate poses $\mathcal{C}_{\mathrm{can}} = \{ \mathcal{C}_{\mathrm{can}}^1, ..., \mathcal{C}_{\mathrm{can}}^K \}$ from the searching space $S$ around the current pose $\mathcal{C}_{\mathrm{curr}}$ , initialize candidate mask set $\mathcal{M}_{\mathrm{can}} = \{ \}$ . 4: for $\mathcal{C}$ in $\{ \mathcal{C}_{\mathrm{can}}^1, ..., \mathcal{C}_{\mathrm{can}}^K \}$ do 5: $\mathcal{M}_{\mathcal{C}} = \mathrm{Render}(\mathcal{P}_{\mathrm{curr}}, \mathcal{C})$ (Render a binary mask where 1 indicates occlusion/missing regions, 0 indicates filled regions from the current point cloud $\mathcal{P}_{\mathrm{curr}}$ at pose $\mathcal{C}$ ) 6: $\mathcal{M}_{\mathrm{can}}.\mathrm{append}(\mathcal{M}_{\mathcal{C}})$ 7: end for 8: $\mathcal{C}_{\mathrm{nbv}} = \underset{\mathcal{C} \in \mathcal{C}_{\mathrm{can}}}{\arg\max} \mathcal{F}(\mathcal{C})$ (Select the next best view based on the utility function) 9: $\mathcal{T}_{\mathrm{nbv}} = \mathcal{V}(\mathrm{interpolate}(\mathcal{C}_{\mathrm{curr}}, \mathcal{C}_{\mathrm{nbv}}), \mathcal{P}_{\mathrm{curr}})$ (Generate novel views using ViewCrafter along an interpolated path from $\mathcal{C}_{\mathrm{curr}}$ to $\mathcal{C}_{\mathrm{nbv}}$ , conditioned on $\mathcal{P}_{\mathrm{curr}}$ ) 10: $\mathcal{P}_{\mathrm{curr}} \leftarrow \mathcal{D}(\mathcal{T}_{\mathrm{nbv}}, \mathcal{P}_{\mathrm{curr}})$ (Update current point cloud by back-projecting and aligning the generated novel views) 11: $\mathcal{C}_{\mathrm{curr}} \leftarrow \mathcal{C}_{\mathrm{nbv}}$ (Update current camera pose) 12: $step \leftarrow step + 1$ 13: end while 14: return
- Searching Space ( $S$ ): A forward-facing quarter-sphere with evenly distributed camera poses is chosen. It's centered at the origin of the point cloud's world coordinate system, with its radius set to the depth of the center pixel in the reference image.
- Candidate Masks ( $\mathcal{M}_{\mathrm{can}}$ ): For each candidate pose $\mathcal{C}$ , a binary mask $\mathcal{M}_{\mathcal{C}}$ is rendered from the current point cloud $\mathcal{P}_{\mathrm{curr}}$ . A value of 1 in the mask indicates occluded or missing regions, while 0 indicates already filled regions.
- Utility Function ( $\mathcal{F}(\cdot)$ ): This function determines the optimal camera pose $\mathcal{C}_{\mathrm{nbv}}$ for the next step by evaluating how effectively a candidate pose reveals new information. It balances revealing occlusions while avoiding overly large "holes" that might be difficult for ViewCrafter to fill plausibly. $\mathcal{F}(\mathcal{C}) = \left\{ \begin{array}{ll} \displaystyle \frac{\mathrm{sum}(\mathcal{M}_{\mathcal{C}})}{W \times H}, & \frac{\mathrm{sum}(\mathcal{M}_{\mathcal{C}})}{W \times H} < \Theta \\ \displaystyle 1 - \frac{\mathrm{sum}(\mathcal{M}_{\mathcal{C}})}{W \times H}, & \frac{\mathrm{sum}(\mathcal{M}_{\mathcal{C}})}{W \times H} \geq \Theta \end{array} \right.$ where $\mathcal{C} \in \mathcal{C}_{\mathrm{can}}$ is a candidate camera pose, $\mathcal{M}_{\mathcal{C}} \in \mathcal{M}_{\mathrm{can}}$ is its corresponding mask, and $\mathrm{sum}(\mathcal{M}_{\mathcal{C}}) = \sum_{u=0}^{W} \sum_{v=0}^{H} \mathcal{M}_{\mathcal{C}}(u, v)$ calculates the number of occluded/missing pixels. $W \times H$ is the total number of pixels in the image. The function prioritizes poses that reveal new, moderate amounts of occluded regions. If a pose reveals too many occlusions (i.e., $\frac{\mathrm{sum}(\mathcal{M}_{\mathcal{C}})}{W \times H} \geq \Theta$ ), which might indicate a view that is too far from the existing knowledge or presents an extremely challenging hole, its utility decreases (calculated as $1 - \frac{\mathrm{sum}(\mathcal{M}_{\mathcal{C}})}{W \times H}$ ). $\Theta$ is a threshold to manage this trade-off.

4.2.5. Applications

Beyond novel view synthesis, ViewCrafter facilitates two key applications:

Efficient 3D-GS Optimization:
- Challenge: Directly optimizing 3D-GS from an incomplete initial point cloud using multiple ViewCrafter runs at different times can lead to inconsistencies.
- Solution (referencing Fig. 2): The iterative view synthesis strategy is used to iteratively complete the initial point cloud and synthesize novel views. This provides consistent novel views as training data and a strong geometry initialization for the 3D-GS model.
- Optimization: The centers of 3D Gaussians are initialized from the completed dense point cloud. Their attributes (covariance, color, opacity) are then optimized under the supervision of the synthesized novel views. The process is simplified by deprecating 3D-GS tricks like densification, splitting, and opacity reset, and optimized for only 2,000 iterations, making it significantly faster than standard 3D-GS training.
Scene-Level Text-to-3D Generation: ViewCrafter can be combined with text-to-image (T2I) diffusion models to enable text-to-3D generation.
1. Image Generation: Given a text prompt, a T2I model generates a corresponding reference image.
2. Novel View Synthesis and Reconstruction: ViewCrafter then uses this generated reference image to synthesize consistent novel views and reconstruct the 3D scene (e.g., as a point cloud or subsequently optimize a 3D-GS representation).

5. Experimental Setup

5.1. Datasets

The experiments for zero-shot novel view synthesis and sparse view 3D-GS reconstruction were conducted on diverse real-world datasets:

CO3D [39]:
- Characteristics: Object-centric scenes.
- Usage: 10 scenes from this dataset were used for evaluation of zero-shot novel view synthesis.
RealEstate10K [7]:
- Characteristics: Video clips of indoor scenes.
- Usage: 10 scenes from its test set were adopted for evaluation of zero-shot novel view synthesis.
Tanks-and-Temples [21]:
- Characteristics: Contains large-scale outdoor and indoor scenes, often used for 3D reconstruction benchmarks.
- Usage: All 9 scenes were used for zero-shot novel view synthesis evaluation. For scene reconstruction, 6 scenes were used.
DL3DV [70]:
- Characteristics: A large-scale dataset for learning-based 3D vision.
- Usage: Used as part of the mixed training dataset for ViewCrafter.
  
  Test Set Creation: For CO3D, RealEstate10K, and Tanks-and-Temples benchmarks, frames were extracted from original captured videos to create two types of test sets:
Easy Test Set: Generated using a small frame sampling stride, featuring slow camera motions and limited view ranges.
Hard Test Set: Produced with a large sampling stride, characterized by rapid camera motions and large view ranges.

Training Data Generation:

The model was trained on a mixed dataset combining DL3DV [70] and RealEstate10K [7].
Video data was divided into clips of 25 frames.
DUSt3R [19] was used to obtain camera trajectories and globally aligned point clouds for each video frame.
Point clouds of video frames were randomly selected and rendered along estimated camera trajectories using Pytorch3D [71] to generate conditional signals (point cloud renders).
In total, 632,152 video pairs were generated as training data.

5.2. Evaluation Metrics

The paper employs a comprehensive set of metrics to evaluate both image quality and pose accuracy.

Image Quality Metrics:

Peak Signal-to-Noise Ratio (PSNR):
- Conceptual Definition: PSNR measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. It's often used to quantify reconstruction quality for images and videos. A higher PSNR value indicates better quality (i.e., less noise or distortion relative to the original signal). It is typically expressed in decibels (dB).
- Mathematical Formula: $\mathrm{PSNR} = 10 \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right)$
- Symbol Explanation:
  - $\mathrm{MAX}_I$ : The maximum possible pixel value of the image. For an 8-bit image, this is 255.
  - $\mathrm{MSE}$ $MSE$ : Mean Squared Error between the original and reconstructed image. For two images $I$ $I$ and $K$ $K$ of size $m \times n$ $m \times n$ , $\mathrm{MSE}$ $MSE$ is calculated as: $\mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2$
    - I(i,j): Pixel value at row $i$ , column $j$ of the original image.
    - K(i,j): Pixel value at row $i$ , column $j$ of the reconstructed image.
    - $m \times n$ : Dimensions of the image.
Structural Similarity Index Measure (SSIM) [73]:
- Conceptual Definition: SSIM is a perception-based model that considers image degradation as perceived change in structural information, while also incorporating luminance and contrast changes. It is designed to be more perceptually aligned with the human visual system than PSNR. Values typically range from -1 to 1, where 1 indicates perfect similarity.
- Mathematical Formula: $\mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}$
- Symbol Explanation:
  - $x$ : A block of pixels from the original image.
  - $y$ : A block of pixels from the reconstructed image.
  - $\mu_x$ : Average of $x$ .
  - $\mu_y$ : Average of $y$ .
  - $\sigma_x^2$ : Variance of $x$ .
  - $\sigma_y^2$ : Variance of $y$ .
  - $\sigma_{xy}$ : Covariance of $x$ and $y$ .
  - $C_1 = (K_1L)^2$ , $C_2 = (K_2L)^2$ : Small constants to avoid division by zero.
  - $L$ : Dynamic range of the pixel values (e.g., 255 for 8-bit grayscale images).
  - $K_1, K_2$ : Small constants, typically $K_1 = 0.01, K_2 = 0.03$ .
Learned Perceptual Image Patch Similarity (LPIPS) [74]:
- Conceptual Definition: LPIPS measures the perceptual similarity between two images by comparing their feature activations in a pre-trained deep neural network (e.g., VGG, AlexNet). It is often considered to correlate better with human judgment of image similarity than traditional metrics like PSNR or SSIM, as it captures more abstract, perceptual differences. A lower LPIPS score indicates higher perceptual similarity.
- Mathematical Formula: $\mathrm{LPIPS}(x, x_0) = \sum_l \frac{1}{H_lW_l} \sum_{h,w} \|w_l \odot (\phi_l(x)_{hw} - \phi_l(x_0)_{hw})\|_2^2$
- Symbol Explanation:
  - $x$ : The original image.
  - $x_0$ : The reconstructed image.
  - $\phi_l$ : Feature stack from layer $l$ of a pre-trained network.
  - $w_l$ : A learned scaling vector for each channel in layer $l$ .
  - $H_l, W_l$ : Height and width of the feature map at layer $l$ .
  - $\odot$ : Element-wise multiplication.
Fréchet Inception Distance (FID) [75]:
- Conceptual Definition: FID measures the similarity between the feature distributions of generated images and real images. It computes the Fréchet distance (or Wasserstein-2 distance) between two Gaussian distributions fitted to the feature representations (typically from an Inception-v3 network) of real and generated images. A lower FID score indicates higher quality and more realistic generated images. It's particularly useful for evaluating the overall quality and diversity of image generation, especially when dealing with missing and occlusion regions.
- Mathematical Formula: $\mathrm{FID} = \|\mu_1 - \mu_2\|_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1\Sigma_2)^{1/2})$
- Symbol Explanation:
  - $\mu_1$ : Mean of the feature vectors for real images.
  - $\mu_2$ : Mean of the feature vectors for generated images.
  - $\Sigma_1$ : Covariance matrix of the feature vectors for real images.
  - $\Sigma_2$ : Covariance matrix of the feature vectors for generated images.
  - $\mathrm{Tr}(\cdot)$ : The trace of a matrix.

Pose Accuracy Metrics:

To evaluate the accuracy of camera pose control, poses of generated novel views are estimated and compared to ground truth. DUSt3R [19] is used for robust pose estimation, as COLMAP [76] can be sensitive to inconsistent features in generated images.

Rotation Distance ( $R_{\mathrm{dist}}$ ):
- Conceptual Definition: R_dist quantifies the cumulative angular difference between the rotation matrices of the estimated camera poses and the ground truth camera poses over a sequence of novel views. A smaller value indicates more accurate rotational alignment. The camera coordinates of the estimated poses are transformed to be relative to the first frame, and the translation scale is normalized using the furthest frame, as specified in [58].
- Mathematical Formula: $R_{\mathrm{dist}} = \sum_{i=1}^{n} \operatorname{arccos} \left( \frac{\mathrm{tr}(\mathbf{R}_{\mathrm{gen}}^i (\mathbf{R}_{\mathrm{gt}}^i)^{\mathrm{T}}) - 1}{2} \right)$
- Symbol Explanation:
  - $n$ : Total number of frames in the novel view sequence.
  - $\mathbf{R}_{\mathrm{gen}}^i$ : The rotation matrix of the $i$ -th generated novel view's camera pose.
  - $\mathbf{R}_{\mathrm{gt}}^i$ : The ground truth rotation matrix of the $i$ -th camera pose.
  - $(\mathbf{R}_{\mathrm{gt}}^i)^{\mathrm{T}}$ : The transpose of the ground truth rotation matrix.
  - $\mathrm{tr}(\cdot)$ : The trace of a matrix (sum of diagonal elements).
  - $\operatorname{arccos}(\cdot)$ : Inverse cosine function, which yields the angle between two rotation matrices.
Translation Distance ( $T_{\mathrm{dist}}$ ):
- Conceptual Definition: T_dist measures the cumulative Euclidean distance between the translation vectors of the estimated camera poses and the ground truth camera poses over a sequence. Similar to $R_{\mathrm{dist}}$ , a smaller value indicates more accurate translational alignment. The same normalization procedure as for $R_{\mathrm{dist}}$ is applied to the translation scale.
- Mathematical Formula: $T_{\mathrm{dist}} = \sum_{i=1}^{n} \| \mathbf{T}_{\mathrm{gt}}^i - \mathbf{T}_{\mathrm{gen}}^i \|_2$
- Symbol Explanation:
  - $n$ : Total number of frames in the novel view sequence.
  - $\mathbf{T}_{\mathrm{gt}}^i$ : The ground truth translation vector of the $i$ -th camera pose.
  - $\mathbf{T}_{\mathrm{gen}}^i$ : The estimated translation vector of the $i$ -th generated novel view's camera pose.
  - $\| \cdot \|_2$ : The Euclidean (L2) norm of a vector.

5.3. Baselines

The paper compares ViewCrafter against different baselines depending on the task:

For Zero-shot Novel View Synthesis:

These baselines are also diffusion-based generalizable novel view synthesis frameworks and are primarily designed for single-view input.

ZeroNVS [12]: A model fine-tuned from Zero-1-to-3 [11]. It generates novel views conditioned on a reference image and relative camera pose, where the camera pose is processed as CLIP text embedding and injected into the diffusion U-Net via cross-attention.
MotionCtrl [13]: A camera-conditioned video diffusion model fine-tuned from SVD [17]. It generates consistent novel views from a conditioned reference image and relative camera pose sequences, also using high-level camera embeddings injected into the video diffusion U-Net through cross-attention.
LucidDreamer [14]: Utilizes depth-based warping to synthesize novel views and then employs a pre-trained diffusion-based inpainting model [43] to refine missing regions.

For Scene Reconstruction Comparison (Sparse View 3D-GS Reconstruction):

These baselines are 3D-GS representation-based methods for sparse view reconstruction.

DNGaussian [77]: Uses a point cloud produced by COLMAP [76] for initialization and leverages both image supervision and depth regularization for sparse view reconstruction.
FSGS [78]: Similar to DNGaussian, it also utilizes COLMAP-produced point clouds for initialization and focuses on few-shot view synthesis using Gaussian Splatting.
InstantSplat [79]: Explores using a point cloud produced by DUSt3R [19] (like ViewCrafter for its initial point cloud) as initialization, aiming for efficient 3D-GS training from sparse images.

6. Results & Analysis

6.1. Core Results Analysis

Zero-shot Novel View Synthesis Comparison

The qualitative results highlight ViewCrafter's superior performance in generating high-fidelity and geometrically consistent novel views with precise camera control compared to baselines.

As shown in Figure 3, the results of LucidDreamer [14] exhibit severe artifacts. This is attributed to its depth-based warping approach, which struggles with in-the-wild images due to unknown camera intrinsics, leading to inaccurate warping. The subsequent use of an off-the-shelf inpainting model for refinement often introduces inconsistencies between original and inpainted content. ZeroNVS [12] also produces relatively low-quality and inaccurate novel views. The primary reason is its reliance on text embeddings from CLIP to introduce camera pose conditions, which lacks the precision required for exact 6-DoF control. MotionCtrl [13], while offering better fidelity than ZeroNVS, still falls short in precisely aligning generated views with given camera conditions, similarly due to its high-level camera embedding approach. In contrast, ViewCrafter (ours) demonstrates superior results in both pose control accuracy and overall image quality, which the authors attribute to its incorporation of explicit point cloud priors alongside the video diffusion model.

The following are the results from Table 1 of the original paper:

Dataset	Method	LPIPS ↓	PSNR ↑	SSIM ↑	FID ↓	Rdist ↓	Tdist ↓	LPIPS ↓	PSNR ↑	SSIM ↑	FID ↓	Rdist ↓	Tdist ↓
Dataset	Method	Easy set						Hard set
Tanks-and-Temples	LucidDreamer [14]	0.413	14.53	0.362	42.32	6.137	5.695	0.558	11.69	0.267	200.8	8.998	9.305
	ZeroNVS [12]	0.482	14.71	0.380	74.60	8.810	6.348	0.569	12.05	0.309	131.0	8.860	8.557
	MotionCtrl [13]	0.400	15.34	0.427	70.3	7.299	8.039	0.473	13.29	0.384	196.8	9.801	9.112
	ViewCrafter (ours)	0.194	21.26	0.655	27.18	0.471	1.009	0.283	18.07	0.563	38.92	1.109	0.910
RealEstate10K	LucidDreamer [14]	0.315	16.35	0.579	56.77	5.821	10.02	0.400	14.13	0.511	71.43	7.990	10.85
	ZeroNVS [12]	0.364	16.50	0.577	96.18	6.370	9.817	0.431	14.24	0.535	105.8	8.562	10.31
	MotionCtrl [13]	0.341	16.31	0.604	89.90	4.236	9.091	0.386	16.29	0.587	70.02	8.084	9.295
	ViewCrafter (ours)	0.145	21.81	0.796	33.09	0.380	2.888	0.178	22.04	0.798	24.89	1.098	2.867
CO3D	LucidDreamer [14]	0.429	15.11	0.451	78.87	12.90	6.665	0.517	12.69	0.374	157.8	16.43	8.301
	ZeroNVS [12]	0.467	15.15	0.463	93.84	15.44	8.872	0.524	13.31	0.426	143.2	15.02	10.22
	MotionCtrl [13]	0.393	16.87	0.529	69.18	16.87	5.131	0.443	15.46	0.502	112.7	18.81	5.575
	ViewCrafter (ours)	0.243	21.38	0.687	24.63	2.175	1.033	0.324	18.96	0.641	36.96	2.849	1.480

The quantitative comparison results in Table 1 demonstrate ViewCrafter's significant lead across all metrics on all three datasets (Tanks-and-Temples, RealEstate10K, and CO3D) for both "Easy" and "Hard" test sets.

Image Quality: ViewCrafter consistently achieves higher PSNR and SSIM values, indicating better image quality and structural similarity to the ground truth. Its LPIPS scores are significantly lower, suggesting superior perceptual accuracy. The FID scores, which measure the realism of generated images, are also substantially lower for ViewCrafter, particularly on the "Hard" test set, implying it captures the underlying data distribution more effectively and handles challenging missing/occlusion regions better.
Pose Accuracy: The R_dist and T_dist metrics for ViewCrafter are notably lower than all baselines across all datasets and test sets. This confirms the effectiveness of its design in enabling more precise camera pose control during novel view synthesis. For example, on the Tanks-and-Temples easy set, ViewCrafter has an R_dist of 0.471 and T_dist of 1.009, which are vastly superior to LucidDreamer's 6.137 and 5.695, ZeroNVS's 8.810 and 6.348, and MotionCtrl's 7.299 and 8.039. This stark difference underscores the benefit of using explicit point cloud priors for precise 3D guidance.

Scene Reconstruction Comparison

The qualitative results for scene reconstruction (Figure 4) show that baselines like DNGaussian [77] and FSGS [78] produce significant artifacts. InstantSplat [79], while using DUSt3R for initialization, fails to recover occlusion regions due to omitting the densification process of 3D-GS, leading to holes in novel views. ViewCrafter, by contrast, leverages priors from video diffusion models to generate high-fidelity novel views even with only 2 ground truth training images, implying better recovery of scene geometry and appearance in occluded regions.

The following are the results from Table 2 of the original paper:

Method	LPIPS ↓	PSNR ↑	SSIM ↑
DNGausian [77]	0.331	15.47	0.541
FSGS [78]	0.364	17.53	0.558
InstantSplat [79]	0.275	18.61	0.614
ViewCrafter (ours)	0.245	21.50	0.692

Table 2 presents the quantitative comparison for scene reconstruction on the Tanks-and-Temples dataset using only 2 ground truth training images. ViewCrafter consistently outperforms all baselines across LPIPS, PSNR, and SSIM. This further validates its effectiveness in reconstructing scenes from sparse views, implying its ability to generate high-quality and consistent novel views that are beneficial for downstream 3D-GS optimization. For instance, ViewCrafter's PSNR of 21.50 is significantly higher than InstantSplat's 18.61, FSGS's 17.53, and DNGaussian's 15.47.

6.2. Ablation Studies / Parameter Analysis

Discussion on Pose Condition Strategy

The paper compares its point cloud-based pose conditioning strategy with a Plücker coordinate-based approach, which some concurrent works use for pose-controllable video generation. The Plücker model is trained with Plücker coordinates (6 channels per pixel describing per-pixel motion, resized to match latent space) as conditional input, while keeping the rest of the architecture identical to ViewCrafter.

The following are the results from Table 3 of the original paper:

Method	LPIPS ↓	PSNR ↑	SSIM ↑	FID ↓	Rdist ↓	Tdist ↓
Plücker model	0.370	17.51	0.546	49.33	2.688	2.570
Ours	0.270	20.25	0.649	38.17	0.552	0.983

Table 3 clearly shows that ViewCrafter (ours) significantly outperforms the Plücker model in all image quality metrics (LPIPS, PSNR, SSIM, FID) and especially in pose accuracy (R_dist, T_dist). This demonstrates that the explicit point cloud-based pose condition strategy in ViewCrafter achieves more accurate pose control for novel view synthesis. The qualitative comparison in Figure 6 visually supports this, showing ViewCrafter produces higher-fidelity novel views. Figure 7 further visualizes the pose accuracy, revealing that poses estimated from ViewCrafter's generated novel views align much more closely with ground truth poses than those from the Plücker model. The authors also observed that the Plücker model tends to ignore high-frequency camera movements.

Robustness for Point Cloud Condition

The paper highlights ViewCrafter's robustness to imperfections in the conditional point cloud renders. As shown in Figure 5, the conditioned point cloud renders often contain occlusions, missing regions, and geometric distortions (e.g., along object boundaries). However, the corresponding novel views produced by ViewCrafter successfully fill in these holes and correct inaccurate geometry. This indicates that ViewCrafter has developed a strong understanding of the 3D world, allowing it to generate high-quality novel views even from imperfect conditional information.

Ablation on Training Paradigm

This study evaluates the effectiveness of different training choices.

The following are the results from Table 4 of the original paper:

Traing paradigm	LPIPS ↓	PSNR ↑	SSIM ↑	FID ↓
Only train spatial layers	0.301	18.82	0.595	42.30
Directly train on 576×1024	0.314	18.55	0.582	41.01
16 frames model	0.289	19.07	0.610	38.43
Ours	0.280	19.52	0.615	37.77

Table 4 shows the quantitative results for different training paradigms, with "Ours" representing the full ViewCrafter training strategy (training both spatial and temporal layers, progressive training, and 25-frame inference).

Training Spatial vs. Spatial + Temporal Layers: Training only spatial layers (0.301 LPIPS) yields worse performance than training both spatial and temporal layers (0.280 LPIPS for ours). This confirms the importance of temporal layers in the video denoising U-Net for maintaining consistency across frames.
Progressive Training: Directly training at high resolution ( $576 \times 1024$ ) results in worse performance (0.314 LPIPS) compared to the progressive training strategy (0.280 LPIPS for ours), which starts at a lower resolution and then fine-tunes. This indicates that progressive training is crucial for adapting the model to high resolutions effectively.
Inference Length: A model inferring 16 frames (0.289 LPIPS) performs slightly worse than the 25-frame model (0.280 LPIPS for ours). This suggests that longer inference sequences (up to a certain point) improve temporal consistency and synthesis quality, although there's a trade-off with computational cost. The authors chose 25 frames for their final model as a balance.

These results collectively demonstrate the effectiveness of the chosen training paradigm, including training both spatial and temporal layers, employing a progressive training strategy, and selecting an appropriate inference frame length.

Ablation on Camera Trajectory Planning

The ablation study on camera trajectory planning evaluates the effectiveness of the proposed Next-Best-View (NBV) algorithm compared to a predefined trajectory.

Figure 8 illustrates this comparison:

Predefined Circular Trajectory: When using a predefined circular camera trajectory, the reconstructed point cloud fails to effectively complete the occlusion region. The image shows noticeable missing areas and an incomplete reconstruction (Figure 8(a)).
Proposed Camera Trajectory Planning: In contrast, the point cloud reconstructed using ViewCrafter's content-adaptive camera trajectory planning algorithm reveals occlusion regions more effectively, leading to a more complete reconstruction of the scene (Figure 8(b)). This validates that the adaptive NBV approach is superior in guiding the camera to views that maximize new information gain, thus improving the overall scene reconstruction quality.

6.3. Text-to-3D Generation

The paper also explores the application of ViewCrafter for text-to-3D generation by combining it with text-to-image (T2I) diffusion models. As depicted in Figure 9, given a text prompt (e.g., "An orange-adorned vanilla chocolate ice cream."), a T2I model first generates a reference image. ViewCrafter then synthesizes consistent novel views from this reference image. This demonstrates the framework's potential for creative content generation, extending its utility beyond real-world image inputs. The following figure (Figure 9 from the original paper) shows examples of text-to-3D generation:

该图像是示意图，展示了利用 ViewCrafter 方法生成的一组新视图示例，包括宇航员、动漫风格城堡、盛开的花朵和冰淇淋的多个视角。这些新视图展示了高-fidelity 和一致性，突出了方法在生成过程中的强大能力。

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces ViewCrafter, a novel view synthesis framework that addresses the challenge of generating high-fidelity and accurate novel views from single or sparse images. By ingeniously combining the powerful generative capabilities of video diffusion models with the explicit 3D geometric priors provided by point clouds, ViewCrafter overcomes key limitations of prior approaches. It achieves precise 6-DoF camera pose control and ensures high fidelity and consistency of generated novel views across various scene types. Furthermore, the paper proposes an iterative view synthesis strategy coupled with a content-adaptive camera trajectory planning algorithm, enabling the progressive expansion of covered view areas and the reconstruction of more complete 3D scenes. The extensive experiments on diverse datasets (Tanks-and-Temples, RealEstate10K, CO3D) demonstrate ViewCrafter's superior performance in both image quality and pose accuracy metrics compared to state-of-the-art baselines. Beyond novel view synthesis, the framework's versatility is showcased through its applications in efficiently optimizing 3D-GS representations for real-time rendering and facilitating scene-level text-to-3D generation, marking significant advancements in immersive experiences and imaginative content creation.

7.2. Limitations & Future Work

The authors acknowledge several limitations of ViewCrafter:

Large View Range Synthesis from Limited 3D Clues: The method may face challenges when synthesizing novel views with extremely large view ranges, especially when the initial 3D clues are very sparse (e.g., generating a front-view image from only a back-view image). This highlights the inherent difficulty of hallucinating entirely unseen geometry from minimal input.
Robustness to Highly Inaccurate Point Clouds: While the method demonstrates robustness to low-quality point clouds with minor artifacts, significant inaccuracies in the conditioned point clouds could still pose challenges. The quality of the initial 3D prior remains a factor in performance.
Computational Cost of Multi-step Denoising: As a video diffusion model, ViewCrafter requires a multi-step denoising process during inference. This inherently leads to a relatively higher computing cost and slower generation compared to feed-forward models, limiting its real-time applicability for direct novel view generation (though its output can be used to optimize 3D-GS for real-time rendering).

The paper implicitly suggests future work in improving the robustness to extreme view changes and very poor point cloud quality. Addressing the inference speed for direct novel view generation could also be a fruitful direction. Further exploration into the nuances of combining ViewCrafter with T2I models for even more sophisticated text-to-3D control and imagination is also a clear path forward.

7.3. Personal Insights & Critique

ViewCrafter represents a significant step forward in the challenging field of novel view synthesis from sparse inputs. The core insight of leveraging explicit 3D geometry (point clouds) to guide the generative power of video diffusion models is both elegant and effective. Previous works often struggled with either high-fidelity generation (regression-based), 3D consistency (pure diffusion-based with weak conditioning), or precise camera control (high-level embeddings). ViewCrafter successfully addresses these issues by providing a coarse but direct geometric "scaffolding" for the diffusion model to build upon.

One of the most impressive aspects is the iterative view synthesis combined with the content-adaptive camera trajectory planning. This mechanism is crucial for scaling the method beyond short video segments, allowing it to explore and reconstruct larger scene areas progressively. The utility function for Next-Best-View selection is a practical approach to guide the exploration in a geometrically meaningful way, avoiding redundant views and prioritizing unknown regions. The robustness to imperfect point clouds is also a critical practical advantage, as real-world sparse reconstruction is rarely pristine.

The applications highlighted, particularly the efficient 3D-GS optimization and text-to-3D generation, demonstrate the versatility and potential impact of the method. Using the generated high-quality novel views to supervise 3D-GS training is a clever way to bridge the gap between powerful generative models and real-time rendering solutions. This approach could significantly accelerate the creation of immersive virtual environments and 3D assets.

Potential areas for improvement or further investigation could include:

Efficiency of Iterative Process: While iterative, the process of generating a video, updating the point cloud, and re-planning might still be computationally intensive for very large-scale scenes or real-time exploration. Exploring more efficient point cloud updating or faster video generation techniques could enhance scalability.
Refinement of Point Cloud Prior: Although robust to low-quality point clouds, the overall fidelity could potentially be boosted by incorporating a more refined or dynamically improving 3D prior within the diffusion process itself, rather than relying solely on the initial coarse reconstruction.
Generalization to Dynamic Scenes: The current method focuses on static scenes. Extending ViewCrafter to dynamically changing scenes would be a formidable but impactful challenge, requiring modeling of both 3D geometry and motion.
User Control: While camera pose control is precise, future work could explore more intuitive user controls for scene manipulation or specific aesthetic outcomes.

Overall, ViewCrafter lays a strong foundation for future research in controllable and high-fidelity 3D content generation, especially from limited input data. Its principled integration of 3D geometry with powerful generative models is a compelling direction for the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~29 min read · 42,331 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

Regression-based Novel View Synthesis

Diffusion-based Novel View Synthesis

Conditional Video Diffusion Models

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Preliminary: Video Diffusion Models

4.2.2. Point Cloud Reconstruction from Single or Sparse Images

4.2.3. Rendering High-fidelity Novel Views with Video Diffusion Models

4.2.4. Iterative View Synthesis and Camera Trajectory Planning

4.2.5. Applications

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

Image Quality Metrics:

Pose Accuracy Metrics:

5.3. Baselines

For Zero-shot Novel View Synthesis:

For Scene Reconstruction Comparison (Sparse View 3D-GS Reconstruction):

6. Results & Analysis

6.1. Core Results Analysis

Zero-shot Novel View Synthesis Comparison

Scene Reconstruction Comparison

6.2. Ablation Studies / Parameter Analysis

Discussion on Pose Condition Strategy

Robustness for Point Cloud Condition

Ablation on Training Paradigm

Ablation on Camera Trajectory Planning

6.3. Text-to-3D Generation

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers