Paper status: completed

GaussianObject: High-Quality 3D Object Reconstruction from Four Views with Gaussian Splatting

Published:02/16/2024

3D Gaussian Splatting representation (12)Sparse-View 3D Reconstruction (1)Diffusion-Based 3D Repair (1)Multi-View Consistency Optimization (1)COLMAP-Free Camera Pose Estimation (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

GaussianObject uses Gaussian splatting with visual hull and floater elimination to reconstruct high-quality 3D objects from only four views. A diffusion-based Gaussian repair model restores missing details, and a COLMAP-free variant enables pose-free reconstruction, outperforming

Abstract

Reconstructing and rendering 3D objects from highly sparse views is of critical importance for promoting applications of 3D vision techniques and improving user experience. However, images from sparse views only contain very limited 3D information, leading to two significant challenges: 1) Difficulty in building multi-view consistency as images for matching are too few; 2) Partially omitted or highly compressed object information as view coverage is insufficient. To tackle these challenges, we propose GaussianObject, a framework to represent and render the 3D object with Gaussian splatting that achieves high rendering quality with only 4 input images. We first introduce techniques of visual hull and floater elimination, which explicitly inject structure priors into the initial optimization process to help build multi-view consistency, yielding a coarse 3D Gaussian representation. Then we construct a Gaussian repair model based on diffusion models to supplement the omitted object information, where Gaussians are further refined. We design a self-generating strategy to obtain image pairs for training the repair model. We further design a COLMAP-free variant, where pre-given accurate camera poses are not required, which achieves competitive quality and facilitates wider applications. GaussianObject is evaluated on several challenging datasets, including MipNeRF360, OmniObject3D, OpenIllumination, and our-collected unposed images, achieving superior performance from only four views and significantly outperforming previous SOTA methods. Our demo is available at https://gaussianobject.github.io/, and the code has been released at https://github.com/GaussianObject/GaussianObject.

Mind Map

In-depth Reading

English Analysis~50 min read · 65,413 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "GaussianObject: High-Quality 3D Object Reconstruction from Four Views with Gaussian Splatting". The central topic is the reconstruction and rendering of high-quality 3D objects using an extremely sparse set of input images, specifically as few as four views, leveraging the 3D Gaussian Splatting representation.

1.2. Authors

The authors are:

CHEN YANG (MoE Key Lab of Artificial Intelligence, AI Institute, SJTU, China)
SIKUANG LI (MoE Key Lab of Artificial Intelligence, AI Institute, SJTU, China)
JIEMIN FANG (Huawei Inc., China)
RUOFAN LIANG (University of Toronto, Canada)
LINGXI XIE (Huawei Inc., China)
XIAOPENG ZHANG (Huawei Inc., China)
WI SHE (MoE Key Lab of Artificial Intelligence, AI Institute, SJTU, China)
QI TIAN (Huawei Inc., China)

The authors are primarily affiliated with the MoE Key Lab of Artificial Intelligence at Shanghai Jiao Tong University (SJTU), China, and Huawei Inc., China, with one author from the University of Toronto, Canada. Their research backgrounds likely lie in computer vision, 3D reconstruction, neural rendering, and artificial intelligence, given the subject matter.

1.3. Journal/Conference

The paper is published in ACM Trans. Graph. 43, 6 (December 2024), 28 pages. This indicates publication in ACM Transactions on Graphics (TOG), a highly reputable and influential journal in the field of computer graphics. ACM TOG is a premier venue for publishing significant advancements in computer graphics research, often associated with the SIGGRAPH conference.

1.4. Publication Year

The paper was published on 2024-02-15T18:42:33.000Z (as a preprint on arXiv) and scheduled for December 2024 in ACM Transactions on Graphics.

1.5. Abstract

The paper addresses the challenge of reconstructing and rendering 3D objects from highly sparse views, which is crucial for 3D vision applications and user experience but is difficult due to limited 3D information. Two key challenges are identified: 1) building multi-view consistency with few images, and 2) dealing with omitted or compressed object information due to insufficient view coverage.

To overcome these, the authors propose GaussianObject, a framework based on Gaussian Splatting that achieves high rendering quality from only 4 input images. The methodology involves:

Structure Priors Injection: Introducing visual hull and floater elimination techniques to inject structural priors during initial optimization, helping establish multi-view consistency and yielding a coarse 3D Gaussian representation.
Gaussian Repair Model: Constructing a diffusion model-based Gaussian repair model to supplement omitted object information and refine Gaussians. A self-generating strategy is designed to create image pairs for training this repair model.
COLMAP-Free Variant: Developing a COLMAP-free version (CF-GaussianObject) that does not require pre-given accurate camera poses, broadening its applicability.

GaussianObject is evaluated on challenging datasets (MipNeRF360, OmniObject3D, OpenIllumination, and custom unposed images), demonstrating superior performance from four views and significantly outperforming previous state-of-the-art methods.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2402.10259, which is a preprint on arXiv. The PDF link is https://arxiv.org/pdf/2402.10259v4.pdf. The publication status is Published at (UTC): 2024-02-15T18:42:33.000Z on arXiv, with a formal publication in ACM Trans. Graph. in December 2024.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the high-quality 3D object reconstruction and rendering from extremely sparse multi-view images, specifically as few as four input images, covering a $360^\circ$ range around the object.

This problem is critical because current 3D reconstruction techniques, while powerful, typically demand a large number of input images (dozens or more) to achieve high fidelity. This requirement makes these techniques cumbersome and impractical for many real-world applications and for users without expert knowledge, such as creating 3D assets for games, movies, or AR/VR products. The ability to reconstruct from very few images would significantly expedite and democratize these applications.

The paper identifies two significant challenges inherent in highly sparse view reconstruction:

Difficulty in building multi-view consistency: With only a handful of images, there's very limited 3D information. This makes it hard to establish accurate geometric relationships between views, leading to models that might overfit individual input images and result in fragmented, unrealistic 3D representations.
Partially omitted or highly compressed object information: Sparse captures, especially across a $360^\circ$ range, mean that large parts of the object might be poorly observed, completely occluded, or only seen from extreme angles. This "missing" or "degraded" information cannot be reliably reconstructed from the input images alone, leading to incomplete or artifact-ridden 3D models.

The paper's entry point and innovative idea revolve around leveraging the efficiency and explicitness of 3D Gaussian Splatting as a base representation and augmenting it with two main components: structure priors to guide initial geometric consistency, and a diffusion model-based repair mechanism to synthesize missing details. This addresses the limitations of sparse data by combining explicit geometric guidance with generative capabilities.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Structure-Prior-Aided 3D Gaussian Optimization for Sparse Views: The introduction of techniques like visual hull for initialization and floater elimination during training. These explicitly inject structural priors, such as the object's basic outline, into the optimization process of 3D Gaussians. This helps to establish multi-view consistency from highly sparse inputs, yielding a better coarse 3D representation than previous methods that struggle with limited data for initialization (e.g., SfM points).
Diffusion-Based Gaussian Repair Model: The proposal of a novel Gaussian repair model that utilizes large 2D diffusion models to address artifacts and missing information resulting from poorly observed regions. This model translates corrupted rendered images into high-fidelity ones, which are then used to refine the 3D Gaussians. A self-generating strategy is designed to create sufficient image pairs for training this repair model, overcoming the lack of such data in existing datasets.
COLMAP-Free Variant for Wider Applications: The development of CF-GaussianObject, a variant that removes the dependency on accurate camera poses (intrinsics and extrinsics) provided by traditional Structure-from-Motion (SfM) pipelines like COLMAP. This significantly enhances the practical applicability of the framework, especially for casually captured images, while maintaining competitive reconstruction quality.

The key findings are that GaussianObject consistently achieves superior performance compared to previous state-of-the-art methods across several challenging real-world datasets (MipNeRF360, OmniObject3D, OpenIllumination, and custom-collected unposed images). It demonstrates significantly higher perceptual quality (as measured by LPIPS) and competitive PSNR and SSIM scores, even with as few as four input views. The COLMAP-free variant also proves effective, reducing practical barriers for users. These findings demonstrate that combining explicit 3D representations with strong structural priors and generative models can effectively overcome the limitations of extremely sparse input data for high-quality 3D reconstruction.

3.1. Foundational Concepts

To understand the GaussianObject paper, a reader should be familiar with several foundational concepts in 3D computer vision and generative AI.

3.1.1. 3D Object Reconstruction

3D object reconstruction is the process of creating a 3D model of an object or scene from 2D images or other sensor data. This is a fundamental task in computer graphics and vision, enabling applications like augmented reality, virtual reality, robotics, and digital asset creation. The challenge often lies in accurately inferring depth and geometry from inherently 2D observations.

3.1.2. Novel View Synthesis (NVS)

Novel View Synthesis is the task of generating new images of a scene or object from viewpoints not present in the input training data. It's a key capability for truly immersive 3D experiences and often serves as a primary evaluation metric for 3D reconstruction methods. High-quality NVS requires an accurate and complete 3D representation of the scene.

3.1.3. Neural Radiance Fields (NeRF)

Neural Radiance Fields (NeRF) [Mildenhall et al., 2020] is a seminal neural rendering technique for novel view synthesis. A NeRF model represents a 3D scene as a continuous volumetric function, typically implemented by a Multi-Layer Perceptron (MLP). For any given 3D coordinate (x, y, z) and viewing direction $(\theta, \phi)$ , the MLP outputs a color (r, g, b) and a volume density $\sigma$ . To render an image from a particular viewpoint, rays are cast from the camera through the scene. Along each ray, samples are taken, and their predicted colors and densities are accumulated using volume rendering techniques (similar to classical computer graphics) to produce the final pixel color.

Key idea: Represent a scene implicitly using a neural network.
Strengths: Can achieve highly photo-realistic results, especially with dense input views.
Weaknesses: Computationally intensive training and rendering (though optimized versions exist), struggles with sparse input views, can be slow to optimize.

3.1.4. 3D Gaussian Splatting (3DGS)

3D Gaussian Splatting (3DGS) [Kerbl et al., 2023] is a recent explicit 3D representation method that has shown impressive performance in novel view synthesis, particularly regarding rendering speed and quality. Instead of an implicit neural field, 3DGS represents a 3D scene using a collection of hundreds of thousands or millions of 3D Gaussians. Each 3D Gaussian is a primitive defined by:

Its center location (mean) $\mu \in \mathbb{R}^3$ .
A rotation quaternion $q \in \mathbb{H}$ (to orient the Gaussian in space).
A scaling vector $s \in \mathbb{R}^3$ (to control its size along its principal axes).
An opacity $\sigma \in [0, 1]$ (how transparent or opaque it is).
Spherical Harmonic (SH) coefficients $sh \in \mathbb{R}^{K \times 3}$ (to represent view-dependent color, where $K$ is the order of SH). During rendering, these 3D Gaussians are projected onto the 2D image plane, becoming 2D Gaussians. These 2D Gaussians are then composited in depth-sorted order using alpha blending. The parameters of these Gaussians are optimized end-to-end to match the input images.
Key idea: Explicitly represent the scene with learnable 3D Gaussians, allowing for very fast differentiable rendering.
Strengths: Extremely fast training and real-time rendering, high visual quality.
Weaknesses: Still struggles with very sparse input views (similar to NeRFs) without additional priors, prone to "floaters" (unwanted Gaussians in empty space) or "holes" in unobserved regions.

3.1.5. Diffusion Models and Latent Diffusion Models (LDM)

Diffusion Models are a class of generative models that learn to reverse a diffusion process. During training, noise is progressively added to data (e.g., images) over several steps until it becomes pure noise. The model then learns to reverse this process, gradually denoising the noisy data back to its original clean form.

Latent Diffusion Models (LDM) [Rombach et al., 2022], like Stable Diffusion, operate in a compressed latent space rather than directly on pixel space. This makes them much more efficient. An encoder (part of a Variational Autoencoder - VAE) compresses an image into a lower-dimensional latent representation, and a decoder converts the latent back to an image. The diffusion process (adding noise and denoising) happens in this latent space.
- Encoder ( $\mathcal{E}$ ): $Z_0 = \mathcal{E}(X_0)$ , where $X_0$ is the original image and $Z_0$ is its latent representation.
- Noise Addition: Noise $\epsilon$ is added to $Z_0$ over $T$ steps to get $Z_t$ .
- Denoising (U-Net): A U-Net architecture is trained to predict the noise $\epsilon$ given $Z_t$ , a timestep $t$ , and optional conditioning (e.g., text prompt, image).
- Decoder ( $\mathcal{D}$ ): After denoising in latent space to get $\hat{Z}_0$ , the decoder reconstructs the image: $\hat{X}_0 = \mathcal{D}(\hat{Z}_0)$ .
Strengths: High-quality image generation, flexibility for various tasks (inpainting, outpainting, image-to-image translation, text-to-image).
Weaknesses: Can introduce semantic inconsistencies or artifacts if not properly conditioned or fine-tuned for specific tasks.

3.1.6. ControlNet

ControlNet [Zhang et al., 2023a] is an neural network architecture that allows Large Diffusion Models (like Stable Diffusion) to be controlled with additional input conditions, such as edge maps, segmentation maps, depth maps, or normal maps. It works by "locking" the original diffusion model's weights and adding a trainable copy of its U-Net encoder, connected via zero-convolution layers. This enables the model to learn new conditions without destroying the original model's generation quality.

Key idea: Provides fine-grained control over diffusion model output by conditioning it on structural information from input images.
The conditional loss for ControlNet is given by: $ \mathcal { L } _ { C o n d } = \mathbb { E } _ { Z _ { 0 } , t , \epsilon } \left[ | \epsilon _ { \theta } ( \sqrt { \bar { \alpha } _ { t } } Z _ { 0 } + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon , t , c ^ { \mathrm { t e x } } , c ^ { \mathrm { i m g } } ) - \epsilon | _ { 2 } ^ { 2 } \right] $ Where:
- $Z_0$ : The latent code of the original image (from VAE encoder).
- $t$ : The current timestep in the diffusion process (noise level).
- $\epsilon$ : Gaussian noise sampled from a standard normal distribution.
- $\epsilon _ { \theta }$ : The noise predicted by the diffusion model with parameters $\theta$ .
- $\sqrt { \bar { \alpha } _ { t } } Z _ { 0 } + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon$ : The noisy latent $Z_t$ at timestep $t$ . $\bar{\alpha}_t$ is a scaling factor from the noise schedule.
- $c ^ { \mathrm { t e x } }$ : Text conditioning (e.g., a text prompt).
- $c ^ { \mathrm { i m g } }$ : Image conditioning (e.g., a Canny edge map, depth map).
- $\| \cdot \| _ { 2 } ^ { 2 }$ : Squared L2 norm, measuring the difference between predicted noise and actual noise.

3.1.7. Visual Hull

Visual Hull [Laurentini, 1994] is a classical computer graphics technique for 3D reconstruction. It approximates the shape of an object by intersecting the visual cones (or frustums) formed by projecting the object's silhouette (mask) from multiple camera views into 3D space. The intersection of these cones forms a maximal volume that is guaranteed to contain the object.

Key idea: Reconstructs a coarse but geometrically consistent 3D shape from 2D silhouettes.
Strengths: Requires only object masks (silhouettes) and camera parameters, robust to lighting changes, provides a strong geometric prior.
Weaknesses: Cannot reconstruct concave features (because concavities are "filled in" by the intersection of cones), provides only a coarse approximation.

3.1.8. Structure from Motion (SfM) and COLMAP

Structure from Motion (SfM) [Schönberger and Frahm, 2016] is a photogrammetric technique used to determine the 3D structure of a scene and the 3D position and orientation (camera poses) of the cameras that captured a set of 2D images. It works by identifying and matching distinctive features (e.g., SIFT, SURF) across multiple images, then solving for camera poses and 3D point cloud via bundle adjustment. COLMAP is a general-purpose SfM and Multi-View Stereo (MVS) pipeline that provides state-of-the-art results for camera pose estimation and dense 3D reconstruction.

Key idea: Reconstructs 3D points and camera parameters simultaneously from 2D image correspondences.
Strengths: Highly accurate camera poses and detailed 3D point clouds, foundational for many 3D tasks.
Weaknesses: Requires a sufficient number of overlapping images with texture, struggles with highly sparse views (where feature matching becomes unreliable), requires expertise to run. The paper's COLMAP-free variant directly addresses this limitation.

3.2. Previous Works

The paper categorizes previous works into several groups, highlighting their limitations in extremely sparse $360^\circ$ view reconstruction:

3.2.1. Sparse-View NeRFs and Regularization Techniques

Vanilla NeRF struggles with sparse views due to overfitting and lack of multi-view consistency. Prior works attempted to mitigate this:

Visibility/Depth Priors: Methods like [Deng et al. 2022; Roessle et al. 2022; Somraj et al. 2024, 2023; Somraj and Soundararajan 2023] use SfM-derived visibility or depth information. However, these mostly focus on closely aligned views and SfM itself struggles with extremely sparse $360^\circ$ setups. Xu et al. [2022] relies on ground truth depth, which is impractical. Guangcong et al. [2023] and Song et al. [2023b] use monocular depth estimators or sensors, but these can be coarse.
Semantic Priors: Jain et al. [2021] (DietNeRF) uses a vision-language model for unseen view rendering, but the semantic consistency is often too high-level to guide precise low-level geometric reconstruction.
Deep Image Priors: Shi et al. [2024b] (ZeroRF) combines a deep image prior with factorized NeRF, capturing overall appearance but potentially missing fine details from input views.
Information Theory/Continuity/Symmetry/Frequency Priors: Other priors [Kim et al. 2022; Niemeyer et al. 2022 (RegNeRF); Seo et al. 2023; Song et al. 2023a; Yang et al. 2023 (FreeNeRF)] are effective for specific scenarios but lack general applicability for arbitrary objects in sparse $360^\circ$ views.
Vision Transformer (ViT) based approaches: Recent works [Jang and Agapito 2024; Jiang et al. 2024; Xu et al. 2024c; Zou et al. 2024] employ ViTs to reduce reconstruction requirements for NeRFs and Gaussians, but their effectiveness in extreme sparsity might still be limited.

3.2.2. Diffusion Models in 3D Applications

The rise of diffusion models has significantly impacted 3D:

Text-to-3D Generation: Dreamfusion [Poole et al. 2023] introduced Score Distillation Sampling (SDS) to distill NeRFs from pre-trained diffusion models for text-to-3D object generation. This has been extensively refined [Chen et al. 2023; Lin et al. 2023; Metzer et al. 2023; Shi et al. 2024a; Tang et al. 2024b; Wang et al. 2023a,b; Yi et al. 2024] and extended to 3D/4D editing [Haque et al. 2023; Shao et al. 2024].
Single-Image 3D/View Synthesis: Burgess et al. [2024]; Chan et al. [2023]; Liu et al. [2023c] (Zero123-XL); Müller et al. [2024]; Pan et al. [2024]; Zhu and Zhuang [2024] adapted these for single-image 3D generation and view synthesis. However, they often have strict input requirements (e.g., object-centric, specific camera poses) and can produce overly saturated images or struggle with consistency across views.
Diffusion in Sparse Reconstruction: DiffusioNeRF [Wynn and Turmukhambetov 2023], SparseFusion [Zhou and Tulsiani 2023], Deceptive-NeRF [Liu et al. 2023b], ReconFusion [Wu et al. 2024] and CAT3D [Gao et al. 2024] integrate diffusion models with NeRFs for sparse reconstruction. These typically use SDS loss to guide NeRF training. The GaussianObject paper, however, notes that SDS can lead to unstable optimization in their sparse-view context.

3.2.3. Large Reconstruction Models (LRMs)

Recent LRMs [Hong et al. 2024; Li et al. 2024; Tang et al. 2024a (LGM); Wang et al. 2024b; Wei et al. 2024; Weng et al. 2023; Xu et al. 2024a,b; Zhang et al. 2024; Zou et al. 2024 (TriplaneGaussian)] are feed-forward models aiming for fast 3D reconstruction from sparse views. While effective, they often require extensive pre-training, strict requirements on view distribution and object location, and struggle with real-world captures.

3.2.4. Sparse-View Gaussian Splatting

Similar to NeRF, 3DGS also struggles with sparse $360^\circ$ views. FSGS [Zhu et al. 2024] is built upon Gaussian Splatting but still severely relies on SfM points for initialization. It typically requires over 20 views, which is still too dense for the paper's target of 4 views.

3.3. Technological Evolution

The evolution of 3D reconstruction from images has progressed from classical geometric methods (SfM, MVS, Visual Hull) to implicit neural representations (NeRF) and now to explicit neural representations (3D Gaussian Splatting). Each step aims to improve quality, speed, or reduce data requirements.

Classical methods: Provided geometric priors and accurate pose estimation but often required dense captures and were limited in appearance modeling.
Implicit Neural Representations (NeRF): Revolutionized NVS by modeling scenes as continuous functions, achieving unprecedented photorealism but suffering from slow training/rendering and sparsity.
Explicit Neural Representations (3DGS): Combined the photorealism of neural rendering with the efficiency of explicit primitives, drastically improving speed but retaining some sparsity challenges.
Generative Models (Diffusion): The integration of diffusion models represents a shift towards using powerful 2D generative priors to fill in missing information or hallucinate plausible details in 3D, especially when data is scarce.

This paper's work fits into the current frontier by combining the strengths of explicit 3DGS (speed, quality) with classical structure priors (visual hull, floater elimination) and the generative power of diffusion models to address the critical challenge of extremely sparse input views, pushing the boundary from "few-shot" (often 20+ views) to "ultra-sparse" (4 views).

3.4. Differentiation Analysis

Compared to the main methods in related work, GaussianObject introduces several core differences and innovations:

Ultra-Sparse View Reconstruction (4 Views): Unlike most sparse NeRFs or even sparse 3DGS methods (FSGS requiring >20 views), GaussianObject specifically targets and achieves high-quality reconstruction from only four $360^\circ$ input images. This is a significant reduction in data requirement.
Explicit Structure Priors for 3DGS: Instead of relying solely on SfM points (which are unreliable in extreme sparsity) or implicit regularizations, GaussianObject explicitly injects geometric priors:
- Visual hull for robust initial 3D Gaussian distribution, providing a strong starting point even with minimal data.
- Floater elimination to remove spurious Gaussians that arise from under-constrained optimization, improving geometric fidelity. This contrasts with implicit methods that struggle to maintain structure or 3DGS methods that depend on good SfM initialization.
Diffusion-Based Gaussian Repair Model: While other methods use diffusion models for SDS-based training or single-image generation, GaussianObject proposes a novel Gaussian repair model based on ControlNet specifically for correcting artifacts and supplementing omitted information in already rendered images. This model is trained using a unique self-generating strategy (leave-one-out training and 3D noise addition) to create degraded/target image pairs, which is more tailored to refining an existing 3DGS model than generic SDS guidance. The paper specifically notes SDS instability in their context.
COLMAP-Free Variant: A major practical innovation is the CF-GaussianObject. By integrating DUSt3R to predict camera poses and intrinsics, it removes the dependency on accurate SfM pipelines, which often fail or are difficult to use in sparse scenarios. This broadens the method's applicability to casual captures without professional photogrammetry setups. Previous SfM-reliant methods (FSGS) would entirely break down without this.
Focus on Perceptual Quality: The framework emphasizes improving perceptual quality, as evidenced by significant LPIPS improvements, often at the expense of marginal PSNR/SSIM differences. This is crucial for user experience and photorealistic rendering.

In essence, GaussianObject combines the fast rendering of 3DGS with robust geometric initialization, and a targeted diffusion-based image-to-image repair mechanism, all while making it practical for real-world sparse capture scenarios via a COLMAP-free option. This holistic approach differentiates it from methods that only address one aspect (e.g., general sparse NeRF regularization, generic diffusion guidance, or 3DGS relying on dense SfM).

4. Methodology

The GaussianObject framework aims to reconstruct and render high-quality 3D objects from as few as four input images. It leverages 3D Gaussian Splatting (3DGS) as its core representation and addresses the challenges of sparse views through explicit structure priors and a diffusion-based repair mechanism.

The overall framework, as illustrated in Figure 2 from the original paper, operates in several stages:

Initial Optimization with Structure Priors: This stage initializes 3D Gaussians using a visual hull and refines them with floater elimination to build a coarse 3D Gaussian representation ( $\mathcal{G}_c$ ).
Gaussian Repair Model Setup: This involves a self-generating strategy to create sufficient corrupted/clean image pairs, which are then used to fine-tune a ControlNet-based Gaussian repair model ( $\mathcal{R}$ ).
Gaussian Repair with Distance-Aware Sampling: The trained Gaussian repair model is used to rectify rendered images from sparsely observed viewpoints. These repaired images then guide the further refinement of the 3D Gaussians.
COLMAP-Free Variant (Optional): For scenarios without accurate camera poses, a COLMAP-free variant integrates DUSt3R for pose estimation and adapts the optimization process.

The following figure (Figure 2 from the original paper) shows the overall framework:

该图像是示意图，展示了GaussianObject框架的工作流程。左侧(a)部分展示了利用结构先验进行优化的步骤，中间(b)部分介绍了高斯修复模型的设置，右侧(c)部分展示了结合距离感知采样的修复过程。该框架通过仅使用四个视图达到高质量的3D对象重建。

4.1. Principles

The core idea behind GaussianObject is to mitigate the information scarcity of ultra-sparse views by combining explicit geometric guidance with the powerful generative capabilities of diffusion models. The theoretical basis and intuition are as follows:

Explicit Representation for Injecting Priors: 3D Gaussian Splatting is chosen because its explicit, point-like structure makes it easier to directly inject geometric priors (like the object's outline from a visual hull) and manipulate individual Gaussians (e.g., floater elimination). This is harder with implicit NeRF representations.
Geometric Consistency from Priors: In sparse settings, SfM fails to provide enough reliable 3D points. The visual hull offers a robust, albeit coarse, geometric scaffold that enforces multi-view consistency from the beginning by constraining Gaussians to plausible object regions. Floater elimination further refines this by statistically removing spurious Gaussians that arise from under-constrained optimization.
Generative Models for Completing Missing Information: Sparse views inherently lead to missing or ambiguous object information. Traditional reconstruction methods struggle here. Diffusion models, particularly when fine-tuned, excel at hallucinating plausible details and correcting corrupted images while maintaining coherence. By training a specialized Gaussian repair model that can "fix" poor renderings from sparsely observed angles, the framework can infer and refine details that were not directly visible in the input.
Self-Supervised Data Generation: Training a repair model requires pairs of "corrupted" and "ideal" images. Since such data is not readily available for sparse views, the paper devises ingenious self-generating strategies (leave-one-out training and 3D noise addition) to create this synthetic training data, making the diffusion model applicable to this specific problem.
Distance-Aware Refinement: The quality of initial 3DGS rendering is better near input views and worse in between. The distance-aware sampling strategy prioritizes applying the Gaussian repair model to regions where the 3DGS model is least confident, effectively focusing the generative guidance where it's most needed.
Practicality via COLMAP-Free: Recognizing that SfM is a bottleneck for sparse real-world captures, the integration of DUSt3R for pose estimation directly addresses a major practical limitation, making the method more accessible.

In essence, GaussianObject iteratively refines a 3D explicit representation by first providing it with strong geometric scaffolding, and then using a powerful generative image prior to "dream" in missing details and correct inconsistencies, guiding the 3D representation towards a complete and photorealistic object.

4.2. Core Methodology In-depth

4.2.1. Preliminary

The paper builds upon 3D Gaussian Splatting and ControlNet.

4.2.1.1. 3D Gaussian Splatting

3D Gaussian Splatting [Kerbl et al. 2023] represents a 3D scene as a collection of $P$ individual 3D Gaussians. Each Gaussian $G_i$ is characterized by a set of attributes:

$\mu_i$ : The 3D center location of the Gaussian.
$q_i$ : A rotation quaternion that defines the orientation of the Gaussian.
$s_i$ : A scaling vector that determines the size and shape of the Gaussian along its principal axes.
$\sigma_i$ : The opacity of the Gaussian, ranging from 0 (fully transparent) to 1 (fully opaque).
$sh_i$ : Spherical Harmonic (SH) coefficients that model the view-dependent color of the Gaussian. Thus, a scene is represented as a set $\mathcal { G } = \{ G _ { i } : \mu _ { i } , q _ { i } , s _ { i } , \sigma _ { i } , s h _ { i } \} _ { i = 1 } ^ { P }$ . During rendering, these 3D Gaussians are projected onto the 2D image plane, and their contributions are composited to form the final pixel color. All these attributes are optimized to accurately reproduce the input images from various viewpoints.

4.2.1.2. ControlNet

ControlNet [Zhang et al. 2023a] enhances generative diffusion models by allowing them to be conditioned on additional image inputs. Diffusion models operate by reversing a process that adds Gaussian noise $\epsilon$ to data over time $t$ . They learn to predict this noise $\epsilon_\theta$ at each step to iteratively denoise a noisy input $X_t$ back to a clean data $X_0$ . Latent Diffusion Models (LDM) [Rombach et al. 2022] perform this in a latent space, where an encoder ( $\mathcal{E}$ ) converts data $X_0$ to its latent $Z_0$ , and a decoder ( $\mathcal{D}$ ) converts latents back to data.

ControlNet integrates additional image conditioning $c^{\mathrm{img}}$ into the LDM's U-Net architecture without retraining the entire model. It does this by adding a trainable copy of the U-Net encoder, connected via zero-convolution layers, while keeping the original LDM weights frozen. The ControlNet is optimized with the following loss function: $ \mathcal { L } _ { C o n d } = \mathbb { E } _ { Z _ { 0 } , t , \epsilon } \left[ | \epsilon _ { \theta } ( \sqrt { \bar { \alpha } _ { t } } Z _ { 0 } + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon , t , c ^ { \mathrm { t e x } } , c ^ { \mathrm { i m g } } ) - \epsilon | _ { 2 } ^ { 2 } \right] $ Where:

$Z_0$ : The latent representation of the original image, obtained from a Variational Autoencoder (VAE) encoder.
$t$ : The current timestep of the diffusion process, representing the noise level. $t$ ranges from 0 to $T$ .
$\epsilon$ : A sample of Gaussian noise, drawn from a standard normal distribution, that is added to the latent.
$\epsilon _ { \theta }$ : The noise predicted by the diffusion model (specifically, the U-Net within the ControlNet framework) with learnable parameters $\theta$ .
$\sqrt { \bar { \alpha } _ { t } } Z _ { 0 } + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon$ : This term represents the noisy latent $Z_t$ at timestep $t$ . It is a combination of the original latent $Z_0$ and the noise $\epsilon$ , weighted by factors derived from a noise schedule $\bar{\alpha}_t$ . $\bar{\alpha}_t \in (0,1]^T$ is a decreasing sequence associated with the noise-adding process.
$c ^ { \mathrm { t e x } }$ : Text conditioning, typically an embedding of a text prompt, guiding the image generation semantically.
$c ^ { \mathrm { i m g } }$ : Image conditioning, an additional input image (e.g., an edge map, depth map, or in this paper, a degraded rendering) that guides the structural aspects of the generated image.
$\| \cdot \| _ { 2 } ^ { 2 }$ : The squared L2 norm, which measures the mean squared error between the predicted noise $\epsilon_\theta$ and the actual noise $\epsilon$ . The diffusion model is trained to minimize this difference, effectively learning to denoise.

4.2.2. Overall Framework Process

Given $N$ reference images $X ^ { \mathrm { r e f } } = \{ x _ { i } \} _ { i = 1 } ^ { N }$ captured from a $360^\circ$ range, their intrinsics $K ^ { \mathrm { r e f } } = \{ k _ { i } \} _ { i = 1 } ^ { N }$ , extrinsics $\Pi ^ { \mathrm { r e f } } = \{ \pi _ { i } \} _ { i = 1 } ^ { N }$ , and masks $M ^ { \mathrm { r e f } } = \{ m _ { i } \} _ { i = 1 } ^ { N }$ , the goal is to obtain a 3D representation $\mathcal { G }$ (a set of 3D Gaussians) that can render photo-realistic images from any novel viewpoint $\pi$ , denoted as $x = \mathcal { G } ( \pi | \{ x _ { i } , \pi _ { i } , m _ { i } \} _ { i = 1 } ^ { N } )$ .

4.2.3. Initial Optimization with Structure Priors

This stage aims to create a coarse but structurally sound 3D Gaussian representation ( $\mathcal{G}_c$ ) from the very sparse input views.

4.2.3.1. Initialization with Visual Hull

To overcome the lack of reliable SfM points in sparse view settings, the paper uses a visual hull for initializing 3D Gaussians.

Visual Hull Construction: The visual hull is generated by taking the camera view frustums (the pyramidal volume seen by each camera) and the object masks ( $m_i$ ) from the input images. The intersection of these projected masks in 3D space defines the visual hull. This process provides a geometric scaffold that bounds the object's plausible volume. SAM [Kirillov et al. 2023] is used to obtain the object masks.
Gaussian Initialization: Points are randomly initialized within this visual hull using rejection sampling. This means:
- Random 3D points are sampled uniformly in a bounding box.
- Each sampled 3D point is projected onto all input image planes.
- Only points that fall within the intersection of all image-space masks (i.e., inside the silhouette from every view) are retained. These points are considered part of the visual hull.
- Point colors are assigned by averaging the bilinearly interpolated pixel colors from their projections onto the reference images.
Conversion to 3D Gaussians: These initialized 3D points are then converted into 3D Gaussians:
- Each point's 3D coordinate becomes the center location $\mu$ of a Gaussian.
- The point's color is converted into spherical harmonic coefficients sh.
- The scale $s$ is initialized based on the mean distance between adjacent points (e.g., using a k-d tree search for nearest neighbors).
- The rotation $q$ is set to a default unit quaternion (no specific orientation initially).
- The opacity $\sigma$ is set to a constant initial value. This visual hull initialization provides a much denser and more geometrically consistent starting point than SfM points in sparse scenarios.

4.2.3.2. Floater Elimination

Even with visual hull initialization, areas outside the true object boundary might contain spurious Gaussians (called floaters) due to insufficient observational data. These floaters degrade rendering quality.

Statistical Detection: To eliminate floaters, the K-Nearest Neighbors (KNN) algorithm is used to calculate the average distance to the nearest $\sqrt{P}$ Gaussians for each Gaussian in the set $\mathcal{G}_c$ . This provides a measure of local density.
Adaptive Thresholding: A normative range for these distances is established by computing their mean and standard deviation. Gaussians whose mean neighbor distance exceeds an adaptive threshold $\tau$ $τ$ are identified as floaters and removed.
- The threshold is $\tau = \mathrm{mean} + \lambda_e \mathrm{std}$ , where mean and std are the mean and standard deviation of neighbor distances, and $\lambda_e$ is a weighting factor.
- $\lambda_e$ is linearly decreased to 0 throughout the optimization process, gradually refining the scene by removing more distant Gaussians. This process is repeated periodically (e.g., every 500 iterations).

4.2.3.3. Initial Optimization

The coarse 3D Gaussian representation $\mathcal{G}_c$ is optimized using a combination of color, mask, and monocular depth losses.

Color Loss: Combines L1 and D-SSIM losses, similar to standard 3D Gaussian Splatting. $ \begin{array} { r } { \mathcal { L } _ { 1 } = \Vert x - x ^ { \mathrm { { r e f } } } \Vert _ { 1 } , \quad \mathcal { L } _ { \mathrm { { D } -S S I M } } = 1 - \mathrm { { SSIM } } ( x , x ^ { \mathrm { { r e f } } } ) , } \end{array} $ Where:
- $x$ : The image rendered by the current 3D Gaussian model from a reference viewpoint.
- $x ^ { \mathrm { r e f } }$ : The corresponding ground truth reference image.
- $\Vert \cdot \Vert _ { 1 }$ : L1 norm, measuring the absolute difference between pixel values.
- $\mathrm { SSIM } ( x , x ^ { \mathrm { r e f } } )$ : Structural Similarity Index Measure, a perceptual metric that quantifies image similarity by considering luminance, contrast, and structure.
Mask Loss: A Binary Cross-Entropy (BCE) loss [Jadon 2020] is applied to align the rendered object mask with the ground truth mask. $ \mathcal { L } _ { \mathrm { m } } = - ( m ^ { \mathrm { r e f } } \log m + ( 1 - m ^ { \mathrm { r e f } } ) \log ( 1 - m ) ) $ Where:
- $m$ : The mask rendered by the 3D Gaussian model.
- $m ^ { \mathrm { r e f } }$ : The ground truth object mask for the reference image.
- $\log$ : Natural logarithm.
Depth Loss: A shift and scale invariant depth loss guides the geometry, using monocular depth estimations. $ \mathcal { L } _ { \mathrm { d } } = \Vert D ^ { * } - D _ { \mathrm { p r e d } } ^ { * } \Vert _ { 1 } $ Where:
- $D ^ { * }$ : The normalized depth map rendered by the 3D Gaussian model.
- $D _ { \mathrm { p r e d } } ^ { * }$ : The normalized monocularly estimated depth map (e.g., from ZoeDepth [Bhat et al. 2023]) for the reference image.
- The normalization follows [Ranftl et al. 2020]: $ D ^ { * } = \frac { D - \mathrm { m e d i a n } ( D ) } { \frac { 1 } { M } \sum _ { i = 1 } ^ { M } \left| D - \mathrm { m e d i a n } ( D ) \right| } $ Where:
  - $D$ : The raw depth map (either rendered or predicted).
  - $\mathrm { m e d i a n } ( D )$ : The median value of all valid pixels in the depth map.
  - $M$ : The total number of valid pixels in the depth map. This normalization makes the loss robust to global scale and shift differences between rendered and predicted depths.
Overall Loss: The combined loss for initial optimization is: $ \mathcal { L } _ { \mathrm { r e f } } = \left( 1 - \lambda _ { \mathrm { S S I M } } \right) \mathcal { L } _ { 1 } + \lambda _ { \mathrm { S S I M } } \mathcal { L } _ { \mathrm { D - S S I M } } + \lambda _ { \mathrm { m } } \mathcal { L } _ { \mathrm { m } } + \lambda _ { \mathrm { d } } \mathcal { L } _ { \mathrm { d } } $ Where:
- $\lambda _ { \mathrm { S S I M } }$ , $\lambda _ { \mathrm { m } }$ , and $\lambda _ { \mathrm { d } }$ are hyperparameters that control the weighting of each loss term. This initial optimization is fast, taking about 1 minute to train a coarse Gaussian representation.

4.2.4. Gaussian Repair Model Setup

The coarse model $\mathcal{G}_c$ still suffers from artifacts in poorly observed or unobserved regions. A Gaussian repair model $\mathcal{R}$ is introduced to fix these issues. This model takes a corrupted rendered image $x'( \mathcal{G}_c, \pi^{\mathrm{nov}} )$ as input and outputs a photo-realistic, high-fidelity image $\hat{x}$ . This repair capability is then used to refine the 3D Gaussians.

4.2.4.1. Self-Generating Strategies for Training Data

Training $\mathcal{R}$ (a diffusion model) requires sufficient pairs of corrupted and ideal images, which are not readily available. Two strategies are designed to generate these image pairs:

Leave-One-Out Training:
- From the $N$ input reference images, $N$ subsets are created. Each subset contains N-1 reference images and 1 left-out image ( $x^{\mathrm{out}}$ ).
- For each subset, a 3DGS model $\mathcal{G}_c^i$ is trained using the N-1 images. Since $x^{\mathrm{out}}$ is not used, renderings from this view will be "corrupted."
- After initial training, the left-out image $x^{\mathrm{out}}$ is then used to further train the corresponding $\mathcal{G}_c^i$ into $\hat{\mathcal{G}}_c^i$ (the "ideal" model for that view).
- Renderings from the left-out view at different iterations during both training phases (before and after using $x^{\mathrm{out}}$ ) are stored. These pairs of degraded renderings (from $\mathcal{G}_c^i$ ) and improved renderings (from $\hat{\mathcal{G}}_c^i$ ) for the $x^{\mathrm{out}}$ viewpoint form image pairs for training $\mathcal{R}$ .
Adding 3D Noises:
- 3D noises $\epsilon_s$ are added to the attributes (e.g., positions, opacities, scales) of the Gaussians in $\mathcal{G}_c$ . The magnitude of these noises is derived from the differences observed between the $\mathcal{G}_c^i$ and $\hat{\mathcal{G}}_c^i$ models from the leave-one-out strategy.
- By rendering images $x'(\mathcal{G}_c(\epsilon_s), \pi^{\mathrm{ref}})$ from these noisy Gaussians at all reference views, an extensive set of (corrupted image, original reference image) pairs $(X', X^{\mathrm{ref}})$ is generated.

4.2.4.2. Training the Gaussian Repair Model ( $\mathcal{R}$ )

The Gaussian repair model is implemented by injecting LoRA (Low-Rank Adaptation) [Hu et al. 2022] weights into a pre-trained ControlNet [Zhang et al. 2023b] (specifically, ControlNetTile based on Stable Diffusion v1.5). The training procedure is shown in Figure 3 from the original paper.

$Fig. 3. Illustration of Gaussian repair model setup. First, we add Gaussian noise $\\epsilon$ to a reference image $x ^ { \\mathrm { r e f } }$ to for a noisy image. Next, this noisy image along with \$…$ 该图像是插图，展示了高斯修复模型的设置。图中包含一张参考图像 $x^{\mathrm{ref}}$ 和受高斯噪声 $\epsilon$ 影响的图像，以及对应的降质图像 $x^{\prime}$ 。这些图像被输入到预训练的 ControlNet 中，通过可学习的 LoRA 层来预测噪声分布 $\epsilon_{\theta}$ ，并利用 $\epsilon$ 与 $\epsilon_{\theta}$ 的差异来微调 LoRA 层的参数。

The LoRA layers are injected into the text encoder, image condition branch, and U-Net of the ControlNet. The model is fine-tuned using the generated image pairs from the self-generating strategies. The loss function for fine-tuning is: $ \begin{array} { r } { \mathcal { L } _ { \mathrm { t u n e } } = \mathbb { E } _ { { x } ^ { \mathrm { r e f } } , t , \epsilon , x ^ { \prime } } \left[ | ( \epsilon _ { \theta } ( x _ { t } ^ { \mathrm { r e f } } , t , x ^ { \prime } , c ^ { \mathrm { t e x } } ) - \epsilon ) | _ { 2 } ^ { 2 } \right] , } \end{array} $ Where:

$x ^ { \mathrm { r e f } }$ : The "clean" or "ideal" reference image from the generated image pairs (e.g., $x^{\mathrm{out}}$ or the original $X^{\mathrm{ref}}$ ).
$x^{\prime}$ : The "corrupted" or "degraded" rendered image (e.g., from $\mathcal{G}_c^i$ or $\mathcal{G}_c(\epsilon_s)$ ). This serves as the image conditioning $c^{\mathrm{img}}$ for ControlNet.
$t$ : The current timestep of the diffusion process.
$\epsilon$ : The Gaussian noise added to the latent of $x^{\mathrm{ref}}$ .
$x _ { t } ^ { \mathrm { r e f } }$ : The noisy latent representation of $x^{\mathrm{ref}}$ at timestep $t$ .
$\epsilon _ { \theta } ( x _ { t } ^ { \mathrm { r e f } } , t , x ^ { \prime } , c ^ { \mathrm { t e x } } )$ : The noise predicted by the ControlNet model. It takes the noisy latent $x _ { t } ^ { \mathrm { r e f } }$ , timestep $t$ , the degraded image $x^{\prime}$ (as image condition), and a text prompt $c ^ { \mathrm { t e x } }$ as inputs.
$c ^ { \mathrm { t e x } }$ : An object-specific language prompt, defined as "a photo of [V]" (where [V] is a placeholder for the specific object, following Dreambooth [Ruiz et al. 2023]). The loss minimizes the difference between the predicted noise and the actual noise, allowing the ControlNet to learn to "repair" $x'$ into $x^{\mathrm{ref}}$ .

4.2.5. Gaussian Repair with Distance-Aware Sampling

After training $\mathcal{R}$ , its learned object priors are distilled back into $\mathcal{G}_c$ to refine its rendering quality, especially in regions poorly observed by the original reference images.

4.2.5.1. Distance-Aware Sampling

The intuition is that the initial $\mathcal{G}_c$ provides good renderings near the original training views, but poor ones in between.

Path Definition: An elliptical path is established around the object, centered on the object.
- Reference Path: Arcs on this path near the original training viewpoints ( $\Pi^{\mathrm{ref}}$ ) are considered regions where $\mathcal{G}_c$ renders high-quality images.
- Repair Path: The other arcs, which are farther from the training views, are where renderings from $\mathcal{G}_c$ are likely to be degraded and need repair. These define the repair path. This concept is illustrated in Figure 4 from the original paper.
  
  该图像是示意图，展示了距离感知采样的概念。图中黄色的挖土机模型位于中心，周围用蓝色和红色箭头分别标识了参考路径和修复路径。图中还包含表示视角的正方体，标记为 $\pi_i$ 和 $\pi_j$ ，表明不同视角对 3D 物体重建的影响。
Novel Viewpoint Sampling: In each iteration, novel viewpoints $\pi_j \in \Pi^{\mathrm{nov}}$ are randomly sampled from the repair path.

4.2.5.2. Repair Process and Loss

For each sampled novel viewpoint $\pi_j$ :

Render Degraded Image: The current 3D Gaussian model $\mathcal{G}_c$ renders an image $x_j$ from viewpoint $\pi_j$ . This $x_j$ serves as a "degraded rendering" because $\pi_j$ is in a sparsely observed region.
Encode and Noise Latent:
- The rendered image $x_j$ is encoded into its latent representation $\mathcal{E}(x_j)$ using the VAE encoder.
- A cloned version of $\mathcal{E}(x_j)$ $E (x_{j})$ is then diffused into a noisy latent $z_t$ $z_{t}$ : $ z _ { t } = \sqrt { \bar { \alpha } _ { t } } \mathcal { E } ( x _ { j } ) + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon , \mathrm { ~ w h e r e ~ } \epsilon \sim N ( 0 , I ) , t \in [ 0 , T ] $ Where:
  - $z_t$ : The noisy latent at timestep $t$ .
  - $\mathcal{E}(x_j)$ : The latent code of the rendered image $x_j$ .
  - $\sqrt { \bar { \alpha } _ { t } }$ and $\sqrt { 1 - \bar { \alpha } _ { t } }$ : Weighting factors from the diffusion schedule.
  - $\epsilon$ : Gaussian noise sampled from a standard normal distribution.
  - $t$ : The noise level or timestep, varying from 0 to $T$ . This process is similar to SDEdit [Meng et al. 2022], where a noisy version of the input is then denoised.
Generate Repaired Image: The noisy latent $z_t$ $z_{t}$ is passed through the trained Gaussian repair model $\mathcal{R}$ $R$ (specifically, its denoising U-Net) along with the original $\mathcal{E}(x_j)$ $E (x_{j})$ as image conditioning and the text prompt. The DDIM sampling [Song et al. 2021] procedure is run for $k$ $k$ steps ( $k = \lfloor 50 \cdot t/T \rfloor$ $k = ⌊ 50 \cdot t / T ⌋$ ) to denoise $z_t$ $z_{t}$ . Finally, the VAE decoder $\mathcal{D}$ $D$ converts the denoised latent back into a repaired image $\hat{x}_j$ $\overset{x}{^}_{j}$ . $ \begin{array} { r } { \hat { x } _ { j } = \mathcal { D } ( \mathrm { DDIM } ( z _ { t } , \mathcal { E } ( x _ { j } ) ) ) } \end{array} $ Where:
- $\hat{x}_j$ : The high-fidelity, repaired image generated by $\mathcal{R}$ .
- $\mathcal{D}$ : The VAE decoder.
- $\mathrm{DDIM}(z_t, \mathcal{E}(x_j))$ : The DDIM sampling process, which denoises the latent $z_t$ conditioned on $\mathcal{E}(x_j)$ .
Refinement Loss: The repaired image $\hat{x}_j$ $\overset{x}{^}_{j}$ now serves as a high-fidelity target image. A loss function is used to guide the refinement of the 3D Gaussians $\mathcal{G}_c$ $G_{c}$ towards rendering images consistent with $\hat{x}_j$ $\overset{x}{^}_{j}$ . $ \begin{array} { r l r } { { \mathcal { L } _ { \mathrm { r e p } } = \mathbb { E } _ { \pi _ { j } , t } [ w ( t ) \lambda ( \pi _ { j } ) \big ( | x _ { j } - \hat { x } _ { j } | _ { 1 } + | x _ { j } - \hat { x } _ { j } | _ { 2 } + L _ { p } ( x _ { j } , \hat { x } _ { j } ) \big ) ] , } } \ & { } & { \mathrm { w h e r e } ~ \lambda ( \pi _ { j } ) = \frac { 2 \cdot \operatorname* { m i n } _ { i = 1 } ^ { N } ( | \pi _ { j } - \pi _ { i } | _ { 2 } ) } { d _ { \operatorname* { m a x } } } . } \end{array} $ Where:
- $x_j$ : The original rendering from the current $\mathcal{G}_c$ for viewpoint $\pi_j$ .
- $\hat{x}_j$ : The repaired, high-fidelity image generated by $\mathcal{R}$ for viewpoint $\pi_j$ .
- $\| x _ { j } - \hat { x } _ { j } \| _ { 1 }$ : L1 loss between the rendered and repaired images.
- $\| x _ { j } - \hat { x } _ { j } \| _ { 2 }$ : L2 loss (squared Euclidean distance) between the rendered and repaired images.
- $L_p(x_j, \hat{x}_j)$ : Perceptual loss (LPIPS [Zhang et al. 2018]), which measures perceptual similarity using deep features, crucial for visual quality.
- w(t): A noise-level modulated weighting function from DreamFusion [Poole et al. 2023], which adjusts the weight of the loss based on the noise level $t$ used in generating $z_t$ .
- $\lambda(\pi_j)$ $λ (π_{j})$ : A distance-based weighting function. This term assigns higher weights to viewpoints $\pi_j$ $π_{j}$ that are farther away from any of the original reference views $\pi_i$ $π_{i}$ . It is defined as: $ \lambda ( \pi _ { j } ) = \frac { 2 \cdot \operatorname* { m i n } _ { i = 1 } ^ { N } ( | \pi _ { j } - \pi _ { i } | _ { 2 } ) } { d _ { \operatorname* { m a x } } } $ Where:
  - $\operatorname* { m i n } _ { i = 1 } ^ { N } ( \| \pi _ { j } - \pi _ { i } \| _ { 2 } )$ : The minimum Euclidean distance between the novel viewpoint $\pi_j$ and any of the $N$ reference viewpoints $\pi_i$ .
  - $d_{\operatorname* { m a x }}$ : The maximum distance among neighboring reference viewpoints (a normalization factor). This weighting ensures that the repair process focuses more on challenging, under-observed regions. During this entire Gaussian repair procedure, the 3D Gaussians $\mathcal{G}_c$ are also continuously optimized using the reference image loss $\mathcal{L}_{\mathrm{ref}}$ (from Section 4.2.3.3) to maintain coherence with the original input images.

4.2.6. COLMAP-Free GaussianObject (CF-GaussianObject)

The COLMAP-free variant addresses the practical limitation of needing accurate camera parameters, which are hard to obtain with sparse inputs for traditional SfM pipelines.

Pose Estimation with DUSt3R: Given reference input images $X^{\mathrm{ref}}$ , the advanced sparse matching model DUSt3R [Wang et al. 2024a] is used to predict:
- $\mathcal{P}$ : An estimated coarse point cloud of the scene.
- $\hat{\Pi}^{\mathrm{ref}}$ : The predicted camera poses (extrinsics) for $X^{\mathrm{ref}}$ .
- $\hat{K}^{\mathrm{ref}}$ : The predicted camera intrinsics for $X^{\mathrm{ref}}$ . The operation is formulated as: $ \mathcal { P } , \hat { \Pi } ^ { \mathrm { r e f } } , \hat { K } ^ { \mathrm { r e f } } = \mathrm { DUS t 3 R } ( X ^ { \mathrm { r e f } } ) $ Where:
- $X^{\mathrm{ref}}$ : The set of reference input images.
- $\mathcal{P}$ : The coarse 3D point cloud estimated by DUSt3R.
- $\hat{\Pi}^{\mathrm{ref}}$ : The estimated camera extrinsics (rotation and translation matrices) for each reference image.
- $\hat{K}^{\mathrm{ref}}$ : The estimated camera intrinsics (focal length, principal point) for each reference image.
Intrinsic Sharing Modification: For CF-GaussianObject, the intrinsic recovery module within DUSt3R is modified to allow all images $x_i \in X^{\mathrm{ref}}$ to share the same intrinsic camera parameter $\hat{K}$ . This simplifies the camera model and makes it more robust for sparse scenes. DUSt3R is then used to retrieve $\mathcal{P}$ , $\hat{\Pi}^{\mathrm{ref}}$ , and the shared intrinsic $\hat{K}$ .
3D Gaussian Initialization: The estimated coarse point cloud $\mathcal{P}$ from DUSt3R is used to initialize the 3D Gaussians, similar to how SfM points are typically used in 3DGS. This replaces the visual hull initialization step.
Joint Optimization: After initialization, both the estimated camera poses $\hat{\Pi}^{\mathrm{ref}}$ and the initialized 3D Gaussians are jointly optimized. This optimization uses the input images $X^{\mathrm{ref}}$ and depth maps rendered from $\mathcal{P}$ simultaneously.
Regularization Loss: A regularization loss is introduced to constrain deviations from the initial $\hat{\Pi}^{\mathrm{ref}}$ predicted by DUSt3R. This helps to stabilize the optimization of camera parameters, preventing them from drifting too far from DUSt3R's initial (potentially noisy) estimates. After this joint optimization, the refined 3D Gaussians and camera parameters are then used in the Gaussian repair model setup and Gaussian repairing process as described in Sections 4.2.4 and 4.2.5.

4.2.7. Key Mathematical Symbols Summary

The following are the results from Table 1 of the original paper:

Symbol	Meaning
$X^{\mathrm{ref}}$	Reference images
$K^{\mathrm{ref}}$	Intrinsics of $X^{\mathrm{ref}}$
$\hat{K}^{\mathrm{ref}}$	Estimated intrinsics of $X^{\mathrm{ref}}$
$\hat{K}$	Estimated shared intrinsics of $X^{\mathrm{ref}}$
$\Pi^{\mathrm{ref}} = \{\pi_i\}_{i=1}^N$	Extrinsics of $X^{\mathrm{ref}}$
$\Pi^{\mathrm{nov}}$	Extrinsics of viewpoints in repair path
$\hat{\Pi}^{\mathrm{ref}}$	Estimated extrinsics of $X^{\mathrm{ref}}$
$M^{\mathrm{ref}}$	Masks of $X^{\mathrm{ref}}$
$\mu$	Center location of Gaussian
$q$	Rotation quaternion of Gaussian
$s$	Scale vector of Gaussian
$\sigma$	Opacity of Gaussian
`sh`	Spherical harmonic coefficients of Gaussian
$\mathcal{G}_c$	Coarse 3D Gaussians
$\mathcal{R}$	Diffusion based Gaussian repair model
$\mathcal{E}$	Latent diffusion encoder of $\mathcal{R}$
$\mathcal{D}$	Latent diffusion decoder of $\mathcal{R}$
$x'$	Degraded rendering
$\hat{x}$	Image repaired by $\mathcal{R}$
$\epsilon_s$	3D Noise added to attributes of $\mathcal{G}_c$
$\epsilon$	2D Gaussian noise for fine-tuning
$\epsilon_\theta$	2D Noise predicted by $\mathcal{R}$
$c^{\mathrm{tex}}$	Object-specific language prompt
$\mathcal{P}$	Coarse point cloud predicted by DUSt3R

5. Experimental Setup

5.1. Datasets

GaussianObject is evaluated on several challenging datasets that are suitable for sparse-view $360^\circ$ object reconstruction, representing a variety of scene complexities and object types.

MipNeRF360 [Barron et al. 2021]: This dataset features complex real-world scenes captured with a camera moving $360^\circ$ around the object. It includes various objects like "kitchen," "garden," "room," etc., often with challenging lighting and reflective surfaces. The paper specifically mentions experiments on the "kitchen" scene for ablation studies.
OmniObject3D [Wu et al. 2023]: A dataset designed for $360^\circ$ object reconstruction, likely featuring diverse objects under controlled or varied conditions, providing a benchmark for object-centric tasks.
OpenIllumination [Liu et al. 2023a]: This dataset focuses on objects under varying illumination conditions, which is crucial for evaluating a model's ability to disentangle geometry and appearance.
Our-collected unposed images: To demonstrate the practical utility of the COLMAP-free variant, the authors collected images of daily-life objects using an iPhone 13. These images represent real-world casual captures where accurate camera poses are not readily available.
- For these custom images, SAM [Kirillov et al. 2023] (Segment Anything Model) is used to automatically obtain masks of the target objects, simplifying the data preparation for the visual hull initialization.
  
  These datasets are chosen because they represent different facets of the 3D reconstruction problem: complex real-world scenes, diverse objects, varying illumination, and unposed real-world captures. This diversity allows for a comprehensive validation of the method's performance under various challenging conditions, especially for the sparse $360^\circ$ setup.

5.2. Evaluation Metrics

For evaluating the novel view synthesis performance, the paper uses three standard metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). These metrics assess different aspects of image quality.

5.2.1. Peak Signal-to-Noise Ratio (PSNR)

Conceptual Definition: PSNR is a widely used quantitative metric to measure the quality of reconstruction of lossy compression codecs or, in this context, the quality of a generated image compared to its ground truth. It is typically expressed in decibels (dB). A higher PSNR value indicates a better reconstruction, implying that the generated image is closer to the ground truth in terms of pixel intensity values. It is sensitive to absolute pixel differences.
Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $ Where:
- $\mathrm{MAX}_I$ : The maximum possible pixel value of the image. For an 8-bit image, this is 255. For floating-point images scaled to [0, 1], this is 1.
- $\mathrm{MSE}$ $MSE$ : Mean Squared Error between the generated image and the ground truth image. $ \mathrm{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $ Where:
  - M, N: The dimensions (height and width) of the image.
  - I(i,j): The pixel value at row $i$ and column $j$ of the generated image.
  - K(i,j): The pixel value at row $i$ and column $j$ of the ground truth image.

5.2.2. Structural Similarity Index Measure (SSIM)

Conceptual Definition: SSIM is a perceptual metric designed to evaluate the perceived quality of an image. Unlike PSNR which focuses on absolute errors, SSIM attempts to model how the human visual system perceives image degradation. It considers three key factors: luminance, contrast, and structure. The SSIM value ranges from -1 to 1, where 1 indicates perfect similarity. Higher SSIM values are better.
Mathematical Formula: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $ Where:
- x, y: Two image patches being compared (e.g., from the generated and ground truth images).
- $\mu_x, \mu_y$ : The average (mean) of pixels in $x$ and $y$ , respectively.
- $\sigma_x^2, \sigma_y^2$ : The variance of pixels in $x$ and $y$ , respectively.
- $\sigma_{xy}$ : The covariance of $x$ and $y$ .
- $C_1 = (K_1L)^2, C_2 = (K_2L)^2$ : Small constants to avoid division by zero when the denominators are very close to zero. $L$ is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and $K_1=0.01, K_2=0.03$ are common default values. The SSIM calculation is typically applied locally over sliding windows in the image, and the final SSIM score for the image is the average of these local SSIM values.

5.2.3. Learned Perceptual Image Patch Similarity (LPIPS)

Conceptual Definition: LPIPS is a metric that measures the perceptual similarity between two images, often correlating better with human judgment than PSNR or SSIM. It works by extracting features from images using a pre-trained deep neural network (e.g., VGG, AlexNet, SqueezeNet), and then calculating the L2 distance between these feature representations. A lower LPIPS score indicates higher perceptual similarity (i.e., the images look more alike to a human). The paper uses $LPIPS*$ , which is LPIPS multiplied by $10^2$ .
Mathematical Formula: $ \mathrm{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} |w_l \odot (\phi_l(x){hw} - \phi_l(y){hw})|_2^2 $ Where:
- x, y: The two images being compared.
- $\phi_l(\cdot)$ : Feature stack from the $l$ -th layer of a pre-trained deep network (e.g., VGG).
- $w_l$ : A learned scalar weight for each channel in layer $l$ .
- $\odot$ : Element-wise product.
- $H_l, W_l$ : Height and width of the feature map at layer $l$ .
- $\|\cdot\|_2^2$ : Squared L2 norm. The metric sums the weighted L2 distances between feature activations across different layers of the network.

5.3. Baselines

The proposed GaussianObject method is compared against several representative baselines from both NeRF-based and Gaussian Splatting-based approaches, as well as recent Large Reconstruction Models (LRMs).

5.3.1. NeRF-based Methods

DVGO [Sun et al. 2022]: A fast NeRF method that uses explicit voxel grids to represent scenes, improving training and rendering speed. (Vanilla 3DGS is compared to this, as it is a common explicit representation).
DietNeRF [Jain et al. 2021]: A few-shot NeRF method that leverages a CLIP vision-language model as a semantic prior to guide novel view synthesis from sparse inputs.
RegNeRF [Niemeyer et al. 2022]: A NeRF method that uses various regularization techniques, such as depth and color consistency losses, to improve quality from sparse views.
FreeNeRF [Yang et al. 2023]: A NeRF approach that focuses on frequency regularization and implicit neural field priors to generalize better from sparse inputs.
SparseNeRF [Guangcong et al. 2023]: Another NeRF method specifically designed for sparse view settings, often incorporating monocular depth priors.
ZeroRF [Shi et al. 2024b]: Combines a deep image prior with factorized NeRF to effectively capture overall appearance, aiming for zero-shot or few-shot reconstruction.

5.3.2. Gaussian Splatting-based Methods

Vanilla 3DGS [Kerbl et al. 2023]: The original 3D Gaussian Splatting method, initialized randomly or with SfM points. It serves as a direct comparison to show the benefits of the structure priors and repair model added by GaussianObject.
FSGS [Zhu et al. 2024]: A few-shot Gaussian Splatting method that builds upon 3DGS and typically relies on SfM points for initialization. The paper mentions supplying extra SfM points to FSGS to make it work in the highly sparse $360^\circ$ setting, implying that its default setup would not handle 4 views well.

5.3.3. Large Reconstruction Models (LRMs)

LGM [Tang et al. 2024a]: A recent Large Reconstruction Model for fast 3D generation. The paper tests LGM-4 (using 4 sparse captures directly) and LGM-1 (using MVDream [Shi et al. 2024a] to generate specific views conforming to LGM's strict input requirements).
TriplaneGaussian (TGS) [Zou et al. 2024]: Another LGM-like model that often takes a single image as input. The paper feeds it with frontal views of objects.

5.3.4. For COLMAP-Free Evaluation

For a fair comparison with the CF-GaussianObject, other SOTA methods are equipped with camera parameters predicted by DUSt3R (the same pose estimation method used in CF-GaussianObject) when evaluating on unposed images.

These baselines are representative because they cover a range of approaches to sparse-view 3D reconstruction: implicit vs. explicit representations, different types of priors (semantic, depth, frequency, deep image), and both traditional optimization-based and modern generative/feed-forward LRM approaches. Comparing against vanilla 3DGS directly shows the impact of the proposed additions, while FSGS provides a 3DGS-specific few-shot baseline. The LRMs represent the cutting edge of fast, feed-forward reconstruction.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents extensive quantitative and qualitative evaluations of GaussianObject across multiple challenging datasets and configurations. The results consistently demonstrate GaussianObject's superior performance, especially in extremely sparse view settings (4 views).

6.1.1. Sparse $360^\circ$ Reconstruction Performance

The following are the results from Table 2 of the original paper:

	Method	4-view			6-view			9-view
	Method	LPIPS* ↓	PSNR ↑	SSIM ↑	LPIPS* ↓	PSNR ↑	SSIM ↑	LPIPS* ↓	PSNR ↑	SSIM ↑
MipNeRF360	DVGO [Sun et al. 2022]	24.43	14.39	0.7912	26.67	14.30	0.7676	25.66	14.74	0.7842
	3DGS [Kerbl et al. 2023]	10.80	20.31	0.8991	8.38	22.12	0.9134	6.42	24.29	0.9331
	DietNeRF [Jain et al. 2021]	11.17	18.90	0.8971	6.96	22.03	0.9286	5.85	23.55	0.9424
	RegNeRF [Niemeyer et al. 2022]	20.44	13.59	0.8476	20.72	13.41	0.8418	19.70	13.68	0.8517
	FreeNeRF [Yang et al. 2023]	16.83	13.71	0.8534	6.84	22.26	0.9332	5.51	27.66	0.9485
	SparseNeRF [Guangcong et al. 2023]	17.76	12.83	0.8454	19.74	13.42	0.8316	21.56	14.36	0.8235
	ZeroRF [Shi et al. 2024b]	19.88	14.17	0.8188	8.31	24.14	0.9211	5.34	27.78	0.9460
	FSGS [Zhu et al. 2024]	9.51	21.07	0.9097	7.69	22.68	0.9264	6.06	25.31	0.9397
	GaussianObject (Ours)	4.98	24.81	0.9350	3.63	27.00	0.9512	2.75	28.62	0.9638
	CF-GaussianObject (Ours)	8.47	21.39	0.9014	5.71	24.06	0.9269	5.50	24.39	0.9300
OmniObject3D	DVGO [Sun et al. 2022]	14.48	17.14	0.8952	12.89	18.32	0.9142	11.49	19.26	0.9302
	3DGS [Kerbl et al. 2023]	8.60	17.29	0.9299	7.74	18.29	0.9378	6.50	20.26	0.9483
	DietNeRF [Jain et al. 2021]	11.64	18.56	0.9205	10.39	19.07	0.9267	10.32	19.26	0.9258
	RegNeRF [Niemeyer et al. 2022]	16.75	15.20	0.9091	14.38	15.80	0.9207	10.17	17.93	0.9420
	FreeNeRF [Yang et al. 2023]	8.28	17.78	0.9402	7.32	19.02	0.9464	7.25	20.35	0.9467
	SparseNeRF [Guangcong et al. 2023]	17.47	15.22	0.8921	21.71	15.86	0.8935	23.76	17.16	0.8947
	ZeroRF [Shi et al. 2024b]	4.44	27.78	0.9615	3.11	31.94	0.9731	3.10	32.93	0.9747
	FSGS [Zhu et al. 2024]	6.25	24.71	0.9545	6.05	26.36	0.9582	4.17	29.16	0.9695
	GaussianObject (Ours)	2.07	30.89	0.9756	1.55	33.31	0.9821	1.20	35.49	0.9870
	CF-GaussianObject (Ours)	2.62	28.51	0.9669	2.03	30.73	0.9738	2.08	31.23	0.9757

The following are the results from Table 3 of the original paper:

Method	4-view			6-view
Method	LPIPS* ↓	PSNR ↑	SSIM ↑	LPIPS* ↓	PSNR ↑	SSIM ↑
DVGO	11.84	21.15	0.8973	8.83	23.79	0.9209
3DGS	30.08	11.50	0.8454	29.65	11.98	0.8277
DietNeRF†	10.66	23.09	0.9361	9.51	24.20	0.9401
RegNeRF†	47.31	11.61	0.6940	30.28	14.08	0.8586
FreeNeRF†	35.81	12.21	0.7969	35.15	11.47	0.8128
SparseNeRF	22.28	13.60	0.8808	26.30	12.80	0.8403
ZeroR	9.74	24.54	0.9308	7.96	26.51	0.9415
Ours	6.71	24.64	0.9354	5.44	26.54	0.9443

The results from Table 2 and Table 3 clearly demonstrate GaussianObject's superior performance across MipNeRF360, OmniObject3D, and OpenIllumination datasets, particularly evident in the LPIPS metric.

Dominant LPIPS Scores: GaussianObject consistently achieves the lowest $LPIPS*$ scores across all datasets and view counts (4, 6, and 9 views). For example, on MipNeRF360 with 4 views, GaussianObject achieves $LPIPS*$ of 4.98, significantly outperforming the next best FSGS (9.51). On OmniObject3D, it drops to an impressive 2.07. Since LPIPS correlates strongly with human perceptual quality, this indicates that GaussianObject generates visually more pleasing and realistic novel views.
Strong PSNR and SSIM: While LPIPS is the most striking improvement, GaussianObject also achieves the highest PSNR and SSIM values in most scenarios, especially with more views. For instance, on OmniObject3D (4 views), it leads with PSNR 30.89 and SSIM 0.9756. This indicates high fidelity in terms of pixel accuracy and structural preservation.
Effectiveness with Only 4 Views: The method's strength is most apparent with only 4 input images, where many baseline methods (e.g., DVGO, RegNeRF, SparseNeRF) perform poorly, often yielding fragmented or blurred results. GaussianObject's structure priors and Gaussian repair model effectively address the challenges of extreme sparsity.
Outperforming with More Views: The paper highlights that GaussianObject even outperforms methods with more input views (e.g., its 4-view performance can be better than some baselines with 6 or 9 views), further validating its effectiveness.
Qualitative Observations: As shown in Figure 5 from the original paper, implicit NeRF-based methods and randomly initialized 3DGS fail to reconstruct coherent objects in extremely sparse settings, often appearing as fragmented pixel patches. ZeroRF, while having competitive PSNR and SSIM on OpenIllumination, often produces blurred renderings lacking fine details (Figure 6 from the original paper), whereas GaussianObject demonstrates superior fine-detailed reconstruction. This reinforces the importance of the Gaussian repair model for perceptual quality. The following figure (Figure 5 from the original paper) shows qualitative results of various methods:

该图像是一个比较不同方法生成3D物体重建效果的示意图，展示了多种方法（如DVGO、3DGS等）与我们的GaussianObject的对比。每列展示了不同3D物体的重建结果，包括装载机、花卉、冰淇淋等，最底部为真实图像（GT）。

The following figure (Figure 6 from the original paper) shows qualitative results on the OpenIllumination dataset:

该图像是图表，展示了在OpenIllumination数据集上的定性结果，比较了GT、ZeroRF和GaussianObject的重建效果。尽管ZeroRF在PSNR和SSIM方面表现出色，其渲染效果往往模糊；而GaussianObject在恢复细节方面表现优异，显著提升了感知质量。

6.1.2. Comparison with LRMs

The following are the results from Table 4 of the original paper:

Method	LPIPS* ↓	PSNR ↑	SSIM ↑
TGS [Zou et al. 2024]	9.14	18.07	0.9073
LGM-4 [Tang et al. 2024a]	9.20	17.97	0.9071
LGM-1 [Tang et al. 2024a]	9.13	17.46	0.9071
GaussianObject (Ours)	4.99	24.81	0.9350

The comparison with LRM-like methods (Table 4) on MipNeRF360 reveals that while LRMs are fast, their performance significantly lags GaussianObject in sparse, in-the-wild capture scenarios. LGM-4 and LGM-1 (even with MVDream-generated compliant inputs) yield much higher $LPIPS*$ (around 9.1-9.2) and lower PSNR/SSIM compared to GaussianObject's 4.99 $LPIPS*$ . This highlights LRMs' sensitivity to strict input view distributions and object positioning, making them less robust for practical sparse reconstruction compared to GaussianObject which has no such restrictions and does not require extensive pre-training.

6.1.3. Performance of CF-GaussianObject

The COLMAP-free variant (CF-GaussianObject) shows a slight performance degradation compared to the full GaussianObject (which assumes accurate camera poses) but remains highly competitive or even superior to other SOTA methods that do rely on accurate camera parameters (Table 2).

For MipNeRF360 (4 views), CF-GaussianObject achieves $LPIPS*$ 8.47, which is better than FSGS (9.51) and significantly better than most NeRF-based methods.
For OmniObject3D (4 views), CF-GaussianObject's $LPIPS*$ of 2.62 is still better than FSGS (6.25). This demonstrates that while eliminating the need for precise camera parameters, CF-GaussianObject still offers strong performance, making it highly practical for casually captured images. The paper notes that performance degradation can increase with more input views, potentially due to DUSt3R's pose estimation accuracy decline. Qualitative results on iPhone 13-captured images (Figure 7 from the original paper) confirm CF-GaussianObject's superior reconstruction capabilities and visual quality compared to other SOTAs equipped with DUSt3R-predicted poses. The following figure (Figure 7 from the original paper) shows qualitative results on our-collected images:

Fig. 7. Qualitative results on our-collected images captured by an iPhone 13. We equip other SOTAs with camera parameters predicted by DUSt3R for fair comparison. The results demonstrate the superior… 该图像是比较不同算法在输入视图上的重建效果示意图。其中，第一行展示了不同算法的重建结果，包括FreeNeRF、ZeroRF、FSGS和我们的方法（CF），通过可视化比较展示了在细节和质量上的优势。

6.2. Ablation Studies

6.2.1. Key Components

The following are the results from Table 5 of the original paper:

Method	LPIPS* ↓	PSNR ↑	SSIM ↑
Ours w/o Visual Hull	12.72	15.95	0.8719
Ours w/o Floater Elimination	4.99	24.73	0.9346
Ours w/o Setup	5.53	24.28	0.9307
Ours w/o Gaussian Repair	5.55	24.37	0.9297
Ours w/o Depth Loss	5.09	24.84	0.9341
Ours w/ SDS [Poole et al. 2023]	6.07	22.42	0.9188
GaussianObject (Ours)	4.98	24.81	0.9350

Ablation studies on MipNeRF360 (4 views, Table 5) confirm the critical role of each component:

Visual Hull (VH): Omitting Visual Hull initialization (Ours w/o Visual Hull) leads to a drastic drop in performance across all metrics ( $LPIPS*$ jumps from 4.98 to 12.72, PSNR drops from 24.81 to 15.95). This confirms that visual hull is essential for building robust initial multi-view consistency from sparse inputs.
Floater Elimination (FE): Removing Floater Elimination (Ours w/o Floater Elimination) causes a slight increase in $LPIPS*$ (4.98 to 4.99) and minor drops in PSNR/SSIM. While the quantitative impact is small, qualitative evaluation (Figure 9 from the original paper) would likely show more spurious Gaussians.
Gaussian Repair Model (Setup and Repair Process): Both the setup (Ours w/o Setup) and the actual repair process (Ours w/o Gaussian Repair) are crucial. Without the setup (i.e., not fine-tuning the ControlNet), $LPIPS*$ increases to 5.53. Without the repair process itself, $LPIPS*$ increases to 5.55. This indicates that both learning the repair capability and applying it are vital for the significant perceptual improvements. Qualitatively, Figure 8 from the original paper demonstrates noticeable artifacts and lack of details without these components.
Depth Loss: Removing the depth loss (Ours w/o Depth Loss) leads to a minor increase in $LPIPS*$ (4.98 to 5.09) and slight changes in PSNR/SSIM. While its direct quantitative contribution might seem marginal, the authors state it enhances the robustness of the framework, likely by providing additional geometric guidance that helps stabilize optimization.
SDS Loss: Using Score Distillation Sampling (SDS) from DreamFusion (Ours w/ SDS) as an alternative to the proposed repair loss results in significantly worse performance ( $LPIPS*$ 6.07, PSNR 22.42), indicating SDS is unstable and less effective in this sparse-view 3DGS context compared to the Gaussian repair model.

The following figure (Figure 9 from the original paper) shows qualitative ablation studies on different components:

该图像是一个示意图，展示了不同组件对重建质量的影响。左侧为真实视图（GT），右侧展示了不同情况下的重建结果，包括没有视觉外壳（w/o VH）、没有浮动体消除（w/o FE）、使用SDS（w/ SDS）和我们的方法（Ours）。红框和绿框标记了比较区域。

The following figure (Figure 8 from the original paper) shows the importance of the Gaussian repair model setup:

Fig. 8. Importance of our Gaussian repair model setup. Without the Gaussian repair process or the finetuning of the ControlNet, the renderings exhibit noticeable artifacts and lack of details, partic… 该图像是比较不同修复模型设置对渲染效果影响的示意图，其中展示了三种情况：无修复、无设置和我们的修复模型。每个模型的渲染效果以不同的方式展现细节的缺失，特别是在视角覆盖不足的区域。

6.2.2. Structure of Repair Model

The following are the results from Table 6 of the original paper:

Method	LPIPS* ↓	PSNR ↑	SSIM ↑
Zero123-XL [Liu et al. 2023c]	13.97	17.71	0.8921
Dreambooth [Ruiz et al. 2023]	6.58	21.85	0.9093
Depth Condition	7.00	21.87	0.9112
Depth Condition w/ Mask	6.87	21.92	0.9117
GaussianObject (Ours)	5.79	23.55	0.9220

Comparison with alternative Gaussian repair model structures (Table 6) highlights the effectiveness of the proposed design:

Zero123-XL: A well-known single-image reconstruction model, yields the worst performance ( $LPIPS*$ 13.97), indicating its struggle with multi-view consistency despite generating visually acceptable individual images. Its strict input requirements (object-centered, precise camera info) are also a limitation.
Dreambooth: While useful for semantic modifications, Dreambooth alone ( $LPIPS*$ 6.58) fails to provide strong 3D-coherent synthesis, suggesting that purely semantic guidance isn't sufficient for geometric consistency.
Depth Condition (with/without mask): Using monocular depth conditioning for the repair model (similar to Song et al. [2023b]) shows some improvement over Dreambooth but still performs worse than GaussianObject ( $LPIPS*$ 7.00 and 6.87 respectively). This suggests that relying solely on depth maps as conditioning might introduce roughness or artifacts, whereas the proposed method leverages the degraded rendering itself as a condition.
GaussianObject (Ours): Our model's design, which leverages the degraded rendering itself as condition and fine-tunes a ControlNet with self-generated data, achieves the best performance ( $LPIPS*$ 5.79), excelling in both 3D consistency and detail fidelity. Qualitative results in Figure 10 from the original paper visually confirm these findings. The following figure (Figure 10 from the original paper) shows qualitative comparisons by ablating different Gaussian repair model setup methods:

该图像是比较不同修复模型设置对渲染效果影响的示意图，其中展示了三种情况：无修复、无设置和我们的修复模型。每个模型的渲染效果以不同的方式展现细节的缺失，特别是在视角覆盖不足的区域。

6.2.3. Effect of View Numbers

The following figure (Figure 11 from the original paper) shows ablation on training view number:

Fig. 11. Ablation on Training View Number. Experiments are conducted on scene kitchen in the MipNeRF360 dataset. 该图像是图表，展示了基于视图数量的LPIPS和PSNR指标的比较，其中包括GaussianObject和3DGS方法的性能。左图(a)展示了LPIPS与视图数量的关系，右图(b)展示了PSNR与视图数量的关系，分别显示了不同方法的优劣趋势。

Figure 11 from the original paper demonstrates that GaussianObject consistently outperforms vanilla 3DGS across varying numbers of training views (from 4 to 24). This indicates that the proposed improvements are not limited to just 4 views but generalize to other sparse-to-medium sparsity levels. Notably, GaussianObject with 24 training views achieves performance comparable to 3DGS trained on all views (243 views), showcasing its data efficiency and ability to extract more information from fewer images.

6.3. Qualitative Examples

The paper includes several qualitative examples beyond the main result figures to further illustrate the capabilities of GaussianObject.

Figure 18 from the original paper (qualitative examples on the OpenIllumination dataset with four input views) provides more detailed visual comparisons, reinforcing the quantitative findings that GaussianObject produces sharper images with better details compared to baselines.
Figure 19 from the original paper (qualitative samples of the Gaussian repaired models on several scenes from different views) showcases the improved realism and detail in objects like plush dolls, broccoli, gloves, and ceramic jugs after the Gaussian repair process, demonstrating the model's ability to maintain multi-view consistency while enhancing visual quality.

These qualitative results complement the quantitative metrics by providing visual evidence of the high-quality 3D object reconstruction achieved by GaussianObject, especially in terms of fine details and overall perceptual realism.

7. Conclusion & Reflections

7.1. Conclusion Summary

GaussianObject is presented as a novel framework for high-quality 3D object reconstruction from extremely sparse $360^\circ$ views, specifically leveraging only four input images. The framework builds upon 3D Gaussian Splatting, a representation known for its real-time rendering capabilities.

The core contributions that enable this breakthrough are:

Structure-Prior-Aided Optimization: This involves using visual hull for robust initialization of 3D Gaussians and floater elimination during training. These explicit geometric priors are crucial for establishing multi-view consistency when input data is severely limited.
Diffusion-Based Gaussian Repair Model: A specialized ControlNet-based model is introduced to address artifacts and supplement omitted object information that cannot be recovered from sparse views alone. This model is trained using a novel self-generating strategy to create necessary image pairs (degraded renderings and high-fidelity targets). The distance-aware sampling strategy efficiently guides the refinement process by focusing on poorly observed regions.
COLMAP-Free Variant: Recognizing the practical challenges of obtaining accurate camera poses, CF-GaussianObject integrates DUSt3R for pose estimation, making the method applicable to casually captured images without requiring complex SfM pipelines.

Evaluations on diverse and challenging datasets (MipNeRF360, OmniObject3D, OpenIllumination, and custom iPhone captures) demonstrate that GaussianObject consistently achieves superior performance over previous state-of-the-art methods, particularly in perceptual quality (LPIPS), even outperforming methods that use more input views. The COLMAP-free variant also proves competitive, significantly broadening the method's real-world applicability.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Hallucinations in Unobserved Regions: In areas that are completely unobserved or extremely sparsely covered, the Gaussian repair model may generate hallucinations, meaning it might create plausible but non-existent details (e.g., filling a hole in a vase, as shown in Figure 12 from the original paper). The authors note that such regions are inherently non-deterministic, and other methods also struggle here. The following figure (Figure 12 from the original paper) shows hallucinations of non-existent details:

该图像是用于展示 GaussianObject 方法效果的插图。左侧为原始图像，右侧为该方法生成的细节。可以观察到，GaussianObject 在信息较少的区域处理时可能会填补一些不存在的细节，例如，石瓶的孔洞被合理地修复。
Limited View-Dependent Effects: Due to the high sparsity of input data, the model currently struggles to capture accurate view-dependent effects (e.g., reflections, specularity). It tends to "bake in" view-dependent features (like reflected white light) onto the surface, leading to incorrect appearance from novel viewpoints and unintended artifacts (Figure 13 from the original paper). The following figure (Figure 13 from the original paper) shows challenges in reconstructing view-dependent appearance:

该图像是示意图，展示了使用GaussianObject方法重建的陶罐的细节，左侧为原始输入图像，右侧为重建结果的对比。红框内显示了与真实图像（GT）和生成图像（Ours）在细节上的差异，强调了重建过程中视图相关外观的挑战。
- Future Work: Fine-tuning diffusion models with more view-dependent data is suggested as a promising direction to address this.
Integration with Surface Reconstruction: The paper suggests that integrating GaussianObject with surface reconstruction methods like 2DGS [Huang et al. 2024] and GOF [Yu et al. 2024] could be a promising direction. This could potentially yield more robust and geometrically precise mesh-based or explicit surface models.
Performance Gap in COLMAP-Free Variant: While CF-GaussianObject achieves competitive performance, there remains a performance gap compared to using precisely ground-truth or highly accurate camera parameters.
- Future Work: Leveraging confidence maps from matching methods within the pose estimation pipeline could lead to more accurate pose estimates and further close this performance gap.

7.3. Personal Insights & Critique

GaussianObject represents a significant step forward in making high-quality 3D reconstruction accessible from ultra-sparse inputs. Its strength lies in its hybrid approach, intelligently combining explicit geometric priors with the generative power of diffusion models, tailored specifically for the challenges of sparse $360^\circ$ captures.

Key Strengths and Innovations:

Practicality: The COLMAP-free variant is a game-changer for real-world applications. By removing the need for professional photogrammetry setups, it empowers casual users to create 3D assets from simple phone captures. This directly addresses a major bottleneck for widespread adoption of 3D vision techniques.
Robustness in Extreme Sparsity: The ability to generate high-quality results from just four views is remarkable. The visual hull and floater elimination are elegant solutions to provide the much-needed geometric scaffolding when traditional SfM fails.
Targeted Diffusion Integration: Instead of just applying generic SDS (which the paper shows to be less stable in this context), the development of a specialized Gaussian repair model with its self-generating data strategy is highly innovative. It shows a deep understanding of how to adapt powerful generative models to solve specific reconstruction artifacts.
Perceptual Quality Focus: The emphasis on LPIPS and qualitative results highlights a user-centric design, prioritizing how good the rendered objects look over pure pixel accuracy, which is often more important for visual applications.

Potential Improvements and Critiques:

Hallucination Control: While acknowledged, the hallucination problem is inherent to generative models. Future work could explore methods to quantify reconstruction uncertainty and visually indicate regions where the model is "dreaming" rather than reconstructing, or integrate more semantic understanding to guide plausible completion.
View-Dependent Appearance: The limitation on view-dependent effects is a tough challenge for sparse views. Perhaps a hybrid model that can infer material properties or use physically-based rendering priors could disentangle appearance more effectively, even with limited data.
Dynamic Objects/Scenes: The current method is likely optimized for static objects. Extending it to dynamic scenes or objects with non-rigid deformation from sparse views would be a monumental but impactful next step.
Computational Cost for Repair Model Training: While the 3DGS optimization is fast, the ControlNet fine-tuning step, even with LoRA, still adds a significant training time and computational resource requirement. Optimizing this phase further could reduce the overall pipeline time.

Transferability and Future Value: The core ideas in GaussianObject—the judicious use of explicit geometric priors, a targeted generative repair mechanism, and a COLMAP-free approach—could be highly transferable.

Other Explicit Representations: The repair model concept could be adapted for other explicit 3D representations beyond Gaussians, such as meshes or point clouds, by repairing their rendered outputs.
Medical Imaging/Robotics: In fields where data capture is inherently sparse (e.g., medical scans, robotic exploration in constrained environments), similar principles could be applied to reconstruct complete 3D models from limited sensor data.
Asset Creation Pipelines: GaussianObject could become a standard tool in 3D asset creation, dramatically reducing the time and effort required to digitize real-world objects for games, film, and e-commerce.

Overall, GaussianObject is a well-engineered and highly practical solution that pushes the boundaries of sparse-view 3D reconstruction, making high-quality 3D content creation more accessible.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.