Paper status: completed

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

Published:10/14/2025

3D Gaussian Splatting representation (12)Generative Prior (1)Multi-View Consistency Modeling (1)Diffusion Model-Based 3D Reconstruction (1)Geometry-Guided Planar Structures (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

G4Splat integrates geometry-guided depth supervision with generative priors to improve 3D scene reconstruction quality and multi-view consistency, outperforming baselines on diverse datasets and supporting single-view and unposed video inputs.

Abstract

Despite recent advances in leveraging generative prior from pre-trained diffusion models for 3D scene reconstruction, existing methods still face two critical limitations. First, due to the lack of reliable geometric supervision, they struggle to produce high-quality reconstructions even in observed regions, let alone in unobserved areas. Second, they lack effective mechanisms to mitigate multi-view inconsistencies in the generated images, leading to severe shape-appearance ambiguities and degraded scene geometry. In this paper, we identify accurate geometry as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction. We first propose to leverage the prevalence of planar structures to derive accurate metric-scale depth maps, providing reliable supervision in both observed and unobserved regions. Furthermore, we incorporate this geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models, resulting in accurate and consistent scene completion. Extensive experiments on Replica, ScanNet++, and DeepBlending show that our method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. Moreover, our method naturally supports single-view inputs and unposed videos, with strong generalizability in both indoor and outdoor scenarios with practical real-world applicability. The project page is available at https://dali-jack.github.io/g4splat-web/.

Mind Map

In-depth Reading

English Analysis~35 min read · 49,205 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior".

1.2. Authors

The authors are Junfeng Ni, Yixin Chen, Zhifei Yang, Yu Liu, Ruijie Lu, Song-Chun Zhu, and Siyuan Huang. Their affiliations include Tsinghua University, State Key Laboratory of General Artificial Intelligence (BIGAI), and Peking University. Junfeng Ni's work was done as an intern at BIGAI, and Siyuan Huang is identified as the corresponding author.

1.3. Journal/Conference

The paper is published at (UTC): 2025-10-14T03:06:28.000Z and is available on arXiv, a preprint server. While arXiv hosts preprints, the publication date suggests it is intended for a major conference or journal in 2025. Given the topics (3D vision, generative models) and the typical publication cycles, it is likely targeted for a top-tier computer vision conference such as CVPR, ICCV, or ECCV, which are highly reputable and influential venues in the field.

1.4. Publication Year

The publication year is 2025.

1.5. Abstract

Despite recent advancements in using pre-trained diffusion models as a generative prior for 3D scene reconstruction, current methods suffer from two key limitations. First, they lack reliable geometric supervision, leading to low-quality reconstructions in both observed and unobserved regions. Second, they struggle with multi-view inconsistencies in generated images, causing shape-appearance ambiguities and degraded scene geometry. This paper, G4SPLAT, proposes that accurate geometry is crucial for effectively leveraging generative models. It introduces a novel approach that uses the prevalence of planar structures in scenes to derive accurate metric-scale depth maps, providing reliable geometric supervision for both observed and unobserved areas. This geometry guidance is integrated throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when employing video diffusion models for inpainting. Experimental results on Replica, ScanNet++, and DeepBlending datasets demonstrate that G4SPLAT consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. The method also supports single-view inputs and unposed videos, showcasing strong generalizability and practical applicability in various scenarios.

1.6. Original Source Link

https://arxiv.org/abs/2510.12099

1.7. PDF Link

https://arxiv.org/pdf/2510.12099v1.pdf

2. Executive Summary

2.1. Background & Motivation

The paper addresses the challenge of high-quality 3D scene reconstruction, particularly from sparse input views, which is a critical problem for applications like virtual reality, robotics, and embodied artificial intelligence. Recent progress in 3D Gaussian Splatting (3DGS) has enabled photo-realistic novel view synthesis but struggles with sparse inputs due to insufficient geometric and photometric supervision.

A new wave of methods attempts to overcome this by incorporating generative priors from pre-trained diffusion models to hallucinate missing content in "under-constrained areas" (regions distant from or invisible to input views). However, these methods face two critical limitations:

Lack of reliable geometric supervision: Existing approaches often produce poor reconstruction quality even in observed regions with sparse inputs, which undermines the geometric foundation needed for inpainting unobserved areas. Monocular depth estimators, while useful, suffer from inherent "scale ambiguity" (they can tell relative depth but not absolute, real-world distances).
Ineffective mitigation of multi-view inconsistencies: The generated images from diffusion models often lack consistency across different viewpoints, leading to "shape-appearance ambiguities" (where the reconstructed 3D shape doesn't match the generated 2D appearance from different angles) and degraded overall scene geometry.

The paper argues that robust and accurate geometry guidance is the fundamental prerequisite for effectively harnessing the power of generative models to achieve high-quality 3D scene reconstruction, especially in previously unobserved regions.

2.2. Main Contributions / Findings

The paper introduces G4SPLAT, a novel framework that integrates accurate geometry guidance with generative priors, offering several key contributions:

Plane-aware, Scale-Accurate Geometric Constraints: The method leverages the prevalence of planar structures (like walls, floors, tables) in man-made environments, consistent with the "Manhattan world assumption", to derive scale-accurate metric depth maps. Unlike prior matching-based methods that fail in non-overlapping regions, G4SPLAT exploits the extensibility of plane representations to provide reliable depth supervision for both observed and unobserved areas. This forms a robust geometric foundation for reconstruction.
Geometry-Guided Generative Pipeline: G4SPLAT incorporates this geometry guidance throughout its generative refinement loop to alleviate shape-appearance ambiguities and multi-view inconsistencies. Specifically, it uses geometry to:
- Improve "visibility mask estimation" for inpainting by constructing a 3D visibility grid from scale-accurate depth.
- Guide "novel view selection" by using global 3D planes as object proxies to maximize coverage of complete planar structures, providing richer contextual cues.
- Enhance "multi-view consistency" during inpainting with video diffusion models by modulating color supervision based on which view best observes a specific 3D plane.
State-of-the-Art Performance and Generalizability: Extensive experiments on Replica, ScanNet++, and DeepBlending datasets demonstrate that G4SPLAT consistently surpasses existing baselines in both geometric and appearance reconstruction, showing particularly strong improvements in unobserved regions.
Support for Diverse Scenarios: The method naturally supports reconstruction from single-view inputs and unposed videos, demonstrating strong generalizability across indoor and outdoor environments, highlighting its practical real-world applicability.

These findings collectively demonstrate that by prioritizing and meticulously integrating accurate geometry, G4SPLAT effectively overcomes the limitations of prior generative 3D reconstruction methods, leading to more faithful and consistent scene representations.

3.1. Foundational Concepts

To understand G4SPLAT, familiarity with several core concepts in computer graphics and vision is essential:

3D Gaussian Splatting (3DGS): A novel 3D scene representation and rendering technique introduced by Kerbl et al. (2023). Instead of voxels or meshes, it represents a 3D scene as a collection of millions of tiny, independent 3D Gaussian primitives. Each Gaussian is defined by its 3D position, covariance matrix (determining its shape and orientation), opacity (how transparent it is), and color (often represented by spherical harmonics to capture view-dependent appearance). When rendering a novel view, these Gaussians are projected onto the 2D image plane and composited using alpha blending, ordered by depth. 3DGS offers impressive photo-realistic rendering quality and real-time performance for novel view synthesis, especially when dense input views are available. However, its performance degrades significantly with sparse input views because there isn't enough information to accurately optimize the vast number of Gaussian parameters, leading to "floaters" (unwanted blurry artifacts) and poor geometry.
2D Gaussian Splatting (2DGS): An extension of 3DGS (Huang et al., 2024a) that collapses the 3D volumetric Gaussians into 2D anisotropic disks. This means that instead of having 3D Gaussians in space, the primitives are essentially 2D ellipses defined on a surface (a "chart" or "surfel") and then projected. This approach aims to improve geometric accuracy and rendering from sparse views by operating more directly on 2D image-based representations. MAtCha (Guédon et al., 2025) is a method built upon 2DGS.
Generative Models, particularly Diffusion Models: A class of generative artificial intelligence models capable of producing new data (like images or 3D assets) that resemble the data they were trained on. "Diffusion models" (e.g., Rombach et al., 2022; Blattmann et al., 2023) work by iteratively adding noise to data and then learning to reverse this process to generate clean data from noise. They have shown remarkable success in generating high-quality, diverse images. When used as a "generative prior" in 3D reconstruction, they can "hallucinate" missing content in unobserved regions of a 3D scene, filling in gaps where actual photographic evidence is scarce. "Video diffusion models" (e.g., Ma et al., 2025) extend this capability to generate consistent sequences of images, which is crucial for maintaining "multi-view consistency" in 3D scenes.
Manhattan World Assumption / Planar Structures: This assumption (Coughlan & Yuille, 1999) posits that man-made environments predominantly consist of surfaces that are mutually orthogonal (like walls, floors, ceilings) or parallel. This simplifies scene understanding and reconstruction because it implies a large presence of planar structures. In computer vision, detecting and leveraging these planar structures can significantly aid in tasks like depth estimation, 3D reconstruction, and camera pose estimation, as planes provide strong geometric constraints. For instance, if you know a surface is planar, you can extrapolate its depth and normal even from partial observations.
Monocular Depth Estimation: The task of predicting the depth (distance from the camera) of each pixel in an image using only a single 2D image. Deep learning models have achieved impressive results in this area (e.g., Yang et al., 2024). However, monocular depth estimators inherently suffer from "scale ambiguity"; they can predict relative depths (e.g., object A is twice as far as object B) but not absolute metric depths (e.g., object A is 5 meters away). This makes them less reliable for precise 3D reconstruction without additional scale information.
Structure-from-Motion (SfM): A photogrammetric range imaging technique for estimating 3D structures from 2D image sequences. It works by identifying corresponding features across multiple images and then using triangulation to compute their 3D positions, as well as the camera poses (position and orientation) for each image. SfM provides a sparse 3D point cloud and camera parameters, which are crucial for initializing many 3D reconstruction pipelines. MASt3R-SfM (Duisterhof et al., 2025) is an advanced SfM method.

3.2. Previous Works

The paper categorizes related work into three main areas:

Sparse-View 3DGS:
- Problem: Standard 3DGS performs poorly with few input views due to insufficient geometric and photometric supervision. This leads to floating artifacts and inaccurate geometry.
- Existing Solutions:
  - DNGaussian (Li et al., 2024a) and FSGS (Zhu et al., 2024) incorporate depth regularization to suppress floaters in visible regions.
  - MAtCha (Guédon et al., 2025) introduces "scale-accurate depth" derived from SfM methods (MASt3R-SfM). MAtCha uses a "chart alignment procedure" that optimizes chart parameters for each input view based on SfM outputs.
- Limitations of Previous Works: The monocular depth estimators used in some methods suffer from "scale ambiguity" (Yang et al., 2024), making their geometric supervision unreliable. MAtCha, despite using SfM, struggles to reconstruct "non-overlapping regions" between input views because its chart alignment relies on accurate image correspondences, which are absent in such areas. The paper aims to overcome this limitation by leveraging the "extensibility of plane representations".
Generative Prior for 3DGS:
- Problem: While diffusion models provide powerful priors for 3D reconstruction, directly applying them to sparse-view 3DGS can still lead to issues.
- Existing Solutions: Recent studies (e.g., Poole et al., 2022; Liu et al., 2023) demonstrate the effectiveness of diffusion models. Approaches like GenFusion (Wu et al., 2025c), $Difix3D+$ (Wu et al., 2025a), GuidedVD (Zhong et al., 2025) and See3D (Ma et al., 2025) employ "video diffusion models" to enhance cross-view consistency during content generation.
- Limitations of Previous Works: These generative methods primarily rely on initial (often poor) reconstructions to apply their generative power. They still suffer from "numerous floating Gaussian artifacts" and "low-quality 3D geometry" in the inpainted regions, especially due to insufficient geometric constraints under sparse observations. This results in "shape-appearance ambiguities".
Plane Assumption in Reconstruction:
- Background: The "Manhattan-world assumption" and the "plane assumption" have long been adopted in reconstructing man-made environments.
- Existing Applications:
  - Improving matching accuracy in SfM and SLAM (Liu et al., 2024c; Guo et al., 2024).
  - Directly fitting planes to model indoor scenes (Agarwala et al., 2022; Tan et al., 2023).
  - Optimizing 3D neural implicit representations (Li et al., 2024b; Chen et al., 2024a) or 3DGS (e.g., GeoGaussian (Li et al., 2024d), IndoorGS (Ruan et al., 2025) imposing local planar constraints, PlanarSplatting (Tan et al., 2025) reconstructing planar primitives directly).
- Differentiation: G4SPLAT uses the plane assumption not just for local constraints or direct plane modeling, but to "extract scale-accurate depth" that provides broad geometric guidance. This guidance is then used both for optimizing the Gaussian representation and for facilitating the integration of generative priors, enabling precise scene reconstruction for both planar and non-planar structures across observed and unobserved regions.

3.3. Technological Evolution

The evolution in 3D reconstruction from images has progressed from traditional SfM/Multi-View Stereo (MVS) to Neural Radiance Fields (NeRFs), and more recently to 3D Gaussian Splatting (3DGS).

Early Stages (SfM/MVS): Focused on reconstructing explicit 3D geometry (point clouds, meshes) from multiple images, often struggling with texture-less regions and requiring dense views.
Implicit Representations (NeRFs): Introduced a paradigm shift by representing scenes as neural networks, achieving unprecedented photo-realism but often slow for training and rendering.
Explicit Representations (3DGS): Offered a balance, providing high-quality rendering with significantly faster training and real-time rendering speeds by using explicit Gaussian primitives.
Sparse-View Challenges: All these methods face degraded performance with sparse input views. Initial solutions involved depth regularization or incorporating SfM-derived cues (MAtCha).
Generative Prior Integration: The rise of powerful generative models (diffusion models) led to attempts to fill in missing information in sparse-view scenarios by synthesizing plausible content. However, these often introduced inconsistencies and artifacts due to a lack of strong 3D geometric constraints.
G4SPLAT's Position: This paper fits into the latest wave by addressing the critical weakness of generative prior methods: the insufficient geometric guidance. It builds upon 2DGS and MAtCha's ideas of depth supervision but significantly enhances the geometric accuracy and consistency by specifically leveraging planar structures and integrating this robust geometry throughout the entire generative pipeline.

3.4. Differentiation Analysis

Compared to the main related works, G4SPLAT introduces several core innovations and differentiators:

Robust Geometry from Planes for Unobserved Regions:
- MAtCha: Relies on chart alignment with SfM points, which provides scale-accurate depth but fails in non-overlapping or unobserved regions due to lack of correspondences.
- G4SPLAT: Leverages the "extensibility of plane representations". By fitting global 3D planes from observed regions, it can extrapolate reliable depth estimates to non-overlapping or even unobserved planar regions. This is a crucial distinction as it provides reliable geometric supervision where MAtCha cannot. It also uses this accurate planar depth to linearly align monocular depth for non-planar regions, providing complete and scale-accurate depth maps.
Geometry-Guided Generative Pipeline (End-to-End Integration):
- Generative Prior Methods (e.g., GenFusion, $Difix3D+$ , GuidedVD, See3D): These methods introduce generative priors but often suffer from "shape-appearance ambiguities" and "multi-view inconsistencies" because they lack strong geometric guidance for the generation process itself. They tend to use generative models to fill in based on an existing (and often flawed) 3D reconstruction.
- G4SPLAT: Integrates geometry guidance throughout the generative pipeline. This means geometry informs:
  - Visibility Mask Estimation: Using a 3D visibility grid derived from scale-accurate depth, instead of noisy alpha maps.
  - Novel View Selection: Guiding view selection to maximize coverage of planar structures, providing better context for generation.
  - Multi-view Consistency: Modulating color supervision from generated views based on which view best observes a 3D plane, reducing cross-view conflicts during inpainting. This is a more proactive and integrated approach to ensure consistency, rather than just post-hoc regularization.
Holistic Solution for Sparse and Unobserved Areas: G4SPLAT provides a more complete solution for sparse-view 3DGS by not only improving reconstruction in observed areas but specifically targeting the faithful and consistent reconstruction of previously unobserved regions, which remains a major challenge for other methods.

In essence, G4SPLAT's innovation lies in identifying accurate geometry as the primary driver for effective generative 3D reconstruction and then meticulously weaving this geometric understanding into every stage of the generative process, from foundational depth estimation to guiding how generative models are used to fill in missing scene parts.

4. Methodology

The core idea of G4SPLAT is that robust geometry guidance is fundamental for effectively utilizing generative models to achieve high-quality 3D scene reconstruction. The method achieves this by first leveraging planar structures to derive scale-accurate metric depth maps and then integrating this geometry guidance throughout the generative pipeline. This integration improves visibility mask estimation, guides novel view selection, and enhances multi-view consistency during inpainting with video diffusion models. The theoretical basis builds upon 2D Gaussian Splatting and incorporates principles of geometric consistency from the Manhattan world assumption and robust statistical methods like RANSAC for plane estimation.

The overall workflow of G4SPLAT consists of two main stages: an initialization stage and an iterative geometry-guided generative training loop.

The following figure illustrates the overview of G4SPLAT's workflow:

Figure 2: Overview of G4SPLAT. For each training loop (Section 3.4), we first extract global 3D planes from all training views and compute plane-aware depth maps (Section 3.2). Subsequently, we const… 该图像是示意图，展示了G4Splat方法的整体流程。包括从训练视角提取全局3D平面、计算平面感知深度图，构建可见性网格，选择新视角进行不可见区域补全，并将完成人工视图加入训练的过程。

Figure 2: Overview of G4SPLAT. For each training loop (Section 3.4), we first extract global 3D planes from all training views and compute plane-aware depth maps (Section 3.2). Subsequently, we construct a visibility grid from these depth maps, select plane-aware novel views, inpaint their invisible regions, and incorporate the completed views back into the training set (Section 3.3).

4.1. Background

G4SPLAT builds upon and extends concepts from 2D Gaussian Splatting (2DGS) and MAtCha.

4.1.1. 2D Gaussian Splatting (2DGS) Rendering

In 2DGS, a 3D scene is represented by a collection of 2D anisotropic disks (Gaussians) that are collapsed from 3D volumetric Gaussians. Each 2D Gaussian is associated with an opacity $\alpha$ and a view-dependent color $\mathbf{c}$ (represented using spherical harmonics).

The value at a point $\mathbf{u} = (u, v)$ within the disk of a 2D Gaussian is defined by the Gaussian function: $g(\mathbf{u}) = \exp\left(-\frac{u^2 + v^2}{2}\right)$ where $\mathbf{u}$ represents the normalized 2D coordinates within the Gaussian disk.

During rendering, Gaussians are depth-sorted and composited using front-to-back alpha blending to compute the final pixel color $\mathbf{C}(\mathbf{x})$ : $\mathbf{C}(\mathbf{x}) = \sum_{i=1}^N \mathbf{c}_i \alpha_i g_i\big(\mathbf{u}(\mathbf{x})\big) \prod_{j=1}^{i-1} \big[1 - \alpha_j g_j\big(\mathbf{u}(\mathbf{x})\big)\big]$ Here, $N$ is the total number of Gaussians, $\mathbf{c}_i$ is the color of the $i$ -th Gaussian, $\alpha_i$ is its opacity, and $g_i(\mathbf{u}(\mathbf{x}))$ is the Gaussian function value at the projected pixel coordinate $\mathbf{u}(\mathbf{x})$ . The term $\prod_{j=1}^{i-1} \big[1 - \alpha_j g_j\big(\mathbf{u}(\mathbf{x})\big)\big]$ accounts for the accumulated transparency of preceding Gaussians.

Similarly, the depth map $d(\mathbf{x})$ can be computed by replacing colors with the $\mathbf{Z}$ -buffer values of the corresponding Gaussians: $d(\mathbf{x}) = \sum_{i=1}^N d_i \alpha_i g_i\big(\mathbf{u}(\mathbf{x})\big) \prod_{j=1}^{i-1} \big[1 - \alpha_j g_j\big(\mathbf{u}(\mathbf{x})\big)\big]$ Here, $d_i$ represents the $\mathbf{Z}$ -buffer value (depth) of the $i$ -th 2D Gaussian disk.

4.1.2. Chart Representation and Alignment in MAtCha

MAtCha (Guédon et al., 2025) extends 2DGS by representing each chart (a surface patch) with a lightweight deformation model. This model uses a sparse 2D grid of learnable features in UV space, mapped to 3D deformation vectors by a small Multi-Layer Perceptron (MLP). Depth-dependent features are added to handle depth discontinuities. Each chart is initialized from a monocular depth map.

MAtCha performs an alignment stage to ensure geometric consistency across views by optimizing chart deformations using three objectives:

Fitting Loss ( $\mathcal{L}_{\mathrm{fit}}$ ): Aligns charts with sparse Structure-from-Motion (SfM) points. $\mathcal{L}_{\mathrm{fit}} = \frac{1}{n} \sum_{i=0}^{n-1} \sum_{k=0}^{m_i-1} C_i(u_{ik}) \|\psi_i(u_{ik}) - p_{ik}\|_1 - \alpha \sum_{i=0}^{n-1} \log(C_i)$ Here, $\psi_i(u)$ is the deformed 3D position of a UV coordinate $u$ on chart $i$ . $p_{ik}$ is the 3D position of the $k$ -th SfM point visible in image $i$ . $C_i$ is a learnable confidence map that downweights unreliable SfM points. $n$ is the number of charts, and $m_i$ is the number of SfM points associated with chart $i$ . The second term with $\log(C_i)$ encourages high confidence for relevant points.
Structure Loss ( $\mathcal{L}_{\mathrm{struct}}$ ): Preserves sharp geometric structures from initial depth maps. $\mathcal{L}_{\mathrm{struct}} = \sum_{i=0}^{n-1} \left(1 - \boldsymbol{N}_i \cdot \boldsymbol{N}_i^{(0)}\right) + \frac{1}{4} \sum_{i=0}^{n-1} \|\boldsymbol{M}_i - \boldsymbol{M}_i^{(0)}\|_1$ $\boldsymbol{N}_i$ and $\boldsymbol{M}_i$ are the surface normal and mean curvature of the deformed chart $i$ , respectively. $\boldsymbol{N}_i^{(0)}$ and $\boldsymbol{M}_i^{(0)}$ are their counterparts derived from the initial depth maps. The first term encourages alignment of normals, and the second term regularizes curvature.
Mutual Alignment Loss ( $\mathcal{L}_{\mathrm{align}}$ ): Enforces global coherence by aligning neighboring charts. $\mathcal{L}_{\mathrm{align}} = \sum_{i,j=0}^{n-1} \sum_{u \in V_i} \operatorname*{min}\left(\|\psi_i(u) - \psi_j \circ P_j \circ \psi_i(u)\|_1, \tau\right)$ $V_i$ is the set of sampled UV coordinates on chart $i$ . $P_j$ is the projection from 3D space to the UV domain of chart $j$ . $\tau$ is an attraction threshold that limits the maximum alignment distance, preventing spurious alignments.

The overall optimization objective for chart alignment in MAtCha is: $\mathcal{L} = \mathcal{L}_{\mathrm{fit}} + \lambda_{\mathrm{struct}} \mathcal{L}_{\mathrm{struct}} + \lambda_{\mathrm{align}} \mathcal{L}_{\mathrm{align}}$ with weights $\lambda_{\mathrm{struct}} = 4$ and $\lambda_{\mathrm{align}} = 5$ . While this provides coherent geometry, its accuracy is limited by image correspondences, leading to errors in sparse or non-overlapping regions.

MAtCha further refines the reconstruction using 2DGS with a joint loss combining photometric consistency and geometric regularization.

RGB Loss ( $\mathcal{L}_{\mathrm{rgb}}$ ): A weighted combination of L1 loss and D-SSIM (structural similarity index measure). $\mathcal{L}_{\mathrm{rgb}} = (1 - \lambda) \mathcal{L}_1 + \lambda \mathcal{L}_{\mathrm{D-SSIM}}$ Here, $\mathcal{L}_1$ is the L1 loss (absolute difference), $\mathcal{L}_{\mathrm{D-SSIM}}$ is the D-SSIM loss, and $\lambda = 0.2$ is a weighting parameter.
Regularization Loss ( $\mathcal{L}_{\mathrm{reg}}$ ): Consists of a distortion loss and a depth-normal consistency loss.
- Distortion Loss ( $\mathcal{L}_d$ ): Prevents surfel drifting and enforces cross-chart consistency. $\mathcal{L}_d = \sum_{i,j} \omega_i \omega_j |z_i - z_j|$ $z_i$ denotes the intersection depth of the $i$ -th surfel, and $\omega_i$ is its blending weight. This term encourages Gaussians that are close in 2D to also have similar depths.
- Depth-Normal Consistency Loss ( $\mathcal{L}_n$ ): Promotes alignment of surface orientations. $\mathcal{L}_n = \sum_i \omega_i (1 - \mathbf{n}_i^{\mathrm{T}} \mathbf{N}_p)$ $\mathbf{n}_i$ is the normal of the surfel, and $\mathbf{N}_p$ is the normal derived from the depth gradient of the rendered view. This term encourages the surfel's normal to align with the estimated surface normal from the depth map. The overall regularization loss is: $\mathcal{L}_{\mathrm{reg}} = \lambda_d \mathcal{L}_d + \lambda_n \mathcal{L}_n$ with weights $\lambda_d = 500$ and $\lambda_n = 0.25$ .
Structure Loss ( $\mathcal{L}_{\mathrm{struct}}$ ): Similar to the chart alignment stage, this loss enforces geometric consistency between rendered Gaussians and the charts. $\mathcal{L}_{\mathrm{struct}} = \sum_{i=0}^{n-1} \|\bar{D}_i - D_i\|_1 + \sum_{i=0}^{n-1} (1 - \bar{N}_i \cdot N_i) + \frac{1}{4} \sum_{i=0}^{n-1} \|\bar{M}_i - M_i\|_1$ $\bar{D}_i, \bar{N}_i$ , and $\bar{M}_i$ are the depth, normal, and mean curvature rendered from the Gaussian surfels. $D_i, N_i$ , and $M_i$ are the corresponding values obtained from the MAtCha charts.

The complete optimization objective for Gaussian surfel refinement is: $\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{rgb}} + \mathcal{L}_{\mathrm{reg}} + \mathcal{L}_{\mathrm{struct}}$ G4SPLAT uses this same total loss formulation but enhances it by substituting MAtCha's chart-derived $D_i, N_i, M_i$ with its own "plane-aware depth maps" to provide stronger geometric constraints.

4.2. Plane-Aware Geometry Modeling

This is the first key innovation of G4SPLAT, aiming to provide scale-accurate depth maps by leveraging planar structures.

4.2.1. Per-view 2D Plane Extraction

The method assumes that planar regions in an image have consistent normal directions, smooth geometry, and similar semantics, which is generally true for man-made environments. Plane masks are extracted from images by combining "normal maps" (which indicate surface orientation) with "instance masks" generated by the Segment Anything Model (SAM) (Kirillov et al., 2023). Specifically:

K-means clustering is applied to the normal map to group pixels with similar surface orientations, identifying regions that are potentially planar.
These regions are then filtered using SAM masks, which provide object-level segmentation.
Only regions that are assigned the same instance label by SAM and exceed a predefined size threshold are considered valid "2D plane masks".

The following figure (part C) shows examples of intermediate results:

该图像是论文中的示意图，展示了通过不同方法生成的2D平面掩码、点云以及Alpha Mask和Visibility Mask的对比，进一步显示了基于几何指导的生成模型在视角和修复效果上的提升。

Figure 3: Visualization of intermediate results. Our method addresses key issues in prior approaches: (a) MAtCha produces noticeable errors in non-overlapping regions (highlighted by circles); (b) masks derived from alpha maps contain errors in visible areas of novel views; and (c) naive novel view selection offers only local coverage, causing visible seams in the final reconstruction.

4.2.2. Global 3D Plane Estimation

2D plane masks from individual views are often fragmented. To obtain globally consistent 3D planes, G4SPLAT establishes correspondences among these local masks using the 3D scene point cloud.

For each per-view 2D plane mask, the associated 3D points are collected from the scene point cloud by projecting them.
Two local planes are merged into the same "global 3D plane" if their associated 3D point sets have sufficient spatial overlap and similar normal directions. This process is repeated across all views to yield a set of global point collections $\{ \mathcal{P}_k \}$ ${P_{k}}$ , each representing a global 3D plane.
- Merging Details (Section C.1): To address occlusion and insufficient point density, the method first renders depth maps using Gaussian surfels and back-projects them to reconstruct the 3D scene point cloud. When projecting this point cloud to a viewpoint, points whose depth deviation from the Gaussian-rendered depth exceeds 1% are discarded, ensuring accuracy and handling occlusions.
Each global 3D plane is represented by the equation: $\Phi_k : \mathbf{n}_k^{\top} \mathbf{x} + d_k = 0$ where $\mathbf{n}_k \in \mathbb{R}^3$ is the unit normal vector (perpendicular to the plane) and $d_k \in \mathbb{R}$ is the offset from the origin along the normal direction.
To robustly estimate the plane parameters ( $\mathbf{n}_k, d_k$ ), a subset of high-confidence points $\mathcal{P}_k^{\mathrm{conf}} \subset \mathcal{P}_k$ (observed in at least three views) is selected. These parameters are then optimized using RANSAC (Random Sample Consensus, Fischler & Bolles, 1981) by minimizing the sum of squared distances from the points to the plane: $\operatorname*{min}_{\mathbf{n}_k, d_k} \sum_{\mathbf{p} \in \mathcal{P}_k^{\mathrm{conf}}} (\mathbf{n}_k^{\top} \mathbf{p} + d_k)^2 , \quad \mathrm{s.t. ~} \|\mathbf{n}_k\| = 1$ Here, $\mathbf{p}$ represents a 3D point in the high-confidence set. This procedure yields geometrically accurate and cross-view consistent 3D plane estimates.

4.2.3. Plane-Aware Depth Map Extraction

With the estimated global 3D planes, a "plane-aware depth map" $D^v$ is extracted for each view $v$ .

For each pixel $\mathbf{u}$ within a 2D plane mask $P_i^v$ (associated with global 3D plane $\Phi_{k_i}$ ), its depth is computed by intersecting the camera ray with the global 3D plane: $D_i^v(\mathbf{u}) = \frac{-\mathbf{n}_{k_i}^{\top} \mathbf{o}^v - d_{k_i}}{\mathbf{n}_{k_i}^{\top} \mathbf{r}^v(\mathbf{u})}$ where $\mathbf{o}^v$ is the camera center for view $v$ , and $\mathbf{r}^v(\mathbf{u})$ is the ray direction from the camera center through pixel $\mathbf{u}$ . This provides accurate, metric-scale depth for planar regions.
For non-planar regions, a pre-trained "monocular depth estimator" (Yang et al., 2024) provides a relative depth map $\hat{D}^v$ . To convert this to metric scale and align it with the accurate planar depths, a linear transformation is applied: $D^v(\mathbf{u}) = a_v \hat{D}^v(\mathbf{u}) + b_v$ The scale $a_v$ and offset $b_v$ are estimated by least-squares fitting over pixels belonging to the already-computed planar regions. This effectively calibrates the monocular depth using the known scale of the planes.

The resulting $D^v$ integrates geometry-consistent plane depths with refined monocular predictions, creating a complete and scale-accurate plane-aware depth representation for each view, alleviating errors in non-overlapping regions (Fig. 3a).

4.3. Geometry-Guided Generative Pipeline

Beyond providing an improved geometry basis, G4SPLAT integrates this geometry guidance directly into the generative refinement loop.

4.3.1. Geometry-Guided Visibility

Existing methods often rely on inpainting masks derived from "alpha maps" (transparency maps), which can be noisy and contain errors, especially in visible regions (Fig. 3b). G4SPLAT uses scale-accurate plane-aware depth to model scene visibility more reliably.

3D Boundaries and Voxel Grid: The 3D boundaries of the scene are determined from the depth maps of all training views. The scene is then discretized into a "voxel grid" $\mathcal{G}$ .
Voxel Visibility: For each voxel, its visibility is determined by projecting its center to each training view and checking if it falls within the valid depth range. A voxel is marked as visible (visibility value = 1) if it is observable in at least one view. This is done in parallel for efficiency.
Pixel-wise Visibility Map ( $V^v(\mathbf{u})$ ): For a novel view $v$ , the visibility map is rendered using the corresponding GS-rendered depth map. A ray is cast from the camera center through each pixel $\mathbf{u}$ , and $Q$ points are uniformly sampled along the ray up to the rendered depth. The visibility value $v_q$ of each sample point is obtained via nearest neighbor interpolation from the 3D visibility grid $\mathcal{G}$ . The final per-pixel visibility is computed as: $V^v(\mathbf{u}) = \prod_{q=1}^Q v_q$ This means a pixel is considered visible only if all $Q$ sampled points along its viewing ray are marked as visible in the 3D visibility grid $\mathcal{G}$ . This provides a much more accurate inpainting mask.

4.3.2. Plane-Aware Novel View Selection

Naive novel view selection (e.g., elliptical trajectories) often provides limited local coverage and can lead to inconsistencies (Fig. 3c). G4SPLAT proposes a "plane-aware view selection strategy" to ensure complete object coverage for selected views.

Object Proxies: Global 3D planes are used as object proxies, providing structural and textural cues.
Selection Criteria: For each global 3D plane, its centroid is used as the "look-at target". The method searches for a camera center $\mathbf{c}$ $c$ among the visible grid centers in the visibility grid $\mathcal{C}$ $C$ . The selection is guided by three objectives (Section C.2):
- Maximizing Coverage ( $R(\mathbf{c})$ ): Ratio of plane points visible from $\mathbf{c}$ to total plane points.
- Minimizing Distance ( $-D(\mathbf{c}, \mathbf{p}, \mathbf{n})$ ): Distance from camera center to the plane (look-at point).
- Aligning Viewing Direction and Plane Normal ( $|\cos \theta(\mathbf{c}, \mathbf{p}, \mathbf{n})|$ ): $\theta$ is the angle between the viewing direction $(\mathbf{p} - \mathbf{c})$ and the plane normal $\mathbf{n}$ . The camera center $\mathbf{c}^*$ is chosen by maximizing a composite score: $\mathbf{c}^* = \operatorname*{arg\,max}_{\mathbf{c} \in \mathcal{C}} \Big( R(\mathbf{c}) + \big|\cos \theta(\mathbf{c}, \mathbf{p}, \mathbf{n})\big| - D(\mathbf{c}, \mathbf{p}, \mathbf{n}) \Big)$ Additionally, an elliptical trajectory around the scene center is also incorporated to enhance view diversity.

4.3.3. Geometry-Guided Inpainting

For each selected novel view $v$ , the raw RGB map $\tilde{I}^v$ and the geometry-guided visibility mask $V^v$ are rendered. A pre-trained "video diffusion model" (e.g., See3D by Ma et al., 2025) is then used. This model takes reference input images $\{I^i\}$ and the rendered $\tilde{I}^v, V^v$ as input to jointly inpaint the occluded regions across all generated views, producing completed images $\{\hat{I}^v\}$ .

4.3.4. Mitigating Multi-view Inconsistencies

Despite joint inference by video diffusion models, inpainted results often show multi-view inconsistencies (Fig. A3). To mitigate this, G4SPLAT introduces a strategy (Section C.3):

Depth Projection for Correspondence: Correspondences across views are established by depth projection before training the Gaussian representation.
Modulated Color Supervision: For each region, color supervision primarily comes from a single, most reliable view:
- Planar Regions: If a region lies on a 3D plane, the view that observes that plane most completely is selected for color supervision.
- Non-planar Regions: The first view where the region is observed is selected. These operations are parallelized via geometric projection and act as a pre-processing step, minimizing computational overhead. This strategy significantly reduces "cross-view conflicts".

4.4. Overall Training Strategy

G4SPLAT's training pipeline is structured in two stages:

Initialization Stage:
- MAtCha's chart alignment is used to obtain initial depth maps for input views.
- These depth maps are then used to estimate global 3D planes and compute plane-aware depth maps (as described in Section 4.2).
- Gaussian parameters are initialized from the point cloud derived from these depth maps.
- The Gaussians are trained using the plane-aware depth maps, yielding a baseline model with accurate geometry in observed regions.
Geometry-Guided Generative Training Loop: This stage iteratively refines and extends the reconstruction. Each loop involves:
- Visibility Grid Construction: From the current set of training views.
- Novel View Selection: Using the plane-aware strategy.
- Inpainting: Invisible regions of the selected novel views are inpainted using a video diffusion model (geometry-guided).
- Training Set Augmentation: The inpainted novel views are merged into the training set.
- Recomputation and Fine-tuning: Global 3D planes and plane-aware depths are recomputed based on the augmented training set. The Gaussians are then fine-tuned with this updated supervision.
  
  This iterative process progressively recovers unseen regions and corrects geometric misalignments. The total loss formulation used during 2DGS training is the same as MAtCha's $\mathcal{L}_{\mathrm{total}}$ (Eq. B12), but critically, the structure loss term ( $D_i, N_i, M_i$ ) now leverages the more robust "plane-aware depth maps" instead of MAtCha's chart-derived values. The authors use three generative training loops, with each initialization or loop running for 7000 iterations.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on both synthetic and real-world datasets to ensure comprehensive evaluation.

Replica (Straub et al., 2019): A synthetic dataset comprising 8 diverse indoor scenes. These scenes offer high-quality ground truth geometry and photorealistic textures, making them ideal for quantitative evaluation.
- Example of a data sample: The Replica dataset contains high-fidelity 3D models of indoor environments, including furniture, objects, and architectural elements, rendered from various camera viewpoints. Images would show typical indoor room settings (e.g., a living room, office, bedroom) with realistic lighting and textures.
ScanNet++ (Yeshwanth et al., 2023): A real-world dataset containing 6 indoor scenes. This dataset provides real-world complexity, including varied textures, lighting, and object arrangements, which is crucial for assessing practical applicability.
- Example of a data sample: ScanNet++ scenes are typically captured using 3D scanning equipment, providing RGB-D images and corresponding camera poses. Images would be similar to real photographs of indoor spaces, potentially showing noise or imperfections inherent in real-world data.
DeepBlending (Hedman et al., 2018): A real-world dataset including 3 scenes. This dataset is known for its challenging scenarios with complex lighting and potentially sparse captures, which helps evaluate robustness.
- Example of a data sample: DeepBlending images often feature complex lighting effects, reflections, and fine details. Scenes might include museum exhibits or intricate sculptures, posing challenges for accurate photometric and geometric reconstruction.

Dataset Usage Details:

For each scene, 100 images are uniformly sampled.
For ScanNet++ and DeepBlending, 5 images are randomly selected as input views, and the remaining 95 images are used for testing.
For Replica, experiments are performed with 5, 10, and 15 input views to evaluate performance under varying sparsity levels. In all Replica experiments, a consistent set of 85 test views is used.

5.2. Evaluation Metrics

The paper uses a comprehensive set of metrics to evaluate both geometric reconstruction quality and rendering quality.

The following are the results from Table A2 of the original paper:

Metric	Definition
Chamfer Distance (CD)
Accuracy	$2 \mathrm{mean}_{\mathbf{p} \in P} (\min_{\mathbf{p}^* \in P^} \\|\mathbf{p} - \mathbf{p}^\\|_1)$
Completeness	$\mathrm{mean}_{\mathbf{p}^* \in P^} (\min_{\mathbf{p} \in P} \\|\mathbf{p} - \mathbf{p}^\\|_1)$

F-score	$\frac{2 \times \mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$
Precision	$\mathrm{mean}_{\mathbf{p} \in P} (\min_{\mathbf{p}^* \in P^} \\|\mathbf{p} - \mathbf{p}^\\|_1 < 0.05)$
Recall	$\mathrm{mean}_{\mathbf{p}^* \in P^} (\min_{\mathbf{p} \in P} \\|\mathbf{p} - \mathbf{p}^\\|_1 < 0.05)$

Normal Accuracy	$\mathrm{mean}_{\mathbf{p} \in P} (\mathbf{n}_{\mathbf{p}}^{\mathrm{T}} \mathbf{n}_{\mathbf{p}^}) \mathrm{~s.t.~} \mathbf{p}^ = \operatorname{argmin}_{\mathbf{q}^* \in P^} \\|\mathbf{p} - \mathbf{q}^\\|_1$
Normal Completeness	$\mathrm{mean}_{\mathbf{p}^* \in P^} (\mathbf{n}_{\mathbf{p}}^{\mathrm{T}} \mathbf{n}_{\mathbf{p}^}) \mathrm{~s.t.~} \mathbf{p} = \operatorname{argmin}_{\mathbf{q} \in P} \\|\mathbf{q} - \mathbf{p}^*\\|_1$

Reconstruction Metrics: These metrics quantify how well the reconstructed 3D geometry matches the ground truth.

Chamfer Distance (CD) ( $\downarrow$ ): Measures the average closest point distance between two point clouds. A lower CD indicates better geometric alignment.
- Conceptual Definition: CD quantifies the overall difference in shape and position between two point clouds, typically a predicted point cloud ( $P$ ) and a ground truth point cloud ( $P^*$ ). It averages the distances from each point in one cloud to its nearest neighbor in the other, and vice-versa.
- Mathematical Formula: The paper defines CD through its Accuracy and Completeness components.
  - Accuracy: $\mathrm{mean}_{\mathbf{p} \in P} (\min_{\mathbf{p}^* \in P^*} \|\mathbf{p} - \mathbf{p}^*\|_1)$
  - Completeness: $\mathrm{mean}_{\mathbf{p}^* \in P^*} (\min_{\mathbf{p} \in P} \|\mathbf{p} - \mathbf{p}^*\|_1)$ The Chamfer Distance is typically the sum of these two terms.
- Symbol Explanation:
  - $P$ : The point cloud sampled from the predicted (reconstructed) mesh.
  - $P^*$ : The point cloud sampled from the ground truth mesh.
  - $\mathbf{p}$ : A point in the predicted point cloud $P$ .
  - $\mathbf{p}^*$ : A point in the ground truth point cloud $P^*$ .
  - $\|\mathbf{p} - \mathbf{p}^*\|_1$ : The L1 norm (Manhattan distance) between two points, representing the distance between them.
  - $\min_{\mathbf{p}^* \in P^*} (\dots)$ : Finds the minimum distance from point $\mathbf{p}$ to any point in $P^*$ .
  - $\mathrm{mean}_{\mathbf{p} \in P} (\dots)$ : Calculates the average of the distances over all points in $P$ .
F-Score ( $\uparrow$ ): A harmonic mean of precision and recall, commonly used to evaluate the similarity between two shapes or point clouds, often with a distance threshold. A higher F-Score indicates a better match.
- Conceptual Definition: F-Score assesses the overlap between two point clouds by considering both how many predicted points are close to ground truth points (Precision) and how many ground truth points are covered by predicted points (Recall). It's particularly useful when a threshold is applied to define "closeness."
- Mathematical Formula:
  - $F_{\text{score}} = \frac{2 \times \mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$
  - Precision: $\mathrm{mean}_{\mathbf{p} \in P} (\min_{\mathbf{p}^* \in P^*} \|\mathbf{p} - \mathbf{p}^*\|_1 < 0.05)$
  - Recall: $\mathrm{mean}_{\mathbf{p}^* \in P^*} (\min_{\mathbf{p} \in P} \|\mathbf{p} - \mathbf{p}^*\|_1 < 0.05)$
- Symbol Explanation:
  - $P$ : The point cloud sampled from the predicted mesh.
  - $P^*$ : The point cloud sampled from the ground truth mesh.
  - $\mathbf{p}$ : A point in the predicted point cloud $P$ .
  - $\mathbf{p}^*$ : A point in the ground truth point cloud $P^*$ .
  - $\|\mathbf{p} - \mathbf{p}^*\|_1$ : The L1 norm (Manhattan distance) between two points.
  - $\min_{\mathbf{p}^* \in P^*} (\dots)$ : Finds the minimum distance from point $\mathbf{p}$ to any point in $P^*$ .
  - $< 0.05$ : A threshold (5 cm, as stated in the paper) defining what constitutes a "match." Points closer than 5 cm are considered matching.
  - $\mathrm{mean}_{\mathbf{p} \in P} (\dots)$ : Calculates the average of the boolean condition (true=1, false=0) over all points, effectively counting the proportion of points satisfying the condition.
Normal Consistency (NC) ( $\uparrow$ ): Measures the similarity of surface normals between the reconstructed and ground truth models. A higher NC indicates better geometric accuracy, especially regarding surface orientation.
- Conceptual Definition: NC evaluates how well the orientation of the reconstructed surface matches the ground truth surface. It's calculated by comparing the normal vectors at corresponding points.
- Mathematical Formula: The paper defines NC through its Accuracy and Completeness components.
  - Normal Accuracy: $\mathrm{mean}_{\mathbf{p} \in P} (\mathbf{n}_{\mathbf{p}}^{\mathrm{T}} \mathbf{n}_{\mathbf{p}^*}) \mathrm{~s.t.~} \mathbf{p}^* = \operatorname{argmin}_{\mathbf{q}^* \in P^*} \|\mathbf{p} - \mathbf{q}^*\|_1$
  - Normal Completeness: $\mathrm{mean}_{\mathbf{p}^* \in P^*} (\mathbf{n}_{\mathbf{p}}^{\mathrm{T}} \mathbf{n}_{\mathbf{p}^*}) \mathrm{~s.t.~} \mathbf{p} = \operatorname{argmin}_{\mathbf{q} \in P} \|\mathbf{q} - \mathbf{p}^*\|_1$
- Symbol Explanation:
  - $P$ : The point cloud sampled from the predicted mesh.
  - $P^*$ : The point cloud sampled from the ground truth mesh.
  - $\mathbf{p}$ : A point in the predicted point cloud $P$ .
  - $\mathbf{p}^*$ : A point in the ground truth point cloud $P^*$ .
  - $\mathbf{n}_{\mathbf{p}}$ : The normal vector at point $\mathbf{p}$ .
  - $\mathbf{n}_{\mathbf{p}^*}$ : The normal vector at point $\mathbf{p}^*$ .
  - $\mathbf{n}_{\mathbf{p}}^{\mathrm{T}} \mathbf{n}_{\mathbf{p}^*}$ : The dot product of the two normal vectors. Since normals are unit vectors, this value ranges from -1 (opposite direction) to 1 (same direction), quantifying their alignment.
  - $\operatorname{argmin}_{\mathbf{q}^* \in P^*} \|\mathbf{p} - \mathbf{q}^*\|_1$ : Finds the point $\mathbf{p}^*$ in $P^*$ that is closest to $\mathbf{p}$ .
    
    Rendering Metrics: These metrics evaluate the perceptual quality of the rendered novel views compared to ground truth images.
Peak Signal-to-Noise Ratio (PSNR) ( $\uparrow$ ): A common metric for image quality, typically used to measure the quality of reconstruction of lossy compression codecs or image processing. Higher PSNR indicates better image quality.
- Conceptual Definition: PSNR is a quantitative measure of image reconstruction quality. It compares a reconstructed image to its original ground truth by calculating the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. A higher PSNR generally indicates a higher quality (less noisy) reconstructed image.
- Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $ where $\mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2$
- Symbol Explanation:
  - $\mathrm{MAX}_I$ : The maximum possible pixel value of the image (e.g., 255 for 8-bit images).
  - $\mathrm{MSE}$ : Mean Squared Error between the original image $I$ and the reconstructed image $K$ .
  - I(i,j): The pixel value at coordinates (i,j) in the original image.
  - K(i,j): The pixel value at coordinates (i,j) in the reconstructed image.
  - m, n: The dimensions (height and width) of the image.
Structural Similarity Index Measure (SSIM) ( $\uparrow$ ): A perception-based model that considers image degradation as a perceived change in structural information, while also incorporating luminance and contrast masking. Higher SSIM indicates better perceptual quality.
- Conceptual Definition: SSIM is designed to measure the perceptual similarity between two images, often a distorted image and a reference image. It accounts for three key factors: luminance, contrast, and structure, which are believed to be crucial for human perception of image quality. A value of 1 implies identical images.
- Mathematical Formula: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
  - x, y: Two image patches being compared.
  - $\mu_x, \mu_y$ : The average (mean) of image patch $x$ and $y$ , respectively.
  - $\sigma_x^2, \sigma_y^2$ : The variance of image patch $x$ and $y$ , respectively.
  - $\sigma_{xy}$ : The covariance of image patch $x$ and $y$ .
  - $c_1 = (K_1L)^2, c_2 = (K_2L)^2$ : Two small constants to prevent division by zero, where $L$ is the dynamic range of the pixel values (e.g., 255 for 8-bit grayscale images), and $K_1 \ll 1, K_2 \ll 1$ are small default constants (e.g., $K_1=0.01, K_2=0.03$ ).
Learned Perceptual Image Patch Similarity (LPIPS) ( $\downarrow$ ): A metric that uses a pre-trained deep neural network (VGG in this case, Simonyan & Zisserman, 2014) to compute a perceptual distance between two images. Lower LPIPS indicates higher perceptual similarity.
- Conceptual Definition: LPIPS (sometimes called "perceptual loss") is a metric that aims to quantify the perceptual similarity between two images in a way that aligns better with human judgment than traditional metrics like PSNR or SSIM. It works by computing the distance between feature representations of the images extracted from a pre-trained deep convolutional neural network (like VGG). The idea is that features from deep networks capture more abstract, perceptually relevant information.
- Mathematical Formula: While LPIPS doesn't have a single simple closed-form mathematical formula like PSNR or SSIM, its core idea is: $ \mathrm{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} |w_l \odot (\phi_l(x) - \phi_l(y))|_2^2 $
- Symbol Explanation:
  - x, y: The two images being compared.
  - $\phi_l(\cdot)$ : The feature stack (output) from the $l$ -th layer of a pre-trained network (e.g., VGG).
  - $w_l$ : A learnable scalar weight for each channel in the feature map at layer $l$ .
  - $\odot$ : Element-wise multiplication.
  - $\|\cdot\|_2^2$ : Squared L2 norm.
  - $H_l, W_l$ : Height and width of the feature map at layer $l$ .
  - $\sum_l$ : Sums over selected layers of the network.

5.3. Baselines

The paper compares G4SPLAT against a variety of representative baselines, categorized into classical Gaussian Splatting and methods integrating diffusion models. All baselines are augmented with MASt3R-SfM (Duisterhof et al., 2025) to improve robustness in sparse-view scenarios, ensuring a fair comparison.

Classical Gaussian Splatting Approaches:

3DGS (Kerbl et al., 2023): The original 3D Gaussian Splatting method. Known for high-quality rendering with dense views but struggles with sparse inputs.
2DGS (Huang et al., 2024a): An extension of 3DGS that uses 2D anisotropic disks, aiming to improve geometric accuracy from sparse views.
FSGS (Zhu et al., 2024): "Few-Shot Gaussian Splatting," a sparse-view 3DGS method that incorporates depth regularization.
InstantSplat (Fan et al., 2024): A fast sparse-view Gaussian Splatting method.
MAtCha (Guédon et al., 2025): Builds on 2DGS, introducing chart alignment using SfM-derived scale-accurate depth, specifically designed for sparse views. This is a key baseline as G4SPLAT builds upon its ideas.

Approaches Integrating Diffusion Models:

GenFusion (Wu et al., 2025c): A method that closes the loop between reconstruction and generation using video diffusion models for sparse-view 3DGS.
Difix3D+ (Wu et al., 2025a): Improves 3D reconstructions using single-step diffusion models.
GuidedVD (Zhong et al., 2025): Tames video diffusion priors with scene-grounding guidance for 3D Gaussian Splatting from sparse inputs.
See3D (Ma et al., 2025): A video diffusion model used for 3D creation from pose-free videos. The paper also implements a variant of 2DGS augmented with See3D for comparison.

6. Results & Analysis

The experimental results demonstrate G4SPLAT's superior performance in both geometry and appearance reconstruction, especially in unobserved regions and across diverse scenarios.

6.1. Core Results Analysis

6.1.1. Novel View Synthesis Quality

G4SPLAT consistently achieves better appearance and geometry reconstruction. The method produces fewer Gaussian floaters (blurry, inconsistent artifacts) in both observed and unobserved regions compared to baselines.

The following figure illustrates a qualitative comparison from 5 input views:

Figure 5: Any-view scene reconstruction. Our method demonstrates strong generalization across diverse scenarios, including indoor and outdoor scenes, unposed scenes and even single-view scenes. 该图像是包含多视角场景重建对比的示意图，展示了不同输入视角数量（1、5、10）的彩色照片与对应的几何重建效果，展现了方法在多样室内外场景和不同观察条件下的稳定重建性能。

Figure 4: Qualitative comparison from 5 input views. Our approach achieves better appearance and geometry reconstruction with fewer Gaussian floaters in both observed and unobserved regions.

The following figure shows more qualitative results:

Figure A5: More qualitative results. 该图像是论文中Figure A5的定性对比结果展示，展示了多种方法（MAttCha, GenFusion, Difix3D+, GuidedVD, See3D, Ours）与真实场景（GT）在多视角下的3D场景重建效果，包含彩色图像与对应的深度/几何信息。

Figure A5: More qualitative results.

Comparison with Baselines:

Diffusion-based methods (GenFusion, See3D, GuidedVD): While capable of hallucinating content in unobserved regions, their completions are often blurry and compromised by severe floaters. They can even degrade reconstruction quality in observed regions due to multi-view inconsistencies.
Difix3D+: Shows relatively good quality in observed regions but struggles with unobserved areas.
G4SPLAT: Maintains high fidelity across both observed and unobserved regions, effectively leveraging the generative prior while ensuring consistency.

6.1.2. Geometry Reconstruction Quality

Diffusion-based methods often suffer from severe "shape-appearance ambiguities", meaning their rendered views might appear plausible, but the underlying 3D geometry is poor. G4SPLAT overcomes this by yielding more accurate geometry in unobserved regions and producing smoother, floater-free reconstructions in observed regions. This improvement is attributed to its "plane-aware geometry modeling", which provides reliable depth supervision for both observed and unobserved areas, ensuring scene-wide consistency.

Quantitatively, G4SPLAT significantly outperforms all baselines on all reconstruction metrics (Chamfer Distance, F-Score, Normal Consistency) across multiple datasets, as detailed in Table 1.

The following are the results from Table 1 of the original paper:

Dataset	Method	Reconstruction			Rendering
Dataset	Method	CD↓	F-Score ↑	NC↑	PSNR↑	SSIM↑	LPIPS↓
Replica	3DGS	16.61	27.72	64.34	18.29	0.744	0.254
	2DGS	14.64	48.01	74.14	18.43	0.735	0.306
	FSGS	18.17	26.87	64.16	19.19	0.766	0.259
	InstantSplat	21.00	19.67	62.01	19.39	0.762	0.255
	MAtCha	10.12	60.90	79.33	17.81	0.752	0.228
	See3D	12.74	45.27	73.98	19.22	0.735	0.328
	GenFusion	13.05	41.60	69.33	20.14	0.801	0.258
	Difix3D+	13.71	43.11	65.34	19.42	0.779	0.231
	GuidedVD	27.87	17.29	61.64	22.51	0.822	0.260
Ours	6.61	65.14	83.98	23.90	0.836	0.199
ScanNet++	3DGS	16.60	31.92	65.35	14.28	0.696	0.372
	2DGS	14.34	51.97	70.01	13.91	0.661	0.429
	FSGS	23.80	27.86	64.53	14.80	0.731	0.362
	InstantSplat	21.32	25.44	60.67	15.02	0.742	0.355
	MAtCha	11.55	62.98	73.61	13.58	0.677	0.351
	See3D	13.03	53.65	70.39	14.76	0.684	0.426
	GenFusion	10.68	47.15	66.27	16.12	0.726	0.347
	Difix3D+	13.15	53.91	67.30	14.09	0.701	0.340
	GuidedVD	25.35	16.67	60.48	17.90	0.807	0.336
	Ours	6.34	67.12	77.45	18.69	0.792	0.314
DeepBlending	3DGS	31.44	20.02	55.39	15.33	0.571	0.489
	2DGS	25.60	23.81	63.82	14.89	0.556	0.506
	FSGS	31.45	19.66	57.38	15.72	0.602	0.476
	InstantSplat	33.78	17.99	57.91	15.00	0.569	0.483
	MAtCha	22.36	26.80	67.92	14.74	0.558	0.465
	See3D	31.34	22.68	63.18	15.00	0.552	0.537
	GenFusion	30.70	22.37	58.70	16.20	0.626	0.468
	Difix3D+	32.70	21.94	58.08	15.18	0.583	0.450
	GuidedVD	43.28	15.95	59.21	16.32	0.618	0.481
	Ours	20.72	28.02	72.04	16.76	0.645	0.440

6.1.3. Any-View Scene Reconstruction

G4SPLAT demonstrates robust performance and strong generalizability across a wide range of scenarios.

The following figure shows any-view scene reconstruction examples:

Figure A1: More results of single-view reconstruction. Our method achieves high-quality geometry reconstruction (i.e., geometry view) and realistic texture recovery (i.e., appearance view). The appea… 该图像是论文中单视图重建结果的对比展示。展示了输入视角图像及对应的多视角几何视图和外观视图，体现了方法在几何精度和纹理恢复上的效果。

Figure 5: Any-view scene reconstruction. Our method demonstrates strong generalization across diverse scenarios, including indoor and outdoor scenes, unposed scenes and even single-view scenes.

Diverse Scenarios: It performs well in indoor and outdoor scenes, from single-view inputs, and with unposed videos (like YouTube videos).
Varying Number of Input Views: As shown in Table A1, G4SPLAT consistently outperforms baselines regardless of the number of input views (5, 10, or 15 views).
Complex Lighting: For scenes with complex lighting (where baselines struggle even with dense views), G4SPLAT's accurate geometry guidance effectively suppresses errors caused by significant brightness variations across viewpoints, leading to high-quality reconstructions (Fig. A2).
Non-planar Geometry: While emphasizing planar regions, the method also faithfully reconstructs non-planar geometry, as exemplified by "The Museum" (Fig. 1) and "The Cat" (Fig. 5).

The following figure shows single-view reconstruction results:

该图像是论文中的图表，展示了不同方法在密集视图重建任务上的对比。第一行是输入视图，第二行及第四行为各方法生成的视图图像，第三行及第五行为对应的深度图或几何重建结果，比较了3DGS、2DGS、MAtCha和本文方法的表现。

The following figure shows dense view reconstruction comparison:

Figure A3: Training results with inconsistent (Incon.) vs. consistent (Con.) multi-view inpainting. Training Gaussian representations with inconsistent inpainting leads to black shadows in non-consis… 该图像是图像比较插图，展示了多视图不一致（Incon.）与一致（Con.）的高斯表示训练结果。可见不一致训练导致图中红圈位置出现黑色阴影，而一致训练显著减少了这些伪影，实现了更清晰的渲染效果。

Figure A2: Dense view reconstruction comparison.

The following are the results from Table A1 of the original paper:

Method	Reconstruction (CD↓ / NC↑)			Rendering (PSNR↑/LPIPS↓)
Method	5 views	10 views	15 views	5 views	10 views	15 views
2DGS	14.64 / 74.14	9.37 / 81.24	7.17 / 85.33	18.43 / 0.306	21.79 / 0.207	25.30 / 0.139
FSGS	18.17 / 64.16	13.97 / 68.76	12.64 / 71.22	19.19 / 0.259	22.50 / 0.179	25.84 / 0.127
MAtCha	11.10 / 81.20	7.35 / 83.99	6.03 / 85.67	17.85 / 0.228	21.26 / 0.153	25.00 / 0.109
See3D	12.74 / 73.98	9.22 / 80.44	7.40 / 84.22	19.22 / 0.328	22.73 / 0.240	25.56 / 0.183
GenFusion	13.05 / 69.33	10.04 / 74.03	8.88 / 76.57	20.14 / 0.258	23.90 / 0.183	26.48 / 0.138
Difix3D+	13.71 / 65.34	10.15 / 68.43	7.97 / 70.73	19.42 / 0.231	22.68 / 0.165	26.04 / 0.122
GuidedVD	27.87 / 61.64	20.30 / 64.53	16.62 / 68.54	22.51 / 0.260	25.63 / 0.205	27.91 / 0.163
Ours	6.61 / 83.98	4.88 / 85.49	3.98 / 87.25	23.90 / 0.199	27.48 / 0.140	30.22 / 0.094

6.1.4. Multi-view Inconsistencies

The following figure illustrates the issue of inconsistent multi-view inpainting and how G4SPLAT addresses it:

Figure A4: Failure cases. 该图像是图A4中的示意图，展示了G4Splat方法在输入视角、生成的新视角渲染图像及对应几何形状上的失败案例，红色圆圈标注了错误或不一致区域，反映了方法在某些场景下的局限性。

Figure A3: Training results with inconsistent (Incon.) vs. consistent (Con.) multi-view inpainting. Training Gaussian representations with inconsistent inpainting leads to black shadows in non-consistent regions, while our method significantly reduces such artifacts, yielding sharper and cleaner renderings.

As shown in Fig. A3, training Gaussian representations with multi-view inconsistent inpainting results in black shadows. G4SPLAT's approach of incorporating scale-accurate geometry guidance and modulating color supervision effectively mitigates these inconsistencies, producing sharper and cleaner renderings.

6.2. Ablation Studies / Parameter Analysis

Ablation experiments were conducted on the Replica dataset to evaluate the contribution of each component:

GP: Generative Prior (from diffusion models).
PM: Plane-aware Geometry Modeling.

PP: Geometry-Guided Generative Pipeline.

The following are the results from Table 2 of the original paper:

	GP	PM	PP	Reconstruction			Rendering
	GP	PM	PP	CD↓	F-Score↑	NC↑	PSNR↑	SSIM↑	LPIPS↓
×	×	×	10.60	59.17	79.95	17.85	0.751	0.228
✓	×	×	9.46	56.99	77.58	19.63	0.740	0.295
×	✓	×	8.73	64.96	80.55	17.63	0.752	0.219
✓	✓	×	7.56	62.36	80.89	21.88	0.810	0.221
✓	✓	✓	6.61	65.14	83.98	23.90	0.836	0.199

Key Observations from Ablation:

Generative Prior (GP) alone: Incorporating GP (row 2) improves rendering quality (higher PSNR, SSIM) compared to the baseline (row 1), but leads to limited gains in geometry reconstruction. F-Score and NC even decline, indicating that GP alone tends to produce averaged, blurry results for unseen areas without proper geometric grounding. This confirms that directly introducing generative prior can lead to "shape-appearance ambiguities".
Plane-Aware Geometry Modeling (PM):
- Adding PM alone (row 3) significantly improves geometry reconstruction (lower CD, higher F-Score, NC).
- When PM is combined with GP (row 4), it brings notable gains in rendering quality (PSNR, SSIM) compared to GP alone. This suggests that accurate geometry guidance provides a clean geometric basis, effectively suppressing Gaussian floaters and allowing the generative model to perform as intended.
Geometry-Guided Generative Pipeline (PP): Incorporating the full PP (row 5, i.e., the complete G4SPLAT method) further enhances both rendering fidelity and geometric accuracy, achieving the best results across all metrics. This indicates that geometry guidance contributes through more accurate visibility masks, novel views with broader plane coverage, and consistent color supervision, mitigating multi-view inconsistencies in the generative process.

6.3. Discussion on Plane Representation

The authors justify their choice of a plane representation for scale-accurate geometry guidance for three main reasons:

Prevalence: Planar structures (floors, walls, tabletops) are common and occupy large portions of images in man-made environments.
Generalization: Global 3D planes can be reliably estimated from accurate point clouds in overlapping regions and then extended to non-overlapping or unobserved planar regions. This also enables linear alignment of monocular depth for non-planar areas.

Efficiency: Plane representation is simple, computationally efficient, and memory-friendly.

The following are the results from Table 3 of the original paper:

Method	CD↓	PSNR↑	Time (min)↓
MAtCha	11.57	11.56	32.4
See3D	15.42	14.50	58.6
GenFusion	15.66	16.49	41.4
Difix3D+	17.79	12.94	68.7
GuidedVD	24.29	19.02	141.4
Ours	8.77	20.26	73.3
Ours (DS)	9.33	19.36	43.5

As shown in Table 3, G4SPLAT achieves high-quality reconstruction with a runtime comparable to other approaches that employ generative priors. For instance, training a single scene on Replica with 5 input views takes slightly over one hour (73.3 minutes). A variant with downsampled Gaussians (Ours (DS)) further accelerates training to 43.5 minutes while maintaining competitive performance. This demonstrates that the geometric guidance, despite its complexity, does not introduce prohibitive computational overhead relative to other state-of-the-art methods employing generative priors.

7. Conclusion & Reflections

7.1. Conclusion Summary

G4SPLAT presents a significant advancement in 3D scene reconstruction from sparse views by effectively integrating geometry guidance with generative priors. The core contribution lies in its novel approach to deriving scale-accurate geometric constraints using plane representations, which are then meticulously incorporated throughout the entire generative pipeline. This geometry-guided strategy addresses critical limitations of previous methods, such as unreliable geometric supervision and multi-view inconsistencies in generated content. Extensive experiments demonstrate that G4SPLAT consistently achieves state-of-the-art performance in both geometric and appearance reconstruction, particularly excelling in unobserved regions. Furthermore, its ability to handle single-view inputs and unposed videos highlights its strong generalizability and practical applicability in diverse real-world scenarios.

7.2. Limitations & Future Work

The authors acknowledge several limitations of G4SPLAT and suggest directions for future work:

Current Video Diffusion Model Capabilities: The method's performance is partially limited by the imperfections of existing video diffusion models. These models can struggle with ensuring exact color consistency in completed regions, leading to subtle inconsistencies that can affect rendering, even if the geometry is accurate (Fig. A4a).
Heavily Occluded Regions: G4SPLAT struggles with reconstructing "heavily occluded regions" where objects are very close together (e.g., a chair partially blocked by a table, Fig. A4b). Generating a reasonable novel camera view for such tightly occluded areas remains challenging.
- Future Work: The authors hypothesize that incorporating "object-level prior" (e.g., Ni et al., 2025; Yang et al., 2025) could help improve reconstruction in such severely occluded regions.
Monocular Depth for Non-Plane Regions: While the current method uses a linear transformation to scale monocular depth for non-planar regions, the accuracy ultimately depends on the monocular estimator.
- Future Work: Adopting a more general surface representation that can naturally model both planar and non-planar regions could yield more accurate depth, especially in non-plane areas. Although potentially less computationally efficient than plane-based approaches, it could improve overall reconstruction quality.

7.3. Personal Insights & Critique

G4SPLAT represents a highly insightful and practical approach to a challenging problem. The paper's core argument that "accurate geometry is the fundamental prerequisite" for effective generative 3D reconstruction is compelling and well-supported by the empirical results.

Inspirations and Applications:

Synergy of Priors: The method brilliantly leverages two distinct but complementary priors: the structural regularity of planar geometry and the content generation capabilities of diffusion models. This hybrid approach is likely to be a powerful paradigm for future 3D reconstruction tasks.
Practical Relevance: The ability to reconstruct 3D scenes from single-view inputs and unposed videos opens up significant practical applications in areas like embodied AI (for agents to navigate and interact with reconstructed environments), robotics (for scene understanding and manipulation), and content creation (for quickly generating 3D assets from limited inputs).
Robustness to Sparsity: The method's strong performance even with very sparse inputs (e.g., 5 views or single view) is a major breakthrough, as dense data collection is often costly and impractical in real-world scenarios.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Dependence on Manhattan World Assumption: While effective for man-made environments, the strong reliance on planar structures ("Manhattan world assumption") might limit its applicability to highly organic, natural, or unstructured scenes (e.g., forests, rock formations, complex sculptures without clear planar elements). The method might require extensions or adaptations for such scenarios.
Complexity of Pipeline: While effective, the pipeline is quite intricate, involving multiple stages of plane extraction, RANSAC, depth map fusion, visibility grid construction, novel view selection, video diffusion model inpainting, and iterative Gaussian training. The interplay of these components, though robust, might be complex to implement and fine-tune for new domains.
Computational Cost: Although the paper states runtime is comparable to other generative prior methods, the 73.3 minutes for a single scene with 5 views (and 43.5 minutes for the downsampled variant) is still significant. For large-scale reconstruction or real-time applications beyond rendering, further optimizations in training time might be needed.
Generalizability of Plane Detection: The initial plane extraction relies on normal maps, SAM masks, and K-means. The robustness of this step to extremely noisy input, diverse textures, or complex non-planar objects being misidentified as planar, could influence the downstream plane-aware depth maps.
Uncertainty Quantification: The method provides accurate geometry, but doesn't explicitly quantify the uncertainty in regions that are heavily inpainted or far from observed views. Such information could be valuable for downstream tasks in robotics or autonomous systems.

Overall, G4SPLAT provides a compelling argument for the indispensable role of explicit geometric guidance in the era of powerful generative models for 3D reconstruction, pushing the boundaries of what's possible with sparse and challenging input data.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.