论文状态：已完成

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

发表：2025/10/14

3D Gaussian Splatting 表示 (12)生成式先验 (1)多视角一致性建模 (1)基于扩散模型的3D重建 (1)平面结构几何引导 (1)

原文链接 PDF 下载

价格：0.100000

已有 6 人读过

本分析由 AI 生成，可能不完全准确，请以原文为准。

TL;DR 精炼摘要

本文提出G4Splat方法，结合平面结构几何指导和生成先验，利用度量尺度深度图提供准确监督，解决了3D场景重建中多视图不一致及质量低下问题。实验证明其在已观测和未观测区域均优于现有方法，支持单视角和无姿态视频，泛化性能强。

摘要

Despite recent advances in leveraging generative prior from pre-trained diffusion models for 3D scene reconstruction, existing methods still face two critical limitations. First, due to the lack of reliable geometric supervision, they struggle to produce high-quality reconstructions even in observed regions, let alone in unobserved areas. Second, they lack effective mechanisms to mitigate multi-view inconsistencies in the generated images, leading to severe shape-appearance ambiguities and degraded scene geometry. In this paper, we identify accurate geometry as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction. We first propose to leverage the prevalence of planar structures to derive accurate metric-scale depth maps, providing reliable supervision in both observed and unobserved regions. Furthermore, we incorporate this geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models, resulting in accurate and consistent scene completion. Extensive experiments on Replica, ScanNet++, and DeepBlending show that our method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. Moreover, our method naturally supports single-view inputs and unposed videos, with strong generalizability in both indoor and outdoor scenarios with practical real-world applicability. The project page is available at https://dali-jack.github.io/g4splat-web/.

思维导图

论文精读

中文精读约 34 分钟读完 · 24,642 字

1. 论文基本信息

1.1. 标题

G4SPLAT: Geometry-Guided Gaussian Splatting with Generative Prior (G4SPLAT: 几何引导的高斯溅射与生成先验)

1.2. 作者

Junfeng Ni*, Yixin Chen, Zhifei Yang, Yu Liu, Ruijie Lu, Song-Chun Zhu, Siyuan Huang。 (*表示在 BIGAI 实习期间完成的工作) 隶属机构: 清华大学、北京通用人工智能研究院 (BIGAI) 国家通用人工智能重点实验室、北京大学。

1.3. 发表期刊/会议

论文目前以预印本形式发布在 arXiv 上，发布时间为 2025-10-14T03:06:28.000Z。arXiv 是一个开放获取的预印本服务器，在学术界，尤其是在计算机科学领域，是研究者分享最新研究成果的重要平台。虽然是预印本，但该论文的作者团队来自知名高校和研究机构，表明其研究的严谨性和潜在影响力。

1.4. 发表年份

2025年。

1.5. 摘要

尽管利用预训练扩散模型中的生成先验进行 3D 场景重建 (3D scene reconstruction) 在近期取得了进展，但现有方法仍面临两个关键限制。首先，由于缺乏可靠的几何监督，它们即使在 已观测区域 (observed regions) 也难以产生高质量的重建，更不用说在 未观测区域 (unobserved areas)。其次，它们缺乏有效机制来缓解生成图像中的 多视图不一致性 (multi-view inconsistencies)，导致严重的 形状-外观模糊 (shape-appearance ambiguities) 和退化的场景几何。

本文认为准确的几何是有效利用生成模型增强 3D 场景重建 (3D scene reconstruction) 的基本先决条件。作者首先提出利用平面结构的普遍性来导出准确的 度量尺度深度图 (metric-scale depth maps)，从而在 已观测区域 (observed regions) 和 未观测区域 (unobserved regions) 提供可靠的监督。此外，作者将这种几何指导整合到整个生成流水线中，以改进 可见性掩码 (visibility mask) 估计、引导 新视图选择 (novel view selection)，并在使用 视频扩散模型 (video diffusion models) 进行 修复 (inpainting) 时增强 多视图一致性 (multi-view consistency)，从而实现准确和一致的场景补全。

在 Replica、 $ScanNet++$ 和 DeepBlending 数据集上进行的广泛实验表明，该方法在几何和外观重建方面始终优于现有基线，特别是在 未观测区域 (unobserved regions)。此外，该方法自然支持 单视图输入 (single-view inputs) 和 未姿态视频 (unposed videos)，在室内和室外场景中都具有很强的泛化能力，具有实际的现实世界应用潜力。项目页面可在 https://dali-jack.github.io/g4splat-web/ 访问。

1.6. 原文链接

原文链接: https://arxiv.org/abs/2510.12099 PDF 链接: https://arxiv.org/pdf/2510.12099v1.pdf 发布状态: 预印本 (arXiv)

2. 整体概括

2.1. 研究背景与动机

2.1.1. 核心问题与挑战

当前 3D 场景重建 (3D scene reconstruction) 领域，特别是基于 3D Gaussian Splatting (3DGS) 的方法，在 稀疏视图 (sparse-view) 条件下仍面临显著挑战。论文指出，尽管 3DGS 在密集视图下表现出色，但在视图稀疏时，由于几何和光度监督不足，性能会显著下降。更具体地，现有方法主要存在两个关键限制：

几何监督缺乏导致的重建质量问题： 现有的生成式 3D 场景重建 (3D scene reconstruction) 方法，即使结合了 扩散模型 (diffusion models) 的 生成先验 (generative prior)，在 已观测区域 (observed regions) 也难以生成高质量的重建，更不用说 未观测区域 (unobserved areas)。这是因为它们缺乏可靠的 几何监督 (geometric supervision)。虽然一些方法引入了 深度正则化 (depth regularization)，但 单目深度估计器 (monocular depth estimators) 固有的 尺度模糊 (scale ambiguity) 限制了其提供可靠几何监督的能力。
多视图不一致性导致的形状-外观模糊： 现有方法在生成图像时，缺乏有效机制来缓解 多视图不一致性 (multi-view inconsistencies)。这意味着从不同角度生成的图像内容可能不协调，导致 形状-外观模糊 (shape-appearance ambiguities)，使得重建出的 3D 几何 (3D geometry) 质量下降，并产生 浮点高斯 (floating Gaussian) 等伪影。

2.1.2. 问题的重要性与研究空白

3D 场景重建 (3D scene reconstruction) 是 具身人工智能 (embodied AI)、机器人学和 扩展现实 (extended reality) 等领域的基础任务。在现实世界中，往往难以获取密集的视图输入，例如在 单视图 (single-view) 或 未姿态视频 (unposed videos) 场景下。因此，在稀疏视图下实现高质量、一致的 3D 重建 (3D reconstruction) 具有重要的实际应用价值。

现有方法在处理稀疏视图时，往往倾向于在 未观测区域 (unobserved regions) 引入 幻觉 (hallucination) 内容，但这些内容的几何和外观质量通常较差，且与 已观测区域 (observed regions) 缺乏 一致性 (consistency)。论文强调，准确的几何 (accurate geometry) 是有效利用 生成模型 (generative models) 增强 3D 场景重建 (3D scene reconstruction) 的根本先决条件。在没有坚实几何基础的情况下，生成模型 (generative models) 即使能够生成内容，也难以保证其 3D 形状 (3D shape) 和 跨视图一致性 (cross-view consistency)。

2.1.3. 论文的切入点与创新思路

G4SPLAT 的核心切入点在于：将精确的几何指导作为驱动生成模型提升 3D 场景重建 (3D scene reconstruction) 质量的基石。

创新思路体现在以下几个方面：

平面结构利用： 识别并利用人造环境中普遍存在的平面结构（与 曼哈顿世界假设 (Manhattan world assumption) 一致），来推导 度量尺度深度图 (metric-scale depth maps)。这种方法能够从局部深度观测中可靠地推断出 3D 平面 (3D plane)，并将其扩展到 未观测区域 (unobserved regions)，从而提供可靠的 几何监督 (geometric supervision)。
几何指导融入生成流水线： 不仅仅是在初始化阶段使用几何信息，而是将 几何指导 (geometry guidance) 贯穿于整个 生成流水线 (generative pipeline)：
- 改进可见性掩码估计： 使用 尺度精确深度 (scale-accurate depth) 构建 3D 可见性网格 (3D visibility grid)，生成更可靠的 可见性掩码 (visibility mask)，避免传统 Alpha 图 (alpha maps) 的错误。
- 引导新视图选择： 利用 全局 3D 平面 (global 3D planes) 作为 对象代理 (object proxies)，指导 新视角 (novel viewpoints) 的选择，最大化对完整平面结构的覆盖，为 修复 (inpainting) 提供更丰富的 上下文线索 (contextual cues)。
- 增强多视图一致性： 在使用 视频扩散模型 (video diffusion models) 进行 修复 (inpainting) 时，利用 全局 3D 平面 (global 3D planes) 调制颜色监督，有效减少 跨视图冲突 (cross-view conflicts)。
支持多样输入： 该方法能够自然地支持 单视图输入 (single-view inputs) 和 未姿态视频 (unposed videos)，大大拓宽了其实际应用范围。

通过这些创新点，G4SPLAT 旨在克服现有方法在 稀疏视图 (sparse-view) 下的几何和外观重建限制，实现准确且一致的 场景补全 (scene completion)。

2.2. 核心贡献/主要发现

2.2.1. 主要贡献

提出了一个新颖的方法，利用平面表示来导出 尺度精确的几何约束 (scale-accurate geometric constraints)， 这显著改善了 3D 场景重建 (3D scene reconstruction)，甚至在 未观测区域 (unobserved regions) 也是如此。这是通过从局部深度观测中估计 3D 平面 (3D plane) 并将其扩展到整个表面来实现的，从而解决了 单目深度估计器 (monocular depth estimators) 的 尺度模糊 (scale ambiguity) 问题。
将 几何指导 (geometry guidance) 整合到 生成流水线 (generative pipeline) 中， 这改进了 可见性掩码估计 (visibility mask estimation)、新视图选择 (novel view selection)，并增强了 视频扩散模型 (video diffusion models) 的 多视图一致性 (multi-view consistency)，从而实现了可靠且一致的 场景补全 (scene completion)。
在多个数据集上实现了 最先进的 (state-of-the-art) 性能， 包括 Replica、 $ScanNet++$ 和 DeepBlending，并支持室内、室外、单视图和未姿态视频重建。

2.2.2. 关键结论和发现

准确的几何是生成模型有效性的基础： 论文通过实验证明，为 3D 场景重建 (3D scene reconstruction) 提供可靠的 几何基础 (geometric basis) 是有效利用 生成模型 (generative models) 的关键。没有它，生成先验 (generative prior) 往往会导致 形状-外观模糊 (shape-appearance ambiguities) 和低质量的重建。
平面结构能够提供强大的几何监督： 利用人造环境中普遍存在的平面结构，可以有效地推导出 尺度精确的深度信息 (scale-accurate depth information)，这对于 已观测区域 (observed regions) 和 未观测区域 (unobserved regions) 都提供了强大的 几何监督 (geometric supervision)。
几何指导贯穿生成过程的重要性： 将 几何指导 (geometry guidance) 融入 可见性掩码 (visibility mask)、新视图选择 (novel view selection) 和 多视图一致性 (multi-view consistency) 的处理中，能够显著提高 生成模型 (generative models) 的性能，使其能够生成更准确、更一致的 场景补全 (scene completion)。
广泛的适用性和泛化能力： G4SPLAT 不仅在 稀疏视图 (sparse-view) 下表现优异，在 密集视图 (dense-view) 下也取得了更好的结果，并且能处理 单视图 (single-view) 和 未姿态视频 (unposed videos) 等挑战性输入，证明了其强大的泛化能力和实用性。
消融实验 (ablation studies) 验证了各组件的有效性： 论文通过消融实验证实了 生成先验 (generative prior)、平面感知几何建模 (plane-aware geometry modeling) 和 几何引导的生成流水线 (geometry-guided generative pipeline) 各个组件对提升重建质量的贡献，其中 平面感知几何建模 (plane-aware geometry modeling) 尤其关键。

3. 预备知识与相关工作

3.1. 基础概念

3.1.1. 3D Gaussian Splatting (3DGS)

3D Gaussian Splatting (3DGS) 是一种用于实时辐射场渲染和 3D 场景重建 (3D scene reconstruction) 的方法。它将 3D 场景 (3D scene) 表示为一系列 3D 高斯 (3D Gaussians) 的集合，每个高斯都是一个具有位置、协方差（形状和方向）、不透明度（alpha 值）和 球谐函数 (spherical harmonics) 系数（表示视图相关颜色）的椭球体。渲染时，这些高斯被投影到 2D 图像平面 (2D image plane) 上，并使用一种 可微分的 (differentiable) 的 Alpha 混合 (alpha blending) 技术进行合成。3DGS 以其在渲染质量和训练效率方面的优势而闻名。

3.1.2. 2D Gaussian Splatting (2DGS)

2D Gaussian Splatting (2DGS) 是 3DGS 的一个扩展，它将 3D 高斯 (3D Gaussians) 坍缩成 2D 各向异性盘 (2D anisotropic disks)。这种方法主要用于从 多视图图像 (multi-view images) 重建 场景 (scene)，并通过优化 2D 高斯 (2D Gaussians) 的参数来达到 光真实感 (photo-realistic) 的 新视图合成 (novel view synthesis)。它继承了 3DGS 的高效渲染能力，并在此基础上进行了一些改进，以适应特定的重建任务。

3.1.3. 扩散模型 (Diffusion Models)

扩散模型 (Diffusion Models) 是一类 生成模型 (generative models)，它们通过模拟一个逐渐向数据添加噪声的前向 扩散过程 (diffusion process)，然后学习一个逆向过程来从噪声中恢复数据。在 3D 场景重建 (3D scene reconstruction) 中，扩散模型 (Diffusion Models) 通常被用作 生成先验 (generative prior)，以在 稀疏视图 (sparse views) 或 未观测区域 (unobserved regions) 生成缺失的内容或细节，从而提高重建的质量和完整性。视频扩散模型 (video diffusion models) 能够生成时间上连贯的图像序列，这对于在 新视图合成 (novel view synthesis) 中保持 多视图一致性 (multi-view consistency) 尤为重要。

3.1.4. 曼哈顿世界假设 (Manhattan World Assumption)

曼哈顿世界假设 (Manhattan World Assumption) 是一种计算机视觉和 3D 重建 (3D reconstruction) 中常用的先验知识，它假设人造环境中的大部分表面都与三个相互正交的全局方向（例如，地平线、垂直方向）对齐。这意味着墙壁、地板、天花板和许多家具表面都是平面的，并且相互平行或垂直。利用这一假设可以简化几何推理，提高 3D 结构估计 (3D structure estimation) 的准确性和鲁棒性，尤其是在 稀疏视图 (sparse views) 或 低纹理 (low-texture) 区域。

3.1.5. 单目深度估计器 (Monocular Depth Estimators)

单目深度估计器 (Monocular Depth Estimators) 是一种 深度学习模型 (deep learning models)，它能从单个 2D 图像 (2D image) 预测 场景 (scene) 的 深度图 (depth map)。这些模型通常经过大量图像-深度对的训练。然而，它们预测的深度通常是 相对尺度 (relative scale) 的，而不是 绝对度量尺度 (absolute metric scale) 的。这意味着虽然它们能给出场景中物体之间相对深度的关系，但无法直接提供物体与相机之间的真实物理距离。这种 尺度模糊 (scale ambiguity) 是在需要 度量尺度 (metric scale) 几何信息（如 3D 重建 (3D reconstruction)）时的主要挑战。

3.1.6. SfM (Structure-from-Motion)

Structure-from-Motion (SfM) 是一种计算机视觉技术，用于从一组 2D 图像 (2D images) 中估计 3D 结构 (3D structure) 和 相机姿态 (camera poses)。它通过在不同图像中识别 对应点 (corresponding points)，然后使用 三角测量 (triangulation) 和 束调整 (bundle adjustment) 等技术来计算这些点的 3D 坐标 (3D coordinates) 和每个图像拍摄时的 相机位置 (camera position) 及 方向 (orientation)。SfM 能够生成 尺度精确的 (scale-accurate) 3D 点云 (3D point clouds)，但它依赖于足够多的 图像对应 (image correspondences)，在 低纹理 (low-texture) 区域或 非重叠区域 (non-overlapping regions) 可能会失效。

3.2. 前人工作

3.2.1. 稀疏视图 3DGS (Sparse-View 3DGS)

3D Gaussian Splatting (3DGS) 在 密集视图 (dense-view) 下表现出色，但在 稀疏视图 (sparse-view) 下性能显著下降。为解决此问题，研究者们提出了多种方法：

深度正则化 (Depth Regularization): DNGaussian (Li et al., 2024a) 和 FSGS (Zhu et al., 2024) 引入了 深度正则化 (depth regularization) 来抑制可见区域的 浮点 (floaters)。然而，正如上文所述，单目深度估计器 (monocular depth estimators) 固有的 尺度模糊 (scale ambiguity) 限制了它们提供可靠 几何监督 (geometric supervision) 的能力。
尺度精确深度 (Scale-Accurate Depth): MAtCha (Guédon et al., 2025) 尝试通过 SfM (Structure-from-Motion) 方法 MASt3R-SfM (Duisterhof et al., 2025) 引入 尺度精确深度 (scale-accurate depth) 来克服 尺度模糊 (scale ambiguity)。然而，MAtCha 仍然难以重建 非重叠区域 (non-overlapping regions)，因为 SfM 依赖于 图像对应 (image correspondences)。 G4Splat 的方法通过利用平面表示的可扩展性来解决 MAtCha 在 非重叠区域 (non-overlapping regions) 的问题，将准确的深度估计从重叠区域传播到非重叠甚至未观测区域。

3.2.2. 用于 3DGS 的生成先验 (Generative Prior for 3DGS)

近年来，扩散模型 (diffusion models) 被广泛应用于 3D 重建 (3D reconstruction) 中，作为强大的 生成先验 (generative prior)，能够补充 稀疏观测 (sparse observations)。

基于扩散模型的 3D 重建： Poole et al., 2022, Xiong et al., 2023, Weber et al., 2024 等研究展示了 扩散模型 (diffusion models) 在提供 3D 重建 (3D reconstruction) 先验方面的有效性。
结合视频扩散模型 (Video Diffusion Models)： Liu et al., 2024a;b;d, Zhao et al., 2025b, Gao et al., 2024, Zhou et al., 2025, Bao et al., 2025, Wu et al., 2025a;c, Fischer et al., 2025, Yin et al., 2025, Zhong et al., 2025 等进一步利用 视频扩散模型 (video diffusion models) 来增强 跨视图一致性 (cross-view consistency)。然而，这些方法通常依赖于初始重建结果来施加其生成能力，因此在几何约束不足的 稀疏观测 (sparse observations) 下，特别是在 修复区域 (inpainted regions)，仍然存在大量 浮点高斯 (floating Gaussian) 伪影和低质量的 3D 几何 (3D geometry)。G4Splat 的方法通过提供 尺度精确深度监督 (scale-accurate depth supervision) 和在 视频扩散模型 (video diffusion models) 中集成 几何指导 (geometric guidance) 来解决这些问题，显著改善几何和外观重建。

3.2.3. 重建中的平面假设 (Plane Assumption in Reconstruction)

曼哈顿世界假设 (Manhattan-world assumption)，尤其是平面假设，在人造环境的重建中得到了广泛应用。

SfM 和 SLAM 中的应用： Liu et al., 2024c, Guo et al., 2024, Mazur et al., 2024, Liu et al., 2025b, Pataki et al., 2025 利用平面假设改进 SfM 和 SLAM 中的匹配精度。
直接拟合平面建模场景： Liu et al., 2019, Agarwala et al., 2022, Xie et al., 2022, Tan et al., 2023, Watson et al., 2024, Ye et al., 2025, Liu et al., 2025a 直接拟合一系列平面来建模室内场景。
整合到神经隐式表示 (Neural Implicit Representations)： Guo et al., 2022, Li et al., 2024b, Chen et al., 2024a, Shi et al., 2025 将平面假设整合到 3D 神经隐式表示 (3D neural implicit representations) 中。例如，GeoGaussian (Li et al., 2024d) 和 IndoorGS (Ruan et al., 2025) 施加局部平面约束来调节高斯的分裂和移动；PlanarSplatting (Tan et al., 2025) 直接从 多视图图像 (multiview images) 重建 3D 场景 (3D scenes) 为 平面基元 (planar primitives)。 G4Splat 的方法与这些方法不同，它利用平面假设来提取 尺度精确深度 (scale-accurate depth)，这不仅为优化高斯表示提供 几何指导 (geometric guidance)，而且促进了 生成先验 (generative prior) 的整合，最终实现了对观察到和未观察到的区域中的平面和非平面结构的精确场景重建。

3.3. 技术演进

该领域的技术演进大致可分为几个阶段：

3DGS 的诞生与发展： Kerbl et al., 2023 提出了 3D Gaussian Splatting (3DGS)，以其卓越的渲染质量和训练效率在 密集视图 (dense-view) 新视图合成 (novel view synthesis) 领域脱颖而出。
稀疏视图挑战与初步解决方案： 随着 3DGS 的普及，研究者们开始关注其在 稀疏视图 (sparse-view) 条件下的性能退化问题。早期解决方案主要集中在引入 深度正则化 (depth regularization) (DNGaussian, FSGS)，但受限于 单目深度估计器 (monocular depth estimators) 的 尺度模糊 (scale ambiguity)。
几何对齐的尝试： MAtCha 试图通过 图表对齐 (chart alignment) 从 SfM 中获取 尺度精确深度 (scale-accurate depth)，但其在 非重叠区域 (non-overlapping regions) 的局限性凸显。
生成先验的引入： 扩散模型 (diffusion models) 的兴起为 稀疏视图 (sparse-view) 3D 重建 (3D reconstruction) 带来了新的可能性。GenFusion, $Difix3D+$ , GuidedVD, See3D 等方法开始利用 生成先验 (generative prior) 来 修复 (inpainting) 缺失区域。然而，这些方法往往在几何和 多视图一致性 (multi-view consistency) 上存在问题。
几何指导的回归与深化： G4Splat 代表了该领域的最新进展，它认识到 准确几何 (accurate geometry) 作为 生成模型 (generative models) 基础的重要性。通过将 平面假设 (plane assumption) 引入以获取 尺度精确深度 (scale-accurate depth)，并将其贯穿于整个 生成流水线 (generative pipeline)，G4Splat 有效地结合了几何的严谨性和生成模型的创造力，解决了 形状-外观模糊 (shape-appearance ambiguities) 和 多视图不一致性 (multi-view inconsistencies) 等核心挑战。

3.4. 差异化分析

G4Splat 的方法与相关工作的主要区别和创新点如下：

3.4.1. 与稀疏视图 3DGS 方法（如 MAtCha）的对比

MAtCha 的局限性： MAtCha 依赖于 图表对齐 (chart alignment) 和 图像对应 (image correspondences) 来获取 尺度精确深度 (scale-accurate depth)。在 非重叠区域 (non-overlapping regions) 或 低纹理 (low-texture) 区域，图像对应 (image correspondences) 难以建立，导致 MAtCha 在这些区域的重建出现显著误差。
G4Splat 的优势： G4Splat 利用 平面表示 (plane representation) 的可扩展性。它从 已观测区域 (observed regions) 准确估计 3D 平面 (3D planes) 后，能够将这些平面的深度信息外推到 非重叠 (non-overlapping) 甚至 未观测区域 (unobserved regions)。这使得 G4Splat 能够提供更鲁棒、更全面的 尺度精确深度监督 (scale-accurate depth supervision)，从而在这些挑战区域实现更准确的重建。此外，G4Splat 还通过线性调整 单目深度 (monocular depth) 来改进非平面区域的深度精度。

3.4.2. 与引入生成先验的 3DGS 方法（如 GenFusion, Difix3D+, GuidedVD, See3D）的对比

现有生成方法的局限性： 这些方法主要依靠 扩散模型 (diffusion models) 的 生成先验 (generative prior) 来填补 未观测区域 (unobserved regions)。然而，由于缺乏可靠的 几何约束 (geometric constraints)，这些方法容易产生：
- 低质量几何： 即使生成的内容看起来合理，其 3D 几何 (3D geometry) 往往不准确，充满 浮点高斯 (floating Gaussians) 伪影。
- 多视图不一致性： 扩散模型 (diffusion models) 生成的 新视图 (novel views) 之间可能存在 外观 (appearance) 和 形状 (shape) 上的不一致，导致 形状-外观模糊 (shape-appearance ambiguities)，进而影响最终的 3D 重建 (3D reconstruction)。
G4Splat 的创新： G4Splat 明确指出 准确几何 (accurate geometry) 是 生成模型 (generative models) 有效性的基础。它将 几何指导 (geometry guidance) 深入融入 生成流水线 (generative pipeline) 的多个环节：
- 可靠的几何基础： 首先通过 平面感知几何建模 (plane-aware geometry modeling) 提供 尺度精确深度 (scale-accurate depth)，为生成模型提供坚实的 3D 基础 (3D foundation)。
- 几何引导的可见性： 利用精确几何信息生成更可靠的 可见性掩码 (visibility masks)，指导 修复 (inpainting) 过程。
- 几何引导的新视图选择： 采用 平面感知 (plane-aware) 的策略选择 新视图 (novel views)，确保最大化覆盖关键 几何结构 (geometric structures)，为 修复 (inpainting) 提供更好的上下文。
- 几何引导的多视图一致性： 在 视频扩散模型 (video diffusion models) 中利用 全局 3D 平面 (global 3D planes) 调制颜色监督，有效缓解 跨视图冲突 (cross-view conflicts)。这些深度的几何集成使得 G4Splat 能够生成更准确、更一致的 几何 (geometry) 和 外观 (appearance)，尤其是在 未观测区域 (unobserved regions)。

4. 方法论

4.1. 方法原理

G4SPLAT 的核心原理在于，将 准确的几何 (accurate geometry) 视为有效利用 生成模型 (generative models) 增强 3D 场景重建 (3D scene reconstruction) 的根本先决条件。它通过识别并利用人造环境中普遍存在的 平面结构 (planar structures)，来克服 稀疏视图 (sparse-view) 下 几何监督 (geometric supervision) 缺乏的问题，并解决 生成模型 (generative models) 带来的 多视图不一致性 (multi-view inconsistencies)。

整体思路是：

建立可靠的几何基础： 首先，从输入视图中提取 2D 平面掩码 (2D plane masks)，并将其合并为 全局 3D 平面 (global 3D planes)。利用这些 全局 3D 平面 (global 3D planes) 结合 单目深度估计器 (monocular depth estimator)，生成 尺度精确 (scale-accurate) 的 平面感知深度图 (plane-aware depth maps)。这些深度图为场景提供了在 已观测 (observed) 和 未观测区域 (unobserved regions) 均可靠的 几何监督 (geometric supervision)。
几何指导生成过程： 将上述 几何信息 (geometric information) 整合到 生成流水线 (generative pipeline) 的多个环节：
- 利用 尺度精确深度 (scale-accurate depth) 构建 可见性网格 (visibility grid)，以生成更准确的 可见性掩码 (visibility mask)。
- 基于 全局 3D 平面 (global 3D planes) 智能地选择 新视图 (novel views)，以最大化对关键 几何结构 (geometric structures) 的覆盖。
- 在通过 视频扩散模型 (video diffusion models) 进行 修复 (inpainting) 时，利用 全局 3D 平面 (global 3D planes) 调制颜色监督，以增强 多视图一致性 (multi-view consistency)。
迭代优化高斯表示： 通过一个 两阶段 (two-stage) 的训练策略，首先用初始几何信息初始化 高斯 (Gaussians)，然后通过 几何引导 (geometry-guided) 的 生成训练循环 (generative training loop) 迭代地细化和扩展 高斯表示 (Gaussian representation)，逐步恢复 未见区域 (unseen regions) 并修正 几何错位 (geometric misalignments)。

这种方法解决了现有生成方法中 形状-外观模糊 (shape-appearance ambiguities) 和 退化几何 (degraded scene geometry) 的问题，最终实现了准确、一致且具有高泛化能力的 3D 场景重建 (3D scene reconstruction)。

4.2. 核心方法详解

4.2.1. 3.1 背景 (Background)

3.1.1. 2D Gaussian Splatting (2DGS)

2D Gaussian Splatting (2DGS) 是 3D Gaussian Splatting (3DGS) 的扩展，它将 3D 体积高斯 (3D volumetric Gaussians) 坍缩成 2D 各向异性盘 (2D anisotropic disks)。每个 2D 高斯 (2D Gaussian) 与一个不透明度 $\alpha$ 和一个 视图依赖颜色 (view-dependent color) $\mathbf{c}$ 相关联，该颜色使用 球谐函数 (spherical harmonics) 表示。

高斯函数定义： 在 2DGS 中，2D 高斯 (2D Gaussian) 盘内点 $\mathbf{u} = (u, v)$ 的值由以下高斯函数定义： $g ( \mathbf { u } ) = \exp \left( - \frac { u ^ { 2 } + v ^ { 2 } } { 2 } \right)$ 其中 u, v 是相对于 高斯中心 (Gaussian center) 的坐标。

Alpha 渲染： 在 光栅化 (rasterization) 期间，高斯 (Gaussians) 按照深度排序，并使用 从前到后 (front-to-back) 的 Alpha 混合 (alpha blending) 合成到最终图像中。给定 $N$ 个高斯，图像中像素 $\mathbf{x}$ 的颜色 $\mathbf{c}(\mathbf{x})$ 计算为： $\mathbf { c } ( \mathbf { x } ) = \sum _ { i = 1 } ^ { N } \mathbf { c } _ { i } \alpha _ { i } g _ { i } \big ( \mathbf { u } ( \mathbf { x } ) \big ) \prod _ { j = 1 } ^ { i - 1 } \big [ 1 - \alpha _ { j } g _ { j } \big ( \mathbf { u } ( \mathbf { x } ) \big ) \big ]$ 其中：

$\mathbf{c}_i$ : 第 $i$ 个 高斯 (Gaussian) 的 视图依赖颜色 (view-dependent color)。
$\alpha_i$ : 第 $i$ 个 高斯 (Gaussian) 的不透明度。
$g_i(\mathbf{u}(\mathbf{x}))$ : 第 $i$ 个 高斯 (Gaussian) 在像素 $\mathbf{x}$ 处投影到 2D 图像平面 (2D image plane) 上的函数值。
$\mathbf{u}(\mathbf{x})$ : 像素 $\mathbf{x}$ 对应的 相机射线 (camera ray) 与 2D 高斯盘 (2D Gaussian disk) 相交得到的 ``(u, v) 坐标 (u,v) coordinate。
$\prod _ { j = 1 } ^ { i - 1 } \big [ 1 - \alpha _ { j } g _ { j } \big ( \mathbf { u } ( \mathbf { x } ) \big ) \big ]$ : 累积透明度，表示前 i-1 个 高斯 (Gaussians) 对当前像素的遮挡程度。

深度图渲染： 同样地，深度图 (depth map) $d(\mathbf{x})$ 可以通过将颜色替换为相应 高斯 (Gaussians) 的 Z 缓冲区值 (Z-buffer values) 来计算： $d ( \mathbf { x } ) = \sum _ { i = 1 } ^ { N } d _ { i } \alpha _ { i } g _ { i } \big ( \mathbf { u } ( \mathbf { x } ) \big ) \prod _ { j = 1 } ^ { i - 1 } \big [ 1 - \alpha _ { j } g _ { j } \big ( \mathbf { u } ( \mathbf { x } ) \big ) \big ]$ 其中 $d_i$ 表示第 $i$ 个 2D 高斯盘 (2D Gaussian disk) 的 Z 缓冲区值 (Z-buffer value)。

3.1.2. MAtCha 中的图表表示与对齐 (Chart Representation and Alignment in MAtCha)

MAtCha (Guédon et al., 2025) 使用 图表 (chart) 来表示 3D 场景 (3D scene)。每个 图表 (chart) 都由一个轻量级的 变形模型 (deformation model) 参数化，该模型平衡了灵活性和效率。它通过 UV 空间 (UV space) 中可学习特征的 稀疏 2D 网格 (sparse 2D grid) 和一个将这些特征映射到 3D 变形向量 (3D deformation vectors) 的 小型 MLP (small MLP) 来定义。为了处理 对象边界 (object boundaries) 上的 深度不连续性 (depth discontinuities)，该模型还增加了 深度依赖特征 (depth-dependent features)。

MAtCha 通过一个 对齐阶段 (alignment stage) 联合优化 图表变形 (chart deformations)，以确保 几何一致性 (geometric consistency)。这个阶段的目标函数包括：

拟合损失 (Fitting loss): 将 图表 (charts) 与 稀疏 SfM 点 (sparse SfM points) 对齐。它最小化 SfM 点 (SfM points) 到变形 图表 (charts) 的距离： $\mathcal { L } _ { \mathrm { f i f } } = \frac { 1 } { n } \sum _ { i = 0 } ^ { n - 1 } \sum _ { k = 0 } ^ { m _ { i } - 1 } C _ { i } ( u _ { i k } ) \| \psi _ { i } ( u _ { i k } ) - p _ { i k } \| _ { 1 } - \alpha \sum _ { i = 0 } ^ { n - 1 } \log ( C _ { i } )$ 其中：
- $n$ : 图表 (charts) 的数量。
- $m_i$ : 在图像 $i$ 中可见的 SfM 点 (SfM points) 数量。
- $C_i(u_{ik})$ : 可学习的 置信图 (confidence map)，用于降低不可靠 SfM 点 (SfM points) 的权重。
- $\psi_i(u_{ik})$ : UV 坐标 (UV coordinate) $u_{ik}$ 在 图表 (chart) $i$ 上变形后的 3D 位置 (3D position)。
- $p_{ik}$ : SfM 点 (SfM point) 的 3D 位置 (3D position)。
- $\| \cdot \|_1$ : $L_1$ 范数。
- $\alpha$ : 权重参数。
结构损失 (Structure loss): 保留 初始深度图 (initial depth maps) 捕获的清晰 几何结构 (geometric structures)。它通过正则化 变形图表 (deformed charts) 与其初始化之间的 表面法线 (surface normals) 和 平均曲率 (mean curvature) 来实现： $\mathcal { L } _ { \mathrm { s t r u c t } } = \sum _ { i = 0 } ^ { n - 1 } \left( 1 - \boldsymbol { N } _ { i } \cdot \boldsymbol { N } _ { i } ^ { ( 0 ) } \right) + \frac { 1 } { 4 } \sum _ { i = 0 } ^ { n - 1 } \Vert \boldsymbol { M } _ { i } - \boldsymbol { M } _ { i } ^ { ( 0 ) } \Vert _ { 1 }$ 其中：
- $\boldsymbol{N}_i$ 和 $\boldsymbol{M}_i$ : 变形图表 (deformed chart) $i$ 的 表面法线 (surface normal) 和 平均曲率 (mean curvature)。
- $\boldsymbol{N}_i^{(0)}$ 和 $\boldsymbol{M}_i^{(0)}$ : 从 初始深度图 (initial depth maps) 导出的对应值。
- $\cdot$ : 点积。
- $\| \cdot \|_1$ : $L_1$ 范数。
互相对齐损失 (Mutual alignment loss): 强制 全局一致性 (global coherence)，鼓励相邻 图表 (charts) 对齐，通过最小化投影重叠点之间的距离： $\mathcal { L } _ { \mathrm { a l i g n } } = \sum _ { i , j = 0 } ^ { n - 1 } \sum _ { u \in V _ { i } } \operatorname* { m i n } ( \Vert \psi _ { i } ( u ) - \psi _ { j } \circ P _ { j } \circ \psi _ { i } ( u ) \Vert _ { 1 } , \tau )$ 其中：
- $V_i$ : 图表 (chart) $i$ 上采样的 UV 坐标 (UV coordinates) 集合。
- $P_j$ : 从 3D 空间 (3D space) 到 图表 (chart) $j$ 的 UV 域 (UV domain) 的投影。
- $\tau$ : 吸引阈值 (attraction threshold)，限制最大对齐距离。

最终目标函数 (Final objective function): 总优化目标结合了这三项损失： $\mathcal { L } = \mathcal { L } _ { \mathrm { f i t } } + \lambda _ { \mathrm { s t r u c t } } \mathcal { L } _ { \mathrm { s t r u c t } } + \lambda _ { \mathrm { a l i g n } } \mathcal { L } _ { \mathrm { a l i g n } }$ 其中， $\lambda_{\mathrm{struct}} = 4$ 和 $\lambda_{\mathrm{align}} = 5$ 是权重参数。 MAtCha 的有效性严重依赖于准确的 图像对应 (image correspondences)，在匹配不良或缺失的区域会出现显著误差，这是 G4Splat 引入 平面感知几何建模 (plane-aware geometry modeling) 的动机。

MAtCha 进一步通过 2DGS 引入 高斯面元 (Gaussian surfel) 细化阶段，以增强 细粒度场景结构 (fine-grained scene structures) 的重建。在这个阶段，面元 (surfels) 使用一个结合 光度一致性 (photometric consistency) 和 几何正则化项 (geometric regularization terms) 的联合损失函数进行迭代优化。

RGB 损失 (RGB Loss): 光度损失 (photometric loss) 定义为 L1 损失 (L1 loss) 和 D-SSIM (D-SSIM) 的加权组合： $\mathcal { L } _ { \mathrm { r g b } } = ( 1 - \lambda ) \mathcal { L } _ { 1 } + \lambda \mathcal { L } _ { \mathrm { D - S S I M } }$ 其中 $\lambda = 0.2$ 。

正则化损失 (Regularization Loss): 正则化损失 (regularization loss) 包含 畸变损失 (distortion loss) 和 深度-法线一致性损失 (depth-normal consistency loss)，这些在 2DGS 中已被引入。

畸变损失 (Distortion loss): 用于防止 面元 (surfel) 漂移并强制 跨图表一致性 (cross-chart consistency)： $\mathcal { L } _ { d } = \sum _ { i , j } \omega _ { i } \omega _ { j } | z _ { i } - z _ { j } |$ 其中 $z_i$ 表示第 $i$ 个 面元 (surfel) 的 交点深度 (intersection depth)， $\omega_i$ 是其 混合权重 (blending weight)。
深度-法线一致性损失 (Depth-normal consistency loss): 用于促进 不同图表 (different charts) 之间 表面方向 (surface orientations) 的对齐： $\mathcal { L } _ { n } = \sum _ { i } \omega _ { i } ( 1 - { \bf n } _ { i } ^ { \mathrm { T } } { \bf N } _ { p } )$ 其中 $\mathbf{n}_i$ 是 面元 (surfel) 的 法线 (normal)， $\mathbf{N}_p$ 是从 深度梯度 (depth gradient) 导出的 法线 (normal)。

总正则化损失 (Overall regularization loss): $\mathcal { L } _ { \mathrm { r e g } } = \lambda _ { d } \mathcal { L } _ { d } + \lambda _ { n } \mathcal { L } _ { n }$ 其中 $\lambda_d = 500$ 和 $\lambda_n = 0.25$ 是权重参数。

结构损失 (Structure loss): 结构损失 (structure loss) 遵循类似于 $Eq. (A5)$ 的公式： $\mathcal { L } _ { \mathrm { s t r u c t } } = \sum _ { i = 0 } ^ { n - 1 } \lVert \bar { D } _ { i } - D _ _ { i } \rVert _ { 1 } + \sum _ { i = 0 } ^ { n - 1 } \left( 1 - \bar { N } _ { i } \cdot N _ { i } \right) + \frac { 1 } { 4 } \sum _ { i = 0 } ^ { n - 1 } \lVert \bar { M } _ { i } - M _ { i } \rVert _ { 1 }$ 其中：

$\bar{D}_i, \bar{N}_i, \bar{M}_i$ : 从 高斯面元 (Gaussian surfels) 渲染的 深度 (depth)、法线 (normal) 和 平均曲率 (mean curvature)。
$D_i, N_i, M_i$ : 从 图表 (charts) 获取的对应值。

总细化损失 (Total refinement loss): $\mathcal { L } _ { \mathrm { t o t a l } } = \mathcal { L } _ { \mathrm { r g b } } + \mathcal { L } _ { \mathrm { r e g } } + \mathcal { L } _ { \mathrm { s t r u c t } }$ 在 2DGS 的每一轮训练中，G4Splat 采用与 MAtCha 相同的总损失公式，但通过使用 平面感知几何建模 (plane-aware geometry modeling) 来计算 结构损失 (structure loss) 项中的 $D_i, N_i, M_i$ ，从而引入更强大的 几何约束 (geometric constraints)，实现更准确和一致的重建。

4.2.2. 3.2 平面感知几何建模 (Plane-Aware Geometry Modeling)

G4Splat 的核心之一是构建 平面感知 (plane-aware) 的 尺度精确几何 (scale-accurate geometry)。这包括三个主要步骤：

4.2.2.1. 每视图 2D 平面提取 (Per-view 2D Plane Extraction)

受先前工作启发，作者假设图像中的平面区域具有一致的 法线方向 (normal directions)、平滑的 几何 (geometry) 和相似的 语义 (semantics)。

法线图聚类 (Normal Map Clustering): 对 法线图 (normal map) 执行 K-means 聚类 (K-means clustering)，以获取具有连贯 表面方向 (coherent surface orientations) 的区域。
SAM 掩码过滤 (SAM Mask Filtering): 使用 SAM (Segment Anything Model) (Kirillov et al., 2023) 生成的 实例掩码 (instance masks) 过滤这些区域。
有效平面掩码识别 (Valid Plane Mask Identification): 只有那些分配了相同 实例标签 (instance label) 且超出预定义 尺寸阈值 (size threshold) 的区域才被视为有效的 2D 平面掩码 (2D plane masks)。

如图 4（原文 Figure 3a）所示，这一过程能够准确地从图像中提取局部平面。

4.2.2.2. 全局 3D 平面估计 (Global 3D Plane Estimation)

从单个视图提取的 2D 平面掩码 (2D plane masks) 通常是 过度分割 (oversegmented) 且缺乏 全局一致性 (global consistency)，导致同一个 3D 平面 (3D plane) 在多个视图中被碎片化。为解决此问题，G4Splat 利用 3D 场景点云 (3D scene point cloud) 建立局部掩码之间的对应关系，并将它们合并为 全局一致的 3D 平面 (globally consistent 3D planes)。

关联 3D 点 (Associating 3D Points): 对于每个 每视图 2D 平面掩码 (per-view 2D plane mask)，通过投影收集 场景点云 (scene point cloud) 中关联的 3D 点 (3D points)。
合并标准 (Merging Criteria): 如果两个 局部平面 (local planes) 的 关联 3D 点集 (associated 3D point sets) 表现出足够的 空间重叠 (spatial overlap) 且具有相似的 法线方向 (normal directions)，则将它们合并到同一个 全局 3D 平面 (global 3D plane) 中。
点云准确性增强 (Point Cloud Accuracy Enhancement): 为了解决 点云投影 (point cloud projection) 带来的遮挡和稀疏性问题（详见 Section C.1）：
- 首先使用 高斯面元 (Gaussian surfels) 渲染所有视图的 深度图 (depth maps)。
- 然后 反投影 (back-project) 这些 深度图 (depth maps) 以重建 3D 场景点云 (3D scene point cloud)，确保每个像素都有一个有效的 表面点 (surface point)。
- 在投影回给定视图时，要求点的 深度值 (depth values) 与 高斯渲染深度值 (Gaussian-rendered depth values) 的相对偏差不超过 $1\%$ ，以处理 遮挡 (occlusion)。
鲁棒参数估计 (Robust Parameter Estimation): 通过对所有视图重复此过程，获得一组 全局点集合 (global point collections) $\{ \mathcal { P } _ { k } \}$ ${P_{k}}$ ，每个集合代表一个 全局 3D 平面 (global 3D plane)。每个 全局 3D 平面 (global 3D plane) $\Phi_k$ $Φ_{k}$ 表示为： $\Phi _ { k } : \mathbf { n } _ { k } ^ { \top } \mathbf { x } + d _ { k } = 0$ 其中：
- $\mathbf{n}_k \in \mathbb{R}^3$ : 单位 法向量 (unit normal vector)。
- $d_k \in \mathbb{R}$ : 偏移量 (offset)。为了鲁棒地估计 平面参数 (plane parameters)，从 $\mathcal{P}_k$ 中选择一个 高置信度点子集 (subset of high-confidence points) $\mathcal { P } _ { k } ^ { \mathrm { c o n f } } \subset \mathcal { P } _ { k }$ （定义为在至少三个视图中被观察到的点），并使用 RANSAC (Fischler & Bolles, 1981) 最小化以下目标函数来估计 $\mathcal { P } _ { k } ^ { \mathrm { c o n f } }$ 的平面参数： $\operatorname* { m i n } _ { \mathbf { n } _ { k } , d _ { k } } \sum _ { \mathbf { p } \in \mathcal { P } _ { k } ^ { \mathrm { c o n f } } } ( \mathbf { n } _ { k } ^ { \top } \mathbf { p } + d _ { k } ) ^ { 2 } , \quad \mathrm { s . t . ~ } \| \mathbf { n } _ { k } \| = 1$ 这一过程生成 几何准确 (geometrically accurate) 和 跨视图一致 (cross-view consistent) 的 3D 平面估计 (3D plane estimates)，为后续优化提供了可靠的几何基础。

4.2.2.3. 平面感知深度图提取 (Plane-Aware Depth Map Extraction)

利用估计的 全局 3D 平面 (global 3D planes)，为每个视图 $v$ 提取 平面感知深度图 (plane-aware depth map) $D^v$ 。

平面区域深度计算 (Depth for Planar Regions): 假设视图 $v$ $v$ 中有 $M$ $M$ 个 2D 平面掩码 (2D plane masks) $\{ P _ { i } ^ { v } \} _ { i = 1 } ^ { M }$ ${P_{i}^{v}}_{i = 1}^{M}$ ，每个都与 全局 3D 平面 (global 3D plane) $\Phi_{k_i}$ $Φ_{k_{i}}$ 关联。对于 掩码 (mask) $P_i^v$ $P_{i}^{v}$ 中的每个像素 $\mathbf{u}$ $u$ ，从 相机中心 (camera center) $\mathbf{o}^v$ $o^{v}$ 沿 射线方向 (ray direction) $\mathbf{r}^v(\mathbf{u})$ $r^{v} (u)$ 发射一条射线，并通过与 全局 3D 平面 (global 3D plane) $\Phi_{k_i}$ $Φ_{k_{i}}$ 相交来计算其深度： D _ { i } ^ { v } ( { \mathbf { u } } ) = \frac { - { \mathbf { n } } _ { k _ _ i } ^ { \top } { \mathbf { o } } ^ { v } - d _ { k _ i } } { { \mathbf { n } } _ { k _ i } ^ { \top } { \mathbf { r } } ^ { v } ( { \mathbf { u } } ) } 其中：
- $\mathbf{n}_{k_i}$ : 全局 3D 平面 (global 3D plane) $\Phi_{k_i}$ 的 单位法向量 (unit normal vector)。
- $d_{k_i}$ : 全局 3D 平面 (global 3D plane) $\Phi_{k_i}$ 的 偏移量 (offset)。
- $\mathbf{o}^v$ : 视图 v (view v) 的 相机中心 (camera center)。
- $\mathbf{r}^v(\mathbf{u})$ : 从 相机中心 (camera center) 穿过像素 $\mathbf{u}$ 的 射线方向 (ray direction)。
非平面区域深度调整 (Depth Adjustment for Non-Planar Regions): 对于 视图 (view) $I^v$ 中的 非平面区域 (non-planar regions)，使用预训练的 单目深度估计器 (monocular depth estimator) (Yang et al., 2024) 预测 相对深度图 (relative depth map) $\hat{D}^v$ 。然后通过线性变换将其调整为 度量尺度 (metric scale)： $D ^ { v } ( \mathbf { u } ) = a _ { v } \hat { D } ^ { v } ( \mathbf { u } ) + b _ { v }$ 其中 尺度 (scale) $a_v$ 和 偏移 (offset) $b_v$ 通过在 平面区域 (planar regions) 的像素上进行 最小二乘拟合 (least-squares fitting) 来估计。最终的 深度图 (depth map) $D^v$ 结合了 几何一致的平面深度 (geometry-consistent plane depths) 和 细化后的单目预测 (refined monocular predictions)，从而为每个视图生成一个完整且 尺度精确的 (scale-accurate) 平面感知深度表示 (plane-aware depth representation)。这显著缓解了 MAtCha 在 非重叠区域 (non-overlapping regions) 中的误差（如图 4 中的 Figure 3a 所示）。

4.2.3. 3.3 几何引导的生成流水线 (Geometry-Guided Generative Pipeline)

在建立了改进的几何基础之后，G4Splat 进一步将 几何指导 (geometry guidance) 整合到 生成细化循环 (generative refinement loop) 中，以缓解 形状-外观模糊 (shape-appearance ambiguities)。

4.2.3.1. 几何引导的可见性 (Geometry-Guided Visibility)

现有方法依赖于从 Alpha 图 (alpha maps) 导出的 修复掩码 (inpainting masks)，这些掩码通常在可见区域引入误差，从而降低 修复结果 (inpainting results) 的质量。为解决此问题，G4Splat 采用 尺度精确的平面感知深度 (scale-accurate plane-aware depth) 来使用 可见性网格 (visibility grid) 建模 场景可见性 (scene visibility)。

确定 3D 边界 (Determining 3D Boundaries): 首先根据所有训练视图的 深度图 (depth maps) 确定 场景 (scene) 的 3D 边界 (3D boundaries)。
场景离散化 (Scene Discretization): 将 场景 (scene) 离散化为 体素网格 (voxel grid) $\mathcal{G}$ 。
体素可见性评估 (Voxel Visibility Assessment): 对于每个 体素 (voxel)，通过将其中心投影到每个 训练视图 (training view) 并检查其是否落在有效 深度范围 (depth range) 内来确定其可见性。如果 体素 (voxel) 至少在一个视图的 可观测深度范围 (observable depth range) 内，则将其标记为可见（即 可见性值 (visibility value) = 1）。
像素级可见性计算 (Pixel-wise Visibility Computation): 使用 可见性网格 (visibility grid)，通过相应的 GS 渲染深度图 (GS-rendered depth map) 渲染 新视图 (novel view) 的 可见性图 (visibility map)。像素级可见性通过从 相机中心 (camera center) 穿过每个像素发射射线，并沿射线均匀采样 $Q$ $Q$ 个点直到渲染深度来评估。每个采样点 $q$ $q$ 的 可见性值 (visibility value) $v_q$ $v_{q}$ 通过在 可见性网格 (visibility grid) $\mathcal{G}$ $G$ 上进行 最近邻插值 (nearest neighbor interpolation) 确定。最终的 每像素可见性 (per-pixel visibility) $V^v(\mathbf{u})$ $V^{v} (u)$ 计算为： $V ^ { v } ( \mathbf { u } ) = \prod _ { q = 1 } ^ { Q } v _ { q }$ 其中：
- $Q$ : 沿射线采样的点数。
- $v_q$ : 第 $q$ 个采样点的 可见性值 (visibility value)。这表示只有当沿其 视线 (viewing ray) 的所有 $Q$ 个采样点在 可见性网格 (visibility grid) $\mathcal{G}$ 中被标记为可见时，像素才被视为可见。这确保了更可靠的 修复掩码 (inpainting masks)，避免了 Alpha 图 (alpha maps) 带来的误差（如图 4 中的 Figure 3b 所示）。

4.2.3.2. 平面感知新视图选择 (Plane-Aware Novel View Selection)

传统的 新视图选择策略 (novel view selection strategies)，例如围绕场景中心绘制椭圆轨迹，通常只能提供有限的局部覆盖，导致 修复结果 (inpainting results) 在最终重建中引入明显的 伪影 (artifacts)。G4Splat 提出了一种 平面感知视图选择策略 (plane-aware view selection strategy)，通过利用 全局 3D 平面 (global 3D planes) 作为 对象代理 (object proxies)，确保所选视图对对象的完整覆盖。全局 3D 平面 (Global 3D planes) 通常提供足够的 结构 (structural) 和 纹理线索 (textural cues) 来实现可靠的 修复 (inpainting)。

目标函数 (Objective Function): 对于每个 全局 3D 平面 (global 3D plane)，使用其 质心 (centroid) 作为 视点目标 (look-at target)，并在 可见性网格 (visibility grid) 中搜索 可见网格中心 (visible grid centers) 作为 相机中心 (camera center)。选择过程由三个目标指导：

最大化 平面点覆盖 (coverage of plane points)。
最小化到 平面 (plane) 的距离。
鼓励 视线方向 (viewing direction) 与 平面法线 (plane normal) 对齐。 相机中心 (camera center) $\mathbf{c}^*$ 的优化问题定义为： $\mathbf { c } ^ { * } = \arg \operatorname* { m i n } _ { \mathbf { c } \in \mathcal { C } } \Big ( D ( \mathbf { c } , \mathbf { p } , \mathbf { n } ) - R ( \mathbf { c } ) - \big | \cos \theta ( \mathbf { c } , \mathbf { p } , \mathbf { n } ) \big | \Big )$ 在原文中，公式是 $\mathbf { c } ^ { * } = \arg \operatorname* { m a x } _ { \mathbf { c } \in \mathcal { C } } \Big ( R ( \mathbf { c } ) + \big | \cos \theta ( \mathbf { c } , \mathbf { p } , \mathbf { n } ) \big | - D ( \mathbf { c } , \mathbf { p } , \mathbf { n } ) \Big )$ 。这实际上是一个最大化问题，等价于我上面写的最小化问题，只是将距离项取负。我将严格遵循原文。 $\mathbf { c } ^ { * } = \arg \operatorname* { m a x } _ { \mathbf { c } \in \mathcal { C } } \Big ( R ( \mathbf { c } ) + \big | \cos \theta ( \mathbf { c } , \mathbf { p } , \mathbf { n } ) \big | - D ( \mathbf { c } , \mathbf { p } , \mathbf { n } ) \Big )$ 其中：

$\mathcal{C}$ : 可见性网格 (visibility grid) 中 可见体素中心 (visible voxel centers) 的集合。
$R(\mathbf{c})$ : 从 $\mathbf{c}$ 处可见的 平面点 (plane points) 数量与 平面点总数 (total number of plane points) 之比。
$\theta(\mathbf{c}, \mathbf{p}, \mathbf{n})$ : 视线方向 (viewing direction) $(\mathbf{p} - \mathbf{c})$ 与 平面法线 (plane normal) $\mathbf{n}$ 之间的角度。
$\mathbf{p}$ : 3D 平面 (3D plane) 的中心（即 视点目标 (look-at target)）。
$\mathbf{n}$ : 平面 (plane) 的 法线 (normal)。
$D(\mathbf{c}, \mathbf{p}, \mathbf{n})$ : 从 相机中心 (camera center) $\mathbf{c}$ 到 平面 (plane) 的距离。

除了这种 平面感知 (plane-aware) 策略外，还结合了围绕场景中心的 椭圆轨迹 (elliptical trajectory) (Wu et al., 2025c)，以进一步增加视图多样性。

4.2.3.3. 几何引导的修复 (Geometry-Guided Inpainting)

对于每个 新视图 (novel view) $v$ ，渲染 原始 RGB 图 (raw RGB map) $\tilde{I}^v$ （Eq. A2，即 2DGS 的 Alpha 渲染 (Alpha rendering) 结果）以及 可见性掩码 (visibility mask) $V^v$ （Eq. 6）。然后，使用预训练的 视频扩散模型 (video diffusion model) (Ma et al., 2025，如 See3D 或 Stable Virtual Camera)，该模型以参考图像 $\{ I ^ { i } \}$ 和输入 $\{ \tilde { I } ^ { v } , V ^ { v } \}$ 作为输入，联合 修复 (inpainting) 所有视图中的 遮挡区域 (occluded regions)，生成完成的图像 $\{ \hat { I } ^ { v } \}$ 。

为了缓解 多视图不一致性 (multi-view inconsistencies)：

颜色监督调制 (Color Supervision Modulation): 对于每个 全局 3D 平面 (global 3D plane)，主要依赖于对其提供最完整观测的视图的颜色监督，从而减少 跨视图冲突 (cross-view conflicts)。
深度投影对应 (Depth Projection Correspondence): 在训练 高斯表示 (Gaussian representation) 之前，通过 深度投影 (depth projection) 建立视图间的对应关系。
选择颜色监督视图 (Selecting Color Supervision View): 如果区域位于 3D 平面 (3D plane) 上，选择对该平面提供最完整观测的视图；对于 非平面区域 (non-planar regions)，选择该区域首次被观察到的视图。这些操作通过 几何投影 (geometric projection) 并行执行，仅在 高斯训练 (Gaussian training) 之前进行一次预处理，因此计算开销很小。如图 9（原文 Figure A3）所示，这种方法显著减少了 多视图不一致性 (multi-view inconsistencies) 的影响，生成更清晰、更纯净的渲染结果。

4.2.4. 3.4 整体训练策略 (Overall Training Strategy)

G4Splat 的训练流水线包括两个阶段：初始化阶段 (initialization stage) 和 几何引导的生成训练循环 (geometry-guided generative training loop)。

4.2.4.1. 初始化阶段 (Initialization Stage)

初始深度图 (Initial Depth Maps): 首先在 MAtCha 中应用 图表对齐 (chart alignment)，为每个输入视图获取 初始深度图 (initial depth map)。
全局 3D 平面和平面感知深度图 (Global 3D Planes and Plane-Aware Depth Maps): 根据这些 深度图 (depth maps) 估计 全局 3D 平面 (global 3D planes)，并计算 平面感知深度图 (plane-aware depth maps)（如 Section 3.2 所述）。
高斯初始化与训练 (Gaussian Initialization and Training): 从 深度图 (depth maps) 的 点云 (point cloud) 初始化 高斯参数 (Gaussian parameters)，并使用 平面感知深度图 (plane-aware depth maps) 进行训练。这会生成一个在输入视图观察到的区域具有准确几何的 基线模型 (baseline model)。

4.2.4.2. 几何引导的生成训练循环 (Geometry-Guided Generative Training Loop)

第二阶段通过 迭代过程 (iterative process) 细化和扩展重建（如 Section 3.3 所述）。如图 3（原文 Figure 2）所示，每个循环包括：

构建可见性网格 (Constructing Visibility Grid): 从当前训练视图构建 可见性网格 (visibility grid)。
选择新视角和修复 (Selecting Novel Viewpoints and Inpainting): 选择 新视角 (novel viewpoints) 并 修复 (inpainting) 其 不可见区域 (invisible regions)。
合并到训练集 (Merging into Training Set): 将 修复后的新视图 (inpainted novel views) 合并到训练集中。
重新计算平面和深度 (Recomputing Planes and Depths): 重新计算 全局 3D 平面 (global 3D planes) 和 平面感知深度 (plane-aware depths)。
高斯微调 (Gaussian Fine-tuning): 使用更新的监督信息微调 高斯 (Gaussians)。

重复此循环可逐步恢复 未见区域 (unseen regions) 并修正 几何错位 (geometric misalignments)。

损失函数 (Loss Function): 在 2DGS 的每一轮训练中，G4Splat 采用与 MAtCha 相同的总损失公式（ $Eq. (1)$ ），但通过使用 平面感知几何建模 (plane-aware geometry modeling) 增强了 图表深度图 (chart depth maps)，从而引入更强大的 几何约束 (geometric constraints)，实现更准确和一致的重建。具体来说，在 MAtCha 的总细化损失 $\mathcal { L } _ { \mathrm { t o t a l } } = \mathcal { L } _ { \mathrm { r g b } } + \mathcal { L } _ { \mathrm { r e g } } + \mathcal { L } _ { \mathrm { s t r u c t } }$ 中，G4Splat 通过其 平面感知几何建模 (plane-aware geometry modeling) 来计算 结构损失 (structure loss) $\mathcal { L } _ { \mathrm { s t r u c t } }$ 中的 $D_i, N_i, M_i$ ，即用 平面感知深度 (plane-aware depth)、法线 (normal) 和 平均曲率 (mean curvature) 来替代 MAtCha 中从 图表 (charts) 获取的对应值。

在实验中，作者使用了三个 生成训练循环 (generative training loops)。

4.3. C.2 平面感知新视图选择 (Plane-Aware Novel View Selection)

正如 Section 3.3 中所述，G4Splat 将 平面感知新视图选择 (plane-aware novel view selection) 中的 相机中心 (camera center) 选择表述为一个搜索问题。目标是最大化 平面点 (plane points) 的覆盖率，最小化 相机 (camera) 到 平面 (plane) 的距离，并鼓励 相机视线方向 (camera viewing direction) 与 对应平面法线 (corresponding plane normal) 对齐。

相机中心选择的优化问题 (Optimization Problem for Camera Center Selection): 设 相机中心 (camera center) 为 $\mathbf{c}$ ，视点目标 (look-at point) 为 $\mathbf{p}$ (即 3D 平面 (3D plane) 的中心)，平面法线 (plane normal) 为 $\mathbf{n}$ 。则优化问题定义为： $\mathbf { c } ^ { * } = \arg \operatorname* { m a x } _ { \mathbf { c } \in \mathcal { C } } \Big ( R ( \mathbf { c } ) + \big | \cos \theta ( \mathbf { c } , \mathbf { p } , \mathbf { n } ) \big | - D ( \mathbf { c } , \mathbf { p } , \mathbf { n } ) \Big )$ 其中：

$\mathcal{C}$ : 可见性网格 (visibility grid) 中 可见体素中心 (visible voxel centers) 的集合。
$R(\mathbf{c})$ : 从 $\mathbf{c}$ 处可见的 平面点 (plane points) 数量与 平面点总数 (total number of plane points) 之比。这个项旨在最大化视图对 平面 (plane) 的覆盖。
$\theta(\mathbf{c}, \mathbf{p}, \mathbf{n})$ : 视线方向 (viewing direction) $(\mathbf{p} - \mathbf{c})$ 与 平面法线 (plane normal) $\mathbf{n}$ 之间的角度。 $|\cos \theta|$ 项鼓励 视线 (viewing direction) 与 平面法线 (plane normal) 对齐（即 相机 (camera) 正对 平面 (plane)），这有助于捕捉 平面 (plane) 的正面信息。
$D(\mathbf{c}, \mathbf{p}, \mathbf{n})$ : 从 相机中心 (camera center) 到 平面 (plane) 的距离。这个项是负的，表示目标是最小化距离，从而使得 相机 (camera) 靠近 平面 (plane) 以获取更详细的视图。

除了这种 平面感知新视图选择策略 (plane-aware novel view selection strategy)，为了进一步增强视图多样性，作者还结合了围绕场景中心的 椭圆轨迹 (elliptical trajectory)，这与先前的研究 (Wu et al., 2025c) 类似。

4.4. C.3 几何引导修复 (Geometry-Guided Inpainting)

多视图不一致的 (multi-view inconsistent) 修复结果会导致 高斯表示 (Gaussian representations) 训练时出现 黑色阴影 (black shadows) 等伪影（如图 9 中的 Figure A3 所示），这在 GuidedVD (Zhong et al., 2025) 等方法中也观察到。GuidedVD 尝试通过约束 扩散去噪过程 (diffusion denoising process) 来保留 已观测区域 (observed regions)，从而缓解这些区域附近的不一致性，但对于大型缺失区域效果不佳且训练速度较慢。

G4Splat 通过在整个 生成流水线 (generative pipeline) 中（如 Section 3.3 所述）整合 尺度精确的几何指导 (scale-accurate geometry guidance) 来缓解 多视图不一致性 (multi-view inconsistencies)。在此流水线中，还引入了一项策略，通过基于 尺度精确深度 (scale-accurate depth) 调制 颜色监督 (color supervision) 来减少 修复结果 (inpainting results) 中的不一致性。

策略细节 (Strategy Details):

视图间对应 (Correspondence Across Views): 在训练 高斯表示 (Gaussian representation) 之前，首先通过 深度投影 (depth projection) 在视图之间建立对应关系。
选择主导颜色监督视图 (Selecting Dominant Color Supervision View): 对于每个区域，主要依赖于单个视图的 颜色监督 (color supervision)：
- 平面区域 (Planar Regions): 如果区域位于 3D 平面 (3D plane) 上，则选择对该平面提供最完整观测的视图，以确保 修复 (inpainting) 的一致性。
- 非平面区域 (Non-planar Regions): 对于 非平面区域 (non-planar regions)，选择该区域首次被观察到的视图。这些操作通过 几何投影 (geometric projection) 并行执行，且仅在 高斯训练 (Gaussian training) 之前进行一次预处理步骤，因此计算开销极小。如图 9（原文 Figure A3）所示，G4Splat 的方法显著减少了 多视图不一致性 (multi-view inconsistencies) 的影响，生成更清晰、更纯净的渲染结果。

5. 实验设置

5.1. 数据集

实验在以下三个数据集上进行评估：

Replica (Straub et al., 2019): 这是一个合成的室内场景数据集，包含 8 个室内场景。该数据集提供了高质量的 RGB-D 图像 (RGB-D images) 和 3D 场景几何 (3D scene geometry)，常用于 3D 重建 (3D reconstruction) 和 新视图合成 (novel view synthesis) 任务。
ScanNet++ (Yeshwanth et al., 2023): 这是一个真实的室内场景数据集，论文中使用了其中的 6 个场景。 $ScanNet++$ 提供了更真实、复杂的室内环境，包含 RGB-D 扫描 (RGB-D scans) 和 语义标注 (semantic annotations)。
DeepBlending (Hedman et al., 2018): 这是一个真实的室外场景数据集，论文中使用了 3 个场景。DeepBlending 数据集以其在复杂光照和反射场景下的挑战性而闻名。

视图采样策略：

对于每个场景，统一采样 100 张图像。
ScanNet++ 和 DeepBlending: 随机选择 5 张图像作为 输入视图 (input views)，其余 95 张图像作为 测试视图 (test views)。
Replica: 进行了三组实验，分别使用 5、10 和 15 张图像作为 输入视图 (input views)，以评估在不同 视图稀疏度 (view sparsity) 下的性能。在这三组实验中，都使用了相同的 85 张图像作为 测试视图 (test views)，以确保评估集的一致性。

5.2. 评估指标

论文使用了 重建指标 (reconstruction metrics) 和 渲染指标 (rendering metrics) 来全面评估模型的性能。

5.2.1. 渲染指标 (Rendering Metrics)

这些指标用于评估 新视图合成 (novel view synthesis) 的图像质量。

PSNR (Peak Signal-to-Noise Ratio，峰值信噪比):
1. 概念定义: PSNR 是一种衡量图像质量的客观标准，它通过比较原始图像与处理后图像的 像素值 (pixel values) 差异来量化噪声水平。PSNR 值越高，表示图像失真越小，质量越好。它通常用于评估 有损压缩 (lossy compression) 和重建算法的性能。
2. 数学公式: $\mathrm{PSNR} = 10 \cdot \log_{10}\left(\frac{MAX_I^2}{\mathrm{MSE}}\right)$ 其中，MSE (Mean Squared Error，均方误差) 定义为： $\mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2$
3. 符号解释:
  - $MAX_I$ : 图像中可能的最大像素值。对于 8 位图像 (8-bit images)，通常为 255。
  - MSE: 原始图像 $I$ 和重建图像 $K$ 之间的 均方误差 (Mean Squared Error)。
  - m, n: 图像的宽度和高度。
  - I(i,j): 原始图像在坐标 (i,j) 处的像素值。
  - K(i,j): 重建图像在坐标 (i,j) 处的像素值。
SSIM (Structural Similarity Index Measure，结构相似性指数):
1. 概念定义: SSIM 是一种用于衡量两幅图像相似度的指标，它从 亮度 (luminance)、对比度 (contrast) 和 结构 (structure) 三个方面来评估图像质量，更符合人类视觉感知。SSIM 值接近 1 表示两幅图像非常相似，质量很好。
2. 数学公式: $\mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}$
3. 符号解释:
  - x, y: 分别代表两幅图像（或图像的局部窗口）的 像素值 (pixel values)。
  - $\mu_x, \mu_y$ : 图像 $x$ 和 $y$ 的 平均像素值 (average pixel values)。
  - $\sigma_x^2, \sigma_y^2$ : 图像 $x$ 和 $y$ 的 方差 (variances)。
  - $\sigma_{xy}$ : 图像 $x$ 和 $y$ 的 协方差 (covariance)。
  - $c_1 = (k_1 L)^2, c_2 = (k_2 L)^2$ : 用于稳定除法的小常数，其中 $L$ 是像素值的动态范围（例如，8 位图像 (8-bit image) 为 255）， $k_1, k_2$ 是小常数（通常 $k_1=0.01, k_2=0.03$ ）。
LPIPS (Learned Perceptual Image Patch Similarity，学习型感知图像块相似度):
1. 概念定义: LPIPS 是一种基于 深度学习 (deep learning) 的图像相似度度量，它使用预训练的 神经网络 (neural network)（通常是 VGG 或 AlexNet）来提取图像特征，然后计算这些特征之间的距离。LPIPS 旨在更好地匹配人类的感知判断，其值越低表示两幅图像在感知上越相似，质量越好。论文中使用了 VGG 网络 (VGG network)。
2. 数学公式: 论文未直接给出 LPIPS 的公式，但其核心思想是计算两张图像在 深度特征空间 (deep feature space) 中的距离。 $\mathrm{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} \|w_l \odot (\phi_l(x)_{hw} - \phi_l(y)_{hw})\|_2$
3. 符号解释:
  - x, y: 输入的两幅图像。
  - $\phi_l$ : 预训练 神经网络 (neural network) 的第 $l$ 层激活输出。
  - $w_l$ : 第 $l$ 层的 权重向量 (weight vector)。
  - $H_l, W_l$ : 第 $l$ 层特征图的维度。
  - $\odot$ : 元素级乘法。
  - $\|\cdot\|_2$ : $L_2$ 范数。

5.2.2. 重建指标 (Reconstruction Metrics)

这些指标用于评估 3D 场景重建 (3D scene reconstruction) 的几何准确性。论文的 Table A2 给出了详细的定义。

Chamfer Distance (CD，倒角距离):
1. 概念定义: 倒角距离 (Chamfer Distance) 衡量两个点集之间形状的相似度。它计算一个点集中的每个点到另一个点集中最近点的距离的平方和，然后反过来计算，并求和。CD 值越小，表示两个 点云 (point clouds) 越相似，几何重建越准确。
2. 数学公式: 准确率 (Accuracy): $\mathrm{Accuracy} = \frac{1}{|P|} \sum_{\mathbf{p} \in P} \min_{\mathbf{p}^* \in P^*} \|\mathbf{p} - \mathbf{p}^*\|_2^2$ 完整性 (Completeness): $\mathrm{Completeness} = \frac{1}{|P^*|} \sum_{\mathbf{p}^* \in P^*} \min_{\mathbf{p} \in P} \|\mathbf{p} - \mathbf{p}^*\|_2^2$ CD 是 Accuracy 和 Completeness 的平均值。 $\mathrm{CD} = \frac{1}{2} (\mathrm{Accuracy} + \mathrm{Completeness})$
3. 符号解释:
  - $P$ : 从 预测网格 (predicted mesh) 采样得到的 点云 (point cloud)。
  - $P^*$ : 从 真实标注网格 (ground truth mesh) 采样得到的 点云 (point cloud)。
  - $|\cdot|$ : 点云中点的数量。
  - $\mathbf{p} \in P$ : 预测点云 (predicted point cloud) 中的一个点。
  - $\mathbf{p}^* \in P^*$ : 真实点云 (ground truth point cloud) 中的一个点。
  - $\min_{\mathbf{p}^* \in P^*} \|\mathbf{p} - \mathbf{p}^*\|_2^2$ : 预测点 (predicted point) $\mathbf{p}$ 到 真实点云 (ground truth point cloud) $P^*$ 中最近点的 欧氏距离 (Euclidean distance) 的平方。
  - $\min_{\mathbf{p} \in P} \|\mathbf{p} - \mathbf{p}^*\|_2^2$ : 真实点 (ground truth point) $\mathbf{p}^*$ 到 预测点云 (predicted point cloud) $P$ 中最近点的 欧氏距离 (Euclidean distance) 的平方。
F-Score (F-分数):
1. 概念定义: F-分数 (F-Score) 是一种评估 3D 重建 (3D reconstruction) 质量的指标，它结合了 精度 (Precision) 和 召回率 (Recall)。它衡量了 预测几何 (predicted geometry) 与 真实几何 (ground truth geometry) 之间的重叠程度。F-分数 (F-Score) 值越高，表示重建质量越好。通常会设置一个距离阈值，只有在阈值内的点才被认为是匹配的。
2. 数学公式: $\mathrm{F-score} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$ 精度 (Precision): $\mathrm{Precision} = \frac{1}{|P|} \sum_{\mathbf{p} \in P} \mathbb{I}(\min_{\mathbf{p}^* \in P^*} \|\mathbf{p} - \mathbf{p}^*\|_1 < \tau)$ 召回率 (Recall): $\mathrm{Recall} = \frac{1}{|P^*|} \sum_{\mathbf{p}^* \in P^*} \mathbb{I}(\min_{\mathbf{p} \in P} \|\mathbf{p} - \mathbf{p}^*\|_1 < \tau)$
3. 符号解释:
  - $\mathbb{I}(\cdot)$ : 指示函数，如果条件为真则为 1，否则为 0。
  - $\tau$ : 距离阈值，论文中设置为 5cm。
  - $\|\cdot\|_1$ : $L_1$ 范数。
  - 其他符号与 CD 相同。
Normal Consistency (NC，法线一致性):
1. 概念定义: 法线一致性 (Normal Consistency) 衡量 预测几何 (predicted geometry) 的 表面法线 (surface normals) 与 真实几何 (ground truth geometry) 的 表面法线 (surface normals) 的对齐程度。它通过计算对应点之间 法向量 (normal vectors) 的 点积 (dot product) 来评估。NC 值越高，表示法线对齐越好，几何细节越准确。
2. 数学公式: 法线准确率 (Normal Accuracy): $\mathrm{Normal\ Accuracy} = \frac{1}{|P|} \sum_{\mathbf{p} \in P} (\mathbf{n}_{\mathbf{p}}^\top \mathbf{n}_{\mathbf{p}^*}) \quad \text{s.t. } \mathbf{p}^* = \arg\min_{\mathbf{q} \in P^*} \|\mathbf{p} - \mathbf{q}\|_1$ 法线完整性 (Normal Completeness): $\mathrm{Normal\ Completeness} = \frac{1}{|P^*|} \sum_{\mathbf{p}^* \in P^*} (\mathbf{n}_{\mathbf{p}}^\top \mathbf{n}_{\mathbf{p}^*}) \quad \text{s.t. } \mathbf{p} = \arg\min_{\mathbf{q} \in P} \|\mathbf{p}^* - \mathbf{q}\|_1$ NC 是 Normal Accuracy 和 Normal Completeness 的平均值。
3. 符号解释:
  - $\mathbf{n}_{\mathbf{p}}$ : 点 $\mathbf{p}$ 处的 法向量 (normal vector)。
  - $\mathbf{n}_{\mathbf{p}^*}$ : 点 $\mathbf{p}^*$ 处的 法向量 (normal vector)。
  - $\mathbf{p}^* = \arg\min_{\mathbf{q} \in P^*} \|\mathbf{p} - \mathbf{q}\|_1$ : 预测点 (predicted point) $\mathbf{p}$ 在 真实点云 (ground truth point cloud) $P^*$ 中找到的最近点。
  - $\mathbf{p} = \arg\min_{\mathbf{q} \in P} \|\mathbf{p}^* - \mathbf{q}\|_1$ : 真实点 (ground truth point) $\mathbf{p}^*$ 在 预测点云 (predicted point cloud) $P$ 中找到的最近点。
  - $\top$ : 向量转置，用于计算 点积 (dot product)。

5.3. 对比基线

论文将 G4Splat 与以下代表性基线方法进行了比较：

经典 Gaussian Splatting 方法:
- 3DGS (Kerbl et al., 2023): 原始的 3D Gaussian Splatting 方法。
- 2DGS (Huang et al., 2024a): 3DGS 的 2D 扩展。
最先进的稀疏视图 3DGS 方法:
- FSGS (Zhu et al., 2024): Real-time few-shot view synthesis using gaussian splatting。
- InstantSplat (Fan et al., 2024): Sparse-view gaussian splatting in seconds。
- MAtCha (Guédon et al., 2025): Atlas of charts for high-quality geometry and photorealism from sparse views。
结合扩散模型的稀疏视图 3DGS 方法:
- GenFusion (Wu et al., 2025c): Closing the loop between reconstruction and generation via videos。
- $Difix3D+ (Wu et al., 2025a)$ : Improving 3d reconstructions with single-step diffusion models。
- GuidedVD (Zhong et al., 2025): Taming video diffusion prior with scene-grounding guidance for 3d gaussian splatting from sparse inputs。
- $See3D (Ma et al., 2025)$ : Learning 3d creation on pose-free videos at scale 的 2DGS 增强版本。
  
  所有基线都用 MASt3R-SfM (Duisterhof et al., 2025) 进行了增强， 以提高在 稀疏视图 (sparse-view) 场景中的鲁棒性。这确保了公平的比较，因为 MASt3R-SfM 为这些方法提供了更稳定的 相机姿态 (camera poses) 和 初始几何 (initial geometry)。

6. 实验结果与分析

6.1. 核心结果分析

6.1.1. 新视图合成质量 (Novel View Synthesis Quality)

通过将 几何指导 (geometry guidance) 整合到 生成先验 (generative prior) 中，G4Splat 在 未观测区域 (unobserved regions) 实现了更准确的渲染，并在 已观测区域 (observed regions) 产生了更少的伪影。

对比其他生成方法： 其他利用 生成先验 (generative prior) 的方法表现出明显的局限性。
- $Difix3D+$ 在 已观测区域 (observed regions) 质量相对较好，但在 未观测区域 (unobserved areas) 处理不力。
- GenFusion、See3D 和 GuidedVD 可以在 未观测区域 (unobserved regions) 产生 幻觉 (hallucinate) 内容，但其补全模糊，并受到严重 浮点 (floaters) 的影响，甚至会降低 已观测区域 (observed regions) 的重建质量。
G4Splat 的优势： 如图 5（原文 Figure 4）和 10（原文 Figure A5）所示，G4Splat 保持了 已观测区域 (observed regions) 的高保真度，并显著改善了 未观测区域 (unobserved regions) 的渲染质量。定量结果在 Table 1 中也有体现，G4Splat 在 PSNR、SSIM 方面表现突出，且 LPIPS 值最低，表明其生成的图像在感知上更接近真实。这凸显了 G4Splat 几何引导生成流水线 (geometry-guided generative pipeline) 在保持 高保真度 (high fidelity) 方面的有效性。

6.1.2. 几何重建质量 (Geometry Reconstruction Quality)

$Difix3D+$ 、GenFusion、See3D 和 GuidedVD 等方法都存在严重的 形状-外观模糊 (shape-appearance ambiguities)：即使渲染的视图看起来合理，但重建的几何质量却很差。

G4Splat 的优势： 相比之下，G4Splat 在 未观测区域 (unobserved regions) 产生了更准确的几何，并在 已观测区域 (observed regions) 产生了更平滑、无 浮点 (floater-free) 的重建。这种改进源于 G4Splat 的 平面感知几何建模 (plane-aware geometry modeling)，它为 已观测区域 (observed areas) 和 未观测区域 (unobserved areas) 都提供了可靠的 深度监督 (depth supervision)，确保了整个场景的一致性。
定量结果： Table 1 显示，G4Splat 在所有数据集的所有 重建指标 (reconstruction metrics) (CD、F-Score、NC) 上都显著优于所有基线。例如，在 Replica 数据集上，G4Splat 的 CD 值（6.61）远低于最佳基线 MAtCha（10.12），F-Score（65.14）和 NC（83.98）也显著更高。

6.1.3. 任意视图场景重建 (Any-View Scene Reconstruction)

G4Splat 在各种场景中表现出强大的鲁棒性，包括室内和室外场景、单视图输入和未姿态视频。

多样化场景支持： 如图 1、6（原文 Figure 5）和 7（原文 Figure A1）所示，G4Splat 能够处理多样的 输入场景 (input scenarios)。
不同输入视图数量的鲁棒性： Table A1 进一步验证了这一点，G4Splat 无论在 5、10 还是 15 个 输入视图 (input views) 的情况下，都始终优于所有基线。
复杂光照条件下的性能： 在 复杂光照 (complex lighting) 条件下（如图 8 中的 Figure A2 所示），现有基线即使在 密集视图 (dense views) 下也难以实现准确重建，因为 高光 (specularities) 和 反射 (reflections) 导致颜色在不同视点间显著变化。G4Splat 能够利用 准确的几何指导 (accurate geometry guidance) 有效抑制因显著 亮度变化 (brightness variations) 引起的误差，从而产生高质量的重建。
处理任意 3D 结构： 除了生成更平滑的 平面区域 (planar regions)，G4Splat 也能忠实地重建 非平面几何 (non-planar geometry)，例如图 1 中的 博物馆 (Museum) 和图 6（原文 Figure 5）中的 猫 (Cat)。

6.2. 数据呈现 (表格)

以下是原文提供的表格结果：

以下是原文 Table 1 的结果：

Dataset	Method	Reconstruction			Rendering
Dataset	Method	CD↓	F-Score ↑	NC↑	PSNR↑	SSIM↑	LPIPS↓
Replica	3DGS	16.61	27.72	64.34	18.29	0.744	0.254
	2DGS	14.64	48.01	74.14	18.43	0.735	0.306
	FSGS	18.17	26.87	64.16	19.19	0.766	0.259
	InstantSplat	21.00	19.67	62.01	19.39	0.762	0.255
	MAtCha	10.12	60.9	79.33	17.81	0.752	0.228
	See3D	12.74	45.27	73.98	19.22	0.735	0.328
	GenFusion	13.05	41.60	69.33	20.14	0.801	0.258
	Difix3D+	13.71	43.11	65.34	19.42	0.779	0.231
	GuidedVD	27.87	17.29	61.64	22.51	0.822	0.260
ScanNet++	Ours	6.61	65.14	83.98	23.90	0.836	0.199
	3DGS	16.60	31.92	65.35	14.28	0.696	0.372
	2DGS	14.34	51.97	70.01	13.91	0.661	0.429
	FSGS	23.80	27.86	64.53	14.80	0.731	0.362
	InstantSplat	21.32	25.44	60.67	15.02	0.742	0.355
	MAtCha	11.55	62.98	73.61	13.58	0.677	0.351
	See3D	13.03	53.65	70.39	14.76	0.684	0.426
	GenFusion	10.68	447.15	66.27	16.12	0.726	0.347
	Difix3D+	13.15	53.91	67.30	14.09	0.701	0.340
	GuidedVD Ours	25.35 6.34	16.67 67.12	60.48 77.45	17.90 18.69	0.807 0.792	0.336
DeepBlending							0.314
	3DGS 2DGS	31.44	20.02	55.39	15.33	0.571	0.489
	FSGS	25.60	23.81	63.82	14.89	0.556	0.506
	InstantSplat	31.45	19.66	57.38	15.72	0.602	0.476
	MAtCha	33.78	17.99	57.91	15.00	0.569	0.483
	See3D	22.36	26.80	67.92	14.74	0.558	0.465
		31.34	22.68	63.18	15.00	0.552	0.537
	GenFusion	30.70	22.37	58.70	16.20	0.626	0.468
	Difix3D+	32.70	21.94	58.08	15.18	0.583	0.450
	GuidedVD	43.28	15.95	59.21	16.32	0.618	0.481
Ours	20.72	28.02	72.04		16.76	0.645	0.440

以下是原文 Table 2 的结果：

	GP PM PP		Reconstruction			Rendering
	GP PM PP		CD↓	F-Score↑	NC↑	PSNR↑	SSIM↑	LPIPS↓
×	×	×	10.60	59.17	79.95	17.85	0.751	0.228
✓	×	×	9.46	56.99	77.58	19.63	0.740	0.295
×	✓	×	8.73	64.96	80.55	17.63	0.752	0.219
V	✓	×	7.56	62.36	80.89	21.88	0.810	0.221
✓		¸ √	6.61	65.14	83.98	23.90	0.836	0.199

以下是原文 Table 3 的结果：

Method	CD↓ PSNR↑ Time (min)↓
MAtCha	11.57	11.56	32.4
See3D	15.42	14.50	58.6
GenFusion	15.66	16.49	41.4
Difix3D+	17.79	12.94	68.7
GuidedVD	24.29	19.02	141.4
Ours	8.77	20.26	73.3
Ours (DS)	9.33	19.36	43.5

以下是原文 Table A1 的结果：

Method	Reconstruction (CD↓ / NC↑)			Rendering (PSNR↑/LPIPS↓)
Method	5 views	10 views	15 views	5 views	10 views	15 views
2DGS	14.64 / 74.14	9.37 / 81.24	7.17 / 85.33	18.43 / 0.306	21.79 / 0.207	25.30 / 0.139
FSGS	18.17 / 64.16	13.97 / 68.76	12.64 / 71.22	19.19 / 0.259	22.50 / 0.179	25.84 / 0.127
MAtCha	11.10 / 81.20	7.35 / 83.99	6.03 / 85.67	17.85 / 0.228	21.26 / 0.153	25.00 / 0.109
See3D	12.74 / 73.98	9.22 / 80.44	7.40 / 84.22	19.22 / 0.328	22.73 / 0.240	25.56 / 0.183
GenFusion	13.05 / 69.33	10.04 / 74.03	8.88 / 76.57	20.14 / 0.258	23.90 / 0.183	26.48 / 0.138
Difix3D+	13.71 / 65.34	10.15 / 68.43	7.97 / 70.73	19.42 / 0.231	22.68 / 0.165	26.04 / 0.122
GuidedVD	27.87 / 61.64	20.30 / 64.53	16.62 / 68.54	22.51 / 0.260	25.63 / 0.205	27.91 / 0.163
Ours	6.61 / 83.98	4.88 / 85.49	3.98 / 87.25	23.90 / 0.199	27.48 / 0.140	30.22 / 0.094

以下是原文 Table A2 的结果：

Metric	Definition
Chamfer Distance (CD)
Accuracy	$\frac{1}{\|P\|} \sum_{\mathbf{p} \in P} \min_{\mathbf{p}^* \in P^} \\|\mathbf{p} - \mathbf{p}^\\|_2^2$
Completeness	$\frac{1}{\|P^\|} \sum_{\mathbf{p}^ \in P^} \min_{\mathbf{p} \in P} \\|\mathbf{p} - \mathbf{p}^\\|_2^2$
F-score
Precision	$\frac{1}{\|P\|} \sum_{\mathbf{p} \in P} \mathbb{I}(\min_{\mathbf{p}^* \in P^} \\|\mathbf{p} - \mathbf{p}^\\|_1 < 0.05)$
Recall	$\frac{1}{\|P^\|} \sum_{\mathbf{p}^ \in P^} \mathbb{I}(\min_{\mathbf{p} \in P} \\|\mathbf{p} - \mathbf{p}^\\|_1 < 0.05)$
Normal Consistency (NC)
Normal Accuracy	$\frac{1}{\|P\|} \sum_{\mathbf{p} \in P} (\mathbf{n}_{\mathbf{p}}^\top \mathbf{n}_{\mathbf{p}^}) \quad \text{s.t. } \mathbf{p}^ = \arg\min_{\mathbf{q} \in P^*} \\|\mathbf{p} - \mathbf{q}\\|_1$
Normal Completeness	$\frac{1}{\|P^\|} \sum_{\mathbf{p}^ \in P^} (\mathbf{n}_{\mathbf{p}}^\top \mathbf{n}_{\mathbf{p}^}) \quad \text{s.t. } \mathbf{p} = \arg\min_{\mathbf{q} \in P} \\|\mathbf{p}^* - \mathbf{q}\\|_1$

6.3. 消融实验/参数分析

论文在 Replica 数据集上进行了消融实验，以评估 生成先验 (generative prior, GP)、平面感知几何建模 (plane-aware geometry modeling, PM) 和 几何引导生成流水线 (geometry-guided generative pipeline, PP) 各自的贡献。Table 2 展示了主要观察结果：

单独引入生成先验 (GP)：
- 效果： 仅引入 GP 可以提高 渲染质量 (rendering quality) (PSNR 从 17.85 提高到 19.63)。
- 局限性： 但对 几何重建 (geometry reconstruction) 的增益有限，甚至导致 LPIPS、F-Score 和 NC 下降。
- 分析： 这表明 GP 单独作用时，倾向于为 未见区域 (unseen areas) 产生平均化、模糊的结果，并导致 形状-外观模糊 (shape-appearance ambiguities)。这验证了论文的核心论点：直接引入 生成先验 (generative prior) 并不能达到预期效果。
加入平面感知几何建模 (PM)：
- 效果： 无论 PM 是单独加入还是与 GP 结合使用，都显著改善了 几何重建 (geometry reconstruction)。例如，在第二行基础上加入 PM (即第四行)，CD 从 9.46 降至 7.56，F-Score 和 NC 也均有所提升。
- 分析： 这表明 准确的几何指导 (accurate geometry guidance) 有效地提供了干净的 几何基础 (geometry basis) 并抑制了 高斯浮点 (Gaussian floaters)，从而使 生成模型 (generative model) 能够发挥其应有的作用。当 PM 与 GP 结合时，渲染质量 (rendering quality) 也获得了显著提升，例如 PSNR 从 19.63 提高到 21.88，LPIPS 从 0.295 降至 0.221。
整合几何引导生成流水线 (PP)：
- 效果： 在 PM 和 GP 的基础上，进一步整合 PP（即第五行）进一步提高了 渲染保真度 (rendering fidelity) 和 几何精度 (geometric accuracy)。CD 从 7.56 降至 6.61，PSNR 从 21.88 提高到 23.90，LPIPS 从 0.221 降至 0.199。
- 分析： 这表明 几何指导 (geometry guidance) 提供了更准确的 可见性掩码 (visibility masks)、具有更广泛 平面覆盖 (plane coverage) 的 新视图 (novel views)，以及 一致的颜色监督 (consistent color supervision)，从而通过缓解 多视图不一致性 (multi-view inconsistencies) 来改进 生成过程 (generative process)。
  
  总结而言，消融实验结果清晰地支持了 G4Splat 的设计理念： 生成先验 (generative prior) 必须与 强大的几何指导 (strong geometric guidance) 结合才能发挥其最大潜力。其中，平面感知几何建模 (plane-aware geometry modeling) 提供了关键的几何基础，而 几何引导生成流水线 (geometry-guided generative pipeline) 则确保了 几何信息 (geometric information) 在生成过程中的有效利用和 多视图一致性 (multi-view consistency) 的保持。

7. 总结与思考

7.1. 结论总结

G4Splat 提出了一种 几何引导的生成框架 (geometry-guided generative framework)，用于 3D 场景重建 (3D scene reconstruction)。该方法的核心在于识别 准确几何 (accurate geometry) 作为有效利用 生成模型 (generative models) 的先决条件。

主要贡献包括：

尺度精确几何约束： 利用人造环境中普遍存在的 平面表示 (plane representations)，G4Splat 能够从 稀疏视图 (sparse views) 中推导出 尺度精确 (scale-accurate) 的 几何约束 (geometric constraints)。这通过从局部观测估计 3D 平面 (3D planes) 并将其外推到 未观测区域 (unobserved regions)，有效解决了 单目深度估计器 (monocular depth estimators) 的 尺度模糊 (scale ambiguity) 问题。
几何引导的生成流水线： G4Splat 将这些 几何约束 (geometric constraints) 深度整合到整个 生成流水线 (generative pipeline) 中。这具体体现在：
- 改进了 可见性掩码 (visibility mask) 估计，通过 3D 可见性网格 (3D visibility grid) 提供了更可靠的 修复区域 (inpainting regions)。
- 引导了 平面感知的新视图选择 (plane-aware novel view selection)，确保了对关键 几何结构 (geometric structures) 的最大覆盖。
- 增强了使用 视频扩散模型 (video diffusion models) 进行 修复 (inpainting) 时的 多视图一致性 (multi-view consistency)，通过 全局 3D 平面 (global 3D planes) 调制颜色监督来减少 跨视图冲突 (cross-view conflicts)。
卓越的性能和泛化能力： 在 Replica、 $ScanNet++$ 和 DeepBlending 等多个数据集上的广泛实验表明，G4Splat 在 几何 (geometry) 和 外观重建 (appearance reconstruction) 方面始终优于现有方法，尤其是在 未观测区域 (unobserved regions) 取得了显著改进。此外，该方法自然支持 未姿态视频 (unposed video) 和 单视图输入 (single-view inputs)，展现了在室内和室外场景中的强大泛化能力和实际应用潜力。

7.2. 局限性与未来工作

论文作者指出了 G4Splat 的以下局限性，并提出了未来的研究方向：

视频扩散模型的局限性： G4Splat 的性能部分受限于当前 视频扩散模型 (video diffusion models) 的能力。例如，See3D 等模型在 补全区域 (completed regions) 的颜色与原始场景的颜色精确匹配方面仍存在困难，这可能导致在训练 高斯 (Gaussians) 时出现不一致，从而影响渲染输出与可见周围区域的对齐。尽管 G4Splat 通过引入 尺度精确几何监督 (scale-accurate geometry supervision) 实现了精确的几何重建，即使在不一致的补全下也是如此（如图 10 中的 Figure A4(a) 所示），但 生成模型 (generative model) 本身的质量仍是一个瓶颈。
- 未来工作： 改进 视频扩散模型 (video diffusion models) 在颜色一致性方面的能力，将直接提升 G4Splat 的表现。
严重遮挡区域的挑战： 该方法在处理 严重遮挡区域 (heavily occluded regions) 时仍面临挑战，例如被桌子部分遮挡的椅子（如图 10 中的 Figure A4(b) 所示）。由于桌子和被遮挡的椅子区域非常接近，生成一个合理的新相机视图来观察这些被遮挡的区域非常困难。
- 未来工作： 引入 对象级先验 (object-level prior) (Ni et al., 2025; Yang et al., 2025) 可能有助于重建这些 严重遮挡区域 (severely occluded regions)。这意味着模型需要对场景中的对象有更高级的理解，而不仅仅是平面。
平面表示的普适性： 尽管 平面表示 (plane representation) 在人造环境中有效，但 G4Splat 的 单目深度估计器 (monocular depth estimator) 在 非平面区域 (non-plane regions) 的表现虽然令人满意，但仍有提升空间。
- 未来工作： 采用更通用的 表面表示 (surface representation)，能够自然地建模 平面 (plane) 和 非平面区域 (non-plane regions)，可能会带来更准确的深度，尤其是在 非平面区域 (non-plane areas)。虽然这种表示可能不如基于平面的方法计算效率高，但预计会提高整体重建质量。

7.3. 个人启发与批判

7.3.1. 个人启发

几何为王 (Geometry is King)： G4Splat 的核心论点——“准确的几何是有效利用生成模型的根本先决条件”——是一个深刻的启发。在当前 生成式 AI (Generative AI) 蓬勃发展的背景下，许多工作侧重于提升 生成模型 (generative models) 的能力。然而，G4Splat 提醒我们，在 3D 重建 (3D reconstruction) 这种需要精确 物理世界理解 (physical world understanding) 的任务中，纯粹的 生成能力 (generative capability) 必须有一个坚实的 几何基础 (geometric foundation) 来锚定，否则 幻觉 (hallucination) 和 不一致性 (inconsistency) 将难以避免。这强调了 领域知识 (domain knowledge) (如 平面结构 (planar structures)) 与 通用生成模型 (general-purpose generative models) 结合的重要性。
精细化几何引导的价值： G4Splat 不仅仅是在初始阶段利用几何信息，而是将其贯穿于 生成流水线 (generative pipeline) 的每一个关键环节（可见性掩码 (visibility mask)、新视图选择 (novel view selection)、多视图颜色监督 (multi-view color supervision)）。这种端到端的 几何引导 (geometry guidance) 策略，而非仅仅作为额外的约束，是其成功的关键。这提供了一个范式：如何将 传统计算机视觉 (traditional computer vision) 中成熟的几何推理能力，与 深度学习 (deep learning) 强大的 生成能力 (generative capabilities) 进行深度融合。
工程实践中的平衡艺术： G4Splat 利用了 平面结构 (planar structures) 的普遍性，这种选择既高效又具有可扩展性，因为它将复杂的 3D 场景 (3D scenes) 简化为易于处理的 平面基元 (planar primitives)。这展示了在研究中如何进行权衡：为了效率和鲁棒性，有时对特定场景（如 曼哈顿世界 (Manhattan world)）的先验假设是极其有价值的。

7.3.2. 批判

通用性限制： 尽管 平面假设 (plane assumption) 在人造环境中非常有效，但在自然场景（如山脉、森林、不规则的岩石）中，平面结构 (planar structures) 并不普遍。G4Splat 在这些场景中的表现可能会受到限制。虽然论文提到了 非平面区域 (non-planar regions) 通过 单目深度 (monocular depth) 进行调整，但其准确性仍然是依赖于 平面 (planes) 进行校准的。因此，其 尺度精确深度 (scale-accurate depth) 的可靠性在高度非结构化环境中会下降。
对扩散模型质量的依赖： 尽管 G4Splat 努力缓解 扩散模型 (diffusion models) 带来的 不一致性 (inconsistencies)，但最终 修复 (inpainting) 区域的视觉质量（如颜色、纹理细节）仍受到所用 视频扩散模型 (video diffusion model) 的上限限制。如果 扩散模型 (diffusion model) 本身生成质量不高或 幻觉 (hallucination) 过于严重，即使 几何 (geometry) 再精确，整体 外观 (appearance) 也可能不令人满意。
计算复杂度： 尽管论文称其方法运行时间与使用 生成先验 (generative prior) 的其他方法相当（Table 3），但 全局 3D 平面估计 (global 3D plane estimation)、可见性网格构建 (visibility grid construction) 和 平面感知新视图选择 (plane-aware novel view selection) 这些步骤，尤其是在大型复杂场景中，可能会带来显著的计算开销。未来的工作需要持续优化这些几何处理步骤的效率。
对初始 SfM/MAtCha 质量的依赖： G4Splat 的 初始化阶段 (initialization stage) 依赖于 MAtCha 获取 初始深度图 (initial depth map)，而 MAtCha 又依赖于 SfM 提供的 稀疏 SfM 点 (sparse SfM points)。如果 SfM 在极端稀疏或无纹理区域失败，或者 MAtCha 自身在某些情况下表现不佳，可能会影响 G4Splat 后续的 平面估计 (plane estimation) 和整体重建质量。

总的来说，G4Splat 为 稀疏视图 3D 重建 (sparse-view 3D reconstruction) 领域提供了一个有力的解决方案，它巧妙地结合了 几何先验 (geometric prior) 和 生成模型 (generative models) 的优势。然而，未来的研究仍需在 通用性 (generality)、生成模型 (generative model) 性能和计算效率之间寻求更好的平衡。

相似论文推荐

基于向量语义检索推荐的相关论文。

暂时没有找到相似论文。