AiPaper
Paper status: completed

Look at the Sky: Sky-aware Efficient 3D Gaussian Splatting in the Wild

Published:03/07/2025
Original Link
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a sky-aware 3D Gaussian Splatting framework for efficient scene reconstruction from unconstrained photo collections. It leverages a greedy supervision strategy and pseudo masks from a pre-trained segmentation network, improving efficiency and rendering quali

Abstract

Photos taken in unconstrained tourist environments often present challenges for accurate 3D scene reconstruction due to variable appearances and transient occlusions, which can introduce artifacts in novel view synthesis. Recently, in-the-wild 3D scene reconstruction has been achieved realistic rendering with Neural Radiance Fields (NeRFs). With the advancement of 3D Gaussian Splatting (3DGS), some methods also attempt to reconstruct 3D scenes from unconstrained photo collections and achieve real-time rendering. However, the rapid convergence of 3DGS is misaligned with the slower convergence of neural network-based appearance encoder and transient mask predictor, hindering the reconstruction efficiency. To address this, we propose a novel sky-aware framework for scene reconstruction from unconstrained photo collection using 3DGS. Firstly, we observe that the learnable per-image transient mask predictor in previous work is unnecessary. By introducing a simple yet efficient greedy supervision strategy, we directly utilize the pseudo mask generated by a pre-trained semantic segmentation network as the transient mask, thereby achieving more efficient and higher quality in-the-wild 3D scene reconstruction. Secondly, we find that separately estimating appearance embeddings for the sky and building significantly improves reconstruction efficiency and accuracy. We analyze the underlying reasons and introduce a neural sky module to generate diverse skies from latent sky embeddings extract from unconstrained images. Finally, we propose a mutual distillation learning strategy to constrain sky and building appearance embeddings within the same latent space, further enhancing reconstruction efficiency and quality. Extensive experiments on multiple datasets demonstrate that the proposed framework outperforms existing methods in novel view and appearance synthesis, offering superior rendering quality with faster convergence and rendering speed.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Look at the Sky: Sky-aware Efficient 3D Gaussian Splatting in the Wild

1.2. Authors

Yuze Wang, Junyi Wang, Ruicheng Gao, Yansong Qu, Wantong Duan, Shuo Yang, Yue Qi

1.3. Journal/Conference

The paper is marked as "Published at (UTC): 2025-03-07T00:00:00.000Z," which suggests it is a forthcoming publication or a preprint accepted for a future venue. Given the topic and the authors' affiliations (likely research institutions or universities), it is expected to be published in a reputable computer graphics or computer vision conference (e.g., CVPR, ICCV, ECCV, SIGGRAPH) or journal. These venues are highly influential in the relevant fields, known for publishing cutting-edge research in 3D reconstruction, neural rendering, and computer vision.

1.4. Publication Year

2025 (Based on the publication date provided: 2025-03-07)

1.5. Abstract

This paper addresses the challenges of accurate 3D scene reconstruction from unconstrained photo collections (photos taken in diverse, real-world conditions like tourist environments), which often suffer from variable appearances (e.g., changing lighting, weather) and transient occlusions (e.g., people, cars). While Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have advanced in-the-wild reconstruction, current 3DGS-based methods face efficiency issues because their fast convergence is misaligned with the slower convergence of neural networks used for appearance encoding and transient mask prediction.

To overcome this, the authors propose a novel sky-aware framework. First, they simplify the transient mask prediction by observing that a learnable per-image transient mask predictor is unnecessary. Instead, they introduce a greedy supervision strategy that directly uses pseudo masks generated by a pre-trained semantic segmentation network, leading to more efficient and higher-quality reconstruction. Second, they find that separately estimating appearance embeddings for the sky and buildings significantly improves efficiency and accuracy. They introduce a neural sky module to generate diverse skies from latent sky embeddings. Finally, they propose a mutual distillation learning strategy to constrain sky and building appearance embeddings within the same latent space, further enhancing efficiency and quality. Extensive experiments demonstrate that their framework outperforms existing methods in novel view and appearance synthesis, offering superior rendering quality with faster convergence and rendering speed.

/files/papers/6919d53c110b75dcc59ae2a4/paper.pdf (This link points to a local file path, indicating it was likely provided as part of a dataset or internal system. Its publication status is likely a preprint or an internally hosted version awaiting official publication.)

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the efficient and high-quality 3D scene reconstruction from unconstrained photo collections using 3D Gaussian Splatting (3DGS). Unconstrained photo collections refer to images taken in real-world scenarios, often by tourists, exhibiting significant variations in viewpoint, camera settings, lighting conditions, and the presence of transient occluders (e.g., people, vehicles, temporary objects).

This problem is highly important in computer graphics and computer vision due to its applications in virtual reality (VR), scene editing, and digital twins of real-world landmarks. Prior research, particularly with Neural Radiance Fields (NeRFs) and more recently with 3DGS, has made strides in reconstructing in-the-wild scenes. However, existing 3DGS-based methods that adapt to unconstrained conditions introduce neural networks to handle appearance variations (e.g., changing illumination, time of day) and transient occlusions. The key challenge or gap is a mismatch in convergence speed: 3DGS itself converges rapidly, but the concurrently trained neural networks for appearance encoding and transient mask prediction require significantly more iterations to converge. This disparity creates a training bottleneck, hindering the overall efficiency and sometimes the quality of the reconstruction.

The paper's innovative idea stems from a critical reconsideration: Can we simplify or omit certain modules, or leverage image intrinsics and semantics as priors to improve reconstruction efficiency, or even enhance reconstruction quality? The authors identify that treating the sky and buildings differently, and simplifying transient mask prediction, can significantly improve the process.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  • Novel Transient Mask Prediction: They demonstrate that a learnable per-image transient mask predictor is unnecessary. Instead, they propose a simple yet efficient greedy supervision strategy that directly leverages pseudo masks generated by a pre-trained Large Vision Model (LVM) for semantic segmentation. This approach reduces trainable parameters, resolves ambiguity between the 3D radiance field and 2D transient mask, and improves training efficiency and quality.

  • Sky-aware Framework with Separate Appearance Learning: They introduce a sky-aware framework that recognizes the distinct radiance properties of the sky and buildings. They segment these regions during preprocessing and encode their appearances separately using two distinct encoders. This differential treatment enhances reconstruction efficiency and accuracy.

  • Neural Sky Module for Explicit Cubemap Generation: A novel neural sky module is proposed, which generates explicit cubemaps for diverse skies from latent sky embeddings. This module explicitly handles location-dependent sky details and mitigates catastrophic forgetting issues often associated with simple Multi-Layer Perceptrons (MLPs).

  • Mutual Distillation Learning Strategy: To align the sky and building appearance embeddings within the same latent space, they propose a mutual distillation learning strategy. This alignment allows the sky embeddings to guide and constrain building appearance optimization, and potentially infer sky appearance even when it's not present in the input image.

  • State-of-the-Art Performance: Extensive experiments on the Photo Tourism (PT) and NeRF-OSR datasets demonstrate that their framework achieves state-of-the-art (SOTA) performance in novel view synthesis and novel appearance synthesis, offering superior rendering quality with faster convergence and rendering speed compared to existing methods.

    These findings collectively address the efficiency bottleneck in 3DGS-based in-the-wild reconstruction by simplifying mask prediction, introducing a specialized and guided sky representation, and aligning appearance embeddings, ultimately leading to higher quality and faster processing.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a novice should be familiar with the following foundational concepts:

3.1.1. 3D Gaussian Splatting (3DGS)

3D Gaussian Splatting (3DGS) [21] is a novel explicit 3D scene representation and rendering technique that has recently emerged as a highly efficient alternative to Neural Radiance Fields (NeRFs). Instead of representing a scene implicitly with a neural network, 3DGS explicitly represents a 3D scene using millions of anisotropic 3D Gaussians.

  • 3D Gaussian: Each 3D Gaussian is a learnable primitive characterized by:
    • 3D Position (μ\mu): The center of the Gaussian in 3D space.
    • Covariance Matrix (Σ\Sigma): Defines the shape and orientation of the Gaussian ellipsoid. It can be decomposed into a scaling matrix (SS) which controls the size along its principal axes, and a rotation matrix (RR) which controls its orientation.
    • Opacity (α\alpha): Determines how transparent or opaque the Gaussian is.
    • Color (Spherical Harmonics coefficients CC): Represents the color of the Gaussian, which can vary with viewing direction. Spherical Harmonics (SH) are functions used to approximate complex directional light distributions, allowing for view-dependent appearance.
  • Projection to 2D: During rendering, these 3D Gaussians are projected onto the 2D image plane of a virtual camera. The covariance matrix of the 3D Gaussian is transformed into a 2D covariance matrix on the image plane.
  • Rasterization and α\alpha-Blending: The projected 2D Gaussians are then rasterized (drawn) onto the image. Their colors and opacities are combined using alpha blending (also known as volumetric rendering). This process blends the contributions of Gaussians sorted by depth, from back to front, to produce the final pixel color. The paper uses the following formula for volumetric rendering: $ \widetilde { I } ( x , d ) = \sum _ { i \in M } T _ { i } \alpha _ { i } ^ { \prime } C _ { i } ( d ) , \mathrm { w h e r e } T _ { i } = \Pi _ { j = 1 } ^ { i - 1 } ( 1 - \alpha _ { j } ^ { \prime } ) $ Where:
    • I~(x,d)\widetilde { I } ( x , d ): The final rendered color of a pixel along a ray from viewpoint xx in direction dd.
    • MM: The set of 3D Gaussians sampled along the ray.
    • TiT_i: The accumulated transmittance (how much light reaches Gaussian ii without being blocked by previous Gaussians) along the ray up to Gaussian ii. It's the product of (1αj)(1 - \alpha_j') for all preceding Gaussians jj.
    • αi\alpha_i': The effective opacity of Gaussian ii after considering its 2D projection and density. It's computed as α=αe12(yμ)TΣ(yμ)\alpha ^ { \prime } = \alpha e ^ { - \frac { 1 } { 2 } ( y - \mu ^ { \prime } ) ^ { T } \Sigma ^ { \prime } ( y - \mu ^ { \prime } ) }, where α\alpha is the base opacity, yy are the coordinates in the projected space, μ\mu' is the projected 2D mean, and Σ\Sigma' is the projected 2D covariance.
    • Ci(d)C_i(d): The color of Gaussian ii as seen from direction dd, determined by its Spherical Harmonic coefficients.
  • Advantages: 3DGS offers significant advantages in convergence speed and rendering efficiency compared to NeRFs, often achieving real-time rendering.

3.1.2. Neural Radiance Fields (NeRFs)

Neural Radiance Fields (NeRFs) [34] represent a 3D scene as a continuous volumetric function using a Multi-Layer Perceptron (MLP).

  • Implicit Representation: A NeRF takes a 3D coordinate (x, y, z) and a 2D viewing direction (θ,ϕ)(\theta, \phi) as input and outputs a color (RGB) and a volume density (σ\sigma). The density represents the probability of a ray terminating at that point.
  • Volume Rendering: To render an image, rays are cast from the camera through each pixel. Points are sampled along each ray, and their predicted colors and densities are combined using volume rendering techniques to produce the final pixel color.
  • Limitations for in-the-wild: Vanilla NeRF assumes static scenes and consistent lighting. When applied to unconstrained photo collections (in-the-wild), it struggles with appearance variations (e.g., changes in illumination, weather, camera parameters) and transient occlusions (moving objects like people or cars).

3.1.3. Semantic Segmentation

Semantic segmentation is a computer vision task that involves classifying each pixel in an image into a predefined semantic class (e.g., "sky," "building," "person," "car").

  • Pseudo Masks: In this paper, pseudo masks refer to segmentation masks generated automatically by a pre-trained semantic segmentation network (like LSeg) without human annotation or fine-tuning on the specific dataset. These masks act as a "best guess" for the semantic regions in an image.
  • Large Vision Models (LVMs): LVMs are powerful deep learning models, often pre-trained on vast amounts of image and text data, capable of performing various vision tasks including highly accurate semantic segmentation. LSeg [24], Grounded-SAM [45], and SEEM [76] are examples mentioned in the paper, which can perform segmentation based on text prompts (e.g., "sky," "building").

3.1.4. Appearance Embeddings

Appearance embeddings are compact vector representations (learned features) that capture the visual characteristics or "style" of an image or a specific region within an image. In in-the-wild scenarios, these embeddings are learned for each input image to account for variations in lighting, exposure, color grading, and other factors that change the scene's appearance across different photos. When injected into a rendering model (like NeRF or 3DGS), they allow the model to synthesize novel views that match the appearance of a specific input image.

3.1.5. Catastrophic Forgetting

Catastrophic forgetting (or catastrophic interference) is a phenomenon in artificial neural networks where learning new information causes the network to forget previously learned information. This is particularly relevant when training models on sequential data or when different parts of the model are responsible for distinct features. In the context of NeRFs and similar models, it can mean that a network might struggle to represent very different appearances or details if it's forced to learn them all simultaneously in a single, undifferentiated representation.

3.2. Previous Works

The paper builds upon and differentiates itself from prior work in 3D scene representation and in-the-wild reconstruction using both NeRFs and 3DGS.

3.2.1. 3D Scene Representation

  • Traditional Methods: Before neural rendering, 3D scenes were typically represented using meshes [16], point clouds [40], or volumetric models [41]. These methods often struggle with photorealism and efficient rendering of complex scenes.
  • Neural Radiance Fields (NeRFs) and Extensions:
    • NeRF [34] revolutionized view synthesis by representing scenes as implicit radiance fields.
    • Extensions focused on accelerated training [26, 35, 43, 57] (e.g., Instant-NGP [35] using multi-resolution hash encoding), faster rendering [9, 14, 44], scene editing [60, 61, 67, 70], and dynamic scenes [37, 39, 51].
  • 3D Gaussian Splatting (3DGS) and Extensions:
    • 3DGS [21] introduced an explicit representation using millions of 3D Gaussians, significantly accelerating modeling and rendering.
    • Extensions include surface reconstruction [6, 18, 58], SLAM [31, 56, 66], AIGC [27, 28], and scene understanding [17, 42, 49, 75].

3.2.2. NeRF from Unconstrained Photo Collections

This line of work addresses the challenges of appearance variation and transient occlusions in in-the-wild datasets:

  • NeRF-W [30]: The pioneering work. It introduced learnable appearance and transient embeddings for each image via Generative Latent Optimization (GLO). These embeddings allowed the NeRF to adapt its output for different lighting and occlusions. The core idea is to learn a latent code for each image that influences the NeRF's output (color and density).
  • Ha-NeRF [7] and CR-NeRF [68]: Extended NeRF-W by replacing GLO with a Convolutional Neural Network (CNN)-based appearance encoder. This allowed for better generalization and style transfer. CR-NeRF also fine-tuned a pre-trained segmentation network to predict transient masks.
  • K-Planes [12] and RefinedFields [20]: Adopted planar factorization in NeRF, representing the scene using explicit feature planes rather than a single large MLP. This helped mitigate catastrophic forgetting and improved training speed, though at the cost of increased storage.

3.2.3. 3DGS from Unconstrained Photo Collections

More recent work adapted 3DGS to in-the-wild scenarios:

  • GS-W [72]: Proposed an adaptive sampling strategy to capture dynamic appearances from multiple feature maps.
  • SWAG [11]: Predicted image-dependent opacity and appearance variations for each 3D Gaussian.
  • WE-GS [59]: Introduced a plug-and-play, lightweight spatial attention module to simultaneously predict appearance embeddings and transient masks.
  • WildGaussians [23]: Extracted features from images using DINO [36] (a self-supervised vision transformer) and used a trainable affine transformation for transient occluder prediction. This method explicitly attempts to model the sky by filling the periphery of the bounding box with 3D Gaussians during initialization, which the current paper criticizes for ellipsoidal noise in the sky due to a shared appearance encoder.
  • Wild-GS [65]: Proposed a hierarchical appearance decomposition and explicit local appearance modeling.
  • Splatfacto-W [64]: A NeRFstudio [53] implementation of 3DGS for unconstrained photo collections, essentially an engineering adaptation.

3.3. Technological Evolution

The evolution of 3D scene reconstruction has moved from explicit, geometric-based representations (meshes, point clouds) to implicit, neural network-based representations (NeRFs), and most recently to explicit, primitive-based neural representations (3DGS).

  1. Early 3D Reconstruction: Focused on geometric accuracy but often lacked photorealism and struggled with texture synthesis and varying lighting.

  2. NeRF Era: Introduced photorealistic novel view synthesis by implicitly representing scenes as radiance fields, revolutionizing the field. However, vanilla NeRF was limited to static, well-lit scenes.

  3. NeRF in-the-wild: Extensions like NeRF-W addressed varying appearances and transient occlusions by introducing per-image latent embeddings and transient masks. These often involved training complex neural networks from scratch.

  4. 3DGS Era: Accelerated both training and rendering significantly while maintaining or improving quality. It offered an explicit representation that was more amenable to editing and faster processing.

  5. 3DGS in-the-wild: The latest wave of research, including the current paper, adapts 3DGS to unconstrained conditions. These methods largely followed the NeRF-W paradigm by adding neural modules for appearance and transient handling.

    This paper fits into the latest stage of 3DGS in-the-wild reconstruction.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

  • Transient Mask Prediction Simplification:
    • Previous NeRF-W/CR-NeRF/WE-GS/WildGaussians: Typically learn per-image transient masks from scratch using MLPs or CNNs, or fine-tune segmentation networks. This process is time-consuming and can lead to ambiguity between radiance field changes and transient occluders.
    • This Paper: Proposes a greedy supervision strategy using pseudo masks from a pre-trained, off-the-shelf semantic segmentation network (LSeg) without fine-tuning. This is a significant simplification, eliminating a complex, slow-converging neural module and resolving ambiguity.
  • Sky-aware Reconstruction:
    • Previous 3DGS-in-the-wild (e.g., WildGaussians): Often treat the entire scene, including sky and buildings, with a shared appearance encoder or represent the sky with generic 3D Gaussians. This can lead to ellipsoidal noise in the sky and inefficiency due to differing radiance properties.
    • This Paper: Explicitly segments the sky and buildings. It encodes their appearances separately and introduces a dedicated neural sky module to generate explicit cubemaps for the sky, rather than modeling it with Gaussians. This tailored approach exploits the distinct characteristics of sky radiance (more uniform, predictable) versus building radiance (complex, diverse).
  • Mutual Distillation for Appearance Embeddings:
    • Previous Methods: Learn appearance embeddings for the whole image or specific regions, but generally don't explicitly align the latent spaces of different scene components.
    • This Paper: Introduces a mutual distillation learning strategy to align the sky and building appearance embeddings in the same latent space. This allows for cross-guidance, leveraging the simpler sky appearance to constrain and improve building appearance learning, and enabling sky inference even when not visible.
  • Efficiency and Quality Focus: By simplifying the mask prediction and differentiating sky/building modeling, the paper explicitly targets faster convergence and higher reconstruction quality, addressing the convergence speed mismatch bottleneck identified in prior 3DGS-in-the-wild methods.

4. Methodology

The proposed sky-aware method aims to efficiently reconstruct 3D scenes from unconstrained photo collections using 3D Gaussian Splatting (3DGS), offering fast convergence, real-time novel view synthesis, and novel appearance synthesis.

4.1. Problem Definition and Method Overview

Given a set of posed unconstrained images I={I1,I2,...,IK}I = \left\{ I _ { 1 } , I _ { 2 } , . . . , I _ { K } \right\}, which includes varying illumination, post-processing effects, and transient occlusions, the objective is to reconstruct the 3D scene efficiently with capabilities for real-time novel view and appearance synthesis.

The overall pipeline of the method, as illustrated in Fig. 4 (from the original paper), begins by utilizing a pre-trained semantic segmentation model to predict pseudo masks for each unconstrained image. These masks distinguish between building, sky, and transient occluder regions. Based on these masks, separate appearance embeddings are extracted for the building and sky. A neural sky module then generates explicit cubemaps from the sky appearance embeddings, while a neural in-the-wild 3D Gaussian representation handles the building by predicting residual spherical harmonic coefficients for 3D Gaussians. Finally, a mutual distillation learning strategy aligns the sky and building embeddings in the same latent space, and the entire framework is optimized end-to-end.

The following figure (Figure 4 from the original paper) illustrates the overall pipeline of the proposed method:

该图像是一个对比图,展示了不同方法在重建三维场景时的效果,包括Brandenburg Gate、Scare Coeur和Trevi Fountain。每列展示了GT(真实图像)、Ours(我们的方法)、3DGS(3D高斯点云)、WildGaussians、GS-W、K-Planes和CR-NeRF的输出,红框部分强调了各方法的重建效果差异。
该图像是一个对比图,展示了不同方法在重建三维场景时的效果,包括Brandenburg Gate、Scare Coeur和Trevi Fountain。每列展示了GT(真实图像)、Ours(我们的方法)、3DGS(3D高斯点云)、WildGaussians、GS-W、K-Planes和CR-NeRF的输出,红框部分强调了各方法的重建效果差异。

Fig. 4:An overview of our framework. Given an unconstrained image I _ { i } , the pre-trained segmentation model generates pseudo masks for building Mib^\widehat { M _ { i } ^ { b } } , sky Mis^\widehat { M _ { i } ^ { s } } , and transient occluders Mit^\widehat { M _ { i } ^ { t } } .With the guidance of these masks, a building encoder and a sky encoder generate building embeddings libl _ { i } ^ { b } and sky embeddings lisl _ { i } ^ { s } level of unconstrained image, implicit embeddings, and explicit cubemap.

4.2. Pseudo Mask Extraction for Unconstrained Image

Unconstrained photos often contain transient occlusions (e.g., tourists, cars) that previous methods attempted to handle by training per-image 2D mask predictors (MLPs or CNNs). This training is time-consuming and can ambiguously attribute significant appearance changes to transient occluders, slowing down 3D Gaussian optimization.

This paper proposes a simple yet efficient greedy supervision strategy that leverages Large Vision Models (LVMs) for semantic segmentation without any fine-tuning.

  • Semantic Segmentation Model: The LSeg [24] model is used to generate 2D masks for the sky, building, and transient occlusions. Other segmentation networks like Grounded-SAM [45] and SEEM [76] are also considered viable alternatives.
  • Sky and Building Pseudo Masks: For the sky, a text prompt p_sky (e.g., 'sky') is given to the LVM segmentation model to obtain the 2D pseudo mask Mis^\widehat { M _ { i } ^ { s } } for image IiI_i. Similarly, for the building, a text prompt p_building (e.g., 'building') is used to get the 2D pseudo mask Mib^\widehat { M _ { i } ^ { b } }. The formulas for these are: $ \widehat { M _ { i } ^ { s } } = L V M S e g ( I _ { i } , p _ { s k y } ) $ And $ \widehat { M _ { i } ^ { b } } = L V M S e g ( I _ { i } , p _ { b u i l d i n g } ) $ Where:
    • Mis^\widehat { M _ { i } ^ { s } }: The 2D pseudo mask for the sky in image IiI_i.
    • Mib^\widehat { M _ { i } ^ { b } }: The 2D pseudo mask for the building in image IiI_i.
    • LVMSeg: Denotes the pre-trained Large Vision Model segmentation model (e.g., LSeg).
    • IiI_i: The ii-th unconstrained input image.
    • pskyp_{sky}: The text prompt 'sky'.
    • pbuildingp_{building}: The text prompt 'building'.
  • Transient Pseudo Mask: The remaining area in the image, after segmenting out the sky and building, is considered the pseudo transient mask Mit^\widehat { M _ { i } ^ { t } }. This is defined as: $ \widehat { M _ { i } ^ { t } } = \overline { { \widehat { M _ { i } ^ { s } } \cup \widehat { M _ { i } ^ { b } } } } . $ Where:
    • Mit^\widehat { M _ { i } ^ { t } }: The 2D pseudo mask for transient occluders in image IiI_i.
    • A\overline{A}: Denotes the complement of set AA.
    • \cup: Denotes the union of sets. This means any pixel not classified as sky or building is considered a transient occluder.
  • Appearance Embedding Extraction: With these pseudo masks, two separate Convolutional Neural Networks (CNNs), Encθ1sEnc_{\theta_1}^s and Encθ2bEnc_{\theta_2}^b, are used to extract sky embeddings lisl_i^s and building embeddings libl_i^b from each unconstrained image IiI_i. The masks act as an attention mechanism, guiding the CNNs to focus on the relevant regions. $ l _ { i } ^ { s } = E n c _ { \theta _ { 1 } } ^ { s } ( I _ { i } \odot \widehat { M _ { i } ^ { s } } ) $ And $ l _ { i } ^ { b } = E n c _ { \theta _ { 2 } } ^ { b } ( I _ { i } \odot \widehat { M _ { i } ^ { b } } ) $ Where:
    • lisl_i^s: The sky appearance embedding for image IiI_i.
    • libl_i^b: The building appearance embedding for image IiI_i.
    • Encθ1sEnc_{\theta_1}^s: The CNN-based sky encoder with parameters θ1\theta_1.
    • Encθ2bEnc_{\theta_2}^b: The CNN-based building encoder with parameters θ2\theta_2.
    • \odot: Denotes the Hadamard product (element-wise multiplication). This operation effectively masks the image, allowing the encoder to process only the sky or building region.

4.3. Neural Sky

The neural sky module is designed to generate diverse and realistic skies. Instead of using 3D Gaussians for the sky, an explicit cubemap representation is learned.

  • Learnable 4D Tensor for Implicit Cubemap: A learnable 4D tensor TskyR6×C×L×LT_{sky} \in \mathbb { R } ^ { 6 \times C \times L \times L } is introduced. This tensor represents an implicit cubemap, where:
    • 6: Corresponds to the six faces of a cubemap (e.g., front, back, left, right, top, bottom).
    • CC: Denotes the number of feature channels per pixel on each cubemap face.
    • L×LL \times L: Represents the width and height of each cubemap face. This implicit cubemap effectively captures location-dependent details and helps mitigate catastrophic forgetting associated with simpler MLPs.
  • Explicit Cubemap Generation: An MLP (MLPγMLP_\gamma) is used to generate the color for each pixel on the explicit cubemap based on the features from the implicit cubemap and the extracted sky embeddings lisl_i^s: $ C _ { s k y } ( k , u , \nu ) = M L P _ { \gamma } ( l _ { i } ^ { s } , T _ { s k y } ( k , u , \nu ) ) $ Where:
    • Csky(k,u,ν)C_{sky}(k, u, \nu): The color of a pixel at coordinates (u,ν)(u, \nu) on the kk-th face of the explicit sky cubemap.
    • MLPγMLP_\gamma: A Multi-Layer Perceptron that takes the sky embedding and the implicit feature as input.
    • lisl_i^s: The sky appearance embedding for the ii-th image.
    • Tsky(k,u,ν)T_{sky}(k, u, \nu): The corresponding implicit feature for the kk-th side of the cubemap at pixel (u,ν)\left( u , \nu \right). This process generates an explicit sky map CskyR6×3×L×LC_{sky} \in \mathbb { R } ^ { 6 \times 3 \times L \times L } (where 3 is for RGB channels).
  • Total Variation (TV) Loss: To smooth the features in the implicit cubemap and prevent high-frequency noise, a Total Variation (TV) loss [46] is applied: $ \mathcal { L } _ { T V } = \sum _ { k = 0 } ^ { 5 } \sum _ { u = 0 } ^ { L - 1 } \sum _ { \nu = 0 } ^ { L - 1 } \left. T _ { \mathrm { s k y } } ( k , u + 1 , \nu + 1 ) - T _ { \mathrm { s k y } } ( k , u , \nu ) \right. ^ { 2 } . $ Where:
    • LTV\mathcal{L}_{TV}: The Total Variation loss.
    • Tsky(k,u,ν)T_{sky}(k, u, \nu): The feature at pixel (u,ν)(u, \nu) on cubemap face kk. This loss encourages neighboring pixels in the implicit cubemap to have similar features, leading to smoother sky generation.
  • Fine Sky Encoder: A lightweight CNN (Encθ3fEnc_{\theta_3}^f) is introduced as a fine sky encoder to extract fine sky embeddings lisfl_i^{sf} at the sky cubemap level. These embeddings are a concatenation of the features extracted from the explicit cubemap and the image-level sky embeddings lisl_i^s: $ l _ { i } ^ { s f } = [ E n c _ { \theta _ { 3 } } ^ { f } ( C _ { s k y } ) ; l _ { i } ^ { s } ] $ Where:
    • lisfl_i^{sf}: The fine sky embedding for image IiI_i.
    • Encθ3fEnc_{\theta_3}^f: The fine sky encoder (CNN) with parameters θ3\theta_3.
    • CskyC_{sky}: The explicit sky cubemap generated by MLPγMLP_\gamma.
    • [.;.]: Denotes the concatenation operation. The fine sky embeddings (lisfl_i^{sf}) are then passed into the neural in-the-wild 3D Gaussians to condition the appearance of the building, allowing for a more nuanced interaction between sky and building appearances.

4.4. Neural in-the-wild 3D Gaussian

For building modeling, a novel explicit-implicit hybrid representation called neural in-the-wild 3D Gaussian is proposed. This representation adapts vanilla 3DGS to unconstrained images by injecting sky and building embeddings into each neural 3D Gaussian.

  • Learnable Parameters: Each neural in-the-wild 3D Gaussian has the following learnable parameters, extending those of vanilla 3DGS:
    • 3D mean position (μ\mu): The center of the Gaussian.
    • Opacity (α\alpha): Transparency/opaqueness.
    • Rotation (RR): Orientation.
    • Scaling factor (SS): Size.
    • Base color (CC): Represented using Spherical Harmonic coefficients. This is the intrinsic color without appearance variation.
    • Unconstrained radiance feature (FF): A feature vector initialized by applying Positional Encoding (PE) [34] to the 3D mean position of each 3D Gaussian. Positional Encoding helps the MLP capture high-frequency details by mapping input coordinates to higher-dimensional space.
  • Per-Image Translated Color: Given the fine sky embeddings lisfl_i^{sf} (from the neural sky module) and the building embeddings libl_i^b (from the building encoder), an MLP (MLPωMLP_\omega) learns a per-image translated color CiC_i' for each Gaussian: $ C _ { i } ^ { \prime } = M L P _ { \omega } ( C , F , l _ { i } ^ { s f } , l _ { i } ^ { b } ) + C . $ Where:
    • CiC_i': The per-image translated color for a specific Gaussian in image IiI_i. This is the final color used for rendering, which varies based on the appearance.
    • MLPωMLP_\omega: A Multi-Layer Perceptron that takes the base color (CC), unconstrained radiance feature (FF), fine sky embedding (lisfl_i^{sf}), and building embedding (libl_i^b) as input.
    • CC: The base color (Spherical Harmonic coefficients) of the Gaussian.
    • FF: The unconstrained radiance feature of the Gaussian.
    • lisfl_i^{sf}: The fine sky embedding for image IiI_i.
    • libl_i^b: The building embedding for image IiI_i. The MLPωMLP_\omega effectively predicts a residual (or translation) to the base color CC, allowing the Gaussian's appearance to adapt to the specific appearance embeddings of the sky and building for image IiI_i.
  • Conversion to Standard 3DGS: These per-Gaussian radiance (CiC_i') are "baked" back into the neural in-the-wild 3D Gaussians, allowing them to be seamlessly converted into the standard explicit 3DGS representation. This means that after the appearance is determined for a given image, the Gaussians can be treated as regular 3D Gaussians and fed into the vanilla 3DGS rasterization process for efficient rendering. This design allows the method to integrate into any downstream 3DGS tasks, like scene understanding or editing.

4.5. Optimization

The entire framework is optimized end-to-end, including the parameters of the neural in-the-wild 3D Gaussians, the neural sky module, and the building and sky encoders.

4.5.1. Handling Sky

The final pixel color is computed by combining the generated explicit cubemap of the sky and the explicit 3DGS of the building using $$\alpha-blending.

  • Final Rendered Image: $ \tilde { I _ { i } ^ { f } } ( x , d ) = \tilde { I _ { i } } ( x , d ) + ( 1 - O ( x , d ) ) C _ { s k y } ( d ) $ And $ O ( x , d ) = \sum _ { i = 1 } ^ { N } T _ { i } \alpha _ { i } $ Where:
    • Iif~(x,d)\tilde { I _ { i } ^ { f } } ( x , d ): The final rendered image pixel color for view ii at position xx and direction dd.
    • Ii~(x,d)\tilde { I _ { i } } ( x , d ): The rendered color from the neural in-the-wild 3D Gaussians (representing the building).
    • O(x,d)O ( x , d ): The integrated opacity of the building Gaussians along the ray, calculated as the sum of transmittance TiT_i multiplied by opacity αi\alpha_i for NN Gaussians. This represents how much the building blocks the view.
    • Csky(d)C_{sky}(d): The color sampled from the explicit sky cubemap in direction dd. This formula essentially blends the building rendering with the sky background, where the sky is visible only where the building is not fully opaque (1O(x,d)1 - O(x, d)).
  • Anti-aliasing: During training, random perturbations are introduced to the ray direction dd within its unit pixel length to enhance anti-aliasing (reducing jagged edges).
  • Sky Opacity Constraint: To prevent neural in-the-wild 3D Gaussians from appearing in the sky region (i.e., making the sky opaque with Gaussians), the integrated opacity in this region is constrained to approach zero. This is done with the following loss: $ \mathcal { L } _ { o } = - O \cdot \log O - \widehat { M } _ { i } ^ { s } \cdot \log ( 1 - O ) $ Where:
    • Lo\mathcal{L}_o: The opacity loss for the sky region.
    • OO: The rendered opacity map (representing the integrated opacity of Gaussians).
    • M^is\widehat { M } _ { i } ^ { s } : The pseudo mask for the sky in image IiI_i. This loss encourages OO to be low in the sky region (M^is=1\widehat { M } _ { i } ^ { s } = 1) and higher elsewhere. It's a form of cross-entropy loss encouraging the rendered opacity in the sky region to be close to 0.

4.5.2. Handling Transient Occluders

Instead of a learning-based approach, a greedy masking strategy is adopted for transient occluders.

  • Greedy Masking: If a pixel's semantic segmentation result does not classify it as sky or building (i.e., it falls into the pseudo transient mask Mit^\widehat { M _ { i } ^ { t } }), it is considered a transient occluder.
  • Loss Function: The transient occluders are masked out, and the rendering is supervised using a combination of L1 loss and Structural Similarity Index (SSIM) [62] loss: $ \begin{array} { r l } & { \mathcal { L } _ { c } = \lambda _ { 1 } \mathcal { L } _ { 1 } \big ( \big ( 1 - \widehat { M _ { i } ^ { t } } \big ) \odot \tilde { I _ { i } ^ { f } } , \big ( 1 - \widehat { M _ { i } ^ { t } } \big ) \odot I _ { i } \big ) } \ & { \qquad + \lambda _ { 2 } \mathcal { L } _ { S S I M } \big ( \big ( 1 - \widehat { M _ { i } ^ { t } } \big ) \odot \tilde { I _ { i } ^ { f } } , \big ( 1 - \widehat { M _ { i } ^ { t } } \big ) \odot I _ { i } \big ) , } \end{array} $ Where:
    • Lc\mathcal{L}_c: The rendering loss that focuses on non-transient regions.
    • L1\mathcal{L}_1: The L1 loss (mean absolute error).
    • LSSIM\mathcal{L}_{SSIM}: The Structural Similarity Index (SSIM) loss.
    • Iif~\tilde { I _ { i } ^ { f } } : The final rendered image.
    • IiI_i: The ground truth input image.
    • Mit^\widehat { M _ { i } ^ { t } } : The pseudo transient mask.
    • \odot: The Hadamard product. The term (1Mit^)(1 - \widehat { M _ { i } ^ { t } }) creates a mask that is 0 for transient regions and 1 for non-transient regions, effectively ignoring transient occluders during loss calculation.
    • λ1,λ2\lambda_1, \lambda_2: Hyperparameters balancing the L1 and SSIM components.

4.5.3. Sky and Building Encoders Mutual Distillation

To align the appearance embeddings of the sky and building within the same latent space, a mutual distillation strategy is employed. This alignment helps guide building appearance optimization and allows for estimating the sky cubemap even when the sky is not captured in an image.

  • Manhattan Distance Loss: The alignment is enforced using the Manhattan distance (L1 distance) loss between the sky embeddings lisl_i^s and building embeddings libl_i^b: $ \mathcal { L } _ { m d } = M a n h a t ( l _ { i } ^ { s } , l _ { i } ^ { b } ) $ Where:
    • Lmd\mathcal{L}_{md}: The mutual distillation loss.
    • Manhat(.,.): The Manhattan distance function, which calculates the sum of the absolute differences between corresponding elements of two vectors. This loss encourages the latent representations of sky and building appearances to be similar.

4.5.4. Training and Rendering

  • Final Loss Function: The entire framework is optimized end-to-end using the following combined loss function: $ \mathcal { L } = \mathcal { L } _ { c } + \lambda _ { 3 } \mathcal { L } _ { o } + \lambda _ { 4 } \mathcal { L } _ { T V } + \lambda _ { 5 } \mathcal { L } _ { m d } $ Where:

    • L\mathcal{L}: The total loss function.
    • Lc\mathcal{L}_c: The rendering loss for non-transient regions (Eq. 11).
    • Lo\mathcal{L}_o: The opacity loss for the sky region (Eq. 9).
    • LTV\mathcal{L}_{TV}: The Total Variation loss for the implicit sky cubemap (Eq. 6).
    • Lmd\mathcal{L}_{md}: The mutual distillation loss for sky and building embeddings (Eq. 12).
    • λ3,λ4,λ5\lambda_3, \lambda_4, \lambda_5: Hyperparameters to balance the contributions of the respective loss functions.
  • Novel Appearance Synthesis: After training, the model supports three types of novel appearance synthesis:

    1. Unconstrained Image Level: Given a new unconstrained image, its sky and building embeddings are generated (as in training) to produce explicit 3DGS for buildings and an explicit cubemap for the sky, allowing novel views under that specific appearance.
    2. Implicit Embeddings Level: For two input images, their sky and building embeddings are linearly interpolated to create new intermediate embeddings. These intermediate embeddings are then used to synthesize novel appearances that smoothly transition between the two source appearances.
    3. Explicit Cubemap Level: Leveraging the fine sky encoder, an external explicit cubemap can be directly used as input. This cubemap influences the fine sky embeddings and thus the neural in-the-wild 3D Gaussians of the building, enabling explicit sky editing and affecting the building's appearance accordingly.

5. Experimental Setup

5.1. Datasets

The proposed method was evaluated on two challenging datasets focused on outdoor scene reconstruction in the wild.

5.1.1. Photo Tourism (PT) Dataset [19]

  • Source & Characteristics: This dataset consists of multiple scenes of well-known monuments (e.g., Brandenburg Gate, Sacré Coeur, Trevi Fountain). It features collections of user-uploaded images which are highly unconstrained. These images vary significantly in:
    • Time and Date: Different times of day, seasons, and weather conditions.
    • Camera Settings: Diverse camera models, exposure levels, white balance, and post-processing effects.
    • Transient Occluders: Presence of people, cars, and other temporary objects.
  • Scale: Each scene contains 800 to 1500 unconstrained images.
  • Purpose: It's a standard benchmark for evaluating in-the-wild 3D reconstruction methods, specifically designed to test robustness against appearance variations and occlusions.

5.1.2. NeRF-OSR Dataset [47]

  • Source & Characteristics: This is an outdoor scene relighting benchmark. It includes multiple sequences (likely videos or dense image sets) captured at different times, often featuring transient occluders such as pedestrians on the street.

  • Scale: Each scene contains 300 to 400 images.

  • Purpose: It's used to evaluate methods that aim to reconstruct scenes under varying illumination conditions and to understand how well they can handle dynamic elements for relighting purposes.

    These datasets were chosen because they represent the "wild" conditions that the paper aims to address, providing diverse challenges related to appearance, lighting, and occlusions, making them effective for validating the method's performance in real-world scenarios.

5.2. Evaluation Metrics

The paper uses standard metrics for evaluating the quality of novel view synthesis.

5.2.1. Peak Signal-to-Noise Ratio (PSNR)

  • Conceptual Definition: PSNR is a commonly used metric to quantify the quality of reconstruction of lossy compression codecs or, in this context, the quality of a rendered image compared to its ground truth. It measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. A higher PSNR generally indicates higher quality.
  • Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{\mathrm{MSE}} \right) $ Where:
    • MAXIMAX_I: The maximum possible pixel value of the image. For an 8-bit image, this is 255.
    • MSE\mathrm{MSE}: The Mean Squared Error between the rendered image and the ground truth image. $ \mathrm{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $ Where:
      • II: The ground truth image.
      • KK: The rendered image.
      • M, N: The dimensions (height and width) of the images.
      • I(i,j), K(i,j): The pixel values at coordinates (i,j) in the ground truth and rendered images, respectively.
  • Symbol Explanation:
    • PSNR\mathrm{PSNR}: Peak Signal-to-Noise Ratio, measured in decibels (dB).
    • MAXIMAX_I: Maximum possible pixel intensity value (e.g., 255 for 8-bit images).
    • MSE\mathrm{MSE}: Mean Squared Error.
    • I(i,j): Pixel value of the ground truth image at row ii, column jj.
    • K(i,j): Pixel value of the rendered image at row ii, column jj.
    • MM: Number of rows (height) in the image.
    • NN: Number of columns (width) in the image.

5.2.2. Structural Similarity Index Measure (SSIM)

  • Conceptual Definition: SSIM is a perceptual metric that quantifies the perceived structural similarity between two images. Unlike PSNR, which focuses on absolute error, SSIM considers image degradation as a perceived change in structural information, taking into account luminance, contrast, and structural components. Values range from -1 to 1, with 1 indicating perfect similarity. Higher SSIM values are better.
  • Mathematical Formula: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $ Where:
    • μx\mu_x: The average of image xx.
    • μy\mu_y: The average of image yy.
    • σx2\sigma_x^2: The variance of image xx.
    • σy2\sigma_y^2: The variance of image yy.
    • σxy\sigma_{xy}: The covariance of images xx and yy.
    • C1=(K1L)2C_1 = (K_1 L)^2 and C2=(K2L)2C_2 = (K_2 L)^2: Two constants to avoid division by a weak denominator, where LL is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and K11,K21K_1 \ll 1, K_2 \ll 1 (e.g., K1=0.01,K2=0.03K_1=0.01, K_2=0.03).
  • Symbol Explanation:
    • SSIM(x,y)\mathrm{SSIM}(x, y): Structural Similarity Index between image xx (ground truth) and image yy (rendered).
    • μx\mu_x: Mean pixel value of image xx.
    • μy\mu_y: Mean pixel value of image yy.
    • σx2\sigma_x^2: Variance of pixel values in image xx.
    • σy2\sigma_y^2: Variance of pixel values in image yy.
    • σxy\sigma_{xy}: Covariance between pixel values of image xx and image yy.
    • C1,C2C_1, C_2: Small constants to stabilize the division.
    • K1,K2K_1, K_2: Small constants (e.g., 0.01, 0.03).
    • LL: Dynamic range of pixel values (e.g., 255).

5.2.3. Learned Perceptual Image Patch Similarity (LPIPS)

  • Conceptual Definition: LPIPS is a perceptual similarity metric that assesses the difference between two images based on the activations of a pre-trained deep convolutional neural network (often VGG or AlexNet). It is designed to correlate better with human perception of image quality than traditional metrics like PSNR or SSIM. A lower LPIPS score indicates higher perceptual similarity (better quality).
  • Mathematical Formula: The LPIPS calculation involves extracting features from intermediate layers of a pre-trained network (e.g., AlexNet or VGG) for both the reference and test images, scaling these features, and then computing the L2 distance between them. $ \mathrm{LPIPS}(x, y) = \sum_l w_l \cdot ||\phi_l(x) - \phi_l(y)||_2^2 $ Where:
    • xx: The ground truth image.
    • yy: The rendered image.
    • ϕl\phi_l: Features extracted from layer ll of a pre-trained deep network (e.g., AlexNet).
    • wlw_l: A learnable scalar weight for each layer ll.
    • 22||\cdot||_2^2: The squared L2 norm (Euclidean distance).
  • Symbol Explanation:
    • LPIPS(x,y)\mathrm{LPIPS}(x, y): Learned Perceptual Image Patch Similarity between image xx and image yy.

    • ϕl(x)\phi_l(x): Feature stack from layer ll of the pre-trained network for image xx.

    • ϕl(y)\phi_l(y): Feature stack from layer ll of the pre-trained network for image yy.

    • wlw_l: Weight for the difference in features at layer ll.

      The paper also reports average training time in GPU hours and rendering times in Frames Per Second (FPS) to evaluate efficiency.

5.3. Baselines

The proposed method is compared against a comprehensive set of state-of-the-art methods for in-the-wild 3D scene reconstruction, including both NeRF-based and 3DGS-based approaches. These baselines are representative of the current research landscape in this field.

5.3.1. NeRF-based Methods

  • NeRF-W [30]: The foundational NeRF-in-the-wild method, which introduced appearance and transient embeddings.
  • HA-NeRF [7]: An extension of NeRF-W using a CNN-based appearance encoder.
  • CR-NeRF [68]: Further improved NeRF-W with a CNN-based encoder and fine-tuned transient masks.
  • K-Planes [12]: A NeRF variant employing planar factorization for efficiency and quality.
  • RefinedFields [20]: Another NeRF method focusing on refinement for unconstrained scenes.

5.3.2. 3DGS-based Methods

  • SWAG [11]: One of the early 3DGS adaptations for in-the-wild scenes, predicting image-dependent opacity and appearance variations.
  • Splatfacto-W [64]: A NeRFstudio implementation of 3DGS for unconstrained collections.
  • WE-GS [59]: Introduces a lightweight spatial attention module for appearance embeddings and transient masks.
  • GS-W [72]: Employs an adaptive sampling strategy for dynamic appearances.
  • WildGaussians [23]: Uses DINO features and a trainable affine transformation for transient occluders, and attempts to model the sky with 3D Gaussians.
  • 3DGS [21]: The vanilla 3D Gaussian Splatting method, included to show the performance degradation without in-the-wild adaptations.

6. Results & Analysis

6.1. Core Results Analysis

The experiments demonstrate that the proposed sky-aware framework consistently outperforms existing methods in novel view synthesis and novel appearance synthesis, while also offering superior efficiency in terms of convergence and rendering speed.

6.1.1. Performance on Photo Tourism (PT) Dataset

The following are the results from Table 1 of the original paper:

GPU hrs.
/FPS
Bradenburg Gate Sacre Coeur Trevi Fountain
PSNR ↑ SSIM↑ LPIPS ↓ PSNR ↑ SSIM↑ LPIPS ↓ PSNR ↑ SSIM↑ LPIPS ↓
NeRF-W [30] -1<1 24.17 0.890 0.167 19.20 0.807 0.191 18.97 0.698 0.265
HA-NeRF [7] 35.1/<1 24.04 0.877 0.139 20.02 0.801 0.171 20.18 0.690 0.222
CR-NeRF [68] 31.0/<1 26.53 0.900 0.106 22.07 0.823 0.152 21.48 0.711 0.206
K-Planes [12] 0.3/<1 25.49 0.924 - 20.61 0.852 - 22.67 0.856 -
RefinedFields [20] 11.8/<1 26.64 0.886 - 22.26 0.817 - 23.42 0.737 -
SWAG [11]* 0.8/15 26.33 0.929 0.139 21.16 0.860 0.185 23.10 0.815 0.208
Splatfacto-W [64]† 1.1/40 26.87 0.932 0.124 22.66 0.769 0.224 22.53 0.876 0.158
WE-GS [59] 1.8/181 27.74 0.933 0.128 23.62 0.890 0.148 23.63 0.823 0.153
GS-W [72] 2.0/50 28.48 0.929 0.086 23.15 0.859 0.130 22.97 0.773 0.167
WildGaussians [23] 5.0/29 28.26 0.935 0.083 23.51 0.875 0.124 24.37 0.780 0.137
3DGS [21] 0.4/181 20.72 0.889 0.152 17.57 0.839 0.190 17.04 0.690 0.265
Ours 1.6/217 27.79 0.936 0.081 23.51 0.892 0.111 24.61 0.824 0.130

As shown in Table 1, the proposed method Ours achieves competitive or superior performance across the PT dataset scenes, especially for Trevi Fountain and Sacre Coeur.

  • PSNR: Our method achieves the highest PSNR for Trevi Fountain (24.61) and Sacre Coeur (23.51), matching WildGaussians. For Bradenburg Gate, it is 27.79, slightly lower than GS-W (28.48) and WildGaussians (28.26), but still significantly better than other methods.

  • SSIM: Our method consistently achieves high SSIM values, indicating excellent structural similarity. For Bradenburg Gate (0.936) and Sacre Coeur (0.892), it is among the top performers.

  • LPIPS: Our method demonstrates the lowest (best) LPIPS scores across all three scenes (Bradenburg Gate 0.081, Sacre Coeur 0.111, Trevi Fountain 0.130), indicating superior perceptual quality.

  • Efficiency (FPS): With 217 FPS, Ours achieves the highest rendering speed among all methods, including vanilla 3DGS (181 FPS) and WE-GS (181 FPS). This confirms the efficiency gains from avoiding complex sky modeling with Gaussians.

  • Training Time (GPU hrs): At 1.6 GPU hours, Ours is efficient, comparable to WE-GS (1.8), GS-W (2.0), SWAG (0.8), and faster than WildGaussians (5.0) and most NeRF-based methods (e.g., CR-NeRF 31.0, HA-NeRF 35.1).

    The visual comparisons in Fig. 5 (from the original paper) further support these quantitative results, highlighting that Ours produces more accurate and visually pleasing reconstructions, especially in sky regions, compared to baselines like WildGaussians which often show ellipsoidal noise in the sky.

The following figure (Figure 5 from the original paper) shows a visual comparison of reconstruction quality on the PT dataset:

该图像是比较不同方法在重建场景中的表现示意图。包含 GT、Ours、3DGS、Wild Gaussians 和 GSW 的对比,展示了在不同条件下的重建效果。红框部分标示出重点对比区域。
该图像是比较不同方法在重建场景中的表现示意图。包含 GT、Ours、3DGS、Wild Gaussians 和 GSW 的对比,展示了在不同条件下的重建效果。红框部分标示出重点对比区域。

differences in quality are highlighted by insets.

6.1.2. Performance on NeRF-OSR Dataset

The following are the results from Table 2 of the original paper:

europa lwp st stjohann
PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS
NeRF [34] 17.49 0.551 0.503 11.51 0.468 0.574 17.20 0.514 0.502 14.89 0.432 0.639
NeRF-W [30] 20.00 0.699 0.340 19.61 0.616 0.445 20.31 0.607 0.438 21.23 0.670 0.426
Hha-NerRF [7] 17.79 0.632 0.421 20.03 0.685 0.365 17.30 0.538 0.483 17.19 0.686 0.331
C RRM 25] 21.03 0.721 0.294 21.90 0.719 0.336 20.68 0.630 0.402 22.84 0.793 0.235
SWG [11] 23.91 0.864 0.172 22.07 0.783 0.303 22.29 0.713 0.364 23.74 0.845 0.242
WE-GS [59]* 24.74 0.873 0.157 24.33 0.821 0.197 22.45 0.720 0.341 24.12 0.858 0.202
GS-W [72]. 24.70 0.879 0.144 24.50 0.817 0.201 23.32 0.740 0.321 24.20 0.849 0.221
WildGaussians [23] 23.97 0.861 0.174 22.12 0.791 0.310 27.16 0.709 0.366 22.84 0.827 0.274
GS [21] 20.18 0.782 0.252 11.76 0.609 0.414 18.10 0.629 0.406 18.57 0.741 0.268
Ours 24.71 0.879 0.141 24.57 0.826 0.189 22.65 0.742 0.320 24.61 0.867 0.193

On the NeRF-OSR dataset, our method also achieves state-of-the-art performance across several metrics, as shown in Table 2.

  • Ours achieves the highest PSNR for lwp (24.57) and stjohann (24.61), and is competitive for europa (24.71) with WE-GS and GS-W. Notably, it improves the average PSNR by 7.7 dB over vanilla 3DGS on this dataset, indicating its strong adaptation to complex outdoor conditions.

  • SSIM scores are consistently high (e.g., europa 0.879, lwp 0.826, stjohann 0.867), demonstrating excellent structural preservation.

  • LPIPS scores are among the lowest, confirming superior perceptual quality across scenes.

  • The results show that Ours not only reconstructs the sky with greater accuracy but also captures building details more precisely.

    The visual comparisons in Fig. 6 (from the original paper) further illustrate the quality improvements, especially in detailed building structures and clear sky rendering.

    Fig. 7: Ablation studies on the neural sky module and TV loss. (a) Without the neural sky module, the predicted cubemap is learned across scenes and remains the same regardless of different appearanc… 该图像是图表,展示了神经天空模块和全变差(TV)损失对3D场景重建的影响。左侧为渲染结果和预测的立方体图,分别展示了不同情况下的结果:图(a)缺少神经天空模块时,立方体图在不同场景中保持一致;图(b)缺少TV损失时,立方体图的频率过高,影响天空的渲染质量;图(c)表示完整模型的渲染效果。

lon-obvious differences in quality are highlighted by insets.

6.2. Ablation Studies / Parameter Analysis

Ablation studies were conducted on three scenes from the PT dataset to validate the design choices.

The following are the results from Table 3 of the original paper:

PSNR↑ SSIM↑ LPIPS↓
(1) w/o pseudo labels 24.91 0.867 0.114
(2) w/o sky encoder 24.39 0.856 0.117
(3) Neual sky size 16 × 256 × 256 25.21 0.880 0.112
(4) Neual sky size 8 × 512 × 512 25.31 0.884 0.106
(5) w/o sky embeddings 23.81 0.849 0.123
(6) w/o implicit cubemap 23.76 0.843 0.123
(7) w/o LTV 25.09 0.880 0.113
(8) w/o fine sky encoder 25.29 0.885 0.109
(9) w/o mutual distillation 25.17 0.881 0.110
(10) w/o neural feature F 24.96 0.870 0.112
(11) w/o PE. F init. 25.24 0.882 0.109
(12) Complete model 25.30 0.884 0.107

6.2.1. Contribution to Rendering Quality

  • (1) w/o pseudo labels (using fine-tuned transient mask predictor): PSNR drops from 25.30 to 24.91. This indicates that the proposed greedy supervision strategy with pseudo masks is more efficient and provides better (or at least comparable) results than training a per-image transient mask predictor. The authors conclude that the per-image transient mask predictor is unnecessary.
  • (2) w/o sky encoder (sky and building reconstructed by neural in-the-wild 3D Gaussians): PSNR drops significantly to 24.39. This highlights the importance of separately encoding sky and building appearances, and the benefit of the dedicated neural sky module.
  • (10) w/o neural feature F (using position directly as input to MLPωMLP_\omega): PSNR drops to 24.96. The unconstrained radiance feature F (initialized with Positional Encoding) plays a role in enhancing the translated radiance.
  • (11) w/o PE. F init. (initializing FF without Positional Encoding): PSNR is 25.24, slightly lower than the complete model (25.30). This suggests that Positional Encoding for FF is beneficial for capturing high-frequency details.
  • (12) Complete model: Achieves the best overall performance (PSNR 25.30, SSIM 0.884, LPIPS 0.107), confirming that all proposed modules contribute to the final quality.

6.2.2. Influence of Neural Sky Module

  • (5) w/o sky embeddings (neural sky replaced by a learnable explicit cubemap without image-specific sky embeddings): PSNR drops to 23.81. Without sky embeddings, different unconstrained images would render the same sky, demonstrating the importance of sky embeddings for diverse novel appearance synthesis. As shown in Fig. 7 (a), the predicted cubemap remains static across different appearances.

  • (6) w/o implicit cubemap (8-layer coordinate-based MLP with only sky embeddings as input): PSNR drops to 23.76, the lowest score. This variant struggles to produce desired results, with generated skies tending to exhibit uniform color. This emphasizes that the implicit cubemap within the neural sky module is crucial for mitigating catastrophic forgetting and capturing complex, location-dependent sky details.

  • (7) w/o LTV (without Total Variation loss): PSNR drops to 25.09. Removing the Total Variation loss leads to higher frequency noise in the predicted cubemap, which adversely affects the sky rendering quality. Fig. 7 (b) visually confirms this.

  • (3) & (4) Neural sky size (16x256x256 vs. 8x512x512): Increasing the feature dimension (16x256x256, PSNR 25.21) does not significantly improve rendering quality compared to increasing resolution (8x512x512, PSNR 25.31). While higher resolution slightly improves performance, it also increases storage cost.

  • (8) w/o fine sky encoder: PSNR is 25.29, very close to the complete model's 25.30. While it doesn't significantly improve metrics, its purpose is to extract features from the cubemap to condition the neural in-the-wild 3D Gaussians, allowing for more flexible applications in novel appearance and view synthesis, as further illustrated in Fig. 10.

    The following figure (Figure 7 from the original paper) shows ablation studies on the neural sky module and TV loss:

    Fig. 8: Appearance modeling from an unconstrained image, showcasing multiple novel views under the novel appearance. 该图像是一个插图,展示了从一幅无约束图像生成的多个新视角,涵盖了不同的建筑物和天空效果。每个新视角展示了不同的外观变化,展示了布兰登堡门、圣保罗教堂和特雷维喷泉的变化。

Fig. 7: Ablation studies on the neural sky module and TV loss. (a) Without the neural sky module, the predicted cubemap is learned across scenes and remains the same regardless of different appearances. (b) Without TV loss, the frequency of the predicted cubemap is too high, which adversely affects the rendering quality of the sky.

6.2.3. Mutual Distillation

  • (9) w/o mutual distillation: PSNR drops to 25.17. The mutual distillation learning strategy to align sky and building appearance embeddings contributes to enhancing overall rendering quality and efficiency.

6.3. Ablation Study on Pseudo Semantic Mask Quality

To evaluate the robustness of the greedy supervision strategy to pseudo semantic mask inaccuracies, noise was intentionally introduced into the masks.

The following are the results from Table 4 of the original paper:

PSNR↑ SSIM↑ LPIPS↓
Add 5% noise to transient masks 27.77 0.934 0.082
Add 10% noise to transient masks 27.78 0.933 0.084
Add 5% noise to sky and building masks 27.65 0.931 0.089
Add 10% noise to sky and building masks 26.97 0.922 0.101
w/ pseudo masks 27.79 0.936 0.081
  • Noise in Transient Masks: Adding 5% or 10% noise to the transient masks (meaning some non-transient pixels are mistakenly identified as transient) has a minimal impact on reconstruction accuracy (PSNR 27.77-27.78 vs. 27.79 for the baseline). This is attributed to the greedy supervision strategy, where misidentifying building as transient mainly affects the number of samples used for supervision, but enough correct samples remain for robust reconstruction.
  • Noise in Sky and Building Masks: Introducing noise (5% or 10%) into the sky and building masks (meaning some transient pixels are mistakenly identified as sky or building, or vice-versa) leads to a more significant drop in accuracy (PSNR 27.65 for 5% noise, 26.97 for 10% noise). This is because misclassifying transient masks as building or sky forces the model to reconstruct transient objects as part of the static scene, which severely impacts accuracy. This highlights the importance of the semantic segmentation accurately delineating the static scene components.

6.4. Applications

The proposed framework offers flexible novel appearance synthesis capabilities.

6.4.1. Novel Appearance Synthesis from an Unconstrained Image

Given a single unconstrained image, the method can infer its building and sky embeddings, which are then used to generate a novel explicit standard 3DGS representation for buildings and an explicit cubemap for the sky. This allows for rendering novel views under the appearance of the input image. Fig. 8 (from the original paper) showcases this capability, demonstrating successful inference of the novel appearance of an entire building even when partially visible in the input, and plausible sky cubemap inference even when the sky is absent from the input image (due to the mutual distillation strategy).

The following figure (Figure 8 from the original paper) shows appearance modeling from an unconstrained image:

该图像是示意图,展示了不同视角下的目标应用效果。图中显示了源视图与目标应用之间的转换,展示了建筑物在不同环境条件下的表现,反映了本文所提框架在3D场景重建中的有效性和可行性。
该图像是示意图,展示了不同视角下的目标应用效果。图中显示了源视图与目标应用之间的转换,展示了建筑物在不同环境条件下的表现,反映了本文所提框架在3D场景重建中的有效性和可行性。

Fig. 8: Appearance modeling from an unconstrained image, showcasing multiple novel views under the novel appearance.

6.4.2. Novel Appearance Synthesis from Appearance Embeddings

The method allows for linear interpolation of implicit appearance embeddings between any two unconstrained images. This creates intermediate embeddings that, when injected into the neural in-the-wild 3DGS and neural sky, result in novel appearances with smooth transitions. Fig. 9 (from the original paper) illustrates this for variations in weather, time, and camera parameters (e.g., exposure changes), demonstrating a rich and smooth range of appearance transformations.

The following figure (Figure 9 from the original paper) illustrates appearance synthesis from interpolated embeddings:

该图像是示意图,展示了源应用与目标应用之间的不同场景重建效果。图中分为三部分,分别为(a)、(b)和(c),展示了不同的视觉内容与变化过程,强调高效的3D场景重建与渲染效果的精准转换。
该图像是示意图,展示了源应用与目标应用之间的不同场景重建效果。图中分为三部分,分别为(a)、(b)和(c),展示了不同的视觉内容与变化过程,强调高效的3D场景重建与渲染效果的精准转换。

the source and target views.

6.4.3. Novel Appearance Synthesis from Explicit Cubemap

By leveraging the fine sky encoder, the approach enables novel appearance synthesis through direct editing or interpolation of the explicit cubemap. This means an external sky cubemap can be directly fed into the system. As seen in Fig. 10 (from the original paper), interpolating an explicit sky cubemap produces natural and realistic results that are distinct from interpolating implicit embeddings. Furthermore, the sky can be explicitly edited (e.g., replacing it with a fantastical sky) to create various virtual scenes, offering enhanced control over scene editing.

The following figure (Figure 10 from the original paper) illustrates novel appearance synthesis from explicit cubemaps:

Fig. 2: First row: Previous methods learned appearance-dependent 3D Gaussians and 2D transient mask for each image simultaneously. However, after 15K training iterations, the predictions were insuffi…
该图像是一个示意图,展示了使用微调掩膜和伪掩膜进行3D场景重建的结果。左上角为真实图像,右上为微调掩膜(15K)下的结果,左下为使用伪掩膜(15K)生成的效果,右下为生成的伪掩膜。通过简单的贪婪监督策略,伪掩膜的使用提高了重建效率和质量。

Fig. 2: First row: Previous methods learned appearance-dependent 3D Gaussians and 2D transient mask for each image simultaneously. However, after 15K training iterations, the predictions were insufficient for both scene reconstruction quality and 2D mask accuracy. Because distinguishing between significant appearance changes and transient occluders is challenging for the previous framework. Second row: We find that using pseudo masks from a pre-trained segmentation network directly as transient masks, without fine-tuning, improved reconstruction efficiency while maintaining quality, thanks to the proposed simple greedy supervision strategy. (Note: This image from the original paper is labeled Fig. 10 in the prompt's Image & Formula Text Summaries but Fig. 2 in the paper content. I will use the image description provided with the prompt's Image 2 and its associated caption in Image & Formula Text Summaries to ensure fidelity to the provided prompt data. The caption in the paper itself associated with images/2.jpg is Fig. 2: First row: Previous methods learned appearance-dependent 3D Gaussians and 2D transient mask for each image simultaneously. However, after 15K training iterations, the predictions were insufficient for both scene reconstruction quality and 2D mask accuracy. Because distinguishing between significant appearance changes and transient occluders is challenging for the previous framework. Second row: We find that using pseudo masks from a pre-trained segmentation network directly as transient masks, without fine-tuning, improved reconstruction efficiency while maintaining quality, thanks to the proposed simple greedy supervision strategy. I will assume the prompt means Fig. 10 as in the image summary for consistency of the prompt, as the content of the Fig 10 from the prompt matches the description of application for Fig 10 in the paper text, while the Fig 2 from paper text is about pseudo masks.) Given the provided images and their descriptions, the image images/2.jpg is associated with the caption from Fig. 2 in the paper. However, the prompt's Image & Formula Text Summaries describes images/2.jpg as Image 2: The image is an illustration showing the different scene reconstruction effects between the source and target applications. It is divided into three parts: (a), (b), and (c), highlighting the variations in visual content and the accurate transformation of efficient 3D scene reconstruction and rendering effects. Source: 10.jpg. This is conflicting. Given the text Fig. 10 line 1, line 2, line 3 in section 5.5.3, it refers to an image with multiple rows of examples. images/2.jpg (Fig. 2 in the paper) does not fit this description. The provided Image & Formula Text Summaries for Image 2 (Source: 10.jpg) does describe a multi-part image with source and target applications. Therefore, I will include images/10.jpg here and refer to it as Fig. 10, assuming the prompt's image mapping takes precedence for image content.

The following figure (Figure 10 as described in the prompt's Image & Formula Text Summaries) illustrates novel appearance synthesis from explicit cubemaps:

该图像是示意图,展示了源应用与目标应用之间的不同场景重建效果。图中分为三部分,分别为(a)、(b)和(c),展示了不同的视觉内容与变化过程,强调高效的3D场景重建与渲染效果的精准转换。
该图像是示意图,展示了源应用与目标应用之间的不同场景重建效果。图中分为三部分,分别为(a)、(b)和(c),展示了不同的视觉内容与变化过程,强调高效的3D场景重建与渲染效果的精准转换。

Fig. 10: An illustration showing the different scene reconstruction effects between the source and target applications, highlighting the variations in visual content and the accurate transformation of efficient 3D scene reconstruction and rendering effects.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces an efficient, sky-aware framework for reconstructing 3D scenes from unconstrained photo collections using 3D Gaussian Splatting (3DGS). The framework's key contributions include:

  1. Eliminating Per-Image Transient Mask Predictors: By utilizing pseudo masks from a pre-trained semantic segmentation network with a greedy supervision strategy, the method streamlines the process, removing the need for slow-converging per-image transient mask predictors and enhancing efficiency.

  2. Separate Sky and Building Appearance Learning: The framework recognizes the distinct properties of sky and building radiance, enabling more efficient and accurate reconstruction by encoding their appearances separately.

  3. Neural Sky Module for Explicit Cubemap Generation: A novel neural sky module generates diverse explicit cubemaps for the sky from latent sky embeddings, which helps capture location-dependent details and mitigates catastrophic forgetting.

  4. Mutual Distillation for Latent Space Alignment: A mutual distillation strategy aligns sky and building appearance embeddings in the same latent space, fostering inter-component guidance and enabling plausible sky inference even when sky data is missing.

    Extensive experiments on the Photo Tourism and NeRF-OSR datasets demonstrate that the proposed method achieves state-of-the-art performance in novel view and appearance synthesis, delivering superior rendering quality with faster convergence and rendering speed compared to existing in-the-wild 3DGS and NeRF methods.

7.2. Limitations & Future Work

The authors acknowledge several limitations of their proposed method:

  • Complex Lighting Environments: The method relies on a neural-appearance representation without strict physical constraints and struggles with accurately reconstructing highly complex lighting environments. This is partly because it cannot segment the transient objects' shadows, which are crucial for consistent lighting.
  • Lack of Physical Constraints for Scene Editing: While the neural sky module allows for controllable scene editing (e.g., explicit cubemap interpolation), both implicit and explicit cubemap interpolation approaches lack physical constraints related to changes in sunlight position or weather conditions. This means achieving precise outdoor scene relighting remains an open challenge.
  • Indoor Scenes: The current method specifically focuses on outdoor scene reconstruction in the wild. Its applicability and performance for in-the-wild reconstruction of indoor scenes have not been addressed and are left as future work.

7.3. Personal Insights & Critique

The paper presents a very intuitive and practical approach to improve 3DGS in-the-wild performance. The core idea of differentiating sky and building components is a clever way to leverage their distinct characteristics; the sky is generally simpler and more uniform than complex building structures with diverse materials. This sky-aware design, combined with the simplification of transient mask prediction, directly addresses the efficiency and quality bottlenecks prevalent in previous 3DGS-in-the-wild methods.

Inspirations and Transferability:

  • Semantic Priors for Efficiency: The success of using pseudo masks from LVMs without fine-tuning is a significant takeaway. It demonstrates that off-the-shelf, powerful semantic priors can be integrated with minimal overhead to solve complex problems in other domains. This could be applied to other reconstruction tasks where specific object categories or background elements behave predictably. For example, segmenting out water bodies, foliage, or road surfaces and treating them with specialized, more efficient models could improve performance in large-scale urban scene reconstruction.
  • Component-wise Modeling: The idea of breaking down a complex scene into components (sky, building) and modeling them with tailored representations (explicit cubemap vs. neural Gaussians) is powerful. This divide-and-conquer strategy could be extended to other scene components or even to different levels of detail within a component. For instance, fine-grained segmentation of different building materials (glass, concrete, metal) could allow for more physically accurate or specialized appearance modeling for each, enhancing realism.
  • Mutual Distillation for Latent Alignment: The mutual distillation strategy is elegant. The concept of aligning latent spaces of related but distinct components could be beneficial in other multi-modal or multi-component learning scenarios, allowing knowledge transfer and robust inference even when one component's data is sparse or missing.

Potential Issues and Areas for Improvement:

  • Reliance on LVM Accuracy: While the greedy supervision strategy is robust to some noise in transient masks, its performance on sky and building masks is more sensitive (as shown in ablation studies). The quality of the pseudo masks from the LVM is a hard dependency. If the LVM misclassifies critical parts of a building as sky (or vice-versa), or fails entirely on novel scene types, the overall reconstruction quality could suffer. Further robustness might involve uncertainty quantification from the LVM or a semi-supervised refinement step.

  • Limited Physical Realism: The acknowledged limitation regarding complex lighting environments and relighting is a significant one for in-the-wild applications. The neural-appearance representation is powerful for reproducing observed appearances but less so for physically-based simulation. Integrating physically-based rendering (PBR) principles or explicit light source modeling (e.g., sun position, sky radiance models) could address this, moving beyond purely data-driven appearance synthesis.

  • Generalization to Diverse Skies: While the neural sky module generates diverse skies, its ability to generalize to truly novel or extreme sky conditions (e.g., highly unusual cloud formations, aurora borealis) not well-represented in the training data might be limited by the implicit cubemap and sky embeddings.

  • Computational Overhead of LVMs: Although the LVM is pre-trained and not fine-tuned during reconstruction, running a large semantic segmentation model on potentially thousands of images during preprocessing can still be computationally intensive for very large datasets, even if it's a one-time cost. This should be considered for truly large-scale applications.

    Overall, this paper makes a valuable contribution by pragmatically addressing the efficiency and quality trade-offs in in-the-wild 3DGS reconstruction. Its focus on intelligent scene decomposition and leveraging existing powerful tools (LVMs) offers a practical path forward for more robust and performant neural rendering.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.