Look at the Sky: Sky-aware Efficient 3D Gaussian Splatting in the Wild
TL;DR Summary
This paper introduces a sky-aware 3D Gaussian Splatting framework for efficient scene reconstruction from unconstrained photo collections. It leverages a greedy supervision strategy and pseudo masks from a pre-trained segmentation network, improving efficiency and rendering quali
Abstract
Photos taken in unconstrained tourist environments often present challenges for accurate 3D scene reconstruction due to variable appearances and transient occlusions, which can introduce artifacts in novel view synthesis. Recently, in-the-wild 3D scene reconstruction has been achieved realistic rendering with Neural Radiance Fields (NeRFs). With the advancement of 3D Gaussian Splatting (3DGS), some methods also attempt to reconstruct 3D scenes from unconstrained photo collections and achieve real-time rendering. However, the rapid convergence of 3DGS is misaligned with the slower convergence of neural network-based appearance encoder and transient mask predictor, hindering the reconstruction efficiency. To address this, we propose a novel sky-aware framework for scene reconstruction from unconstrained photo collection using 3DGS. Firstly, we observe that the learnable per-image transient mask predictor in previous work is unnecessary. By introducing a simple yet efficient greedy supervision strategy, we directly utilize the pseudo mask generated by a pre-trained semantic segmentation network as the transient mask, thereby achieving more efficient and higher quality in-the-wild 3D scene reconstruction. Secondly, we find that separately estimating appearance embeddings for the sky and building significantly improves reconstruction efficiency and accuracy. We analyze the underlying reasons and introduce a neural sky module to generate diverse skies from latent sky embeddings extract from unconstrained images. Finally, we propose a mutual distillation learning strategy to constrain sky and building appearance embeddings within the same latent space, further enhancing reconstruction efficiency and quality. Extensive experiments on multiple datasets demonstrate that the proposed framework outperforms existing methods in novel view and appearance synthesis, offering superior rendering quality with faster convergence and rendering speed.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Look at the Sky: Sky-aware Efficient 3D Gaussian Splatting in the Wild
1.2. Authors
Yuze Wang, Junyi Wang, Ruicheng Gao, Yansong Qu, Wantong Duan, Shuo Yang, Yue Qi
1.3. Journal/Conference
The paper is marked as "Published at (UTC): 2025-03-07T00:00:00.000Z," which suggests it is a forthcoming publication or a preprint accepted for a future venue. Given the topic and the authors' affiliations (likely research institutions or universities), it is expected to be published in a reputable computer graphics or computer vision conference (e.g., CVPR, ICCV, ECCV, SIGGRAPH) or journal. These venues are highly influential in the relevant fields, known for publishing cutting-edge research in 3D reconstruction, neural rendering, and computer vision.
1.4. Publication Year
2025 (Based on the publication date provided: 2025-03-07)
1.5. Abstract
This paper addresses the challenges of accurate 3D scene reconstruction from unconstrained photo collections (photos taken in diverse, real-world conditions like tourist environments), which often suffer from variable appearances (e.g., changing lighting, weather) and transient occlusions (e.g., people, cars). While Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have advanced in-the-wild reconstruction, current 3DGS-based methods face efficiency issues because their fast convergence is misaligned with the slower convergence of neural networks used for appearance encoding and transient mask prediction.
To overcome this, the authors propose a novel sky-aware framework. First, they simplify the transient mask prediction by observing that a learnable per-image transient mask predictor is unnecessary. Instead, they introduce a greedy supervision strategy that directly uses pseudo masks generated by a pre-trained semantic segmentation network, leading to more efficient and higher-quality reconstruction. Second, they find that separately estimating appearance embeddings for the sky and buildings significantly improves efficiency and accuracy. They introduce a neural sky module to generate diverse skies from latent sky embeddings. Finally, they propose a mutual distillation learning strategy to constrain sky and building appearance embeddings within the same latent space, further enhancing efficiency and quality. Extensive experiments demonstrate that their framework outperforms existing methods in novel view and appearance synthesis, offering superior rendering quality with faster convergence and rendering speed.
1.6. Original Source Link
/files/papers/6919d53c110b75dcc59ae2a4/paper.pdf (This link points to a local file path, indicating it was likely provided as part of a dataset or internal system. Its publication status is likely a preprint or an internally hosted version awaiting official publication.)
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the efficient and high-quality 3D scene reconstruction from unconstrained photo collections using 3D Gaussian Splatting (3DGS). Unconstrained photo collections refer to images taken in real-world scenarios, often by tourists, exhibiting significant variations in viewpoint, camera settings, lighting conditions, and the presence of transient occluders (e.g., people, vehicles, temporary objects).
This problem is highly important in computer graphics and computer vision due to its applications in virtual reality (VR), scene editing, and digital twins of real-world landmarks. Prior research, particularly with Neural Radiance Fields (NeRFs) and more recently with 3DGS, has made strides in reconstructing in-the-wild scenes. However, existing 3DGS-based methods that adapt to unconstrained conditions introduce neural networks to handle appearance variations (e.g., changing illumination, time of day) and transient occlusions. The key challenge or gap is a mismatch in convergence speed: 3DGS itself converges rapidly, but the concurrently trained neural networks for appearance encoding and transient mask prediction require significantly more iterations to converge. This disparity creates a training bottleneck, hindering the overall efficiency and sometimes the quality of the reconstruction.
The paper's innovative idea stems from a critical reconsideration: Can we simplify or omit certain modules, or leverage image intrinsics and semantics as priors to improve reconstruction efficiency, or even enhance reconstruction quality? The authors identify that treating the sky and buildings differently, and simplifying transient mask prediction, can significantly improve the process.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Novel Transient Mask Prediction: They demonstrate that a learnable
per-image transient mask predictoris unnecessary. Instead, they propose asimple yet efficient greedy supervision strategythat directly leveragespseudo masksgenerated by a pre-trainedLarge Vision Model (LVM)forsemantic segmentation. This approach reduces trainable parameters, resolves ambiguity between the3D radiance fieldand2D transient mask, and improves training efficiency and quality. -
Sky-aware Framework with Separate Appearance Learning: They introduce a
sky-aware frameworkthat recognizes the distinct radiance properties of the sky and buildings. They segment these regions during preprocessing and encode theirappearancesseparately using two distinct encoders. This differential treatment enhances reconstruction efficiency and accuracy. -
Neural Sky Module for Explicit Cubemap Generation: A novel
neural sky moduleis proposed, which generatesexplicit cubemapsfor diverse skies fromlatent sky embeddings. This module explicitly handles location-dependent sky details and mitigatescatastrophic forgettingissues often associated with simpleMulti-Layer Perceptrons (MLPs). -
Mutual Distillation Learning Strategy: To align the
skyandbuilding appearance embeddingswithin the samelatent space, they propose amutual distillation learning strategy. This alignment allows the sky embeddings to guide and constrain building appearance optimization, and potentially infer sky appearance even when it's not present in the input image. -
State-of-the-Art Performance: Extensive experiments on the Photo Tourism (PT) and NeRF-OSR datasets demonstrate that their framework achieves
state-of-the-art (SOTA)performance innovel view synthesisandnovel appearance synthesis, offering superior rendering quality with faster convergence and rendering speed compared to existing methods.These findings collectively address the efficiency bottleneck in
3DGS-basedin-the-wildreconstruction by simplifying mask prediction, introducing a specialized and guided sky representation, and aligning appearance embeddings, ultimately leading to higher quality and faster processing.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice should be familiar with the following foundational concepts:
3.1.1. 3D Gaussian Splatting (3DGS)
3D Gaussian Splatting (3DGS) [21] is a novel explicit 3D scene representation and rendering technique that has recently emerged as a highly efficient alternative to Neural Radiance Fields (NeRFs). Instead of representing a scene implicitly with a neural network, 3DGS explicitly represents a 3D scene using millions of anisotropic 3D Gaussians.
- 3D Gaussian: Each
3D Gaussianis a learnable primitive characterized by:- 3D Position (): The center of the Gaussian in 3D space.
- Covariance Matrix (): Defines the shape and orientation of the Gaussian ellipsoid. It can be decomposed into a
scaling matrix() which controls the size along its principal axes, and arotation matrix() which controls its orientation. - Opacity (): Determines how transparent or opaque the Gaussian is.
- Color (Spherical Harmonics coefficients ): Represents the color of the Gaussian, which can vary with viewing direction.
Spherical Harmonics (SH)are functions used to approximate complex directional light distributions, allowing for view-dependent appearance.
- Projection to 2D: During rendering, these 3D Gaussians are projected onto the 2D image plane of a virtual camera. The
covariance matrixof the 3D Gaussian is transformed into a 2Dcovariance matrixon the image plane. - Rasterization and -Blending: The projected 2D Gaussians are then
rasterized(drawn) onto the image. Their colors and opacities are combined usingalpha blending(also known asvolumetric rendering). This process blends the contributions of Gaussians sorted by depth, from back to front, to produce the final pixel color. The paper uses the following formula for volumetric rendering: $ \widetilde { I } ( x , d ) = \sum _ { i \in M } T _ { i } \alpha _ { i } ^ { \prime } C _ { i } ( d ) , \mathrm { w h e r e } T _ { i } = \Pi _ { j = 1 } ^ { i - 1 } ( 1 - \alpha _ { j } ^ { \prime } ) $ Where:- : The final rendered color of a pixel along a ray from viewpoint in direction .
- : The set of 3D Gaussians sampled along the ray.
- : The accumulated
transmittance(how much light reaches Gaussian without being blocked by previous Gaussians) along the ray up to Gaussian . It's the product of for all preceding Gaussians . - : The effective
opacityof Gaussian after considering its 2D projection and density. It's computed as , where is the base opacity, are the coordinates in the projected space, is the projected 2D mean, and is the projected 2D covariance. - : The color of Gaussian as seen from direction , determined by its
Spherical Harmoniccoefficients.
- Advantages: 3DGS offers significant advantages in
convergence speedandrendering efficiencycompared toNeRFs, often achieving real-time rendering.
3.1.2. Neural Radiance Fields (NeRFs)
Neural Radiance Fields (NeRFs) [34] represent a 3D scene as a continuous volumetric function using a Multi-Layer Perceptron (MLP).
- Implicit Representation: A NeRF takes a 3D coordinate
(x, y, z)and a 2D viewing direction as input and outputs a color (RGB) and a volume density (). The density represents the probability of a ray terminating at that point. - Volume Rendering: To render an image, rays are cast from the camera through each pixel. Points are sampled along each ray, and their predicted colors and densities are combined using
volume renderingtechniques to produce the final pixel color. - Limitations for
in-the-wild: Vanilla NeRF assumes static scenes and consistent lighting. When applied tounconstrained photo collections(in-the-wild), it struggles withappearance variations(e.g., changes in illumination, weather, camera parameters) andtransient occlusions(moving objects like people or cars).
3.1.3. Semantic Segmentation
Semantic segmentation is a computer vision task that involves classifying each pixel in an image into a predefined semantic class (e.g., "sky," "building," "person," "car").
- Pseudo Masks: In this paper,
pseudo masksrefer to segmentation masks generated automatically by a pre-trainedsemantic segmentation network(likeLSeg) without human annotation or fine-tuning on the specific dataset. These masks act as a "best guess" for the semantic regions in an image. - Large Vision Models (LVMs):
LVMsare powerful deep learning models, often pre-trained on vast amounts of image and text data, capable of performing various vision tasks including highly accurate semantic segmentation.LSeg[24],Grounded-SAM[45], andSEEM[76] are examples mentioned in the paper, which can perform segmentation based on text prompts (e.g., "sky," "building").
3.1.4. Appearance Embeddings
Appearance embeddings are compact vector representations (learned features) that capture the visual characteristics or "style" of an image or a specific region within an image. In in-the-wild scenarios, these embeddings are learned for each input image to account for variations in lighting, exposure, color grading, and other factors that change the scene's appearance across different photos. When injected into a rendering model (like NeRF or 3DGS), they allow the model to synthesize novel views that match the appearance of a specific input image.
3.1.5. Catastrophic Forgetting
Catastrophic forgetting (or catastrophic interference) is a phenomenon in artificial neural networks where learning new information causes the network to forget previously learned information. This is particularly relevant when training models on sequential data or when different parts of the model are responsible for distinct features. In the context of NeRFs and similar models, it can mean that a network might struggle to represent very different appearances or details if it's forced to learn them all simultaneously in a single, undifferentiated representation.
3.2. Previous Works
The paper builds upon and differentiates itself from prior work in 3D scene representation and in-the-wild reconstruction using both NeRFs and 3DGS.
3.2.1. 3D Scene Representation
- Traditional Methods: Before neural rendering, 3D scenes were typically represented using
meshes[16],point clouds[40], orvolumetric models[41]. These methods often struggle with photorealism and efficient rendering of complex scenes. - Neural Radiance Fields (NeRFs) and Extensions:
NeRF[34] revolutionized view synthesis by representing scenes as implicit radiance fields.- Extensions focused on
accelerated training[26, 35, 43, 57] (e.g.,Instant-NGP[35] using multi-resolution hash encoding),faster rendering[9, 14, 44],scene editing[60, 61, 67, 70], anddynamic scenes[37, 39, 51].
- 3D Gaussian Splatting (3DGS) and Extensions:
3DGS[21] introduced an explicit representation using millions of 3D Gaussians, significantly accelerating modeling and rendering.- Extensions include
surface reconstruction[6, 18, 58],SLAM[31, 56, 66],AIGC[27, 28], andscene understanding[17, 42, 49, 75].
3.2.2. NeRF from Unconstrained Photo Collections
This line of work addresses the challenges of appearance variation and transient occlusions in in-the-wild datasets:
- NeRF-W [30]: The pioneering work. It introduced
learnable appearance and transient embeddingsfor each image viaGenerative Latent Optimization (GLO). These embeddings allowed the NeRF to adapt its output for different lighting and occlusions. The core idea is to learn a latent code for each image that influences the NeRF's output (color and density). - Ha-NeRF [7] and CR-NeRF [68]: Extended NeRF-W by replacing
GLOwith aConvolutional Neural Network (CNN)-based appearance encoder. This allowed for better generalization andstyle transfer.CR-NeRFalso fine-tuned a pre-trainedsegmentation networkto predicttransient masks. - K-Planes [12] and RefinedFields [20]: Adopted
planar factorizationin NeRF, representing the scene using explicit feature planes rather than a single large MLP. This helped mitigatecatastrophic forgettingand improved training speed, though at the cost of increased storage.
3.2.3. 3DGS from Unconstrained Photo Collections
More recent work adapted 3DGS to in-the-wild scenarios:
- GS-W [72]: Proposed an
adaptive sampling strategyto capturedynamic appearancesfrom multiple feature maps. - SWAG [11]: Predicted
image-dependent opacityandappearance variationsfor each 3D Gaussian. - WE-GS [59]: Introduced a
plug-and-play, lightweight spatial attention moduleto simultaneously predictappearance embeddingsandtransient masks. - WildGaussians [23]: Extracted features from images using
DINO[36] (a self-supervised vision transformer) and used a trainable affine transformation fortransient occluderprediction. This method explicitly attempts to model the sky by filling the periphery of the bounding box with 3D Gaussians during initialization, which the current paper criticizes forellipsoidal noisein the sky due to a shared appearance encoder. - Wild-GS [65]: Proposed a
hierarchical appearance decompositionandexplicit local appearance modeling. - Splatfacto-W [64]: A
NeRFstudio[53] implementation of 3DGS for unconstrained photo collections, essentially an engineering adaptation.
3.3. Technological Evolution
The evolution of 3D scene reconstruction has moved from explicit, geometric-based representations (meshes, point clouds) to implicit, neural network-based representations (NeRFs), and most recently to explicit, primitive-based neural representations (3DGS).
-
Early 3D Reconstruction: Focused on geometric accuracy but often lacked photorealism and struggled with texture synthesis and varying lighting.
-
NeRF Era: Introduced photorealistic novel view synthesis by implicitly representing scenes as radiance fields, revolutionizing the field. However, vanilla NeRF was limited to static, well-lit scenes.
-
NeRF
in-the-wild: Extensions like NeRF-W addressed varying appearances and transient occlusions by introducing per-image latent embeddings and transient masks. These often involved training complex neural networks from scratch. -
3DGS Era: Accelerated both training and rendering significantly while maintaining or improving quality. It offered an explicit representation that was more amenable to editing and faster processing.
-
3DGS
in-the-wild: The latest wave of research, including the current paper, adapts 3DGS to unconstrained conditions. These methods largely followed the NeRF-W paradigm by adding neural modules for appearance and transient handling.This paper fits into the latest stage of
3DGS in-the-wildreconstruction.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's approach are:
- Transient Mask Prediction Simplification:
- Previous
NeRF-W/CR-NeRF/WE-GS/WildGaussians: Typically learnper-image transient masksfrom scratch using MLPs or CNNs, or fine-tune segmentation networks. This process is time-consuming and can lead to ambiguity betweenradiance fieldchanges andtransient occluders. - This Paper: Proposes a
greedy supervision strategyusingpseudo masksfrom a pre-trained, off-the-shelfsemantic segmentation network(LSeg) without fine-tuning. This is a significant simplification, eliminating a complex, slow-converging neural module and resolving ambiguity.
- Previous
- Sky-aware Reconstruction:
- Previous
3DGS-in-the-wild(e.g.,WildGaussians): Often treat the entire scene, including sky and buildings, with a shared appearance encoder or represent the sky with generic 3D Gaussians. This can lead toellipsoidal noisein the sky and inefficiency due to differing radiance properties. - This Paper: Explicitly segments the sky and buildings. It encodes their appearances separately and introduces a dedicated
neural sky moduleto generateexplicit cubemapsfor the sky, rather than modeling it with Gaussians. This tailored approach exploits the distinct characteristics of sky radiance (more uniform, predictable) versus building radiance (complex, diverse).
- Previous
- Mutual Distillation for Appearance Embeddings:
- Previous Methods: Learn
appearance embeddingsfor the whole image or specific regions, but generally don't explicitly align the latent spaces of different scene components. - This Paper: Introduces a
mutual distillation learning strategyto align theskyandbuilding appearance embeddingsin thesame latent space. This allows for cross-guidance, leveraging the simpler sky appearance to constrain and improve building appearance learning, and enabling sky inference even when not visible.
- Previous Methods: Learn
- Efficiency and Quality Focus: By simplifying the mask prediction and differentiating sky/building modeling, the paper explicitly targets
faster convergenceandhigher reconstruction quality, addressing theconvergence speed mismatchbottleneck identified in prior3DGS-in-the-wildmethods.
4. Methodology
The proposed sky-aware method aims to efficiently reconstruct 3D scenes from unconstrained photo collections using 3D Gaussian Splatting (3DGS), offering fast convergence, real-time novel view synthesis, and novel appearance synthesis.
4.1. Problem Definition and Method Overview
Given a set of posed unconstrained images , which includes varying illumination, post-processing effects, and transient occlusions, the objective is to reconstruct the 3D scene efficiently with capabilities for real-time novel view and appearance synthesis.
The overall pipeline of the method, as illustrated in Fig. 4 (from the original paper), begins by utilizing a pre-trained semantic segmentation model to predict pseudo masks for each unconstrained image. These masks distinguish between building, sky, and transient occluder regions. Based on these masks, separate appearance embeddings are extracted for the building and sky. A neural sky module then generates explicit cubemaps from the sky appearance embeddings, while a neural in-the-wild 3D Gaussian representation handles the building by predicting residual spherical harmonic coefficients for 3D Gaussians. Finally, a mutual distillation learning strategy aligns the sky and building embeddings in the same latent space, and the entire framework is optimized end-to-end.
The following figure (Figure 4 from the original paper) illustrates the overall pipeline of the proposed method:

该图像是一个对比图,展示了不同方法在重建三维场景时的效果,包括Brandenburg Gate、Scare Coeur和Trevi Fountain。每列展示了GT(真实图像)、Ours(我们的方法)、3DGS(3D高斯点云)、WildGaussians、GS-W、K-Planes和CR-NeRF的输出,红框部分强调了各方法的重建效果差异。
Fig. 4:An overview of our framework. Given an unconstrained image I _ { i } , the pre-trained segmentation model generates pseudo masks for building , sky , and transient occluders .With the guidance of these masks, a building encoder and a sky encoder generate building embeddings and sky embeddings level of unconstrained image, implicit embeddings, and explicit cubemap.
4.2. Pseudo Mask Extraction for Unconstrained Image
Unconstrained photos often contain transient occlusions (e.g., tourists, cars) that previous methods attempted to handle by training per-image 2D mask predictors (MLPs or CNNs). This training is time-consuming and can ambiguously attribute significant appearance changes to transient occluders, slowing down 3D Gaussian optimization.
This paper proposes a simple yet efficient greedy supervision strategy that leverages Large Vision Models (LVMs) for semantic segmentation without any fine-tuning.
- Semantic Segmentation Model: The
LSeg[24] model is used to generate2D masksfor the sky, building, and transient occlusions. Othersegmentation networkslikeGrounded-SAM[45] andSEEM[76] are also considered viable alternatives. - Sky and Building Pseudo Masks: For the sky, a text prompt
p_sky(e.g., 'sky') is given to the LVM segmentation model to obtain the 2D pseudo mask for image . Similarly, for the building, a text promptp_building(e.g., 'building') is used to get the 2D pseudo mask . The formulas for these are: $ \widehat { M _ { i } ^ { s } } = L V M S e g ( I _ { i } , p _ { s k y } ) $ And $ \widehat { M _ { i } ^ { b } } = L V M S e g ( I _ { i } , p _ { b u i l d i n g } ) $ Where:- : The 2D pseudo mask for the sky in image .
- : The 2D pseudo mask for the building in image .
LVMSeg: Denotes the pre-trainedLarge Vision Modelsegmentation model (e.g., LSeg).- : The -th unconstrained input image.
- : The text prompt 'sky'.
- : The text prompt 'building'.
- Transient Pseudo Mask: The remaining area in the image, after segmenting out the sky and building, is considered the
pseudo transient mask. This is defined as: $ \widehat { M _ { i } ^ { t } } = \overline { { \widehat { M _ { i } ^ { s } } \cup \widehat { M _ { i } ^ { b } } } } . $ Where:- : The 2D pseudo mask for transient occluders in image .
- : Denotes the complement of set .
- : Denotes the union of sets. This means any pixel not classified as sky or building is considered a transient occluder.
- Appearance Embedding Extraction: With these
pseudo masks, two separateConvolutional Neural Networks (CNNs), and , are used to extractsky embeddingsandbuilding embeddingsfrom each unconstrained image . The masks act as an attention mechanism, guiding the CNNs to focus on the relevant regions. $ l _ { i } ^ { s } = E n c _ { \theta _ { 1 } } ^ { s } ( I _ { i } \odot \widehat { M _ { i } ^ { s } } ) $ And $ l _ { i } ^ { b } = E n c _ { \theta _ { 2 } } ^ { b } ( I _ { i } \odot \widehat { M _ { i } ^ { b } } ) $ Where:- : The sky appearance embedding for image .
- : The building appearance embedding for image .
- : The CNN-based sky encoder with parameters .
- : The CNN-based building encoder with parameters .
- : Denotes the
Hadamard product(element-wise multiplication). This operation effectively masks the image, allowing the encoder to process only the sky or building region.
4.3. Neural Sky
The neural sky module is designed to generate diverse and realistic skies. Instead of using 3D Gaussians for the sky, an explicit cubemap representation is learned.
- Learnable 4D Tensor for Implicit Cubemap: A learnable 4D tensor is introduced. This tensor represents an
implicit cubemap, where:6: Corresponds to the six faces of a cubemap (e.g., front, back, left, right, top, bottom).- : Denotes the number of feature channels per pixel on each cubemap face.
- : Represents the width and height of each cubemap face.
This
implicit cubemapeffectively captures location-dependent details and helps mitigatecatastrophic forgettingassociated with simpler MLPs.
- Explicit Cubemap Generation: An
MLP() is used to generate the color for each pixel on theexplicit cubemapbased on the features from theimplicit cubemapand the extractedsky embeddings: $ C _ { s k y } ( k , u , \nu ) = M L P _ { \gamma } ( l _ { i } ^ { s } , T _ { s k y } ( k , u , \nu ) ) $ Where:- : The color of a pixel at coordinates on the -th face of the
explicit sky cubemap. - : A
Multi-Layer Perceptronthat takes thesky embeddingand theimplicit featureas input. - : The
sky appearance embeddingfor the -th image. - : The corresponding
implicit featurefor the -th side of thecubemapat pixel . This process generates anexplicit sky map(where3is for RGB channels).
- : The color of a pixel at coordinates on the -th face of the
- Total Variation (TV) Loss: To smooth the features in the
implicit cubemapand prevent high-frequency noise, aTotal Variation (TV) loss[46] is applied: $ \mathcal { L } _ { T V } = \sum _ { k = 0 } ^ { 5 } \sum _ { u = 0 } ^ { L - 1 } \sum _ { \nu = 0 } ^ { L - 1 } \left. T _ { \mathrm { s k y } } ( k , u + 1 , \nu + 1 ) - T _ { \mathrm { s k y } } ( k , u , \nu ) \right. ^ { 2 } . $ Where:- : The
Total Variation loss. - : The feature at pixel on cubemap face . This loss encourages neighboring pixels in the implicit cubemap to have similar features, leading to smoother sky generation.
- : The
- Fine Sky Encoder: A lightweight
CNN() is introduced as afine sky encoderto extractfine sky embeddingsat thesky cubemap level. These embeddings are a concatenation of the features extracted from the explicit cubemap and theimage-level sky embeddings: $ l _ { i } ^ { s f } = [ E n c _ { \theta _ { 3 } } ^ { f } ( C _ { s k y } ) ; l _ { i } ^ { s } ] $ Where:- : The
fine sky embeddingfor image . - : The
fine sky encoder(CNN) with parameters . - : The
explicit sky cubemapgenerated by . [.;.]: Denotes theconcatenation operation. Thefine sky embeddings() are then passed into theneural in-the-wild 3D Gaussiansto condition the appearance of the building, allowing for a more nuanced interaction between sky and building appearances.
- : The
4.4. Neural in-the-wild 3D Gaussian
For building modeling, a novel explicit-implicit hybrid representation called neural in-the-wild 3D Gaussian is proposed. This representation adapts vanilla 3DGS to unconstrained images by injecting sky and building embeddings into each neural 3D Gaussian.
- Learnable Parameters: Each
neural in-the-wild 3D Gaussianhas the following learnable parameters, extending those ofvanilla 3DGS:- 3D mean position (): The center of the Gaussian.
- Opacity (): Transparency/opaqueness.
- Rotation (): Orientation.
- Scaling factor (): Size.
- Base color (): Represented using
Spherical Harmoniccoefficients. This is the intrinsic color without appearance variation. - Unconstrained radiance feature (): A feature vector initialized by applying
Positional Encoding (PE)[34] to the 3D mean position of each 3D Gaussian.Positional Encodinghelps the MLP capture high-frequency details by mapping input coordinates to higher-dimensional space.
- Per-Image Translated Color: Given the
fine sky embeddings(from theneural sky module) and thebuilding embeddings(from thebuilding encoder), anMLP() learns aper-image translated colorfor each Gaussian: $ C _ { i } ^ { \prime } = M L P _ { \omega } ( C , F , l _ { i } ^ { s f } , l _ { i } ^ { b } ) + C . $ Where:- : The
per-image translated colorfor a specific Gaussian in image . This is the final color used for rendering, which varies based on the appearance. - : A
Multi-Layer Perceptronthat takes thebase color(),unconstrained radiance feature(),fine sky embedding(), andbuilding embedding() as input. - : The
base color(Spherical Harmonic coefficients) of the Gaussian. - : The
unconstrained radiance featureof the Gaussian. - : The
fine sky embeddingfor image . - : The
building embeddingfor image . The effectively predicts a residual (or translation) to the base color , allowing the Gaussian's appearance to adapt to the specificappearance embeddingsof the sky and building for image .
- : The
- Conversion to Standard 3DGS: These
per-Gaussian radiance() are "baked" back into theneural in-the-wild 3D Gaussians, allowing them to be seamlessly converted into thestandard explicit 3DGS representation. This means that after the appearance is determined for a given image, the Gaussians can be treated as regular 3D Gaussians and fed into thevanilla 3DGS rasterization processfor efficient rendering. This design allows the method to integrate into any downstream 3DGS tasks, like scene understanding or editing.
4.5. Optimization
The entire framework is optimized end-to-end, including the parameters of the neural in-the-wild 3D Gaussians, the neural sky module, and the building and sky encoders.
4.5.1. Handling Sky
The final pixel color is computed by combining the generated explicit cubemap of the sky and the explicit 3DGS of the building using $$\alpha-blending.
- Final Rendered Image:
$
\tilde { I _ { i } ^ { f } } ( x , d ) = \tilde { I _ { i } } ( x , d ) + ( 1 - O ( x , d ) ) C _ { s k y } ( d )
$
And
$
O ( x , d ) = \sum _ { i = 1 } ^ { N } T _ { i } \alpha _ { i }
$
Where:
- : The final rendered image pixel color for view at position and direction .
- : The rendered color from the
neural in-the-wild 3D Gaussians(representing the building). - : The integrated
opacityof the building Gaussians along the ray, calculated as the sum oftransmittancemultiplied byopacityfor Gaussians. This represents how much the building blocks the view. - : The color sampled from the
explicit sky cubemapin direction . This formula essentially blends the building rendering with the sky background, where the sky is visible only where the building is not fully opaque ().
- Anti-aliasing: During training, random perturbations are introduced to the ray direction within its unit pixel length to enhance
anti-aliasing(reducing jagged edges). - Sky Opacity Constraint: To prevent
neural in-the-wild 3D Gaussiansfrom appearing in the sky region (i.e., making the sky opaque with Gaussians), the integrated opacity in this region is constrained to approach zero. This is done with the following loss: $ \mathcal { L } _ { o } = - O \cdot \log O - \widehat { M } _ { i } ^ { s } \cdot \log ( 1 - O ) $ Where:- : The
opacity lossfor the sky region. - : The rendered
opacity map(representing the integrated opacity of Gaussians). - : The
pseudo maskfor the sky in image . This loss encourages to be low in the sky region () and higher elsewhere. It's a form ofcross-entropy lossencouraging the rendered opacity in the sky region to be close to 0.
- : The
4.5.2. Handling Transient Occluders
Instead of a learning-based approach, a greedy masking strategy is adopted for transient occluders.
- Greedy Masking: If a pixel's
semantic segmentationresult does not classify it asskyorbuilding(i.e., it falls into thepseudo transient mask), it is considered atransient occluder. - Loss Function: The
transient occludersare masked out, and the rendering is supervised using a combination ofL1 lossandStructural Similarity Index (SSIM)[62] loss: $ \begin{array} { r l } & { \mathcal { L } _ { c } = \lambda _ { 1 } \mathcal { L } _ { 1 } \big ( \big ( 1 - \widehat { M _ { i } ^ { t } } \big ) \odot \tilde { I _ { i } ^ { f } } , \big ( 1 - \widehat { M _ { i } ^ { t } } \big ) \odot I _ { i } \big ) } \ & { \qquad + \lambda _ { 2 } \mathcal { L } _ { S S I M } \big ( \big ( 1 - \widehat { M _ { i } ^ { t } } \big ) \odot \tilde { I _ { i } ^ { f } } , \big ( 1 - \widehat { M _ { i } ^ { t } } \big ) \odot I _ { i } \big ) , } \end{array} $ Where:- : The
rendering lossthat focuses on non-transient regions. - : The
L1 loss(mean absolute error). - : The
Structural Similarity Index (SSIM)loss. - : The final rendered image.
- : The ground truth input image.
- : The
pseudo transient mask. - : The
Hadamard product. The term creates a mask that is 0 for transient regions and 1 for non-transient regions, effectively ignoring transient occluders during loss calculation. - : Hyperparameters balancing the
L1andSSIMcomponents.
- : The
4.5.3. Sky and Building Encoders Mutual Distillation
To align the appearance embeddings of the sky and building within the same latent space, a mutual distillation strategy is employed. This alignment helps guide building appearance optimization and allows for estimating the sky cubemap even when the sky is not captured in an image.
- Manhattan Distance Loss: The alignment is enforced using the
Manhattan distance(L1 distance) loss between thesky embeddingsandbuilding embeddings: $ \mathcal { L } _ { m d } = M a n h a t ( l _ { i } ^ { s } , l _ { i } ^ { b } ) $ Where:- : The
mutual distillation loss. Manhat(.,.): TheManhattan distancefunction, which calculates the sum of the absolute differences between corresponding elements of two vectors. This loss encourages thelatent representationsof sky and building appearances to be similar.
- : The
4.5.4. Training and Rendering
-
Final Loss Function: The entire framework is optimized end-to-end using the following combined loss function: $ \mathcal { L } = \mathcal { L } _ { c } + \lambda _ { 3 } \mathcal { L } _ { o } + \lambda _ { 4 } \mathcal { L } _ { T V } + \lambda _ { 5 } \mathcal { L } _ { m d } $ Where:
- : The total loss function.
- : The
rendering lossfor non-transient regions (Eq. 11). - : The
opacity lossfor the sky region (Eq. 9). - : The
Total Variation lossfor the implicit sky cubemap (Eq. 6). - : The
mutual distillation lossfor sky and building embeddings (Eq. 12). - : Hyperparameters to balance the contributions of the respective loss functions.
-
Novel Appearance Synthesis: After training, the model supports three types of novel appearance synthesis:
- Unconstrained Image Level: Given a new unconstrained image, its sky and building embeddings are generated (as in training) to produce explicit 3DGS for buildings and an explicit cubemap for the sky, allowing novel views under that specific appearance.
- Implicit Embeddings Level: For two input images, their sky and building embeddings are linearly interpolated to create new intermediate embeddings. These intermediate embeddings are then used to synthesize novel appearances that smoothly transition between the two source appearances.
- Explicit Cubemap Level: Leveraging the
fine sky encoder, an external explicit cubemap can be directly used as input. This cubemap influences thefine sky embeddingsand thus theneural in-the-wild 3D Gaussiansof the building, enabling explicit sky editing and affecting the building's appearance accordingly.
5. Experimental Setup
5.1. Datasets
The proposed method was evaluated on two challenging datasets focused on outdoor scene reconstruction in the wild.
5.1.1. Photo Tourism (PT) Dataset [19]
- Source & Characteristics: This dataset consists of multiple scenes of well-known monuments (e.g., Brandenburg Gate, Sacré Coeur, Trevi Fountain). It features collections of
user-uploaded imageswhich are highly unconstrained. These images vary significantly in:- Time and Date: Different times of day, seasons, and weather conditions.
- Camera Settings: Diverse camera models, exposure levels, white balance, and post-processing effects.
- Transient Occluders: Presence of people, cars, and other temporary objects.
- Scale: Each scene contains 800 to 1500 unconstrained images.
- Purpose: It's a standard benchmark for evaluating
in-the-wild3D reconstruction methods, specifically designed to test robustness against appearance variations and occlusions.
5.1.2. NeRF-OSR Dataset [47]
-
Source & Characteristics: This is an
outdoor scene relighting benchmark. It includes multiple sequences (likely videos or dense image sets) captured at different times, often featuringtransient occluderssuch as pedestrians on the street. -
Scale: Each scene contains 300 to 400 images.
-
Purpose: It's used to evaluate methods that aim to reconstruct scenes under varying illumination conditions and to understand how well they can handle dynamic elements for
relightingpurposes.These datasets were chosen because they represent the "wild" conditions that the paper aims to address, providing diverse challenges related to appearance, lighting, and occlusions, making them effective for validating the method's performance in real-world scenarios.
5.2. Evaluation Metrics
The paper uses standard metrics for evaluating the quality of novel view synthesis.
5.2.1. Peak Signal-to-Noise Ratio (PSNR)
- Conceptual Definition: PSNR is a commonly used metric to quantify the quality of reconstruction of lossy compression codecs or, in this context, the quality of a rendered image compared to its ground truth. It measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. A higher PSNR generally indicates higher quality.
- Mathematical Formula:
$
\mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{\mathrm{MSE}} \right)
$
Where:
- : The maximum possible pixel value of the image. For an 8-bit image, this is 255.
- : The Mean Squared Error between the rendered image and the ground truth image.
$
\mathrm{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2
$
Where:
- : The ground truth image.
- : The rendered image.
M, N: The dimensions (height and width) of the images.I(i,j),K(i,j): The pixel values at coordinates(i,j)in the ground truth and rendered images, respectively.
- Symbol Explanation:
- : Peak Signal-to-Noise Ratio, measured in decibels (dB).
- : Maximum possible pixel intensity value (e.g., 255 for 8-bit images).
- : Mean Squared Error.
I(i,j): Pixel value of the ground truth image at row , column .K(i,j): Pixel value of the rendered image at row , column .- : Number of rows (height) in the image.
- : Number of columns (width) in the image.
5.2.2. Structural Similarity Index Measure (SSIM)
- Conceptual Definition: SSIM is a perceptual metric that quantifies the perceived structural similarity between two images. Unlike PSNR, which focuses on absolute error, SSIM considers image degradation as a perceived change in structural information, taking into account luminance, contrast, and structural components. Values range from -1 to 1, with 1 indicating perfect similarity. Higher SSIM values are better.
- Mathematical Formula:
$
\mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}
$
Where:
- : The average of image .
- : The average of image .
- : The variance of image .
- : The variance of image .
- : The covariance of images and .
- and : Two constants to avoid division by a weak denominator, where is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and (e.g., ).
- Symbol Explanation:
- : Structural Similarity Index between image (ground truth) and image (rendered).
- : Mean pixel value of image .
- : Mean pixel value of image .
- : Variance of pixel values in image .
- : Variance of pixel values in image .
- : Covariance between pixel values of image and image .
- : Small constants to stabilize the division.
- : Small constants (e.g., 0.01, 0.03).
- : Dynamic range of pixel values (e.g., 255).
5.2.3. Learned Perceptual Image Patch Similarity (LPIPS)
- Conceptual Definition: LPIPS is a perceptual similarity metric that assesses the difference between two images based on the activations of a pre-trained deep convolutional neural network (often VGG or AlexNet). It is designed to correlate better with human perception of image quality than traditional metrics like PSNR or SSIM. A lower LPIPS score indicates higher perceptual similarity (better quality).
- Mathematical Formula: The LPIPS calculation involves extracting features from intermediate layers of a pre-trained network (e.g., AlexNet or VGG) for both the reference and test images, scaling these features, and then computing the L2 distance between them.
$
\mathrm{LPIPS}(x, y) = \sum_l w_l \cdot ||\phi_l(x) - \phi_l(y)||_2^2
$
Where:
- : The ground truth image.
- : The rendered image.
- : Features extracted from layer of a pre-trained deep network (e.g., AlexNet).
- : A learnable scalar weight for each layer .
- : The squared L2 norm (Euclidean distance).
- Symbol Explanation:
-
: Learned Perceptual Image Patch Similarity between image and image .
-
: Feature stack from layer of the pre-trained network for image .
-
: Feature stack from layer of the pre-trained network for image .
-
: Weight for the difference in features at layer .
The paper also reports average training time in
GPU hoursand rendering times inFrames Per Second (FPS)to evaluate efficiency.
-
5.3. Baselines
The proposed method is compared against a comprehensive set of state-of-the-art methods for in-the-wild 3D scene reconstruction, including both NeRF-based and 3DGS-based approaches. These baselines are representative of the current research landscape in this field.
5.3.1. NeRF-based Methods
- NeRF-W [30]: The foundational
NeRF-in-the-wildmethod, which introducedappearanceandtransient embeddings. - HA-NeRF [7]: An extension of
NeRF-Wusing a CNN-basedappearance encoder. - CR-NeRF [68]: Further improved
NeRF-Wwith a CNN-based encoder and fine-tunedtransient masks. - K-Planes [12]: A
NeRFvariant employingplanar factorizationfor efficiency and quality. - RefinedFields [20]: Another
NeRFmethod focusing on refinement forunconstrained scenes.
5.3.2. 3DGS-based Methods
- SWAG [11]: One of the early
3DGSadaptations forin-the-wildscenes, predictingimage-dependent opacityandappearance variations. - Splatfacto-W [64]: A
NeRFstudioimplementation of3DGSfor unconstrained collections. - WE-GS [59]: Introduces a lightweight
spatial attention moduleforappearance embeddingsandtransient masks. - GS-W [72]: Employs an
adaptive sampling strategyfordynamic appearances. - WildGaussians [23]: Uses
DINOfeatures and a trainable affine transformation fortransient occluders, and attempts to model the sky with3D Gaussians. - 3DGS [21]: The
vanilla 3D Gaussian Splattingmethod, included to show the performance degradation withoutin-the-wildadaptations.
6. Results & Analysis
6.1. Core Results Analysis
The experiments demonstrate that the proposed sky-aware framework consistently outperforms existing methods in novel view synthesis and novel appearance synthesis, while also offering superior efficiency in terms of convergence and rendering speed.
6.1.1. Performance on Photo Tourism (PT) Dataset
The following are the results from Table 1 of the original paper:
| GPU hrs. /FPS |
Bradenburg Gate | Sacre Coeur | Trevi Fountain | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM↑ | LPIPS ↓ | PSNR ↑ | SSIM↑ | LPIPS ↓ | PSNR ↑ | SSIM↑ | LPIPS ↓ | ||
| NeRF-W [30] | -1<1 | 24.17 | 0.890 | 0.167 | 19.20 | 0.807 | 0.191 | 18.97 | 0.698 | 0.265 |
| HA-NeRF [7] | 35.1/<1 | 24.04 | 0.877 | 0.139 | 20.02 | 0.801 | 0.171 | 20.18 | 0.690 | 0.222 |
| CR-NeRF [68] | 31.0/<1 | 26.53 | 0.900 | 0.106 | 22.07 | 0.823 | 0.152 | 21.48 | 0.711 | 0.206 |
| K-Planes [12] | 0.3/<1 | 25.49 | 0.924 | - | 20.61 | 0.852 | - | 22.67 | 0.856 | - |
| RefinedFields [20] | 11.8/<1 | 26.64 | 0.886 | - | 22.26 | 0.817 | - | 23.42 | 0.737 | - |
| SWAG [11]* | 0.8/15 | 26.33 | 0.929 | 0.139 | 21.16 | 0.860 | 0.185 | 23.10 | 0.815 | 0.208 |
| Splatfacto-W [64]† | 1.1/40 | 26.87 | 0.932 | 0.124 | 22.66 | 0.769 | 0.224 | 22.53 | 0.876 | 0.158 |
| WE-GS [59] | 1.8/181 | 27.74 | 0.933 | 0.128 | 23.62 | 0.890 | 0.148 | 23.63 | 0.823 | 0.153 |
| GS-W [72] | 2.0/50 | 28.48 | 0.929 | 0.086 | 23.15 | 0.859 | 0.130 | 22.97 | 0.773 | 0.167 |
| WildGaussians [23] | 5.0/29 | 28.26 | 0.935 | 0.083 | 23.51 | 0.875 | 0.124 | 24.37 | 0.780 | 0.137 |
| 3DGS [21] | 0.4/181 | 20.72 | 0.889 | 0.152 | 17.57 | 0.839 | 0.190 | 17.04 | 0.690 | 0.265 |
| Ours | 1.6/217 | 27.79 | 0.936 | 0.081 | 23.51 | 0.892 | 0.111 | 24.61 | 0.824 | 0.130 |
As shown in Table 1, the proposed method Ours achieves competitive or superior performance across the PT dataset scenes, especially for Trevi Fountain and Sacre Coeur.
-
PSNR: Our method achieves the highest PSNR for
Trevi Fountain(24.61) andSacre Coeur(23.51), matchingWildGaussians. ForBradenburg Gate, it is 27.79, slightly lower thanGS-W(28.48) andWildGaussians(28.26), but still significantly better than other methods. -
SSIM: Our method consistently achieves high SSIM values, indicating excellent structural similarity. For
Bradenburg Gate(0.936) andSacre Coeur(0.892), it is among the top performers. -
LPIPS: Our method demonstrates the lowest (best) LPIPS scores across all three scenes (
Bradenburg Gate0.081,Sacre Coeur0.111,Trevi Fountain0.130), indicating superior perceptual quality. -
Efficiency (FPS): With 217 FPS,
Oursachieves the highest rendering speed among all methods, includingvanilla 3DGS(181 FPS) andWE-GS(181 FPS). This confirms the efficiency gains from avoiding complex sky modeling with Gaussians. -
Training Time (GPU hrs): At 1.6 GPU hours,
Oursis efficient, comparable toWE-GS(1.8),GS-W(2.0),SWAG(0.8), and faster thanWildGaussians(5.0) and most NeRF-based methods (e.g.,CR-NeRF31.0,HA-NeRF35.1).The visual comparisons in Fig. 5 (from the original paper) further support these quantitative results, highlighting that
Oursproduces more accurate and visually pleasing reconstructions, especially in sky regions, compared to baselines likeWildGaussianswhich often show ellipsoidal noise in the sky.
The following figure (Figure 5 from the original paper) shows a visual comparison of reconstruction quality on the PT dataset:

该图像是比较不同方法在重建场景中的表现示意图。包含 GT、Ours、3DGS、Wild Gaussians 和 GSW 的对比,展示了在不同条件下的重建效果。红框部分标示出重点对比区域。
differences in quality are highlighted by insets.
6.1.2. Performance on NeRF-OSR Dataset
The following are the results from Table 2 of the original paper:
| europa | lwp | st | stjohann | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS | |
| NeRF [34] | 17.49 | 0.551 | 0.503 | 11.51 | 0.468 | 0.574 | 17.20 | 0.514 | 0.502 | 14.89 | 0.432 | 0.639 |
| NeRF-W [30] | 20.00 | 0.699 | 0.340 | 19.61 | 0.616 | 0.445 | 20.31 | 0.607 | 0.438 | 21.23 | 0.670 | 0.426 |
| Hha-NerRF [7] | 17.79 | 0.632 | 0.421 | 20.03 | 0.685 | 0.365 | 17.30 | 0.538 | 0.483 | 17.19 | 0.686 | 0.331 |
| C RRM 25] | 21.03 | 0.721 | 0.294 | 21.90 | 0.719 | 0.336 | 20.68 | 0.630 | 0.402 | 22.84 | 0.793 | 0.235 |
| SWG [11] | 23.91 | 0.864 | 0.172 | 22.07 | 0.783 | 0.303 | 22.29 | 0.713 | 0.364 | 23.74 | 0.845 | 0.242 |
| WE-GS [59]* | 24.74 | 0.873 | 0.157 | 24.33 | 0.821 | 0.197 | 22.45 | 0.720 | 0.341 | 24.12 | 0.858 | 0.202 |
| GS-W [72]. | 24.70 | 0.879 | 0.144 | 24.50 | 0.817 | 0.201 | 23.32 | 0.740 | 0.321 | 24.20 | 0.849 | 0.221 |
| WildGaussians [23] | 23.97 | 0.861 | 0.174 | 22.12 | 0.791 | 0.310 | 27.16 | 0.709 | 0.366 | 22.84 | 0.827 | 0.274 |
| GS [21] | 20.18 | 0.782 | 0.252 | 11.76 | 0.609 | 0.414 | 18.10 | 0.629 | 0.406 | 18.57 | 0.741 | 0.268 |
| Ours | 24.71 | 0.879 | 0.141 | 24.57 | 0.826 | 0.189 | 22.65 | 0.742 | 0.320 | 24.61 | 0.867 | 0.193 |
On the NeRF-OSR dataset, our method also achieves state-of-the-art performance across several metrics, as shown in Table 2.
-
Oursachieves the highest PSNR forlwp(24.57) andstjohann(24.61), and is competitive foreuropa(24.71) withWE-GSandGS-W. Notably, it improves the average PSNR by 7.7 dB overvanilla 3DGSon this dataset, indicating its strong adaptation to complex outdoor conditions. -
SSIM scores are consistently high (e.g.,
europa0.879,lwp0.826,stjohann0.867), demonstrating excellent structural preservation. -
LPIPS scores are among the lowest, confirming superior perceptual quality across scenes.
-
The results show that
Oursnot only reconstructs the sky with greater accuracy but also captures building details more precisely.The visual comparisons in Fig. 6 (from the original paper) further illustrate the quality improvements, especially in detailed building structures and clear sky rendering.
该图像是图表,展示了神经天空模块和全变差(TV)损失对3D场景重建的影响。左侧为渲染结果和预测的立方体图,分别展示了不同情况下的结果:图(a)缺少神经天空模块时,立方体图在不同场景中保持一致;图(b)缺少TV损失时,立方体图的频率过高,影响天空的渲染质量;图(c)表示完整模型的渲染效果。
lon-obvious differences in quality are highlighted by insets.
6.2. Ablation Studies / Parameter Analysis
Ablation studies were conducted on three scenes from the PT dataset to validate the design choices.
The following are the results from Table 3 of the original paper:
| PSNR↑ | SSIM↑ | LPIPS↓ | |
|---|---|---|---|
| (1) w/o pseudo labels | 24.91 | 0.867 | 0.114 |
| (2) w/o sky encoder | 24.39 | 0.856 | 0.117 |
| (3) Neual sky size 16 × 256 × 256 | 25.21 | 0.880 | 0.112 |
| (4) Neual sky size 8 × 512 × 512 | 25.31 | 0.884 | 0.106 |
| (5) w/o sky embeddings | 23.81 | 0.849 | 0.123 |
| (6) w/o implicit cubemap | 23.76 | 0.843 | 0.123 |
| (7) w/o LTV | 25.09 | 0.880 | 0.113 |
| (8) w/o fine sky encoder | 25.29 | 0.885 | 0.109 |
| (9) w/o mutual distillation | 25.17 | 0.881 | 0.110 |
| (10) w/o neural feature F | 24.96 | 0.870 | 0.112 |
| (11) w/o PE. F init. | 25.24 | 0.882 | 0.109 |
| (12) Complete model | 25.30 | 0.884 | 0.107 |
6.2.1. Contribution to Rendering Quality
- (1)
w/o pseudo labels(using fine-tuned transient mask predictor): PSNR drops from 25.30 to 24.91. This indicates that the proposedgreedy supervision strategywithpseudo masksis more efficient and provides better (or at least comparable) results than training aper-image transient mask predictor. The authors conclude that theper-image transient mask predictoris unnecessary. - (2)
w/o sky encoder(sky and building reconstructed by neural in-the-wild 3D Gaussians): PSNR drops significantly to 24.39. This highlights the importance of separately encoding sky and building appearances, and the benefit of the dedicatedneural sky module. - (10)
w/o neural feature F(using position directly as input to ): PSNR drops to 24.96. Theunconstrained radiance feature F(initialized withPositional Encoding) plays a role in enhancing the translated radiance. - (11)
w/o PE. F init.(initializing withoutPositional Encoding): PSNR is 25.24, slightly lower than the complete model (25.30). This suggests thatPositional Encodingfor is beneficial for capturing high-frequency details. - (12)
Complete model: Achieves the best overall performance (PSNR 25.30, SSIM 0.884, LPIPS 0.107), confirming that all proposed modules contribute to the final quality.
6.2.2. Influence of Neural Sky Module
-
(5)
w/o sky embeddings(neural sky replaced by a learnable explicit cubemap without image-specific sky embeddings): PSNR drops to 23.81. Withoutsky embeddings, different unconstrained images would render the same sky, demonstrating the importance ofsky embeddingsfor diversenovel appearance synthesis. As shown in Fig. 7 (a), the predicted cubemap remains static across different appearances. -
(6)
w/o implicit cubemap(8-layer coordinate-based MLP with only sky embeddings as input): PSNR drops to 23.76, the lowest score. This variant struggles to produce desired results, with generated skies tending to exhibit uniform color. This emphasizes that theimplicit cubemapwithin theneural sky moduleis crucial for mitigatingcatastrophic forgettingand capturing complex, location-dependent sky details. -
(7)
w/o LTV(without Total Variation loss): PSNR drops to 25.09. Removing theTotal Variation lossleads to higher frequency noise in the predicted cubemap, which adversely affects the sky rendering quality. Fig. 7 (b) visually confirms this. -
(3) & (4)
Neural sky size(16x256x256 vs. 8x512x512): Increasing the feature dimension (16x256x256, PSNR 25.21) does not significantly improve rendering quality compared to increasing resolution (8x512x512, PSNR 25.31). While higher resolution slightly improves performance, it also increases storage cost. -
(8)
w/o fine sky encoder: PSNR is 25.29, very close to the complete model's 25.30. While it doesn't significantly improve metrics, its purpose is to extract features from the cubemap to condition theneural in-the-wild 3D Gaussians, allowing for more flexible applications innovel appearanceandview synthesis, as further illustrated in Fig. 10.The following figure (Figure 7 from the original paper) shows ablation studies on the neural sky module and TV loss:
该图像是一个插图,展示了从一幅无约束图像生成的多个新视角,涵盖了不同的建筑物和天空效果。每个新视角展示了不同的外观变化,展示了布兰登堡门、圣保罗教堂和特雷维喷泉的变化。
Fig. 7: Ablation studies on the neural sky module and TV loss. (a) Without the neural sky module, the predicted cubemap is learned across scenes and remains the same regardless of different appearances. (b) Without TV loss, the frequency of the predicted cubemap is too high, which adversely affects the rendering quality of the sky.
6.2.3. Mutual Distillation
- (9)
w/o mutual distillation: PSNR drops to 25.17. Themutual distillation learning strategyto alignskyandbuilding appearance embeddingscontributes to enhancing overall rendering quality and efficiency.
6.3. Ablation Study on Pseudo Semantic Mask Quality
To evaluate the robustness of the greedy supervision strategy to pseudo semantic mask inaccuracies, noise was intentionally introduced into the masks.
The following are the results from Table 4 of the original paper:
| PSNR↑ | SSIM↑ | LPIPS↓ | |
|---|---|---|---|
| Add 5% noise to transient masks | 27.77 | 0.934 | 0.082 |
| Add 10% noise to transient masks | 27.78 | 0.933 | 0.084 |
| Add 5% noise to sky and building masks | 27.65 | 0.931 | 0.089 |
| Add 10% noise to sky and building masks | 26.97 | 0.922 | 0.101 |
| w/ pseudo masks | 27.79 | 0.936 | 0.081 |
- Noise in Transient Masks: Adding 5% or 10% noise to the
transient masks(meaning some non-transient pixels are mistakenly identified as transient) has a minimal impact on reconstruction accuracy (PSNR 27.77-27.78 vs. 27.79 for the baseline). This is attributed to thegreedy supervision strategy, where misidentifying building as transient mainly affects the number of samples used for supervision, but enough correct samples remain for robust reconstruction. - Noise in Sky and Building Masks: Introducing noise (5% or 10%) into the
skyandbuilding masks(meaning some transient pixels are mistakenly identified as sky or building, or vice-versa) leads to a more significant drop in accuracy (PSNR 27.65 for 5% noise, 26.97 for 10% noise). This is because misclassifyingtransient masksasbuildingorskyforces the model to reconstruct transient objects as part of the static scene, which severely impacts accuracy. This highlights the importance of thesemantic segmentationaccurately delineating the static scene components.
6.4. Applications
The proposed framework offers flexible novel appearance synthesis capabilities.
6.4.1. Novel Appearance Synthesis from an Unconstrained Image
Given a single unconstrained image, the method can infer its building and sky embeddings, which are then used to generate a novel explicit standard 3DGS representation for buildings and an explicit cubemap for the sky. This allows for rendering novel views under the appearance of the input image. Fig. 8 (from the original paper) showcases this capability, demonstrating successful inference of the novel appearance of an entire building even when partially visible in the input, and plausible sky cubemap inference even when the sky is absent from the input image (due to the mutual distillation strategy).
The following figure (Figure 8 from the original paper) shows appearance modeling from an unconstrained image:

该图像是示意图,展示了不同视角下的目标应用效果。图中显示了源视图与目标应用之间的转换,展示了建筑物在不同环境条件下的表现,反映了本文所提框架在3D场景重建中的有效性和可行性。
Fig. 8: Appearance modeling from an unconstrained image, showcasing multiple novel views under the novel appearance.
6.4.2. Novel Appearance Synthesis from Appearance Embeddings
The method allows for linear interpolation of implicit appearance embeddings between any two unconstrained images. This creates intermediate embeddings that, when injected into the neural in-the-wild 3DGS and neural sky, result in novel appearances with smooth transitions. Fig. 9 (from the original paper) illustrates this for variations in weather, time, and camera parameters (e.g., exposure changes), demonstrating a rich and smooth range of appearance transformations.
The following figure (Figure 9 from the original paper) illustrates appearance synthesis from interpolated embeddings:

该图像是示意图,展示了源应用与目标应用之间的不同场景重建效果。图中分为三部分,分别为(a)、(b)和(c),展示了不同的视觉内容与变化过程,强调高效的3D场景重建与渲染效果的精准转换。
the source and target views.
6.4.3. Novel Appearance Synthesis from Explicit Cubemap
By leveraging the fine sky encoder, the approach enables novel appearance synthesis through direct editing or interpolation of the explicit cubemap. This means an external sky cubemap can be directly fed into the system. As seen in Fig. 10 (from the original paper), interpolating an explicit sky cubemap produces natural and realistic results that are distinct from interpolating implicit embeddings. Furthermore, the sky can be explicitly edited (e.g., replacing it with a fantastical sky) to create various virtual scenes, offering enhanced control over scene editing.
The following figure (Figure 10 from the original paper) illustrates novel appearance synthesis from explicit cubemaps:

该图像是一个示意图,展示了使用微调掩膜和伪掩膜进行3D场景重建的结果。左上角为真实图像,右上为微调掩膜(15K)下的结果,左下为使用伪掩膜(15K)生成的效果,右下为生成的伪掩膜。通过简单的贪婪监督策略,伪掩膜的使用提高了重建效率和质量。
Fig. 2: First row: Previous methods learned appearance-dependent 3D Gaussians and 2D transient mask for each image simultaneously. However, after 15K training iterations, the predictions were insufficient for both scene reconstruction quality and 2D mask accuracy. Because distinguishing between significant appearance changes and transient occluders is challenging for the previous framework. Second row: We find that using pseudo masks from a pre-trained segmentation network directly as transient masks, without fine-tuning, improved reconstruction efficiency while maintaining quality, thanks to the proposed simple greedy supervision strategy.
(Note: This image from the original paper is labeled Fig. 10 in the prompt's Image & Formula Text Summaries but Fig. 2 in the paper content. I will use the image description provided with the prompt's Image 2 and its associated caption in Image & Formula Text Summaries to ensure fidelity to the provided prompt data. The caption in the paper itself associated with images/2.jpg is Fig. 2: First row: Previous methods learned appearance-dependent 3D Gaussians and 2D transient mask for each image simultaneously. However, after 15K training iterations, the predictions were insufficient for both scene reconstruction quality and 2D mask accuracy. Because distinguishing between significant appearance changes and transient occluders is challenging for the previous framework. Second row: We find that using pseudo masks from a pre-trained segmentation network directly as transient masks, without fine-tuning, improved reconstruction efficiency while maintaining quality, thanks to the proposed simple greedy supervision strategy. I will assume the prompt means Fig. 10 as in the image summary for consistency of the prompt, as the content of the Fig 10 from the prompt matches the description of application for Fig 10 in the paper text, while the Fig 2 from paper text is about pseudo masks.)
Given the provided images and their descriptions, the image images/2.jpg is associated with the caption from Fig. 2 in the paper. However, the prompt's Image & Formula Text Summaries describes images/2.jpg as Image 2: The image is an illustration showing the different scene reconstruction effects between the source and target applications. It is divided into three parts: (a), (b), and (c), highlighting the variations in visual content and the accurate transformation of efficient 3D scene reconstruction and rendering effects. Source: 10.jpg. This is conflicting. Given the text Fig. 10 line 1, line 2, line 3 in section 5.5.3, it refers to an image with multiple rows of examples. images/2.jpg (Fig. 2 in the paper) does not fit this description. The provided Image & Formula Text Summaries for Image 2 (Source: 10.jpg) does describe a multi-part image with source and target applications. Therefore, I will include images/10.jpg here and refer to it as Fig. 10, assuming the prompt's image mapping takes precedence for image content.
The following figure (Figure 10 as described in the prompt's Image & Formula Text Summaries) illustrates novel appearance synthesis from explicit cubemaps:

该图像是示意图,展示了源应用与目标应用之间的不同场景重建效果。图中分为三部分,分别为(a)、(b)和(c),展示了不同的视觉内容与变化过程,强调高效的3D场景重建与渲染效果的精准转换。
Fig. 10: An illustration showing the different scene reconstruction effects between the source and target applications, highlighting the variations in visual content and the accurate transformation of efficient 3D scene reconstruction and rendering effects.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces an efficient, sky-aware framework for reconstructing 3D scenes from unconstrained photo collections using 3D Gaussian Splatting (3DGS). The framework's key contributions include:
-
Eliminating Per-Image Transient Mask Predictors: By utilizing
pseudo masksfrom a pre-trainedsemantic segmentation networkwith agreedy supervision strategy, the method streamlines the process, removing the need for slow-convergingper-image transient mask predictorsand enhancing efficiency. -
Separate Sky and Building Appearance Learning: The framework recognizes the distinct properties of sky and building radiance, enabling more efficient and accurate reconstruction by encoding their appearances separately.
-
Neural Sky Module for Explicit Cubemap Generation: A novel
neural sky modulegenerates diverseexplicit cubemapsfor the sky fromlatent sky embeddings, which helps capture location-dependent details and mitigatescatastrophic forgetting. -
Mutual Distillation for Latent Space Alignment: A
mutual distillation strategyalignsskyandbuilding appearance embeddingsin thesame latent space, fostering inter-component guidance and enabling plausible sky inference even when sky data is missing.Extensive experiments on the Photo Tourism and NeRF-OSR datasets demonstrate that the proposed method achieves
state-of-the-artperformance innovel viewandappearance synthesis, delivering superior rendering quality with faster convergence and rendering speed compared to existingin-the-wild 3DGSandNeRFmethods.
7.2. Limitations & Future Work
The authors acknowledge several limitations of their proposed method:
- Complex Lighting Environments: The method relies on a
neural-appearance representationwithout strict physical constraints and struggles with accurately reconstructinghighly complex lighting environments. This is partly because it cannot segment thetransient objects' shadows, which are crucial for consistent lighting. - Lack of Physical Constraints for Scene Editing: While the
neural sky moduleallows for controllable scene editing (e.g., explicit cubemap interpolation), bothimplicitandexplicit cubemap interpolation approacheslackphysical constraintsrelated to changes in sunlight position or weather conditions. This means achieving preciseoutdoor scene relightingremains an open challenge. - Indoor Scenes: The current method specifically focuses on
outdoor scene reconstruction in the wild. Its applicability and performance forin-the-wild reconstructionofindoor sceneshave not been addressed and are left as future work.
7.3. Personal Insights & Critique
The paper presents a very intuitive and practical approach to improve 3DGS in-the-wild performance. The core idea of differentiating sky and building components is a clever way to leverage their distinct characteristics; the sky is generally simpler and more uniform than complex building structures with diverse materials. This sky-aware design, combined with the simplification of transient mask prediction, directly addresses the efficiency and quality bottlenecks prevalent in previous 3DGS-in-the-wild methods.
Inspirations and Transferability:
- Semantic Priors for Efficiency: The success of using
pseudo masksfromLVMswithout fine-tuning is a significant takeaway. It demonstrates that off-the-shelf, powerfulsemantic priorscan be integrated with minimal overhead to solve complex problems in other domains. This could be applied to other reconstruction tasks where specific object categories or background elements behave predictably. For example, segmenting out water bodies, foliage, or road surfaces and treating them with specialized, more efficient models could improve performance inlarge-scale urban scene reconstruction. - Component-wise Modeling: The idea of breaking down a complex scene into components (sky, building) and modeling them with tailored representations (explicit cubemap vs. neural Gaussians) is powerful. This
divide-and-conquerstrategy could be extended to other scene components or even to different levels of detail within a component. For instance,fine-grained segmentationof different building materials (glass, concrete, metal) could allow for more physically accurate or specializedappearance modelingfor each, enhancing realism. - Mutual Distillation for Latent Alignment: The
mutual distillation strategyis elegant. The concept of aligninglatent spacesof related but distinct components could be beneficial in other multi-modal or multi-component learning scenarios, allowing knowledge transfer and robust inference even when one component's data is sparse or missing.
Potential Issues and Areas for Improvement:
-
Reliance on LVM Accuracy: While the
greedy supervision strategyis robust to some noise intransient masks, its performance onskyandbuilding masksis more sensitive (as shown in ablation studies). The quality of thepseudo masksfrom theLVMis a hard dependency. If theLVMmisclassifies critical parts of a building as sky (or vice-versa), or fails entirely on novel scene types, the overall reconstruction quality could suffer. Further robustness might involveuncertainty quantificationfrom theLVMor asemi-supervised refinementstep. -
Limited Physical Realism: The acknowledged limitation regarding
complex lighting environmentsandrelightingis a significant one forin-the-wildapplications. Theneural-appearance representationis powerful for reproducing observed appearances but less so for physically-based simulation. Integratingphysically-based rendering (PBR)principles orexplicit light source modeling(e.g., sun position, sky radiance models) could address this, moving beyond purely data-driven appearance synthesis. -
Generalization to Diverse Skies: While the
neural sky modulegenerates diverse skies, its ability to generalize to truly novel or extreme sky conditions (e.g., highly unusual cloud formations, aurora borealis) not well-represented in the training data might be limited by theimplicit cubemapandsky embeddings. -
Computational Overhead of LVMs: Although the
LVMis pre-trained and not fine-tuned during reconstruction, running a largesemantic segmentation modelon potentially thousands of images during preprocessing can still be computationally intensive for very large datasets, even if it's a one-time cost. This should be considered for trulylarge-scale applications.Overall, this paper makes a valuable contribution by pragmatically addressing the efficiency and quality trade-offs in
in-the-wild 3DGSreconstruction. Its focus on intelligent scene decomposition and leveraging existing powerful tools (LVMs) offers a practical path forward for more robust and performant neural rendering.
Similar papers
Recommended via semantic vector search.