3D Gaussian Splatting for Real-Time Radiance Field Rendering
TL;DR Summary
The paper presents a 3D Gaussian splatting method for real-time radiance field rendering, introducing three key components: Gaussian scene representation, anisotropic covariance density optimization, and a fast visibility-aware rendering algorithm, achieving ≥30 fps at 1080p reso
Abstract
Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
3D Gaussian Splatting for Real-Time Radiance Field Rendering
1.2. Authors
BERNHARD KERBL*, Inria, Université Côte d'Azur, France GEORGIOS KOPANAS*, Inria, Université Côte d'Azur, France THOMAS LEIMKÜHLER, Max-Planck-Institut für Informatik, Germany GEORGE DRETTAKIS, Inria, Université Côte d'Azur, France (*Both authors contributed equally to the paper.)
1.3. Journal/Conference
ACM Transactions on Graphics (TOG) 42, 4, Article 1 (August 2023). This is a highly prestigious journal in the field of computer graphics, known for publishing groundbreaking research. Publication in ACM TOG signifies significant impact and rigorous peer review within the computer graphics community.
1.4. Publication Year
2023
1.5. Abstract
Radiance Field methods, particularly Neural Radiance Fields (NeRFs), have significantly advanced novel-view synthesis from captured scenes. However, achieving high visual quality typically involves costly neural networks for training and rendering, while faster alternatives compromise quality. Existing methods struggle to provide real-time ( fps) 1080p rendering for complete, unbounded scenes. This paper introduces a novel approach incorporating three key elements to achieve state-of-the-art visual quality with competitive training times and, critically, real-time novel-view synthesis at 1080p resolution. First, the scene is represented using 3D Gaussians, initialized from sparse Structure-from-Motion (SfM) points. This representation maintains the desirable properties of continuous volumetric radiance fields for optimization while efficiently avoiding computation in empty space. Second, an interleaved optimization and adaptive density control strategy is employed for the 3D Gaussians, notably optimizing anisotropic covariance to accurately represent scene geometry. Third, a fast, visibility-aware rendering algorithm is developed, supporting anisotropic splatting, which accelerates both training and enables real-time rendering. The method demonstrates state-of-the-art visual quality and real-time rendering performance across several established datasets.
1.6. Original Source Link
https://arxiv.org/abs/2308.04079v1 The paper was initially published as a preprint on arXiv and later published in ACM Transactions on Graphics.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the trade-off between visual quality and rendering speed in novel-view synthesis using radiance fields. Recent advances, particularly with Neural Radiance Fields (NeRFs), have revolutionized the quality of synthesizing new views from a set of captured images. However, these methods, especially those achieving the highest visual fidelity like Mip-NeRF360, are computationally expensive, requiring extensive training times (tens of hours) and rendering times that are far from real-time (seconds per frame). While faster NeRF variants like InstantNGP and Plenoxels have emerged, they often sacrifice visual quality or still fall short of true real-time display rates ( fps) for high-resolution (1080p) and complex, unbounded scenes. The existing challenge is to achieve state-of-the-art (SOTA) visual quality, competitive training times, and real-time rendering simultaneously, especially for diverse scene types.
The paper's entry point or innovative idea is to combine the advantages of explicit, unstructured representations (like point clouds) with the differentiable properties of volumetric radiance fields. Instead of continuous neural representations or structured grids, they propose using 3D Gaussians as the fundamental scene primitive. This allows for explicit control over scene elements, efficient rendering via splatting, and an optimization process that benefits from the volumetric nature for accurate scene reconstruction.
2.2. Main Contributions / Findings
The paper introduces three primary contributions:
-
Anisotropic 3D Gaussians as a High-Quality, Unstructured Representation: The authors propose using
3D Gaussianswithanisotropic covarianceas a flexible and expressive scene representation. These Gaussians are initialized from sparseStructure-from-Motion (SfM)point clouds, preserving the advantages of differentiable volumetric representations for optimization while avoiding the computational overhead of empty space. This explicit, unstructured nature allows for efficient rasterization. -
Interleaved Optimization with Adaptive Density Control: A novel optimization method is presented that adjusts the properties of the 3D Gaussians (position, opacity, anisotropic covariance, and
Spherical Harmonic (SH)coefficients). This optimization is interleaved withadaptive density controlsteps, which involvecloningGaussians in under-reconstructed areas andsplittinglarge Gaussians in high-variance regions. This dynamic process allows the model to accurately represent complex scenes with a compact set of Gaussians. -
Fast, Differentiable, Visibility-Aware Rendering: The paper develops a
tile-based rasterizerthat supportsanisotropic splattingand respects visibility order through efficientGPU sorting. This rasterizer is fully differentiable, enabling fastbackpropagationfor training without imposing arbitrary limits on the number of Gaussians contributing gradients. This design is crucial for both accelerating training and achieving real-time rendering.The key findings are that this method achieves:
State-of-the-artvisual quality that is competitive with or surpasses the best implicit radiance field approaches (e.g., Mip-NeRF360).- Training speeds that are competitive with the fastest methods (e.g., InstantNGP, Plenoxels), reducing training time from tens of hours to minutes.
- Crucially, it provides the first real-time ( fps) high-quality
novel-view synthesisat 1080p resolution across a wide variety of complex, unbounded scenes, a feat previously unachieved by any other method.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
Radiance Fields
A radiance field is a function that maps any 3D point (x, y, z) and any 2D viewing direction to an emitted color (R, G, B) and a volume density . This concept is central to novel-view synthesis because it allows for synthesizing images of a 3D scene from arbitrary viewpoints. The volume density represents the probability of a ray terminating at a given point, allowing for soft volumetric effects like smoke or translucent objects.
Novel-View Synthesis
Novel-view synthesis is the computer graphics problem of generating new images of a 3D scene from previously unobserved camera viewpoints, using a set of existing images of that scene. The goal is to create photorealistic and geometrically consistent images.
Neural Radiance Fields (NeRF)
Neural Radiance Fields (NeRF) are a specific type of radiance field that uses a Multi-Layer Perceptron (MLP) neural network to represent the scene. The MLP is trained to output color and volume density for any 3D coordinate and viewing direction. Rendering an image with NeRF involves ray-marching through the scene, sampling points along each ray, querying the MLP at these points, and then accumulating the sampled colors and densities using a volumetric rendering equation.
Structure-from-Motion (SfM)
Structure-from-Motion (SfM) is a photogrammetric range imaging technique for estimating the 3D structure of a scene from a set of 2D images. It also simultaneously recovers the camera parameters (pose, intrinsics) for each image. SfM outputs a sparse point cloud, which consists of 3D points corresponding to distinctive features detected across multiple images. These points are typically used as an initial geometric estimate of the scene.
Multi-view Stereo (MVS)
Multi-view Stereo (MVS) is a technique that takes the camera poses and intrinsic parameters (often derived from SfM) and a set of images to produce a dense 3D reconstruction of the scene. Unlike SfM's sparse point clouds, MVS aims to generate a comprehensive representation, often in the form of dense point clouds, meshes, or depth maps. While MVS provides more detailed geometry, it can struggle with featureless or shiny surfaces and may produce over-reconstruction (artifacts) or under-reconstruction (holes).
Spherical Harmonics (SH)
Spherical Harmonics (SH) are a set of basis functions defined on the surface of a sphere, analogous to Fourier series on a circle. In computer graphics, they are commonly used to represent spatially varying directional information, such as lighting or view-dependent appearance. For radiance fields, SH coefficients can encode how the color of a point changes when viewed from different directions, allowing for more realistic rendering of diffuse and mildly specular surfaces without using complex Bidirectional Reflectance Distribution Functions (BRDFs). The order of SH determines the complexity of the directional representation (e.g., zero-order for diffuse, higher orders for more complex effects).
Alpha Blending
Alpha blending is a technique used in computer graphics to combine an image foreground with a background image. It uses an alpha channel (opacity value) to determine how transparent or opaque each pixel is. In volumetric rendering or point-based rendering, alpha blending is crucial for accumulating color and opacity along a ray or through overlapping primitives. Objects closer to the camera obscure those further away based on their opacity. The standard volumetric rendering equation (derived from radiative transfer theory) can be re-written as an alpha blending sequence.
Splatting
Splatting is a point-based rendering technique where each 3D point (or primitive) is projected onto the image plane and "splatted" or distributed across multiple pixels, typically using a 2D kernel (e.g., a Gaussian distribution) to determine its influence and smooth the appearance. This helps to fill in gaps between sparsely projected points and reduce aliasing artifacts, creating a continuous-looking surface from discrete elements. Anisotropic splatting allows the shape of this 2D kernel to vary, adapting to the underlying geometry and projection distortions.
Gaussians
In a general sense, a Gaussian function (or normal distribution) is a bell-shaped curve often used to model probability distributions or as a smoothing kernel.
- 1D Gaussian:
- 3D Gaussian: In 3D, a Gaussian describes an ellipsoid centered at a mean , with its shape and orientation determined by a
covariance matrix. The function indicates the "intensity" or "density" at any point relative to the mean : $ G(x) = e^{-\frac{1}{2}(x - \mu)^T \Sigma^{-1}(x - \mu)} $ where is the mean (3D position) and is the covariance matrix. - Anisotropic Covariance: A
covariance matrixthat is not isotropic (i.e., not a multiple of the identity matrix) allows the Gaussian to have different spreads along different axes, resulting in an ellipsoidal shape rather than a perfect sphere. Thisanisotropyis crucial for representing fine structures or surfaces that are not aligned with coordinate axes, making the representation more compact and accurate.
3.2. Previous Works
Traditional Scene Reconstruction and Rendering
- Light Fields (Gortler et al. 1996; Levoy and Hanrahan 1996): Early approaches to
novel-view synthesisthat captured a dense grid of images from a scene. Rendering involves interpolating between these densely sampled views. While high quality, they required immense data capture and were mostly limited to static, bounded scenes. - SfM (Snavely et al. 2006) and MVS (Goesele et al. 2007): These methods enabled reconstruction from unstructured photo collections.
SfMprovides sparse 3D points and camera poses, whileMVSbuilds dense geometry (meshes, depth maps). Subsequent view synthesis methods (e.g., Chaurasia et al. 2013; Hedman et al. 2018) reprojected and blended input images based on this geometry. While effective, they suffered fromover-reconstructionorunder-reconstructionartifacts in challenging areas.
Neural Rendering and Radiance Fields
- Early Deep Learning for Novel-View Synthesis (Flynn et al. 2016; Hedman et al. 2018; Riegler and Koltun 2020): Applied
Convolutional Neural Networks (CNNs)for tasks like estimating blending weights or texture space solutions. Often still relied on MVS geometry, inheriting its limitations, and CNN-based rendering could lead totemporal flickering. - Volumetric Representations (Henzler et al. 2019; Sitzmann et al. 2019): Introduced differentiable density fields, leveraging volumetric
ray-marching. These were precursors to NeRFs. - Neural Radiance Fields (NeRF) (Mildenhall et al. 2020): A breakthrough method that uses an
MLPto implicitly represent a scene's color and density. It introducedpositional encodingandimportance samplingfor high-quality results.- Volumetric Rendering Equation (from Section 2.3 of the paper):
The color along a ray is given by:
$
C = \sum_{i=1}^{N} T_i (1 - \exp(-\sigma_i \delta_i)) \mathbf{c}i \quad \mathrm{with} \quad T_i = \exp\left(-\sum{j=1}^{i-1} \sigma_j \delta_j \right)
$
where:
- : The number of samples along the ray.
- : The
transmittance(or accumulated opacity) from the ray origin to sample . It represents the probability that the ray reaches sample without being obstructed. - : The
volume densityat sample , indicating the probability of a ray terminating at this point. - : The distance between sample and sample
i-1along the ray. - : The
colorat sample . This can be rewritten using (the opacity of sample ): $ C = \sum_{i=1}^{N} T_i \alpha_i \mathbf{c}i $ with $ \alpha_i = {\big(} 1 - \exp(-\sigma_i \delta_i) {\big)} {\mathrm{~and~}} T_i = \prod{j=1}^{i-1} (1 - \alpha_j) $ - : The opacity of the -th sample.
- The product is an alternative way to calculate transmittance, showing the accumulated transparency up to point
i-1.
- Mip-NeRF360 (Barron et al. 2022): The current
state-of-the-artin image quality fornovel-view synthesisfor unbounded scenes, building upon NeRF by addressing anti-aliasing and scene bounds. Known for outstanding quality but extremely high training (up to 48 hours) and rendering times (seconds per frame).
- Volumetric Rendering Equation (from Section 2.3 of the paper):
The color along a ray is given by:
$
C = \sum_{i=1}^{N} T_i (1 - \exp(-\sigma_i \delta_i)) \mathbf{c}i \quad \mathrm{with} \quad T_i = \exp\left(-\sum{j=1}^{i-1} \sigma_j \delta_j \right)
$
where:
- Faster NeRFs:
- InstantNGP (Müller et al. 2022): Uses a multi-resolution
hash gridandoccupancy gridto accelerate computation and a smallerMLP. Significantly faster training (minutes) and interactive rendering (10-15 fps) but with a trade-off in quality compared to Mip-NeRF360. - Plenoxels (Fridovich-Keil and Yu et al. 2022): Represents radiance fields with a sparse
voxel gridthat interpolates a continuous density field, forgoing neural networks entirely. Achieves fast training and interactive rendering, but also with quality limitations compared to SOTA. Both InstantNGP and Plenoxels rely onSpherical Harmonicsfor directional appearance and are hindered byray-marchingfor rendering.
- InstantNGP (Müller et al. 2022): Uses a multi-resolution
Point-Based Rendering and Radiance Fields
- Traditional Point-Based Rendering (Pfister et al. 2000; Zwicker et al. 2001b): Renders unstructured sets of points by
splattingthem as larger primitives (discs, ellipsoids, surfels) to avoid holes and reduce aliasing. - Differentiable Point-Based Rendering (Wiles et al. 2020; Yifan et al. 2019): Enabled end-to-end training of point cloud representations.
- Neural Point-Based Graphics (Aliev et al. 2020; Kopanas et al. 2021; Rückert et al. 2022 - ADOP): Augmented points with neural features and rendered using CNNs for fast view synthesis. Often depended on
MVSfor initial geometry, inheriting its artifacts.Pulsar(Lassner and Zollhofer 2021) achieved fast sphere rasterization, inspiring tile-based rendering but often usedorder-independent transparency.- Point-Based -Blending (from Section 2.3 of the paper):
A typical neural point-based approach computes the color of a pixel by blending ordered points overlapping the pixel:
$
C = \sum_{i \in N} c_i \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j)
$
where:
- : The color of each point (or
splat). - : The opacity of each point, often derived from evaluating a 2D Gaussian and multiplying by a learned per-point opacity.
This formula is essentially identical to the re-written volumetric rendering equation, highlighting the shared
image formation model.
- : The color of each point (or
- Point-Based -Blending (from Section 2.3 of the paper):
A typical neural point-based approach computes the color of a pixel by blending ordered points overlapping the pixel:
$
C = \sum_{i \in N} c_i \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j)
$
where:
- Point-NeRF (Xu et al. 2022): Uses points to represent a radiance field with
radial basis functionsand employs pruning/densification. However, it still usesvolumetric ray-marchingand cannot achieve real-time rates. - 3D Gaussians in specific contexts (Rhodin et al. 2015; Stoll et al. 2011; Wang et al. 2023; Lombardi et al. 2021): 3D Gaussians (or similar primitives) have been used for specialized tasks like human performance capture or isolated object reconstruction, where depth complexity is often limited.
3.3. Technological Evolution
The evolution of novel-view synthesis has progressed from explicit geometry-based methods (e.g., MVS with image blending) to implicit continuous representations (NeRFs), and then to methods focusing on improving the speed of NeRFs (e.g., InstantNGP, Plenoxels) often by introducing structured grids. Simultaneously, point-based rendering has evolved from basic splatting to differentiable versions, eventually incorporating neural features.
This paper's work (3D Gaussian Splatting) fits into the timeline by bridging the gap between explicit, GPU-friendly representations and the differentiable, high-quality optimization capabilities of volumetric radiance fields. It leverages the expressiveness of Gaussians and couples it with a highly optimized, visibility-aware rasterization pipeline to overcome the speed limitations of traditional ray-marching and the quality limitations of previous fast methods, pushing the field towards practical real-time high-fidelity rendering.
3.4. Differentiation Analysis
Compared to the main methods in related work, 3D Gaussian Splatting (3DGS) introduces several core differences and innovations:
- Vs. Mip-NeRF360 (SOTA Quality, Slow): Mip-NeRF360 achieves the highest quality but at the cost of extremely long training times (48 hours) and very slow rendering (seconds per frame) due to its implicit continuous representation and
ray-marching.3DGSmatches or surpasses this quality with training times reduced to minutes (35-45 minutes) and achieves real-time rendering ( fps). This is a monumental leap in efficiency without sacrificing quality. The explicit nature of 3D Gaussians and the efficientrasterizerare key here. - Vs. InstantNGP/Plenoxels (Fast Training/Interactive Rendering, Lower Quality): While InstantNGP and Plenoxels significantly reduce training time and offer interactive rendering speeds (10-15 fps), they often struggle to achieve the same peak visual quality as Mip-NeRF360, particularly in complex scenes, due to limitations of their structured grid representations and the inherent cost of
ray-marching.3DGSachieves superior visual quality while maintaining competitive training times and true real-time rendering, outperforming them in the quality-speed balance. Theanisotropic Gaussiansallow for more precise representation of fine details than fixed-resolution grids. - Vs. Other Point-Based Methods (e.g., ADOP, Neural Point-Based Graphics): Many prior point-based methods either relied on
MVSgeometry (introducing artifacts) or usedCNNsfor rendering (leading totemporal instability).3DGSavoidsMVSby initializing from sparseSfMpoints and optimizing the Gaussians directly. Itsvisibility-awarerasterizer(which sorts primitives) combined withSpherical Harmonicsavoids the temporal instability issues ofCNN-based rendering and provides higher visual fidelity than simple point sprites. Crucially,3DGSdoes not requireMVSand scales to complex, unbounded scenes, unlike some object-specific point-based approaches. - Novelty of Representation: The choice of
anisotropic 3D Gaussiansas a differentiable primitive is a core innovation. It offers the flexibility of an unstructured explicit representation (allowing dynamic creation/deletion/movement), the differentiability of a volumetric representation (crucial for optimization), and direct projectability to 2D for efficientsplatting. This combines the "best of both worlds" from continuousNeRFsand explicitpoint clouds. - Efficient Rendering Pipeline: The
tile-based rasterizerwith fastGPU sortingandvisibility-awarealpha blendingforanisotropic splatsis a significant architectural innovation, enabling real-time performance. Its ability to backpropagate gradients over an arbitrary number of blended Gaussians is also critical for high-quality optimization, addressing limitations of prior fast rendering methods.
4. Methodology
4.1. Principles
The core idea behind 3D Gaussian Splatting (3DGS) is to represent a 3D scene not as an implicit neural function or a structured grid, but as a collection of discrete, explicit 3D Gaussians. Each Gaussian is a volumetric primitive defined by its 3D position (mean), its 3D anisotropic covariance (which determines its shape and orientation), its opacity (alpha), and Spherical Harmonic (SH) coefficients for its color and view-dependent appearance.
This choice of 3D Gaussians is principled for several reasons:
-
Differentiability: Gaussians are inherently differentiable, allowing for direct optimization of their parameters using gradient descent, similar to
NeRFs. -
Volumetric Properties: They behave like continuous volumetric representations during optimization, capable of modeling density and transparency.
-
Explicit and Unstructured: Unlike implicit
NeRFsthat requireray-marchingcostly sampling, Gaussians are explicit entities. This allows for direct manipulation (creation, deletion, movement) and avoids computation in empty space. Being unstructured, they can adapt to arbitrary scene geometry without grid artifacts. -
Efficient Projection and Rendering:
3D Gaussianscan be analytically projected onto the 2D image plane, resulting in 2D elliptical splats. These splats can then be efficiently rasterized and blended using standardalpha blendingtechniques, leveraging highly optimizedGPUpipelines. -
Anisotropy for Compactness: The
anisotropic covarianceallows Gaussians to stretch and orient themselves to accurately represent surfaces, thin structures, or large homogeneous regions with fewer primitives, leading to a more compact representation.The theoretical basis combines concepts from volumetric rendering,
point-based rendering, and differentiable optimization. Thevolumetric renderingequation dictates how colors and opacities are accumulated along rays.Point-based renderingprinciples guide the efficient projection andsplattingof Gaussians.Differentiable optimizationallows the system to learn the optimal parameters for each Gaussian from multiple input views.
4.2. Core Methodology In-depth (Layer by Layer)
The overall process can be broken down into initialization, representation, optimization with adaptive density control, and fast differentiable rendering. The overview of our method is illustrated in Fig. 2.
该图像是一个示意图,展示了从 SfM 点到图像生成的流程,包含初始化、3D 高斯表示、投影和自适应密度控制等步骤。箭头指示了操作流和梯度流。
The Figure 2 from the original paper provides a high-level overview of the 3D Gaussian Splatting pipeline. It shows the input SfM points being used to initialize 3D Gaussians. These Gaussians then undergo optimization and adaptive density control to accurately represent the scene. Finally, a fast differentiable rasterizer projects and blends these Gaussians to render novel views, with gradients flowing back to update Gaussian parameters.
4.2.1. Initializing 3D Gaussians
The process begins with a set of input images of a static scene and their corresponding camera parameters, obtained through Structure-from-Motion (SfM) [Schönberger and Frahm 2016]. A key byproduct of SfM is a sparse point cloud.
-
The
3D Gaussiansare initialized directly from theseSfMpoints. EachSfMpoint becomes the mean () of a new 3D Gaussian. -
The initial
covariance matrix() for each Gaussian is set to be isotropic. This means it's spherical, with its radius determined by the mean distance to the three closestSfMpoints. -
Initial
opacity() values are typically low, andSpherical Harmonic (SH)coefficients are initialized to represent the color of the correspondingSfMpoint.For specific cases like the synthetic
NeRF-synthetic dataset, the method can even achieve high quality with random initialization (e.g., 100K uniformly random Gaussians within the scene bounds), which are then automatically pruned and refined.
4.2.2. Representing the Scene with 3D Gaussians
Each 3D Gaussian is characterized by several properties that are optimized during training:
-
3D Position (Mean ): A 3D vector
(x, y, z)representing the center of the Gaussian. -
Opacity (): A scalar value
[0, 1)indicating the Gaussian's transparency. -
Anisotropic Covariance (): A symmetric positive semi-definite matrix that defines the shape, size, and orientation of the Gaussian ellipsoid in 3D space.
-
Spherical Harmonic (SH) Coefficients: A set of coefficients that encode the view-dependent color () of the Gaussian. The paper uses 4 bands of SH.
The
3D Gaussianfunction itself is defined as: $ G(x) = e^{-\frac{1}{2}(x - \mu)^T \Sigma^{-1}(x - \mu)} $ where:
-
: A 3D point in space.
-
: The 3D mean (position) of the Gaussian.
-
: The
covariance matrix.For rendering, these
3D Gaussiansneed to be projected to 2D. The projection of a3D Gaussiantoimage spaceresults in a2D Gaussian. This is achieved by computing a2D covariance matrix() from the3D covariance matrix(). Given aviewing transformation(which includes rotation and translation from world to camera coordinates), thecovariance matrixin camera coordinates is obtained: $ \Sigma_c = W \Sigma W^T $ Then, thiscovariance matrixin camera coordinates is projected to screen space. According to Zwicker et al. [2001a], if we let be theJacobianof the affine approximation of theprojective transformation, the2D covariance matrixin image space is: $ \Sigma' = J W \Sigma W^T J^T $ The paper simplifies this by stating: "if we skip the third row and column of , we obtain a variance matrix with the same structure and properties as if we would start from planar points with normals". This effectively means they extract the upper-left submatrix from the full projected covariance in camera/image space.
To optimize the covariance matrix using gradient descent, directly operating on is problematic because it must remain positive semi-definite. To overcome this, the authors decompose into a scaling matrix and a rotation matrix :
$
\Sigma = R S S^T R^T
$
where:
-
: A diagonal matrix defined by a
3D vectorfor scaling along its principal axes. The elements of are . -
: A
rotation matrixderived from aquaternion.The parameters actually stored and optimized are the
3D vector(for scaling) and thequaternion(for rotation). These are then converted to and to reconstruct . This parameterization ensures that remains valid (positive semi-definite) during optimization. Thequaternionis normalized to ensure it's a unit quaternion.
4.2.3. Differentiable Gradient Computation
To avoid significant overhead from automatic differentiation for all parameters, the gradients for all parameters are derived explicitly. This is crucial for the efficiency of the optimization process. The details of these derivative computations are provided in Appendix A.
In Appendix A, the gradients for the 2D covariance matrix with respect to the scaling vector and quaternion are derived using the chain rule.
Recall that is the world space covariance matrix and is the view space (projected 2D) covariance matrix. is the quaternion for rotation, is the 3D vector for scaling. is the viewing transformation and is the Jacobian of the affine approximation of the projective transformation.
The chain rule is applied:
$
\frac{d\Sigma'}{ds} = \frac{d\Sigma'}{d\Sigma} \frac{d\Sigma}{ds}
$
and
$
\frac{d\Sigma'}{dq} = \frac{d\Sigma'}{d\Sigma} \frac{d\Sigma}{dq}
$
To simplify notation, let . Then is the (symmetric) upper left matrix of . The partial derivatives of with respect to elements of are:
$
\frac{\partial \Sigma'}{\partial \Sigma_{ij}} = \left( \begin{array}{ll} U_{1,i}U_{1,j} & U_{1,i}U_{2,j} \ U_{1,j}U_{2,i} & U_{2,i}U_{2,j} \end{array} \right)
$
Next, the derivatives and are needed.
Since , let . Then .
The partial derivative of with respect to is:
$
\frac{d\Sigma}{dM} = 2M^T
$
For scaling, the partial derivative of with respect to the components of () is:
$
\frac{\partial M_{i,j}}{\partial s_k} = \begin{cases} R_{i,k} & \mathrm{~if~} j=k \ 0 & \mathrm{~otherwise~} \end{cases}
$
To derive gradients for rotation, the conversion from a unit quaternion to a rotation matrix R(q) is recalled:
$
R(q) = 2 \left( \begin{array}{ccc} { \frac{1}{2} - (q_j^2 + q_k^2) } & { (q_i q_j - q_r q_k) } & { (q_i q_k + q_r q_j) } \ { (q_i q_j + q_r q_k) } & { \frac{1}{2} - (q_i^2 + q_k^2) } & { (q_j q_k - q_r q_i) } \ { (q_i q_k - q_r q_j) } & { (q_j q_k + q_r q_i) } & { \frac{1}{2} - (q_i^2 + q_j^2) } \end{array} \right)
$
From this, the gradients for the components of are derived (noting are the diagonal elements of ):
$
\begin{array}{cc}
\displaystyle { \frac{\partial M}{\partial q_r} = 2 \left( \begin{array}{ccc} 0 & -s_y q_k & s_z q_j \ s_x q_k & 0 & -s_z q_i \ -s_x q_j & s_y q_i & 0 \end{array} \right) , \quad } & { \displaystyle { \frac{\partial M}{\partial q_i} = 2 \left( \begin{array}{ccc} 0 & s_y q_j & s_z q_k \ s_x q_j & -2 s_y q_i & -s_z q_r \ s_x q_k & s_y q_r & -2 s_z q_i \end{array} \right) } } \
\displaystyle { \frac{\partial M}{\partial q_j} = 2 \left( \begin{array}{ccc} -2 s_x q_j & s_y q_i & s_z q_r \ s_x q_i & 0 & s_z q_k \ -s_x q_r & s_y q_k & -2 s_z q_j \end{array} \right) , \quad } & { \displaystyle { \frac{\partial M}{\partial q_k} = 2 \left( \begin{array}{ccc} -2 s_x q_k & -s_y q_r & s_z q_i \ s_x q_r & -2 s_y q_k & s_z q_j \ s_x q_i & s_y q_j & 0 \end{array} \right) } }
\end{array}
$
Gradients for quaternion normalization are also straightforward.
4.2.4. Optimization with Adaptive Density Control
The core of the method's learning process involves optimizing the 3D Gaussian parameters interleaved with steps to manage the density and number of Gaussians.
-
Optimization Loop: The optimization uses
Stochastic Gradient Descent (SGD)techniques.- Activation Functions: A
sigmoidfunction is used foropacity() to constrain it to the range[0, 1), providing smooth gradients. Anexponential activation functionis used for the scale components () to ensure they remain positive. - Initial Covariance: Initially, the
covariance matrixis isotropic, based on the distance to the closest threeSfMpoints. - Learning Rate Schedule: An
exponential decay schedulingtechnique is used forpositionlearning rates, similar toPlenoxels. - Loss Function: The training objective minimizes a combination of loss and
D-SSIM(Differentiable Structural Similarity Index Measure) term: $ \mathcal{L} = (1 - \lambda) \mathcal{L}1 + \lambda \mathcal{L}{\mathrm{D-SSIM}} $ where:- : The Mean Absolute Error (MAE) between the rendered image and the ground truth image, which encourages pixel-wise accuracy.
- : A differentiable version of the
SSIMmetric, which measures perceptual similarity based on luminance, contrast, and structure, making the results visually more pleasing. - : A weighting factor, set to
0.2in all experiments, balancing the contribution of andD-SSIM.
- Spherical Harmonics (SH) Optimization: To address the sensitivity of
SHcoefficients to missing angular information (e.g., incorner captures), the optimization ofSHis phased. Initially, only thezero-order component(base/diffuse color) is optimized. After every 1000 iterations, an additionalSH bandis introduced until all 4 bands are represented. - Warm-up: For stability, optimization starts at a lower image resolution (4 times smaller) and is upsampled twice after 250 and 500 iterations.
- Activation Functions: A
-
Adaptive Density Control: This crucial mechanism dynamically adjusts the number and density of
3D Gaussiansto accurately represent the scene. It helps to populate empty areas (under-reconstruction) and refine regions with too few large Gaussians (over-reconstruction).-
Trigger for Densification: Densification occurs every 100 iterations after an initial optimization warm-up.
-
Criteria for Densification: Gaussians with an average magnitude of
view-space position gradientsabove a threshold () are targeted for densification. High gradients indicate regions that are not yet well reconstructed, prompting the optimization to move Gaussians. -
Cloning: For small Gaussians in
under-reconstructedregions, a copy of the Gaussian is created (cloned) with the same size and moved slightly in the direction of itspositional gradient. This helps to cover new geometry. -
Splitting: For large Gaussians in regions with high variance (often
over-reconstruction), the Gaussian is replaced by two new, smaller Gaussians. Their scales are divided by a factor of , and their positions are initialized by sampling from the original Gaussian'sprobability density function (PDF). -
Pruning (Removal): Gaussians with
opacity() less than a threshold () are removed. This cleans up transparent or unnecessary Gaussians. -
Periodic Opacity Reset: Every iterations, the values of all Gaussians are set close to zero. This forces the optimization to re-learn opacities, allowing the system to shed
floaters(incorrectly placed Gaussians) and remove Gaussians that are no longer necessary.The
adaptive Gaussian densification schemeis visually summarized inFigure 4from the paper.
该图像是一个示意图,展示了自适应高斯密度化方案的两个阶段。顶部部分展示了当小规模几何体不足覆盖时,采用克隆方式进行优化;底部部分则表现了在过度重建情况下,如何通过拆分大面积样本以提升精度,进一步优化过程。
-
Figure 4 illustrates how adaptive Gaussian densification works. The top row shows cloning: when small-scale geometry (black outline) is inadequately covered, the relevant Gaussian is cloned to expand coverage. The bottom row demonstrates splitting: if a large area of small-scale geometry is represented by a single large splat, it is split into two smaller Gaussians for more detailed representation.
The optimization and densification algorithms are summarized in Algorithm 1 from Appendix B.
| Algorithm 1 Optimization and Densification w, h: width and height of the training images | |
| M ← SfM Points S, C, A ← InitAttributes() | Positions Covariances, Colors, Opacities Iteration Count |
| i ← 0 while not converged do | |
| V, ← SampleTrainingView() > Camera V and Image |
Algorithm 1: Optimization and Densification
- Input: (width), (height) of training images.
- Initialization:
- : Initialized
SfM Points. S, C, A: InitializedAttributes(Scales, Colors, Opacities) for the Gaussians. These correspond to position, covariance, color, and opacity parameters.- : Iteration counter, initialized to 0.
- : Initialized
- Main Loop: Continues
while not converged.V, I: Sample aTraining View(Camera VandImage I).- ... (The pseudocode is cut off here in the provided text, but it would typically involve rendering a view, calculating loss, backpropagating gradients to update Gaussian parameters, and then applying densification/pruning steps at specified intervals.)
4.2.5. Fast Differentiable Rasterizer for Gaussians
The rasterizer is designed for speed and differentiability, allowing for real-time rendering and efficient gradient computation during training.
-
Tile-Based Architecture:
- The screen is divided into
16x16 pixel tiles. - Frustum Culling:
3D Gaussiansare culled (removed) if their99% confidence intervaldoes not intersect theview frustum. Aguard bandis used to reject Gaussians near thenear planebut far outside theview frustumto avoid numerical instability during2D covarianceprojection. - Gaussian Instantiation and Key Assignment: Each Gaussian that overlaps multiple
tilesis instantiated (duplicated) for each tile it covers. Each instance is assigned akeythat combines itsview-space depthand thetile IDit overlaps. - Fast GPU Sorting: All Gaussian instances are sorted globally based on these keys using a single
GPU Radix sort. This ensures approximatedepth orderingfor allsplatsacross the entire image. - Per-Tile Lists: After sorting, lists of Gaussians for each
tileare generated by identifying the start and end indices in the sorted array for eachtile ID.
- The screen is divided into
-
Rasterization Pass (Forward Pass):
- One
thread blockis launched for eachtile. - Threads within a block collaboratively load packets of Gaussians into
shared memory(fast on-chip memory). - For each pixel within the tile,
colorandalphavalues are accumulated by traversing theper-tile Gaussian listfrom front-to-back, performingalpha blending. - Early Termination: Processing for a pixel stops when its
alphavalue reaches a target saturation (e.g., close to 1, specifically 0.9999). The processing of an entiretileterminates when all its pixels have saturated. This maximizes parallelism and avoids unnecessary computation for occluded regions.
- One
-
Differentiable Backward Pass:
-
Crucially, the
rasterizersupports backpropagation without limiting the number of blended primitives that receive gradient updates. This is vital for learning complex scenes with varying depth complexity. -
Reconstructing Intermediate Opacities: Instead of storing long lists of blended points per-pixel (which is memory-intensive), the backward pass re-traverses the
per-tile listsof Gaussians (which were already sorted in the forward pass). -
Back-to-Front Traversal: The traversal for gradients happens back-to-front.
-
Gradient Computation: The
accumulated opacityfrom the forward pass at each step is needed for gradient computation. This is recovered by storing only the final accumulatedopacityfor each point and then repeatedly dividing it by each point'salphaduring the back-to-front traversal. -
Optimized Overlap Testing: During the backward pass, a pixel only performs expensive
overlap testingand processing of points if their depth is less than or equal to the depth of the last point that contributed to its color in the forward pass.The rasterization approach is described at a high-level in
Algorithm 2fromAppendix C.Algorithm 2 GPU software rasterization of 3D Gaussians w, h: width and height of the image to rasterize M, S: Gaussian means and covariances in world space C, A: Gaussian colors and opacities V: view configuration of current camera function RASTERIzE(w, h, M, S, C, A, V) CullGaussian(p, V ) Frustum Culling M′, S′ ← ScreenspaceGaussians(M, S, V ) Transform T ← CreateTiles(w, h) L, K ← DuplicateWithKeys(M ′, T) Indices and Keys SortByKeys(K, L) Globally Sort R ← IdentifyTileRanges(T , K) I ← 0 Init Canvas for all Tiles t in I do for all Pixels i in t do r ← GetTileRange(R, t) I[i] ← BlendInOrder(i, L, r, K, M′, S′, C, A)
-
Algorithm 2: GPU Software Rasterization of 3D Gaussians
-
Input: (width), (height) of the image;
M, S(Gaussian means and covariances in world space);C, A(Gaussian colors and opacities); (view configuration of the current camera). -
Function
RASTERIZE:CullGaussian(p, V): PerformsFrustum Cullingto remove Gaussians outside the view.M', S' ← ScreenspaceGaussians(M, S, V): Transforms Gaussians (means and covariances ) from world space to screen space based on theview configuration V, producing and .T ← CreateTiles(w, h): Divides the screen intotiles(e.g., pixel blocks).L, K ← DuplicateWithKeys(M', T):DuplicatesGaussians that overlap multiple tiles and assigns akeyto each instance, combining itstile IDandprojected depth. refers to the list of these duplicated Gaussian instances.SortByKeys(K, L): Globallysortsthe list of Gaussian instances based on their keys using aRadix sort.R ← IdentifyTileRanges(T, K): Identifies thestartandendindices in the globally sorted list for each tile, creatingper-tile ranges R.I ← 0: Initializes thecanvas(output image) to black.- Loop over Tiles:
for all Tiles t in I do: Iterates through eachtile.- Loop over Pixels within Tile:
for all Pixels i in t do: Iterates through eachpixelwithin the current tile.r ← GetTileRange(R, t): Retrieves therangeof Gaussians relevant to the currenttile t.I[i] ← BlendInOrder(i, L, r, K, M', S', C, A): Blends the Gaussians indepth-sorted orderforpixel iusing their screen-space parameters (M', S'), colors (), and opacities (). This function accumulates color and opacity until saturation.
-
Numerical Stability (Appendix C):
- To ensure numerical stability, especially during the backward pass's opacity recovery:
-
Blending updatesare skipped if (e.g., ). -
Alphavalues are clamped from above at0.99. -
In the forward pass,
front-to-back blendingfor a pixel stops if the accumulated opacity would exceed0.9999, preventingdivision by zeroorinfinite valuesduring backward pass reconstruction.This comprehensive methodology, combining an expressive Gaussian representation with adaptive optimization and a highly optimized rasterization pipeline, is what enables
3D Gaussian Splattingto achieve its unprecedented balance of quality, speed, and real-time performance.
-
- To ensure numerical stability, especially during the backward pass's opacity recovery:
4.2.6. Image Formation Model
As mentioned in the "Related Work" section, the paper highlights that point-based\alpha-blending and NeRF-style volumetric rendering share essentially the same image formation model. This understanding is crucial because it allows the 3D Gaussians to be optimized using principles derived from volumetric rendering while being rendered efficiently using splatting and alpha blending.
The volumetric rendering formula (Eq. 1 in the paper), used by NeRFs, is:
$
C = \sum_{i=1}^{N} T_i (1 - \exp(-\sigma_i \delta_i)) \mathbf{c}i \quad \mathrm{w i t h} \quad T_i = \exp\left(-\sum{j=1}^{i-1} \sigma_j \delta_j \right)
$
where:
-
: The final accumulated color of a ray.
-
: Number of samples along the ray.
-
:
Transmittance(or accumulated transparency) from the ray origin to sample . -
:
Volume densityat sample . -
: Distance between consecutive samples.
-
: Color at sample .
This can be rewritten (Eq. 2 in the paper) by defining the opacity for each sample: $ C = \sum_{i=1}^{N} T_i \alpha_i \mathbf{c}i $ with $ \alpha_i = {\big(} 1 - \exp(-\sigma_i \delta_i) {\big)} {\mathrm{~a n d~}} T_i = \prod{j=1}^{i-1} (1 - \alpha_j) $ where:
-
: The effective opacity of the -th volumetric sample.
-
The new is explicitly shown as a product of transparencies of preceding samples.
The
point-based\alpha-blendingapproach (Eq. 3 in the paper), typical for methods that blend ordered points (orsplats) overlapping a pixel, is: $ C = \sum_{i \in N} \mathbf{c}i \alpha_i \prod{j=1}^{i-1} (1 - \alpha_j) $ where: -
: The set of ordered points (or
splats) overlapping the pixel. -
: The color of each point.
-
: The opacity of each point.
-
: The accumulated transparency from previous points, effectively acting as transmittance.
The paper emphasizes that
from Eq. 2 and Eq. 3, we can clearly see that the image formation model is the same. This equivalence allows the use of3D Gaussians(which project to 2Dsplatswith color and opacity) to be optimized under the same volumetric principles asNeRFs, while being rendered using a highly efficientsplattingapproach that mimics the volumetric accumulation. This is a fundamental insight that justifies the3D Gaussian Splattingapproach.
5. Experimental Setup
5.1. Datasets
The authors evaluated their algorithm on a diverse set of established datasets to demonstrate its robustness across various scene types and capture styles.
-
Mip-NeRF360 Dataset [Barron et al. 2022]: This dataset comprises 9
real-world scenes(bicycle, flowers, garden, stump, treehill, room, counter, kitchen, bonsai) and is considered thestate-of-the-artbenchmark forNeRFrendering quality. It includes both bounded indoor scenes and large unbounded outdoor environments. The scenes often feature complex geometry and view-dependent effects, providing a challenging test for both quality and scalability. -
Tanks&Temples Dataset [Knapitsch et al. 2017]: Two scenes (
TruckandTrain) from this dataset were used. This dataset is known for its large-scaleoutdoor environmentsand realistic captures, often used for benchmarking 3D reconstruction andnovel-view synthesismethods. -
Hedman et al. Dataset [Hedman et al. 2018]: Two scenes (
Dr JohnsonandPlayroom) were included. This dataset providesindoor sceneswith specific challenges inimage-based rendering. -
Synthetic Blender Dataset [Mildenhall et al. 2020]: This dataset (Mic, Chair, Ship, Materials, Lego, Drums, Ficus, Hotdog) consists of
synthetic objectsrendered in a clean environment with uniform backgrounds and well-defined camera parameters. It provides an exhaustive set of views and is bounded in size, making it suitable for evaluating fundamental reconstruction capabilities, especially with random initialization.For all real-world datasets, a standard
train/test splitwas used, following the methodology suggested byMip-NeRF360, where every 8th photo was reserved for testing. This ensures consistent and meaningful comparisons with previous methods. The choice of these datasets allows for validation across varying levels of scene complexity, boundedness, lighting conditions, and capture strategies.
5.2. Evaluation Metrics
The performance of the novel-view synthesis method was evaluated using three widely accepted metrics in the literature: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS).
Peak Signal-to-Noise Ratio (PSNR)
- Conceptual Definition:
PSNRis a common metric for measuring the quality of reconstruction of lossy compression codecs or image processing techniques. It quantifies the difference between two images on a pixel-by-pixel basis. A higherPSNRvalue generally indicates a higher quality (less noisy) reconstructed image. It is typically expressed in decibels (dB). - Mathematical Formula: $ PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) $
- Symbol Explanation:
PSNR:Peak Signal-to-Noise Ratioin decibels (dB).- : The maximum possible pixel value of the image. For 8-bit grayscale images, this is 255. For color images where each channel is 8-bit, it's also typically 255.
MSE:Mean Squared Errorbetween the original (ground truth) image and the reconstructed (rendered) image. It is calculated as: $ MSE = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $ where:- : Number of rows (height) of the image.
- : Number of columns (width) of the image.
I(i,j): The pixel value at row and column of the original image.K(i,j): The pixel value at row and column of the reconstructed image.
Structural Similarity Index Measure (SSIM)
- Conceptual Definition:
SSIMis designed to measure the perceived structural similarity between two images, moving beyond simple pixel differences. It considers three key factors: luminance, contrast, and structure. It ranges from -1 to 1, with 1 indicating perfect similarity. HigherSSIMvalues indicate better perceptual quality. - Mathematical Formula: $ SSIM(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2+\mu_y^2+c_1)(\sigma_x^2+\sigma_y^2+c_2)} $
- Symbol Explanation:
SSIM(x,y): TheStructural Similarity Indexbetween image patches and .- : The average (mean) of image patch .
- : The average (mean) of image patch .
- : The variance of image patch .
- : The variance of image patch .
- : The covariance between image patches and .
- and : Two small constants included to avoid division by zero when the denominators are very close to zero.
- : The dynamic range of the pixel values (e.g., 255 for 8-bit images).
- and are typical default values.
Learned Perceptual Image Patch Similarity (LPIPS)
- Conceptual Definition:
LPIPSis a metric that aims to correlate more closely with human perception of image quality compared to traditional metrics likePSNRandSSIM. It measures the perceptual distance between two images by comparing their feature representations extracted from a pre-traineddeep neural network(e.g., VGG, AlexNet). A lowerLPIPSscore indicates higher perceptual similarity. - Mathematical Formula:
LPIPSis not a single, simple closed-form mathematical expression likePSNRorSSIM. Instead, it involves a computation pipeline: $ \mathrm{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} |w_l \odot (f_l(x) - f_l(y))|_2^2 $ - Symbol Explanation:
- : The
Learned Perceptual Image Patch Similaritybetween images and . - : Summation over different layers of a pre-trained
CNN. - : The feature stack (activation map) extracted from image at layer of the pre-trained network.
- : The feature stack (activation map) extracted from image at layer of the pre-trained network.
- : A trainable scalar weight vector applied to the feature channels at layer , optimizing the metric to align with human judgments.
- : Element-wise multiplication.
- : The squared norm (Euclidean distance).
- : Height and width of the feature map at layer . The term normalizes the squared Euclidean distance by the number of elements in the feature map.
- : The
5.3. Baselines
The paper compares its 3D Gaussian Splatting method against several leading novel-view synthesis techniques, chosen for their state-of-the-art quality or computational efficiency:
-
Mip-NeRF360 [Barron et al. 2022]: This method is considered the
state-of-the-artin terms of rendering quality for unbounded scenes. It serves as the primary benchmark for visual fidelity, despite its very high training and rendering costs. -
InstantNGP [Müller et al. 2022]: This method represents a significant leap in speed for
NeRF-like approaches. It uses a multiresolutionhash encodingand is known for its fast training (minutes) and interactive rendering. The paper compares against two configurations:- INGP-Base: A basic configuration run for 35K iterations.
- INGP-Big: A slightly larger network configuration suggested by the authors, offering potentially higher quality at the cost of slightly more resources.
-
Plenoxels [Fridovich-Keil and Yu et al. 2022]: This method is another fast
NeRFvariant that represents radiance fields with a sparsevoxel grid, notably forgoingneural networksentirely. It is also known for fast training and interactive rendering.These baselines collectively cover the spectrum from highest quality (Mip-NeRF360) to fastest performance (
InstantNGP,Plenoxels), allowing3D Gaussian Splattingto demonstrate its advantage in combining both aspects.
6. Results & Analysis
6.1. Core Results Analysis
The 3D Gaussian Splatting (3DGS) method demonstrates a significant advancement in novel-view synthesis by achieving state-of-the-art (SOTA) visual quality, competitive training times, and critically, real-time rendering at 1080p resolution.
The overall performance comparison with leading methods is presented in Table 1.
| Method|Metric | Mip-NeRF360 | Tanks&Temples | Deep Blending | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SSIM↑ | PSNR↑ | LPIPS↓ | Train | FPS | Mem | SSIM↑ | PSNR↑ | LPIPS↓ | Train | FPS | Mem | SSIM↑ | PSNR↑ | LPIPS↓ | Train | FPS | Mem | |
| Plenoxels | 0.626 | 23.08 | 0.463 | 25m49s | 6.79 | 2.1GB | 0.719 | 21.08 | 0.379 | 25m5s | 13.0 | 2.3GB | 0.795 | 23.06 | 0.510 | 27m49s | 11.2 | 2.7GB |
| INGP-Base | 0.671 | 25.30 | 0.371 | 5m37s | 11.7 | 13MB | 0.723 | 21.72 | 0.330 | 5m26s | 17.1 | 13MB | 0.797 | 23.62 | 0.423 | 6m31s | 3.26 | 13MB |
| INGP-Big | 0.6699 | 25.59 | 0.331 | 7m30s | 9.43 | 48MB | 0.745 | 21.92 | 0.305 | 6m59s | 14.4 | 48MB | 0.817 | 24.96 | 0.390 | 8m | 2.79 | 48MB |
| M-NeRF360 | 0.792* | 27.69† | 0.237 | 48h | 0.06 | 8.6MB | 0.759 | 22.22 | 0.257 | 48h | 0.14 | 8.6MB | 0.901 | 29.40 | 0.245 | 48h | 0.09 | 8.6MB |
| Ours-7K | 0.770 | 25.60 | 0.279 | 6m25s | 160 | 523MB | 0.767 | 21.20 | 0.280 | 6m55s | 197 | 270MB | 0.875 | 27.78 | 0.317 | 4m35s | 172 | 386MB |
| Ours-30K | 0.815 | 27.21 | 0.214 | 41m33s | 134 | 734MB | 0.841 | 23.14 | 0.183 | 26m54s | 154 | 411MB | 0.903 | 29.41 | 0.243 | 36m2s | 137 | 676MB |
The following are the results from Table 1 of the original paper:
SSIM↑: Higher is better.PSNR↑: Higher is better.LPIPS↓: Lower is better.Train: Training time.FPS: Frames Per Second (rendering speed).Mem: Memory used to store the model.- * and † indicate numbers directly adopted from the original paper (for Mip-NeRF360).
Key Observations from Table 1:
-
Quality Dominance (Ours-30K vs. Mip-NeRF360):
- For the
Mip-NeRF360dataset,Ours-30KachievesSSIMof 0.815 (vs. 0.792 for Mip-NeRF360),PSNRof 27.21 (vs. 27.69 for Mip-NeRF360), andLPIPSof 0.214 (vs. 0.237 for Mip-NeRF360). This shows3DGSis largely on par with, and in some metrics (SSIM, LPIPS) even slightly surpasses, theSOTAquality of Mip-NeRF360. - Similar trends are observed for
Tanks&TemplesandDeep Blendingdatasets, whereOurs-30Kconsistently achieves the highestSSIMand lowestLPIPS, and very competitivePSNR.
- For the
-
Unprecedented Real-Time Rendering (Ours vs. All Baselines):
- The most striking advantage is rendering speed. Mip-NeRF360 renders at an abysmal 0.06-0.14
FPS(frames per second), meaning tens of seconds per frame. PlenoxelsandInstantNGPachieve interactive rates (3-17FPS), but still fall short ofreal-time(FPS).Ours-7KandOurs-30Kconsistently achievereal-timerendering speeds, ranging from 134-197FPSacross all datasets. This is a monumental achievement, making3DGSthe first method to enable high-qualityreal-time novel-view synthesis.
- The most striking advantage is rendering speed. Mip-NeRF360 renders at an abysmal 0.06-0.14
-
Competitive Training Times:
- Mip-NeRF360 requires 48 hours of training.
PlenoxelsandInstantNGPtrain in minutes (5-28 minutes).Ours-7Ktrains in 4-7 minutes, matching the fastest methods for initial quality.Ours-30K(full convergence) trains in 26-41 minutes, which is competitive with or slightly longer thanPlenoxels/INGP-Big, but for significantly higher quality. This is a massive reduction from Mip-NeRF360's training time.
-
Memory Consumption:
InstantNGPmodels are very compact (13-48 MB). Mip-NeRF360 is also relatively compact (8.6 MB).3DGSmodels are larger (270-734 MB). While larger than implicit methods, this is still manageable for GPU memory, especially considering the explicit nature of storing millions of Gaussians. The authors note potential for further memory reduction.
Visual Results:
Figure 1 provides a compelling visual summary of the performance comparison:
该图像是插图,展示了不同方法在实时渲染中的性能比较,包括 InstantNGP、Plenoxtels、Mip-NeRF360、我们的方法和真实场景。每个方法下方标注了帧率、训练时间和PSNR值,显示了我们的模型在实时渲染方面的显著优势。最右侧为真实场景图像,作为性能基准。
Figure 1 showcases a comparison of 3D Gaussian Splatting against other methods like InstantNGP, Plenoxels, and Mip-NeRF360. The critical takeaway is 3DGS's ability to achieve real-time rendering (137-197 FPS) at high quality, significantly outperforming Mip-NeRF360 (0.06-0.14 FPS) and offering superior quality and speed compared to InstantNGP and Plenoxels. The PSNR scores further confirm the high visual fidelity of 3DGS.
Figure 5 offers more detailed visual comparisons for specific scenes, highlighting that 3DGS can avoid artifacts present in Mip-NeRF360 (e.g., blurriness in vegetation or walls).
该图像是插图,展示了我们的方法与多种基准方法的结果对比,包括真实场景、我们的方法、Mip-NeRF360、InstantNGP 和 PlenoXels。每一行展示了不同场景的渲染效果,强调了我们方法在视觉质量上的优势。
Figure 5 visually compares the rendering quality of 3D Gaussian Splatting (Ours) against Ground Truth, Mip-NeRF360, InstantNGP, and Plenoxels. The results demonstrate that Ours produces high-fidelity images, often matching or exceeding the perceptual quality of Mip-NeRF360 and clearly outperforming InstantNGP and Plenoxels, especially in fine details and overall sharpness. This figure provides strong visual evidence of the method's state-of-the-art quality.
Synthetic Bounded Scenes:
For the synthetic Blender dataset, where scenes are bounded and views are exhaustive, 3DGS achieves SOTA results even with random initialization (100K points). The adaptive density control quickly prunes them to 6-10K meaningful Gaussians, and the final model reaches 200-500K Gaussians. Table 2 shows PSNR scores on this dataset.
The following are the results from Table 2 of the original paper:
| Mic | Chair | Ship | Materials | Lego | Drums | Ficus | Hotdog | Avg. | |
|---|---|---|---|---|---|---|---|---|---|
| Plenoxels | 33.26 | 33.98 | 29.62 | 29.14 | 34.10 | 25.35 | 31.83 | 36.81 | 31.76 |
| INGP-Base | 36.22 | 35.00 | 31.10 | 29.78 | 36.39 | 26.02 | 33.51 | 37.40 | 33.18 |
| Mip-NeRF | 36.51 | 35.14 | 30.41 | 30.71 | 35.70 | 25.48 | 33.29 | 37.48 | 33.09 |
| Point-NeRF | 35.95 | 35.40 | 30.97 | 29.61 | 35.04 | 26.06 | 36.13 | 37.30 | 33.30 |
| Ours-30K | 35.36 | 35.83 | 30.80 | 30.00 | 35.78 | 26.15 | 34.87 | 37.72 | 33.32 |
For the synthetic Blender dataset, Ours-30K achieves an average PSNR of 33.32, which is the highest among all compared methods, slightly surpassing Point-NeRF (33.30) and INGP-Base (33.18). The rendering FPS for these scenes was 180-300. This confirms the method's effectiveness even when starting from a less structured initial state.
Compactness:
The anisotropic Gaussians prove to be a compact representation. When compared against the point-based models of [Zhang et al. 2022] (which use foreground masks and space carving), 3DGS surpasses their reported PSNR scores using approximately one-fourth of their point count and significantly smaller model sizes (average 3.8 MB vs. 9 MB). This demonstrates the efficiency of using anisotropic shapes to model complex geometry.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Method|Metric | Mip-NeRF360 | Tanks&Temples | Deep Blending | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SSIM↑ | PSNR↑ | LPIPS↓ | Train | FPS | Mem | SSIM↑ | PSNR↑ | LPIPS↓ | Train | FPS | Mem | SSIM↑ | PSNR↑ | LPIPS↓ | Train | FPS | Mem | |
| Plenoxels | 0.626 | 23.08 | 0.463 | 25m49s | 6.79 | 2.1GB | 0.719 | 21.08 | 0.379 | 25m5s | 13.0 | 2.3GB | 0.795 | 23.06 | 0.510 | 27m49s | 11.2 | 2.7GB |
| INGP-Base | 0.671 | 25.30 | 0.371 | 5m37s | 11.7 | 13MB | 0.723 | 21.72 | 0.330 | 5m26s | 17.1 | 13MB | 0.797 | 23.62 | 0.423 | 6m31s | 3.26 | 13MB |
| INGP-Big | 0.6699 | 25.59 | 0.331 | 7m30s | 9.43 | 48MB | 0.745 | 21.92 | 0.305 | 6m59s | 14.4 | 48MB | 0.817 | 24.96 | 0.390 | 8m | 2.79 | 48MB |
| M-NeRF360 | 0.792* | 27.69† | 0.237 | 48h | 0.06 | 8.6MB | 0.759 | 22.22 | 0.257 | 48h | 0.14 | 8.6MB | 0.901 | 29.40 | 0.245 | 48h | 0.09 | 8.6MB |
| Ours-7K | 0.770 | 25.60 | 0.279 | 6m25s | 160 | 523MB | 0.767 | 21.20 | 0.280 | 6m55s | 197 | 270MB | 0.875 | 27.78 | 0.317 | 4m35s | 172 | 386MB |
| Ours-30K | 0.815 | 27.21 | 0.214 | 41m33s | 134 | 734MB | 0.841 | 23.14 | 0.183 | 26m54s | 154 | 411MB | 0.903 | 29.41 | 0.243 | 36m2s | 137 | 676MB |
The following are the results from Table 2 of the original paper:
| Mic | Chair | Ship | Materials | Lego | Drums | Ficus | Hotdog | Avg. | |
|---|---|---|---|---|---|---|---|---|---|
| Plenoxels | 33.26 | 33.98 | 29.62 | 29.14 | 34.10 | 25.35 | 31.83 | 36.81 | 31.76 |
| INGP-Base | 36.22 | 35.00 | 31.10 | 29.78 | 36.39 | 26.02 | 33.51 | 37.40 | 33.18 |
| Mip-NeRF | 36.51 | 35.14 | 30.41 | 30.71 | 35.70 | 25.48 | 33.29 | 37.48 | 33.09 |
| Point-NeRF | 35.95 | 35.40 | 30.97 | 29.61 | 35.04 | 26.06 | 36.13 | 37.30 | 33.30 |
| Ours-30K | 35.36 | 35.83 | 30.80 | 30.00 | 35.78 | 26.15 | 34.87 | 37.72 | 33.32 |
The following are the results from Table 4 of the original paper (SSIM scores for Mip-NeRF360 scenes):
| bicycle | flowers | garden | stump | treehill | room | counter | kitchen | bonsai | |
|---|---|---|---|---|---|---|---|---|---|
| Plenoxels | 0.496 | 0.431 | 0.6063 | 0.523 | 0.509 | 0.8417 | 0.759 | 0.648 | 0.814 |
| INGP-Base | 0.491 | 0.450 | 0.649 | 0.574 | 0.518 | 0.855 | 0.798 | 0.818 | 0.890 |
| INGP-Big | 0.512 | 0.486 | 0.701 | 0.594 | 0.542 | 0.871 | 0.817 | 0.858 | 0.906 |
| Mip-NeRF360† | 0.685 | 0.583 | 0.813 | 0.744 | 0.632 | 0.913 | 0.894 | 0.920 | 0.941 |
| Mip-NeRF360 | 0.685 | 0.584 | 0.809 | 0.745 | 0.631 | 0.910 | 0.892 | 0.917 | 0.938 |
| Ours-7k | 0.675 | 0.525 | 0.836 | 0.728 | 0.598 | 0.884 | 0.873 | 0.900 | 0.910 |
| Ours-30k | 0.771 | 0.605 | 0.868 | 0.775 | 0.638 | 0.914 | 0.905 | 0.922 | 0.938 |
The following are the results from Table 5 of the original paper (PSNR scores for Mip-NeRF360 scenes):
| bicycle | flowers | garden | stump | treehill | room | counter | kitchen | bonsai | |
|---|---|---|---|---|---|---|---|---|---|
| Plenoxels | 21.912 | 20.097 | 23.4947 | 20.661 | 22.248 | 27.594 | 23.624 | 23.420 | 24.669 |
| INGP-Base | 22.193 | 20.348 | 24.599 | 23.626 | 22.364 | 29.269 | 26.439 | 28.548 | 30.337 |
| INGP-Big | 22.171 | 20.652 | 25.069 | 23.466 | 22.373 | 29.690 | 26.691 | 29.479 | 30.685 |
| Mip-NeRF360† | 24.37 | 21.73 | 26.98 | 26.40 | 22.87 | 31.63 | 29.55 | 32.23 | 33.46 |
| Mip-NeRF360 | 24.305 | 21.649 | 26.875 | 26.175 | 22.929 | 31.467 | 29.447 | 31.989 | 33.397 |
| Ours-7k | 23.604 | 20.515 | 26.245 | 25.709 | 22.085 | 28.139 | 26.705 | 28.546 | 28.850 |
| Ours-30k | 25.246 | 21.520 | 27.410 | 26.550 | 22.490 | 30.632 | 28.700 | 30.317 | 31.980 |
The following are the results from Table 6 of the original paper (LPIPS scores for Mip-NeRF360 scenes):
| bicycle | flowers | garden | stump | treehill | room | counter | kitchen | bonsai | |
|---|---|---|---|---|---|---|---|---|---|
| Plenoxels | 0.506 | 0.521 | 0.3864 | 0.503 | 0.540 | 0.4186 | 0.441 | 0.447 | 0.398 |
| INGP-Base | 0.487 | 0.481 | 0.312 | 0.450 | 0.489 | 0.301 | 0.342 | 0.254 | 0.227 |
| INGP-Big | 0.446 | 0.441 | 0.257 | 0.421 | 0.450 | 0.261 | 0.306 | 0.195 | 0.205 |
| Mip-NeRF360† | 0.301 | 0.344 | 0.170 | 0.261 | 0.339 | 0.211 | 0.204 | 0.127 | 0.176 |
| Mip-NeRF360 | 0.305 | 0.346 | 0.171 | 0.265 | 0.347 | 0.213 | 0.207 | 0.128 | 0.179 |
| Ours-7k | 0.318 | 0.417 | 0.153 | 0.287 | 0.404 | 0.272 | 0.254 | 0.161 | 0.244 |
| Ours-30k | 0.205 | 0.336 | 0.103 | 0.210 | 0.317 | 0.220 | 0.204 | 0.129 | 0.205 |
The following are the results from Table 7 of the original paper (SSIM scores for Tanks&Temples and Deep Blending scenes):
| Truck | Train | Dr Johnson | Playroom | |
|---|---|---|---|---|
| Plenoxels | 0.774 | 0.663 | 0.787 | 0.802 |
| INGP-Base | 0.779 | 0.666 | 0.839 | 0.754 |
| INGP-Big | 0.800 | 0.689 | 0.854 | 0.779 |
| Mip-NeRF360 | 0.857 | 0.660 | 0.901 | 0.900 |
| Ours-7k | 0.840 | 0.694 | 0.853 | 0.896 |
| Ours-30k | 0.879 | 0.802 | 0.899 | 0.906 |
The following are the results from Table 8 of the original paper (PSNR scores for Tanks&Temples and Deep Blending scenes):
| Truck | Train | Dr Johnson | Playroom | |
|---|---|---|---|---|
| Plenoxels | 23.221 | 18.927 | 23.142 | 22.980 |
| INGP-Base | 23.260 | 20.170 | 27.750 | 19.483 |
| INGP-Big | 23.383 | 20.456 | 28.257 | 21.665 |
| Mip-NeRF360 | 24.912 | 19.523 | 29.140 | 29.657 |
| Ours-7k | 23.506 | 18.892 | 26.306 | 29.245 |
| Ours-30k | 25.187 | 21.097 | 28.766 | 30.044 |
The following are the results from Table 9 of the original paper (LPIPS scores for Tanks&Temples and Deep Blending scenes):
| Truck | Train | Dr Johnson | Playroom | |
|---|---|---|---|---|
| Plenoxels | 0.335 | 0.422 | 0.521 | 0.499 |
| INGP-Base | 0.274 | 0.386 | 0.381 | 0.465 |
| INGP-Big | 0.249 | 0.360 | 0.352 | 0.428 |
| Mip-NeRF360 | 0.159 | 0.354 | 0.237 | 0.252 |
| Ours-7k | 0.209 | 0.350 | 0.343 | 0.291 |
| Ours-30k | 0.148 | 0.218 | 0.244 | 0.241 |
6.3. Ablation Studies / Parameter Analysis
The authors conducted a thorough ablation study to evaluate the contribution of each key component and design choice in 3D Gaussian Splatting. The quantitative results are summarized in Table 3.
The following are the results from Table 3 of the original paper (PSNR scores for Synthetic NeRF, we start with 100K randomly initialized points. Competing metrics extracted from respective papers.):
| Truck-5K | Garden-5K | Bicycle-5K | Truck-30K | Garden-30K | Bicycle-30K | Average-5K | Average-30K | |
|---|---|---|---|---|---|---|---|---|
| Limited-BW | 14.66 | 22.07 | 20.77 | 13.84 | 22.88 | 20.87 | 19.16 | 19.19 |
| Random Init | 16.75 | 20.90 | 19.86 | 18.02 | 22.19 | 21.05 | 19.17 | 20.42 |
| No-Split | 18.31 | 23.98 | 22.21 | 20.59 | 26.11 | 25.02 | 21.50 | 23.90 |
| No-SH | 22.36 | 25.22 | 22.88 | 24.39 | 26.59 | 25.08 | 23.48 | 25.35 |
| No-Clone | 22.29 | 25.61 | 22.15 | 24.82 | 27.47 | 25.46 | 23.35 | 25.91 |
| Isotropic | 22.40 | 25.49 | 22.81 | 23.89 | 27.00 | 24.81 | 23.56 | 25.23 |
| Full | 22.71 | 25.82 | 23.18 | 24.81 | 27.70 | 25.65 | 23.90 | 26.05 |
-
Initialization from SfM (vs. Random Init):
-
Result:
Random Init(Average-30K PSNR: 20.42) performs significantly worse thanFull(Average-30K PSNR: 26.05). -
Analysis: Initializing
3D GaussiansfromSfMpoints (the defaultFullmethod) is crucial for performance, especially in the background and poorly observed regions. While random initialization can work for well-constrained synthetic scenes, it leads to morefloatersand poorer reconstruction in real-world scenarios.Figure 7visually supports this finding.
该图像是示意图,展示了 SfM 点与随机点云的初始化对比。上方为随机点云初始化,底部为使用 SfM 点的初始化,展示了在场景优化中所取得的明显差异。
Figure 7demonstrates the impact of initialization on scene quality. The top image, initialized with arandom point cloud, shows significant degradation, especially in background areas. The bottom image, initialized usingSfM points, results in a much higher quality scene, proving the importance of leveragingSfMdata for robust scene reconstruction. -
-
Densification Strategies (
No-SplitandNo-Clone):-
Result:
No-Split(Average-30K PSNR: 23.90) andNo-Clone(Average-30K PSNR: 25.91) both reduce quality compared toFull(Average-30K PSNR: 26.05). -
Analysis: Both
cloningandsplittingmechanisms in theadaptive density controlare important.Splittinglarge Gaussians is essential for reconstructing backgrounds and detailed geometry (No-Splitshows a more significant drop).Cloningsmaller Gaussians helps to coverunder-reconstructedregions and accelerate convergence, particularly for thin structures (No-Clonealso shows a drop, albeit smaller).
-
Figure 8visually illustrates the impact of these strategies.
该图像是图表,展示了不同的稠密化策略在合成场景中的效果,包括 "No Split-5k"、"No Clone-5k" 和 "Full-5k" 的对比。图中左上、右上、左下、右下分别显示了这些策略的视觉效果,重点突出细节的差异。
Figure 8illustrates the effects of differentdensification strategies. The "No Split-5k" image shows artifacts where large areas are inadequately refined. The "No Clone-5k" image struggles with covering new geometry. The "Full-5k" image demonstrates superior detail and completeness, validating the effectiveness of bothcloningandsplittingmechanisms. -
-
Unlimited Depth Complexity of Splats with Gradients (
Limited-BW):-
Result:
Limited-BW(limiting gradient computation to a fixed number of front-most points) results in a severe quality degradation (Average-30K PSNR: 19.19) compared toFull(Average-30K PSNR: 26.05). For the Truck scene, this was an 11dB drop inPSNR. -
Analysis: This demonstrates the necessity of allowing all
splatsthat contribute to a pixel's color to receivegradients, regardless of their depth. Limiting this (as some prior methods do for speed) leads to unstable optimization and significant visual artifacts, especially in scenes with highdepth complexity.Figure 9vividly shows this degradation.
该图像是对比图,展示了当限制接收梯度的高斯点数量时,对视觉质量的显著影响。左侧展示了仅限10个接收梯度的高斯点的结果,而右侧则呈现了完整方法的效果。
Figure 9highlights the crucial impact of allowinggradientsto propagate through all contributingGaussians. The left image, with a limit of only 10 Gaussians receiving gradients, exhibits significant visual degradation and artifacts. The right image, rendered with the full method (no limit), demonstrates vastly superior quality and detail, proving that restricting gradient flow severely compromises scene reconstruction. -
-
Anisotropic Covariance (
Isotropic):-
Result:
IsotropicGaussians (Average-30K PSNR: 25.23) lead to lower quality thanFull(Average-30K PSNR: 26.05). -
Analysis: Optimizing the full
anisotropic covariance matrixis critical.Anisotropyallows Gaussians to adapt their shape and orientation to precisely align with surfaces and fine structures, leading to a more compact and accurate scene representation. Without it, the Gaussians are spheres, which are less efficient and accurate for modeling most real-world geometry.Figure 10provides a visual comparison.
该图像是插图,展示了三种不同渲染策略下的场景效果,包括真实场景(Ground Truth)、完整渲染(Full)和各向同性渲染(Isotropic),分别表现了不同的视觉质量。
Figure 10visually compares the rendering quality betweenGround Truth,Full(our method withanisotropic covariance), andIsotropic(our method with spherical Gaussians). TheFullrendering closely matches theGround Truth, demonstrating the benefit ofanisotropic Gaussiansin capturing detailed geometry. TheIsotropicrendering shows a noticeable drop in quality, confirming thatanisotropic covarianceis essential for high-fidelity representation. -
-
Spherical Harmonics (
No-SH):-
Result:
No-SH(Average-30K PSNR: 25.35) results in reduced quality compared toFull(Average-30K PSNR: 26.05). -
Analysis: The use of
Spherical Harmonics (SH)is beneficial for capturingview-dependent effectsand accurately representing the color of Gaussians from various viewing directions, which contributes to higher overallPSNRscores and more realistic appearance.In summary, the
ablation studiesconfirm that all three core components (anisotropic Gaussians,adaptive density control, and thedifferentiable rasterizerwith unlimited gradient depth) are essential for achieving thestate-of-the-artquality and real-time performance of3D Gaussian Splatting.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces 3D Gaussian Splatting (3DGS), a pioneering approach that redefines the landscape of novel-view synthesis. By representing scenes with anisotropic 3D Gaussians, the method successfully bridges the gap between explicit point-based representations and implicit volumetric radiance fields. The combination of a robust optimization pipeline, featuring adaptive density control and anisotropic covariance adjustment, with a highly efficient tile-based differentiable rasterizer, enables unprecedented performance. 3DGS achieves state-of-the-art visual quality comparable to or even surpassing the best NeRF-based methods (e.g., Mip-NeRF360), drastically reduces training times from hours to minutes, and critically, delivers the first truly real-time ( fps) high-resolution (1080p) novel-view synthesis for complex, unbounded scenes. This work demonstrates that continuous implicit representations are not strictly necessary for high-quality and fast radiance field training and rendering, advocating for the power of explicit, GPU-friendly primitives and optimized rasterization principles.
7.2. Limitations & Future Work
The authors acknowledge several limitations of 3D Gaussian Splatting:
-
Artifacts in Poorly Observed Regions: Like other methods,
3DGScan produce artifacts in areas of the scene that are sparsely covered by training images. These can manifest as coarse, anisotropic Gaussians leading to low-detail visuals, or "splotchy" appearance.Figure 11andFigure 12illustrate these failure cases.
该图像是图表,展示了两个场景的比较:左侧为Mip-NeRF360方法的结果,显示了浮动物体和颗粒感;右侧为我们的方法,展示了粗糙、各向异性的高斯表示,导致低细节视觉效果。该图展示了在TRAIN场景中的失效伪影对比。Figure 11illustrates a comparison of failure artifacts in theTRAINscene. On the left,Mip-NeRF360exhibits "floaters" and a grainy appearance in the foreground. On the right, our method produces coarse,anisotropic Gaussiansresulting in low-detail visuals in the background. Both methods show limitations, but with different types of artifacts.
该图像是一个比较图,左侧展示了Mip-NeRF360在训练时视角重叠较少的情况下产生的伪影,右侧是我们的方法在相同场景中的表现。两者均在少数重叠视角下产生 artifacts,展示了不同技术的效果对比。Figure 12shows artifacts that can arise when views havelittle overlap with those seen during trainingin theDrJohnsonscene. The left image (Mip-NeRF360) and the right image (our method) both exhibit issues, indicating a common challenge fornovel-view synthesisin poorly constrained viewing conditions. -
Popping Artifacts: Occasionally,
popping artifactscan occur when the optimization creates large Gaussians, particularly in regions with strongview-dependent appearance. This is attributed to the simplevisibility algorithm(which might cause Gaussians to suddenly switch depth/blending order) and thetrivial rejectionof Gaussians via aguard bandin the rasterizer. -
Lack of Regularization: The current optimization does not employ any
regularizationtechniques, which could potentially help mitigate issues in unseen regions and reducepopping artifacts. -
Hyperparameter Sensitivity: While the same hyperparameters were used for the main evaluation, reducing the
position learning ratemight be necessary for convergence in very large scenes (e.g., urban datasets). -
Memory Consumption: Compared to
NeRF-basedsolutions,3DGShas significantly higher memory consumption. During training of large scenes,peak GPU memorycan exceed 20 GB. Even for rendering, trained models require several hundred megabytes, plus an additional 30-500 MB for the rasterizer. The authors note that this prototype implementation is unoptimized and could be significantly reduced.
Future Work: The authors suggest several promising directions for future research:
- Improved Culling and Antialiasing: A more principled
culling approachfor the rasterizer and incorporatingantialiasingtechniques could alleviatepopping artifactsand improve visual quality. - Regularization: Integrating
regularizationinto the optimization process could enhance robustness in poorly observed regions and reduce artifacts. - Optimization Efficiency: Porting the remaining
Pythonoptimization code entirely toCUDAcould lead to further significant speedups, especially for performance-critical applications. - Memory Optimization: Applying
compression techniques(well-studied inpoint cloudliterature) to the3D Gaussianrepresentation could drastically reduce memory footprint. - Mesh Reconstruction: Exploring whether the
3D Gaussianscan be used to performmesh reconstructionof the captured scene. This would clarify the method's position betweenvolumetricandsurface representationsand open up new applications.
7.3. Personal Insights & Critique
This paper represents a paradigm shift in novel-view synthesis, moving away from the purely implicit representations of NeRFs and demonstrating that an explicit, yet differentiable, primitive like 3D Gaussians can achieve superior performance across the board.
Inspirations and Applications:
- Real-Time Potential: The most significant inspiration is the achievement of
real-time high-quality rendering. This opens doors for applications previously unattainable, such asVR/AR, interactive experiences, games, virtual tourism, and rapid content creation for digital twins. The ability to navigate complex scenes at 1080p and FPS is transformative. - Hybrid Representation: The
3D Gaussianis a brilliant choice, acting as a "learned voxel" or "learned particle" that naturally bridges the gap between discrete points and continuous volumes. It leverages the strengths of both, offering the geometric flexibility of points with the differentiable properties of continuous fields. - Engineering Excellence: The emphasis on a highly optimized
tile-based rasterizerand explicitgradient computationhighlights the importance of classic computer graphics engineering principles in achieving real-world performance forAI-driven tasks. It shows that algorithmic efficiency is as crucial as neural network architecture. - Simplicity and Adaptability: Initializing from readily available
SfMpoints and then dynamically adapting theGaussiandensity and shape is an elegant solution to reconstruction challenges. It makes the method highly adaptable to diverse capture conditions.
Potential Issues, Unverified Assumptions, and Areas for Improvement:
-
Memory Footprint: While the authors acknowledge it, the current memory footprint (hundreds of MBs for inference, GBs for training) might still be a bottleneck for deployment on highly constrained devices (e.g., mobile
AR/VRheadsets with limited VRAM). Aggressive compression techniques will be vital. -
Interpretability and Editability: While
3D Gaussiansare explicit, directly editing individual Gaussians or understanding their precise contribution to a semantic part of the scene might still be complex. Future work on semantic segmentation or hierarchical structures over Gaussians could improve editability and interpretability. -
Generalization to Unseen Conditions: The artifacts in poorly observed regions suggest that while
3DGSis robust, it still relies heavily on sufficient multi-view coverage. Further research intoinpaintingorgenerative modelsto intelligently "fill in" gaps in sparse regions could enhance robustness. -
Anisotropic Artifacts: The mention of "splotchy" or elongated artifacts suggests that while
anisotropyis powerful, it can sometimes be over-optimized or misaligned, leading to unnatural appearances. Regularization or geometric constraints could help. -
Dynamic Scenes: The current method focuses on
static scenes. Extending3D Gaussian Splattingto dynamic environments (e.g., deformable objects, moving scenes) would be a significant but challenging next step, requiring tracking and optimizing many more moving parts.Overall,
3D Gaussian Splattingis a landmark paper that delivers on a long-standing promise ofradiance fields. Its practical implications are enormous, pushing the boundaries of what's possible inreal-time photorealistic renderingand setting a new benchmark for future research.
Similar papers
Recommended via semantic vector search.