Paper status: completed

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Published:08/08/2023

3D Gaussian Splatting Representation (1)Real-Time Radiance Field Rendering (1)Visual Quality Optimization (1)Volumetric Scene Rendering (1)Sparse Point Scene Representation (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents a 3D Gaussian splatting method for real-time radiance field rendering, introducing three key components: Gaussian scene representation, anisotropic covariance density optimization, and a fast visibility-aware rendering algorithm, achieving ≥30 fps at 1080p reso

Abstract

Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

Mind Map

In-depth Reading

English Analysis~37 min read · 49,566 chars

1. Bibliographic Information

1.1. Title

3D Gaussian Splatting for Real-Time Radiance Field Rendering

1.2. Authors

BERNHARD KERBL*, Inria, Université Côte d'Azur, France GEORGIOS KOPANAS*, Inria, Université Côte d'Azur, France THOMAS LEIMKÜHLER, Max-Planck-Institut für Informatik, Germany GEORGE DRETTAKIS, Inria, Université Côte d'Azur, France (*Both authors contributed equally to the paper.)

1.3. Journal/Conference

ACM Transactions on Graphics (TOG) 42, 4, Article 1 (August 2023). This is a highly prestigious journal in the field of computer graphics, known for publishing groundbreaking research. Publication in ACM TOG signifies significant impact and rigorous peer review within the computer graphics community.

1.4. Publication Year

2023

1.5. Abstract

Radiance Field methods, particularly Neural Radiance Fields (NeRFs), have significantly advanced novel-view synthesis from captured scenes. However, achieving high visual quality typically involves costly neural networks for training and rendering, while faster alternatives compromise quality. Existing methods struggle to provide real-time ( $\ge 30$ fps) 1080p rendering for complete, unbounded scenes. This paper introduces a novel approach incorporating three key elements to achieve state-of-the-art visual quality with competitive training times and, critically, real-time novel-view synthesis at 1080p resolution. First, the scene is represented using 3D Gaussians, initialized from sparse Structure-from-Motion (SfM) points. This representation maintains the desirable properties of continuous volumetric radiance fields for optimization while efficiently avoiding computation in empty space. Second, an interleaved optimization and adaptive density control strategy is employed for the 3D Gaussians, notably optimizing anisotropic covariance to accurately represent scene geometry. Third, a fast, visibility-aware rendering algorithm is developed, supporting anisotropic splatting, which accelerates both training and enables real-time rendering. The method demonstrates state-of-the-art visual quality and real-time rendering performance across several established datasets.

1.6. Original Source Link

https://arxiv.org/abs/2308.04079v1 The paper was initially published as a preprint on arXiv and later published in ACM Transactions on Graphics.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the trade-off between visual quality and rendering speed in novel-view synthesis using radiance fields. Recent advances, particularly with Neural Radiance Fields (NeRFs), have revolutionized the quality of synthesizing new views from a set of captured images. However, these methods, especially those achieving the highest visual fidelity like Mip-NeRF360, are computationally expensive, requiring extensive training times (tens of hours) and rendering times that are far from real-time (seconds per frame). While faster NeRF variants like InstantNGP and Plenoxels have emerged, they often sacrifice visual quality or still fall short of true real-time display rates ( $\ge 30$ fps) for high-resolution (1080p) and complex, unbounded scenes. The existing challenge is to achieve state-of-the-art (SOTA) visual quality, competitive training times, and real-time rendering simultaneously, especially for diverse scene types.

The paper's entry point or innovative idea is to combine the advantages of explicit, unstructured representations (like point clouds) with the differentiable properties of volumetric radiance fields. Instead of continuous neural representations or structured grids, they propose using 3D Gaussians as the fundamental scene primitive. This allows for explicit control over scene elements, efficient rendering via splatting, and an optimization process that benefits from the volumetric nature for accurate scene reconstruction.

2.2. Main Contributions / Findings

The paper introduces three primary contributions:

Anisotropic 3D Gaussians as a High-Quality, Unstructured Representation: The authors propose using 3D Gaussians with anisotropic covariance as a flexible and expressive scene representation. These Gaussians are initialized from sparse Structure-from-Motion (SfM) point clouds, preserving the advantages of differentiable volumetric representations for optimization while avoiding the computational overhead of empty space. This explicit, unstructured nature allows for efficient rasterization.
Interleaved Optimization with Adaptive Density Control: A novel optimization method is presented that adjusts the properties of the 3D Gaussians (position, opacity, anisotropic covariance, and Spherical Harmonic (SH) coefficients). This optimization is interleaved with adaptive density control steps, which involve cloning Gaussians in under-reconstructed areas and splitting large Gaussians in high-variance regions. This dynamic process allows the model to accurately represent complex scenes with a compact set of Gaussians.
Fast, Differentiable, Visibility-Aware Rendering: The paper develops a tile-based rasterizer that supports anisotropic splatting and respects visibility order through efficient GPU sorting. This rasterizer is fully differentiable, enabling fast backpropagation for training without imposing arbitrary limits on the number of Gaussians contributing gradients. This design is crucial for both accelerating training and achieving real-time rendering.

The key findings are that this method achieves:

State-of-the-art visual quality that is competitive with or surpasses the best implicit radiance field approaches (e.g., Mip-NeRF360).
Training speeds that are competitive with the fastest methods (e.g., InstantNGP, Plenoxels), reducing training time from tens of hours to minutes.
Crucially, it provides the first real-time ( $\ge 30$ fps) high-quality novel-view synthesis at 1080p resolution across a wide variety of complex, unbounded scenes, a feat previously unachieved by any other method.

3.1. Foundational Concepts

Radiance Fields

A radiance field is a function that maps any 3D point (x, y, z) and any 2D viewing direction $(\theta, \phi)$ to an emitted color (R, G, B) and a volume density $\sigma$ . This concept is central to novel-view synthesis because it allows for synthesizing images of a 3D scene from arbitrary viewpoints. The volume density $\sigma$ represents the probability of a ray terminating at a given point, allowing for soft volumetric effects like smoke or translucent objects.

Novel-View Synthesis

Novel-view synthesis is the computer graphics problem of generating new images of a 3D scene from previously unobserved camera viewpoints, using a set of existing images of that scene. The goal is to create photorealistic and geometrically consistent images.

Neural Radiance Fields (NeRF)

Neural Radiance Fields (NeRF) are a specific type of radiance field that uses a Multi-Layer Perceptron (MLP) neural network to represent the scene. The MLP is trained to output color and volume density for any 3D coordinate and viewing direction. Rendering an image with NeRF involves ray-marching through the scene, sampling points along each ray, querying the MLP at these points, and then accumulating the sampled colors and densities using a volumetric rendering equation.

Structure-from-Motion (SfM)

Structure-from-Motion (SfM) is a photogrammetric range imaging technique for estimating the 3D structure of a scene from a set of 2D images. It also simultaneously recovers the camera parameters (pose, intrinsics) for each image. SfM outputs a sparse point cloud, which consists of 3D points corresponding to distinctive features detected across multiple images. These points are typically used as an initial geometric estimate of the scene.

Multi-view Stereo (MVS)

Multi-view Stereo (MVS) is a technique that takes the camera poses and intrinsic parameters (often derived from SfM) and a set of images to produce a dense 3D reconstruction of the scene. Unlike SfM's sparse point clouds, MVS aims to generate a comprehensive representation, often in the form of dense point clouds, meshes, or depth maps. While MVS provides more detailed geometry, it can struggle with featureless or shiny surfaces and may produce over-reconstruction (artifacts) or under-reconstruction (holes).

Spherical Harmonics (SH)

Spherical Harmonics (SH) are a set of basis functions defined on the surface of a sphere, analogous to Fourier series on a circle. In computer graphics, they are commonly used to represent spatially varying directional information, such as lighting or view-dependent appearance. For radiance fields, SH coefficients can encode how the color of a point changes when viewed from different directions, allowing for more realistic rendering of diffuse and mildly specular surfaces without using complex Bidirectional Reflectance Distribution Functions (BRDFs). The order of SH determines the complexity of the directional representation (e.g., zero-order for diffuse, higher orders for more complex effects).

Alpha Blending

Alpha blending is a technique used in computer graphics to combine an image foreground with a background image. It uses an alpha channel (opacity value) to determine how transparent or opaque each pixel is. In volumetric rendering or point-based rendering, alpha blending is crucial for accumulating color and opacity along a ray or through overlapping primitives. Objects closer to the camera obscure those further away based on their opacity. The standard volumetric rendering equation (derived from radiative transfer theory) can be re-written as an alpha blending sequence.

Splatting

Splatting is a point-based rendering technique where each 3D point (or primitive) is projected onto the image plane and "splatted" or distributed across multiple pixels, typically using a 2D kernel (e.g., a Gaussian distribution) to determine its influence and smooth the appearance. This helps to fill in gaps between sparsely projected points and reduce aliasing artifacts, creating a continuous-looking surface from discrete elements. Anisotropic splatting allows the shape of this 2D kernel to vary, adapting to the underlying geometry and projection distortions.

Gaussians

In a general sense, a Gaussian function (or normal distribution) is a bell-shaped curve often used to model probability distributions or as a smoothing kernel.

1D Gaussian: $f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$
3D Gaussian: In 3D, a Gaussian describes an ellipsoid centered at a mean $\mu$ , with its shape and orientation determined by a covariance matrix $\Sigma$ . The function indicates the "intensity" or "density" at any point $x$ relative to the mean $\mu$ : $ G(x) = e^{-\frac{1}{2}(x - \mu)^T \Sigma^{-1}(x - \mu)} $ where $\mu$ is the mean (3D position) and $\Sigma$ is the $3 \times 3$ covariance matrix.
Anisotropic Covariance: A covariance matrix that is not isotropic (i.e., not a multiple of the identity matrix) allows the Gaussian to have different spreads along different axes, resulting in an ellipsoidal shape rather than a perfect sphere. This anisotropy is crucial for representing fine structures or surfaces that are not aligned with coordinate axes, making the representation more compact and accurate.

3.2. Previous Works

Traditional Scene Reconstruction and Rendering

Light Fields (Gortler et al. 1996; Levoy and Hanrahan 1996): Early approaches to novel-view synthesis that captured a dense grid of images from a scene. Rendering involves interpolating between these densely sampled views. While high quality, they required immense data capture and were mostly limited to static, bounded scenes.
SfM (Snavely et al. 2006) and MVS (Goesele et al. 2007): These methods enabled reconstruction from unstructured photo collections. SfM provides sparse 3D points and camera poses, while MVS builds dense geometry (meshes, depth maps). Subsequent view synthesis methods (e.g., Chaurasia et al. 2013; Hedman et al. 2018) reprojected and blended input images based on this geometry. While effective, they suffered from over-reconstruction or under-reconstruction artifacts in challenging areas.

Neural Rendering and Radiance Fields

Early Deep Learning for Novel-View Synthesis (Flynn et al. 2016; Hedman et al. 2018; Riegler and Koltun 2020): Applied Convolutional Neural Networks (CNNs) for tasks like estimating blending weights or texture space solutions. Often still relied on MVS geometry, inheriting its limitations, and CNN-based rendering could lead to temporal flickering.
Volumetric Representations (Henzler et al. 2019; Sitzmann et al. 2019): Introduced differentiable density fields, leveraging volumetric ray-marching. These were precursors to NeRFs.
Neural Radiance Fields (NeRF) (Mildenhall et al. 2020): A breakthrough method that uses an MLP to implicitly represent a scene's color and density. It introduced positional encoding and importance sampling for high-quality results.
- Volumetric Rendering Equation (from Section 2.3 of the paper): The color $C$ $C$ along a ray is given by: $ C = \sum_{i=1}^{N} T_i (1 - \exp(-\sigma_i \delta_i)) \mathbf{c}i \quad \mathrm{with} \quad T_i = \exp\left(-\sum{j=1}^{i-1} \sigma_j \delta_j \right) $ where:
  - $N$ : The number of samples along the ray.
  - $T_i$ : The transmittance (or accumulated opacity) from the ray origin to sample $i$ . It represents the probability that the ray reaches sample $i$ without being obstructed.
  - $\sigma_i$ : The volume density at sample $i$ , indicating the probability of a ray terminating at this point.
  - $\delta_i$ : The distance between sample $i$ and sample i-1 along the ray.
  - $\mathbf{c}_i$ : The color at sample $i$ . This can be rewritten using $\alpha_i$ (the opacity of sample $i$ ): $ C = \sum_{i=1}^{N} T_i \alpha_i \mathbf{c}i $ with $ \alpha_i = {\big(} 1 - \exp(-\sigma_i \delta_i) {\big)} {\mathrm{~and~}} T_i = \prod{j=1}^{i-1} (1 - \alpha_j) $
  - $\alpha_i$ : The opacity of the $i$ -th sample.
  - The product $\prod_{j=1}^{i-1} (1 - \alpha_j)$ is an alternative way to calculate transmittance, showing the accumulated transparency up to point i-1.
- Mip-NeRF360 (Barron et al. 2022): The current state-of-the-art in image quality for novel-view synthesis for unbounded scenes, building upon NeRF by addressing anti-aliasing and scene bounds. Known for outstanding quality but extremely high training (up to 48 hours) and rendering times (seconds per frame).
Faster NeRFs:
- InstantNGP (Müller et al. 2022): Uses a multi-resolution hash grid and occupancy grid to accelerate computation and a smaller MLP. Significantly faster training (minutes) and interactive rendering (10-15 fps) but with a trade-off in quality compared to Mip-NeRF360.
- Plenoxels (Fridovich-Keil and Yu et al. 2022): Represents radiance fields with a sparse voxel grid that interpolates a continuous density field, forgoing neural networks entirely. Achieves fast training and interactive rendering, but also with quality limitations compared to SOTA. Both InstantNGP and Plenoxels rely on Spherical Harmonics for directional appearance and are hindered by ray-marching for rendering.

Point-Based Rendering and Radiance Fields

Traditional Point-Based Rendering (Pfister et al. 2000; Zwicker et al. 2001b): Renders unstructured sets of points by splatting them as larger primitives (discs, ellipsoids, surfels) to avoid holes and reduce aliasing.
Differentiable Point-Based Rendering (Wiles et al. 2020; Yifan et al. 2019): Enabled end-to-end training of point cloud representations.
Neural Point-Based Graphics (Aliev et al. 2020; Kopanas et al. 2021; Rückert et al. 2022 - ADOP): Augmented points with neural features and rendered using CNNs for fast view synthesis. Often depended on MVS for initial geometry, inheriting its artifacts. Pulsar (Lassner and Zollhofer 2021) achieved fast sphere rasterization, inspiring tile-based rendering but often used order-independent transparency.
- Point-Based $\alpha$ -Blending (from Section 2.3 of the paper): A typical neural point-based approach computes the color $C$ $C$ of a pixel by blending $N$ $N$ ordered points overlapping the pixel: $ C = \sum_{i \in N} c_i \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j) $ where:
  - $\mathbf{c}_i$ : The color of each point (or splat).
  - $\alpha_i$ : The opacity of each point, often derived from evaluating a 2D Gaussian and multiplying by a learned per-point opacity. This formula is essentially identical to the re-written volumetric rendering equation, highlighting the shared image formation model.
Point-NeRF (Xu et al. 2022): Uses points to represent a radiance field with radial basis functions and employs pruning/densification. However, it still uses volumetric ray-marching and cannot achieve real-time rates.
3D Gaussians in specific contexts (Rhodin et al. 2015; Stoll et al. 2011; Wang et al. 2023; Lombardi et al. 2021): 3D Gaussians (or similar primitives) have been used for specialized tasks like human performance capture or isolated object reconstruction, where depth complexity is often limited.

3.3. Technological Evolution

The evolution of novel-view synthesis has progressed from explicit geometry-based methods (e.g., MVS with image blending) to implicit continuous representations (NeRFs), and then to methods focusing on improving the speed of NeRFs (e.g., InstantNGP, Plenoxels) often by introducing structured grids. Simultaneously, point-based rendering has evolved from basic splatting to differentiable versions, eventually incorporating neural features.

This paper's work (3D Gaussian Splatting) fits into the timeline by bridging the gap between explicit, GPU-friendly representations and the differentiable, high-quality optimization capabilities of volumetric radiance fields. It leverages the expressiveness of Gaussians and couples it with a highly optimized, visibility-aware rasterization pipeline to overcome the speed limitations of traditional ray-marching and the quality limitations of previous fast methods, pushing the field towards practical real-time high-fidelity rendering.

3.4. Differentiation Analysis

Compared to the main methods in related work, 3D Gaussian Splatting (3DGS) introduces several core differences and innovations:

Vs. Mip-NeRF360 (SOTA Quality, Slow): Mip-NeRF360 achieves the highest quality but at the cost of extremely long training times (48 hours) and very slow rendering (seconds per frame) due to its implicit continuous representation and ray-marching. 3DGS matches or surpasses this quality with training times reduced to minutes (35-45 minutes) and achieves real-time rendering ( $\ge 30$ fps). This is a monumental leap in efficiency without sacrificing quality. The explicit nature of 3D Gaussians and the efficient rasterizer are key here.
Vs. InstantNGP/Plenoxels (Fast Training/Interactive Rendering, Lower Quality): While InstantNGP and Plenoxels significantly reduce training time and offer interactive rendering speeds (10-15 fps), they often struggle to achieve the same peak visual quality as Mip-NeRF360, particularly in complex scenes, due to limitations of their structured grid representations and the inherent cost of ray-marching. 3DGS achieves superior visual quality while maintaining competitive training times and true real-time rendering, outperforming them in the quality-speed balance. The anisotropic Gaussians allow for more precise representation of fine details than fixed-resolution grids.
Vs. Other Point-Based Methods (e.g., ADOP, Neural Point-Based Graphics): Many prior point-based methods either relied on MVS geometry (introducing artifacts) or used CNNs for rendering (leading to temporal instability). 3DGS avoids MVS by initializing from sparse SfM points and optimizing the Gaussians directly. Its visibility-aware rasterizer (which sorts primitives) combined with Spherical Harmonics avoids the temporal instability issues of CNN-based rendering and provides higher visual fidelity than simple point sprites. Crucially, 3DGS does not require MVS and scales to complex, unbounded scenes, unlike some object-specific point-based approaches.
Novelty of Representation: The choice of anisotropic 3D Gaussians as a differentiable primitive is a core innovation. It offers the flexibility of an unstructured explicit representation (allowing dynamic creation/deletion/movement), the differentiability of a volumetric representation (crucial for optimization), and direct projectability to 2D for efficient splatting. This combines the "best of both worlds" from continuous NeRFs and explicit point clouds.
Efficient Rendering Pipeline: The tile-based rasterizer with fast GPU sorting and visibility-aware alpha blending for anisotropic splats is a significant architectural innovation, enabling real-time performance. Its ability to backpropagate gradients over an arbitrary number of blended Gaussians is also critical for high-quality optimization, addressing limitations of prior fast rendering methods.

4. Methodology

4.1. Principles

The core idea behind 3D Gaussian Splatting (3DGS) is to represent a 3D scene not as an implicit neural function or a structured grid, but as a collection of discrete, explicit 3D Gaussians. Each Gaussian is a volumetric primitive defined by its 3D position (mean), its 3D anisotropic covariance (which determines its shape and orientation), its opacity (alpha), and Spherical Harmonic (SH) coefficients for its color and view-dependent appearance.

This choice of 3D Gaussians is principled for several reasons:

Differentiability: Gaussians are inherently differentiable, allowing for direct optimization of their parameters using gradient descent, similar to NeRFs.
Volumetric Properties: They behave like continuous volumetric representations during optimization, capable of modeling density and transparency.
Explicit and Unstructured: Unlike implicit NeRFs that require ray-marching costly sampling, Gaussians are explicit entities. This allows for direct manipulation (creation, deletion, movement) and avoids computation in empty space. Being unstructured, they can adapt to arbitrary scene geometry without grid artifacts.
Efficient Projection and Rendering: 3D Gaussians can be analytically projected onto the 2D image plane, resulting in 2D elliptical splats. These splats can then be efficiently rasterized and blended using standard alpha blending techniques, leveraging highly optimized GPU pipelines.
Anisotropy for Compactness: The anisotropic covariance allows Gaussians to stretch and orient themselves to accurately represent surfaces, thin structures, or large homogeneous regions with fewer primitives, leading to a more compact representation.

The theoretical basis combines concepts from volumetric rendering, point-based rendering, and differentiable optimization. The volumetric rendering equation dictates how colors and opacities are accumulated along rays. Point-based rendering principles guide the efficient projection and splatting of Gaussians. Differentiable optimization allows the system to learn the optimal parameters for each Gaussian from multiple input views.

4.2. Core Methodology In-depth (Layer by Layer)

The overall process can be broken down into initialization, representation, optimization with adaptive density control, and fast differentiable rendering. The overview of our method is illustrated in Fig. 2.

该图像是一个示意图，展示了从 SfM 点到图像生成的流程，包含初始化、3D 高斯表示、投影和自适应密度控制等步骤。箭头指示了操作流和梯度流。

The Figure 2 from the original paper provides a high-level overview of the 3D Gaussian Splatting pipeline. It shows the input SfM points being used to initialize 3D Gaussians. These Gaussians then undergo optimization and adaptive density control to accurately represent the scene. Finally, a fast differentiable rasterizer projects and blends these Gaussians to render novel views, with gradients flowing back to update Gaussian parameters.

4.2.1. Initializing 3D Gaussians

The process begins with a set of input images of a static scene and their corresponding camera parameters, obtained through Structure-from-Motion (SfM) [Schönberger and Frahm 2016]. A key byproduct of SfM is a sparse point cloud.

The 3D Gaussians are initialized directly from these SfM points. Each SfM point becomes the mean ( $\mu$ ) of a new 3D Gaussian.
The initial covariance matrix ( $\Sigma$ ) for each Gaussian is set to be isotropic. This means it's spherical, with its radius determined by the mean distance to the three closest SfM points.
Initial opacity ( $\alpha$ ) values are typically low, and Spherical Harmonic (SH) coefficients are initialized to represent the color of the corresponding SfM point.

For specific cases like the synthetic NeRF-synthetic dataset, the method can even achieve high quality with random initialization (e.g., 100K uniformly random Gaussians within the scene bounds), which are then automatically pruned and refined.

4.2.2. Representing the Scene with 3D Gaussians

Each 3D Gaussian is characterized by several properties that are optimized during training:

3D Position (Mean $\mu$ ): A 3D vector (x, y, z) representing the center of the Gaussian.
Opacity ( $\alpha$ ): A scalar value [0, 1) indicating the Gaussian's transparency.
Anisotropic Covariance ( $\Sigma$ ): A $3 \times 3$ symmetric positive semi-definite matrix that defines the shape, size, and orientation of the Gaussian ellipsoid in 3D space.
Spherical Harmonic (SH) Coefficients: A set of coefficients that encode the view-dependent color ( $\mathbf{c}$ ) of the Gaussian. The paper uses 4 bands of SH.

The 3D Gaussian function itself is defined as: $ G(x) = e^{-\frac{1}{2}(x - \mu)^T \Sigma^{-1}(x - \mu)} $ where:

$x$ : A 3D point in space.
$\mu$ : The 3D mean (position) of the Gaussian.
$\Sigma$ : The $3 \times 3$ covariance matrix.

For rendering, these 3D Gaussians need to be projected to 2D. The projection of a 3D Gaussian to image space results in a 2D Gaussian. This is achieved by computing a 2D covariance matrix ( $\Sigma'$ ) from the 3D covariance matrix ( $\Sigma$ ). Given a viewing transformation $W$ (which includes rotation and translation from world to camera coordinates), the covariance matrix $\Sigma_c$ in camera coordinates is obtained: $ \Sigma_c = W \Sigma W^T $ Then, this covariance matrix in camera coordinates is projected to screen space. According to Zwicker et al. [2001a], if we let $J$ be the Jacobian of the affine approximation of the projective transformation, the 2D covariance matrix $\Sigma'$ in image space is: $ \Sigma' = J W \Sigma W^T J^T $ The paper simplifies this by stating: "if we skip the third row and column of $\Sigma'$ , we obtain a $2 \times 2$ variance matrix with the same structure and properties as if we would start from planar points with normals". This effectively means they extract the $2 \times 2$ upper-left submatrix from the full $3 \times 3$ projected covariance in camera/image space.

To optimize the covariance matrix $\Sigma$ using gradient descent, directly operating on $\Sigma$ is problematic because it must remain positive semi-definite. To overcome this, the authors decompose $\Sigma$ into a scaling matrix $S$ and a rotation matrix $R$ : $ \Sigma = R S S^T R^T $ where:

$S$ : A diagonal matrix defined by a 3D vector $s = (s_x, s_y, s_z)$ for scaling along its principal axes. The elements of $S$ are $s_x, s_y, s_z$ .
$R$ : A rotation matrix derived from a quaternion $q$ .

The parameters actually stored and optimized are the 3D vector $s$ (for scaling) and the quaternion $q$ (for rotation). These are then converted to $S$ and $R$ to reconstruct $\Sigma$ . This parameterization ensures that $\Sigma$ remains valid (positive semi-definite) during optimization. The quaternion $q$ is normalized to ensure it's a unit quaternion.

4.2.3. Differentiable Gradient Computation

To avoid significant overhead from automatic differentiation for all parameters, the gradients for all parameters are derived explicitly. This is crucial for the efficiency of the optimization process. The details of these derivative computations are provided in Appendix A.

In Appendix A, the gradients for the 2D covariance matrix $\Sigma'$ with respect to the scaling vector $s$ and quaternion $q$ are derived using the chain rule. Recall that $\Sigma$ is the world space covariance matrix and $\Sigma'$ is the view space (projected 2D) covariance matrix. $q$ is the quaternion for rotation, $s$ is the 3D vector for scaling. $W$ is the viewing transformation and $J$ is the Jacobian of the affine approximation of the projective transformation.

The chain rule is applied: $ \frac{d\Sigma'}{ds} = \frac{d\Sigma'}{d\Sigma} \frac{d\Sigma}{ds} $ and $ \frac{d\Sigma'}{dq} = \frac{d\Sigma'}{d\Sigma} \frac{d\Sigma}{dq} $ To simplify notation, let $U = JW$ . Then $\Sigma'$ is the (symmetric) upper left $2 \times 2$ matrix of $U \Sigma U^T$ . The partial derivatives of $\Sigma'$ with respect to elements of $\Sigma$ are: $ \frac{\partial \Sigma'}{\partial \Sigma_{ij}} = \left( \begin{array}{ll} U_{1,i}U_{1,j} & U_{1,i}U_{2,j} \ U_{1,j}U_{2,i} & U_{2,i}U_{2,j} \end{array} \right) $ Next, the derivatives $\frac{d\Sigma}{ds}$ and $\frac{d\Sigma}{dq}$ are needed. Since $\Sigma = R S S^T R^T$ , let $M = RS$ . Then $\Sigma = M M^T$ . The partial derivative of $\Sigma$ with respect to $M$ is: $ \frac{d\Sigma}{dM} = 2M^T $ For scaling, the partial derivative of $M$ with respect to the components of $s$ ( $s_k$ ) is: $ \frac{\partial M_{i,j}}{\partial s_k} = \begin{cases} R_{i,k} & \mathrm{~if~} j=k \ 0 & \mathrm{~otherwise~} \end{cases} $ To derive gradients for rotation, the conversion from a unit quaternion $q = (q_r, q_i, q_j, q_k)$ to a rotation matrix R(q) is recalled: $ R(q) = 2 \left( \begin{array}{ccc} { \frac{1}{2} - (q_j^2 + q_k^2) } & { (q_i q_j - q_r q_k) } & { (q_i q_k + q_r q_j) } \ { (q_i q_j + q_r q_k) } & { \frac{1}{2} - (q_i^2 + q_k^2) } & { (q_j q_k - q_r q_i) } \ { (q_i q_k - q_r q_j) } & { (q_j q_k + q_r q_i) } & { \frac{1}{2} - (q_i^2 + q_j^2) } \end{array} \right) $ From this, the gradients for the components of $q$ are derived (noting $s_x, s_y, s_z$ are the diagonal elements of $S$ ): $ \begin{array}{cc} \displaystyle { \frac{\partial M}{\partial q_r} = 2 \left( \begin{array}{ccc} 0 & -s_y q_k & s_z q_j \ s_x q_k & 0 & -s_z q_i \ -s_x q_j & s_y q_i & 0 \end{array} \right) , \quad } & { \displaystyle { \frac{\partial M}{\partial q_i} = 2 \left( \begin{array}{ccc} 0 & s_y q_j & s_z q_k \ s_x q_j & -2 s_y q_i & -s_z q_r \ s_x q_k & s_y q_r & -2 s_z q_i \end{array} \right) } } \ \displaystyle { \frac{\partial M}{\partial q_j} = 2 \left( \begin{array}{ccc} -2 s_x q_j & s_y q_i & s_z q_r \ s_x q_i & 0 & s_z q_k \ -s_x q_r & s_y q_k & -2 s_z q_j \end{array} \right) , \quad } & { \displaystyle { \frac{\partial M}{\partial q_k} = 2 \left( \begin{array}{ccc} -2 s_x q_k & -s_y q_r & s_z q_i \ s_x q_r & -2 s_y q_k & s_z q_j \ s_x q_i & s_y q_j & 0 \end{array} \right) } } \end{array} $ Gradients for quaternion normalization are also straightforward.

4.2.4. Optimization with Adaptive Density Control

The core of the method's learning process involves optimizing the 3D Gaussian parameters interleaved with steps to manage the density and number of Gaussians.

Optimization Loop: The optimization uses Stochastic Gradient Descent (SGD) techniques.
- Activation Functions: A sigmoid function is used for opacity ( $\alpha$ ) to constrain it to the range [0, 1), providing smooth gradients. An exponential activation function is used for the scale components ( $s_x, s_y, s_z$ ) to ensure they remain positive.
- Initial Covariance: Initially, the covariance matrix is isotropic, based on the distance to the closest three SfM points.
- Learning Rate Schedule: An exponential decay scheduling technique is used for position learning rates, similar to Plenoxels.
- Loss Function: The training objective minimizes a combination of $\mathcal{L}_1$ $L_{1}$ loss and D-SSIM (Differentiable Structural Similarity Index Measure) term: $ \mathcal{L} = (1 - \lambda) \mathcal{L}1 + \lambda \mathcal{L}{\mathrm{D-SSIM}} $ where:
  - $\mathcal{L}_1$ : The Mean Absolute Error (MAE) between the rendered image and the ground truth image, which encourages pixel-wise accuracy.
  - $\mathcal{L}_{\mathrm{D-SSIM}}$ : A differentiable version of the SSIM metric, which measures perceptual similarity based on luminance, contrast, and structure, making the results visually more pleasing.
  - $\lambda$ : A weighting factor, set to 0.2 in all experiments, balancing the contribution of $\mathcal{L}_1$ and D-SSIM.
- Spherical Harmonics (SH) Optimization: To address the sensitivity of SH coefficients to missing angular information (e.g., in corner captures), the optimization of SH is phased. Initially, only the zero-order component (base/diffuse color) is optimized. After every 1000 iterations, an additional SH band is introduced until all 4 bands are represented.
- Warm-up: For stability, optimization starts at a lower image resolution (4 times smaller) and is upsampled twice after 250 and 500 iterations.
Adaptive Density Control: This crucial mechanism dynamically adjusts the number and density of 3D Gaussians to accurately represent the scene. It helps to populate empty areas (under-reconstruction) and refine regions with too few large Gaussians (over-reconstruction).
- Trigger for Densification: Densification occurs every 100 iterations after an initial optimization warm-up.
- Criteria for Densification: Gaussians with an average magnitude of view-space position gradients above a threshold ( $\tau_{\mathrm{pos}} = 0.0002$ ) are targeted for densification. High gradients indicate regions that are not yet well reconstructed, prompting the optimization to move Gaussians.
- Cloning: For small Gaussians in under-reconstructed regions, a copy of the Gaussian is created (cloned) with the same size and moved slightly in the direction of its positional gradient. This helps to cover new geometry.
- Splitting: For large Gaussians in regions with high variance (often over-reconstruction), the Gaussian is replaced by two new, smaller Gaussians. Their scales are divided by a factor of $\phi = 1.6$ , and their positions are initialized by sampling from the original Gaussian's probability density function (PDF).
- Pruning (Removal): Gaussians with opacity ( $\alpha$ ) less than a threshold ( $\epsilon_{\alpha}$ ) are removed. This cleans up transparent or unnecessary Gaussians.
- Periodic Opacity Reset: Every $N=3000$ iterations, the $\alpha$ values of all Gaussians are set close to zero. This forces the optimization to re-learn opacities, allowing the system to shed floaters (incorrectly placed Gaussians) and remove Gaussians that are no longer necessary.
  
  The adaptive Gaussian densification scheme is visually summarized in Figure 4 from the paper.
  
  该图像是一个示意图，展示了自适应高斯密度化方案的两个阶段。顶部部分展示了当小规模几何体不足覆盖时，采用克隆方式进行优化；底部部分则表现了在过度重建情况下，如何通过拆分大面积样本以提升精度，进一步优化过程。

Figure 4 illustrates how adaptive Gaussian densification works. The top row shows cloning: when small-scale geometry (black outline) is inadequately covered, the relevant Gaussian is cloned to expand coverage. The bottom row demonstrates splitting: if a large area of small-scale geometry is represented by a single large splat, it is split into two smaller Gaussians for more detailed representation.

The optimization and densification algorithms are summarized in Algorithm 1 from Appendix B.

Algorithm 1 Optimization and Densification w, h: width and height of the training images
M ← SfM Points S, C, A ← InitAttributes()	Positions Covariances, Colors, Opacities Iteration Count
i ← 0 while not converged do
V, ← SampleTrainingView() > Camera V and Image

Algorithm 1: Optimization and Densification

Input: $w$ (width), $h$ (height) of training images.
Initialization:
- $M$ : Initialized SfM Points.
- S, C, A: Initialized Attributes (Scales, Colors, Opacities) for the Gaussians. These correspond to position, covariance, color, and opacity parameters.
- $i$ : Iteration counter, initialized to 0.
Main Loop: Continues while not converged.
- V, I: Sample a Training View (Camera V and Image I).
- ... (The pseudocode is cut off here in the provided text, but it would typically involve rendering a view, calculating loss, backpropagating gradients to update Gaussian parameters, and then applying densification/pruning steps at specified intervals.)

4.2.5. Fast Differentiable Rasterizer for Gaussians

The rasterizer is designed for speed and differentiability, allowing for real-time rendering and efficient gradient computation during training.

Tile-Based Architecture:
- The screen is divided into 16x16 pixel tiles.
- Frustum Culling: 3D Gaussians are culled (removed) if their 99% confidence interval does not intersect the view frustum. A guard band is used to reject Gaussians near the near plane but far outside the view frustum to avoid numerical instability during 2D covariance projection.
- Gaussian Instantiation and Key Assignment: Each Gaussian that overlaps multiple tiles is instantiated (duplicated) for each tile it covers. Each instance is assigned a key that combines its view-space depth and the tile ID it overlaps.
- Fast GPU Sorting: All Gaussian instances are sorted globally based on these keys using a single GPU Radix sort. This ensures approximate depth ordering for all splats across the entire image.
- Per-Tile Lists: After sorting, lists of Gaussians for each tile are generated by identifying the start and end indices in the sorted array for each tile ID.
Rasterization Pass (Forward Pass):
- One thread block is launched for each tile.
- Threads within a block collaboratively load packets of Gaussians into shared memory (fast on-chip memory).
- For each pixel within the tile, color and alpha values are accumulated by traversing the per-tile Gaussian list from front-to-back, performing alpha blending.
- Early Termination: Processing for a pixel stops when its alpha value reaches a target saturation (e.g., close to 1, specifically 0.9999). The processing of an entire tile terminates when all its pixels have saturated. This maximizes parallelism and avoids unnecessary computation for occluded regions.

Differentiable Backward Pass:

Crucially, the rasterizer supports backpropagation without limiting the number of blended primitives that receive gradient updates. This is vital for learning complex scenes with varying depth complexity.
Reconstructing Intermediate Opacities: Instead of storing long lists of blended points per-pixel (which is memory-intensive), the backward pass re-traverses the per-tile lists of Gaussians (which were already sorted in the forward pass).
Back-to-Front Traversal: The traversal for gradients happens back-to-front.
Gradient Computation: The accumulated opacity from the forward pass at each step is needed for gradient computation. This is recovered by storing only the final accumulated opacity for each point and then repeatedly dividing it by each point's alpha during the back-to-front traversal.

Optimized Overlap Testing: During the backward pass, a pixel only performs expensive overlap testing and processing of points if their depth is less than or equal to the depth of the last point that contributed to its color in the forward pass.

The rasterization approach is described at a high-level in Algorithm 2 from Appendix C.

Algorithm 2 GPU software rasterization of 3D Gaussians w, h: width and height of the image to rasterize M, S: Gaussian means and covariances in world space C, A: Gaussian colors and opacities V: view configuration of current camera

function RASTERIzE(w, h, M, S, C, A, V) CullGaussian(p, V ) Frustum Culling M′, S′ ← ScreenspaceGaussians(M, S, V ) Transform T ← CreateTiles(w, h)

L, K ← DuplicateWithKeys(M ′, T) Indices and Keys SortByKeys(K, L) Globally Sort R ← IdentifyTileRanges(T , K) I ← 0 Init Canvas

for all Tiles t in I do for all Pixels i in t do r ← GetTileRange(R, t) I[i] ← BlendInOrder(i, L, r, K, M′, S′, C, A)

Algorithm 2: GPU Software Rasterization of 3D Gaussians

Input: $w$ (width), $h$ (height) of the image; M, S (Gaussian means and covariances in world space); C, A (Gaussian colors and opacities); $V$ (view configuration of the current camera).
Function RASTERIZE:
- CullGaussian(p, V): Performs Frustum Culling to remove Gaussians outside the view.
- M', S' ← ScreenspaceGaussians(M, S, V): Transforms Gaussians (means $M$ and covariances $S$ ) from world space to screen space based on the view configuration V, producing $M'$ and $S'$ .
- T ← CreateTiles(w, h): Divides the screen into tiles (e.g., $16 \times 16$ pixel blocks).
- L, K ← DuplicateWithKeys(M', T): Duplicates Gaussians that overlap multiple tiles and assigns a key $K$ to each instance, combining its tile ID and projected depth. $L$ refers to the list of these duplicated Gaussian instances.
- SortByKeys(K, L): Globally sorts the list of Gaussian instances $L$ based on their keys $K$ using a Radix sort.
- R ← IdentifyTileRanges(T, K): Identifies the start and end indices in the globally sorted list $L$ for each tile, creating per-tile ranges R.
- I ← 0: Initializes the canvas (output image) $I$ to black.
- Loop over Tiles:
  - for all Tiles t in I do: Iterates through each tile.
  - Loop over Pixels within Tile:
    - for all Pixels i in t do: Iterates through each pixel within the current tile.
    - r ← GetTileRange(R, t): Retrieves the range of Gaussians relevant to the current tile t.
    - I[i] ← BlendInOrder(i, L, r, K, M', S', C, A): Blends the Gaussians in depth-sorted order for pixel i using their screen-space parameters (M', S'), colors ( $C$ ), and opacities ( $A$ ). This function accumulates color and opacity until saturation.
Numerical Stability (Appendix C):
- To ensure numerical stability, especially during the backward pass's opacity recovery:
  - Blending updates are skipped if $alpha < epsilon$ (e.g., $\epsilon = \frac{1}{255}$ ).
  - Alpha values are clamped from above at 0.99.
  - In the forward pass, front-to-back blending for a pixel stops if the accumulated opacity would exceed 0.9999, preventing division by zero or infinite values during backward pass reconstruction.
    
    This comprehensive methodology, combining an expressive Gaussian representation with adaptive optimization and a highly optimized rasterization pipeline, is what enables 3D Gaussian Splatting to achieve its unprecedented balance of quality, speed, and real-time performance.

4.2.6. Image Formation Model

As mentioned in the "Related Work" section, the paper highlights that point-based\alpha-blending and NeRF-style volumetric rendering share essentially the same image formation model. This understanding is crucial because it allows the 3D Gaussians to be optimized using principles derived from volumetric rendering while being rendered efficiently using splatting and alpha blending.

The volumetric rendering formula (Eq. 1 in the paper), used by NeRFs, is: $ C = \sum_{i=1}^{N} T_i (1 - \exp(-\sigma_i \delta_i)) \mathbf{c}i \quad \mathrm{w i t h} \quad T_i = \exp\left(-\sum{j=1}^{i-1} \sigma_j \delta_j \right) $ where:

$C$ : The final accumulated color of a ray.
$N$ : Number of samples along the ray.
$T_i$ : Transmittance (or accumulated transparency) from the ray origin to sample $i$ .
$\sigma_i$ : Volume density at sample $i$ .
$\delta_i$ : Distance between consecutive samples.
$\mathbf{c}_i$ : Color at sample $i$ .

This can be rewritten (Eq. 2 in the paper) by defining the opacity $\alpha_i$ for each sample: $ C = \sum_{i=1}^{N} T_i \alpha_i \mathbf{c}i $ with $ \alpha_i = {\big(} 1 - \exp(-\sigma_i \delta_i) {\big)} {\mathrm{~a n d~}} T_i = \prod{j=1}^{i-1} (1 - \alpha_j) $ where:
$\alpha_i$ : The effective opacity of the $i$ -th volumetric sample.
The new $T_i$ is explicitly shown as a product of transparencies $(1-\alpha_j)$ of preceding samples.

The point-based\alpha-blending approach (Eq. 3 in the paper), typical for methods that blend ordered points (or splats) overlapping a pixel, is: $ C = \sum_{i \in N} \mathbf{c}i \alpha_i \prod{j=1}^{i-1} (1 - \alpha_j) $ where:
$N$ : The set of ordered points (or splats) overlapping the pixel.
$\mathbf{c}_i$ : The color of each point.
$\alpha_i$ : The opacity of each point.
$\prod_{j=1}^{i-1} (1 - \alpha_j)$ : The accumulated transparency from previous points, effectively acting as transmittance.

The paper emphasizes that from Eq. 2 and Eq. 3, we can clearly see that the image formation model is the same. This equivalence allows the use of 3D Gaussians (which project to 2D splats with color and opacity) to be optimized under the same volumetric principles as NeRFs, while being rendered using a highly efficient splatting approach that mimics the volumetric accumulation. This is a fundamental insight that justifies the 3D Gaussian Splatting approach.

5. Experimental Setup

5.1. Datasets

The authors evaluated their algorithm on a diverse set of established datasets to demonstrate its robustness across various scene types and capture styles.

Mip-NeRF360 Dataset [Barron et al. 2022]: This dataset comprises 9 real-world scenes (bicycle, flowers, garden, stump, treehill, room, counter, kitchen, bonsai) and is considered the state-of-the-art benchmark for NeRF rendering quality. It includes both bounded indoor scenes and large unbounded outdoor environments. The scenes often feature complex geometry and view-dependent effects, providing a challenging test for both quality and scalability.
Tanks&Temples Dataset [Knapitsch et al. 2017]: Two scenes (Truck and Train) from this dataset were used. This dataset is known for its large-scale outdoor environments and realistic captures, often used for benchmarking 3D reconstruction and novel-view synthesis methods.
Hedman et al. Dataset [Hedman et al. 2018]: Two scenes (Dr Johnson and Playroom) were included. This dataset provides indoor scenes with specific challenges in image-based rendering.
Synthetic Blender Dataset [Mildenhall et al. 2020]: This dataset (Mic, Chair, Ship, Materials, Lego, Drums, Ficus, Hotdog) consists of synthetic objects rendered in a clean environment with uniform backgrounds and well-defined camera parameters. It provides an exhaustive set of views and is bounded in size, making it suitable for evaluating fundamental reconstruction capabilities, especially with random initialization.

For all real-world datasets, a standard train/test split was used, following the methodology suggested by Mip-NeRF360, where every 8th photo was reserved for testing. This ensures consistent and meaningful comparisons with previous methods. The choice of these datasets allows for validation across varying levels of scene complexity, boundedness, lighting conditions, and capture strategies.

5.2. Evaluation Metrics

The performance of the novel-view synthesis method was evaluated using three widely accepted metrics in the literature: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS).

Peak Signal-to-Noise Ratio (PSNR)

Conceptual Definition: PSNR is a common metric for measuring the quality of reconstruction of lossy compression codecs or image processing techniques. It quantifies the difference between two images on a pixel-by-pixel basis. A higher PSNR value generally indicates a higher quality (less noisy) reconstructed image. It is typically expressed in decibels (dB).
Mathematical Formula: $ PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) $
Symbol Explanation:
- PSNR: Peak Signal-to-Noise Ratio in decibels (dB).
- $MAX_I$ : The maximum possible pixel value of the image. For 8-bit grayscale images, this is 255. For color images where each channel is 8-bit, it's also typically 255.
- MSE: Mean Squared Error between the original (ground truth) image and the reconstructed (rendered) image. It is calculated as: $ MSE = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $ where:
  - $M$ : Number of rows (height) of the image.
  - $N$ : Number of columns (width) of the image.
  - I(i,j): The pixel value at row $i$ and column $j$ of the original image.
  - K(i,j): The pixel value at row $i$ and column $j$ of the reconstructed image.

Structural Similarity Index Measure (SSIM)

Conceptual Definition: SSIM is designed to measure the perceived structural similarity between two images, moving beyond simple pixel differences. It considers three key factors: luminance, contrast, and structure. It ranges from -1 to 1, with 1 indicating perfect similarity. Higher SSIM values indicate better perceptual quality.
Mathematical Formula: $ SSIM(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2+\mu_y^2+c_1)(\sigma_x^2+\sigma_y^2+c_2)} $
Symbol Explanation:
- SSIM(x,y): The Structural Similarity Index between image patches $x$ and $y$ .
- $\mu_x$ : The average (mean) of image patch $x$ .
- $\mu_y$ : The average (mean) of image patch $y$ .
- $\sigma_x^2$ : The variance of image patch $x$ .
- $\sigma_y^2$ : The variance of image patch $y$ .
- $\sigma_{xy}$ : The covariance between image patches $x$ and $y$ .
- $c_1 = (K_1 L)^2$ $c_{1} = (K_{1} L)^{2}$ and $c_2 = (K_2 L)^2$ $c_{2} = (K_{2} L)^{2}$ : Two small constants included to avoid division by zero when the denominators are very close to zero.
  - $L$ : The dynamic range of the pixel values (e.g., 255 for 8-bit images).
  - $K_1 = 0.01$ and $K_2 = 0.03$ are typical default values.

Learned Perceptual Image Patch Similarity (LPIPS)

Conceptual Definition: LPIPS is a metric that aims to correlate more closely with human perception of image quality compared to traditional metrics like PSNR and SSIM. It measures the perceptual distance between two images by comparing their feature representations extracted from a pre-trained deep neural network (e.g., VGG, AlexNet). A lower LPIPS score indicates higher perceptual similarity.
Mathematical Formula: LPIPS is not a single, simple closed-form mathematical expression like PSNR or SSIM. Instead, it involves a computation pipeline: $ \mathrm{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} |w_l \odot (f_l(x) - f_l(y))|_2^2 $
Symbol Explanation:
- $\mathrm{LPIPS}(x, y)$ : The Learned Perceptual Image Patch Similarity between images $x$ and $y$ .
- $\sum_l$ : Summation over different layers $l$ of a pre-trained CNN.
- $f_l(x)$ : The feature stack (activation map) extracted from image $x$ at layer $l$ of the pre-trained network.
- $f_l(y)$ : The feature stack (activation map) extracted from image $y$ at layer $l$ of the pre-trained network.
- $w_l$ : A trainable scalar weight vector applied to the feature channels at layer $l$ , optimizing the metric to align with human judgments.
- $\odot$ : Element-wise multiplication.
- $\|\cdot\|_2^2$ : The squared $L_2$ norm (Euclidean distance).
- $H_l, W_l$ : Height and width of the feature map at layer $l$ . The term $\frac{1}{H_l W_l}$ normalizes the squared Euclidean distance by the number of elements in the feature map.

5.3. Baselines

The paper compares its 3D Gaussian Splatting method against several leading novel-view synthesis techniques, chosen for their state-of-the-art quality or computational efficiency:

Mip-NeRF360 [Barron et al. 2022]: This method is considered the state-of-the-art in terms of rendering quality for unbounded scenes. It serves as the primary benchmark for visual fidelity, despite its very high training and rendering costs.
InstantNGP [Müller et al. 2022]: This method represents a significant leap in speed for NeRF-like approaches. It uses a multiresolution hash encoding and is known for its fast training (minutes) and interactive rendering. The paper compares against two configurations:
- INGP-Base: A basic configuration run for 35K iterations.
- INGP-Big: A slightly larger network configuration suggested by the authors, offering potentially higher quality at the cost of slightly more resources.
Plenoxels [Fridovich-Keil and Yu et al. 2022]: This method is another fast NeRF variant that represents radiance fields with a sparse voxel grid, notably forgoing neural networks entirely. It is also known for fast training and interactive rendering.

These baselines collectively cover the spectrum from highest quality (Mip-NeRF360) to fastest performance (InstantNGP, Plenoxels), allowing 3D Gaussian Splatting to demonstrate its advantage in combining both aspects.

6. Results & Analysis

6.1. Core Results Analysis

The 3D Gaussian Splatting (3DGS) method demonstrates a significant advancement in novel-view synthesis by achieving state-of-the-art (SOTA) visual quality, competitive training times, and critically, real-time rendering at 1080p resolution.

The overall performance comparison with leading methods is presented in Table 1.

Method\|Metric	Mip-NeRF360						Tanks&Temples						Deep Blending
Method\|Metric	SSIM↑	PSNR↑	LPIPS↓	Train	FPS	Mem	SSIM↑	PSNR↑	LPIPS↓	Train	FPS	Mem	SSIM↑	PSNR↑	LPIPS↓	Train	FPS	Mem
Plenoxels	0.626	23.08	0.463	25m49s	6.79	2.1GB	0.719	21.08	0.379	25m5s	13.0	2.3GB	0.795	23.06	0.510	27m49s	11.2	2.7GB
INGP-Base	0.671	25.30	0.371	5m37s	11.7	13MB	0.723	21.72	0.330	5m26s	17.1	13MB	0.797	23.62	0.423	6m31s	3.26	13MB
INGP-Big	0.6699	25.59	0.331	7m30s	9.43	48MB	0.745	21.92	0.305	6m59s	14.4	48MB	0.817	24.96	0.390	8m	2.79	48MB
M-NeRF360	0.792*	27.69†	0.237	48h	0.06	8.6MB	0.759	22.22	0.257	48h	0.14	8.6MB	0.901	29.40	0.245	48h	0.09	8.6MB
Ours-7K	0.770	25.60	0.279	6m25s	160	523MB	0.767	21.20	0.280	6m55s	197	270MB	0.875	27.78	0.317	4m35s	172	386MB
Ours-30K	0.815	27.21	0.214	41m33s	134	734MB	0.841	23.14	0.183	26m54s	154	411MB	0.903	29.41	0.243	36m2s	137	676MB

The following are the results from Table 1 of the original paper:

SSIM↑: Higher is better.
PSNR↑: Higher is better.
LPIPS↓: Lower is better.
Train: Training time.
FPS: Frames Per Second (rendering speed).
Mem: Memory used to store the model.
* and † indicate numbers directly adopted from the original paper (for Mip-NeRF360).

Key Observations from Table 1:

Quality Dominance (Ours-30K vs. Mip-NeRF360):
- For the Mip-NeRF360 dataset, Ours-30K achieves SSIM of 0.815 (vs. 0.792 for Mip-NeRF360), PSNR of 27.21 (vs. 27.69 for Mip-NeRF360), and LPIPS of 0.214 (vs. 0.237 for Mip-NeRF360). This shows 3DGS is largely on par with, and in some metrics (SSIM, LPIPS) even slightly surpasses, the SOTA quality of Mip-NeRF360.
- Similar trends are observed for Tanks&Temples and Deep Blending datasets, where Ours-30K consistently achieves the highest SSIM and lowest LPIPS, and very competitive PSNR.
Unprecedented Real-Time Rendering (Ours vs. All Baselines):
- The most striking advantage is rendering speed. Mip-NeRF360 renders at an abysmal 0.06-0.14 FPS (frames per second), meaning tens of seconds per frame.
- Plenoxels and InstantNGP achieve interactive rates (3-17 FPS), but still fall short of real-time ( $\ge 30$ FPS).
- Ours-7K and Ours-30K consistently achieve real-time rendering speeds, ranging from 134-197 FPS across all datasets. This is a monumental achievement, making 3DGS the first method to enable high-quality real-time novel-view synthesis.
Competitive Training Times:
- Mip-NeRF360 requires 48 hours of training.
- Plenoxels and InstantNGP train in minutes (5-28 minutes).
- Ours-7K trains in 4-7 minutes, matching the fastest methods for initial quality.
- Ours-30K (full convergence) trains in 26-41 minutes, which is competitive with or slightly longer than Plenoxels/INGP-Big, but for significantly higher quality. This is a massive reduction from Mip-NeRF360's training time.
Memory Consumption:
- InstantNGP models are very compact (13-48 MB). Mip-NeRF360 is also relatively compact (8.6 MB).
- 3DGS models are larger (270-734 MB). While larger than implicit methods, this is still manageable for GPU memory, especially considering the explicit nature of storing millions of Gaussians. The authors note potential for further memory reduction.

Visual Results: Figure 1 provides a compelling visual summary of the performance comparison:

该图像是插图，展示了不同方法在实时渲染中的性能比较，包括 InstantNGP、Plenoxtels、Mip-NeRF360、我们的方法和真实场景。每个方法下方标注了帧率、训练时间和PSNR值，显示了我们的模型在实时渲染方面的显著优势。最右侧为真实场景图像，作为性能基准。

Figure 1 showcases a comparison of 3D Gaussian Splatting against other methods like InstantNGP, Plenoxels, and Mip-NeRF360. The critical takeaway is 3DGS's ability to achieve real-time rendering (137-197 FPS) at high quality, significantly outperforming Mip-NeRF360 (0.06-0.14 FPS) and offering superior quality and speed compared to InstantNGP and Plenoxels. The PSNR scores further confirm the high visual fidelity of 3DGS.

Figure 5 offers more detailed visual comparisons for specific scenes, highlighting that 3DGS can avoid artifacts present in Mip-NeRF360 (e.g., blurriness in vegetation or walls).

该图像是插图，展示了我们的方法与多种基准方法的结果对比，包括真实场景、我们的方法、Mip-NeRF360、InstantNGP 和 PlenoXels。每一行展示了不同场景的渲染效果，强调了我们方法在视觉质量上的优势。

Figure 5 visually compares the rendering quality of 3D Gaussian Splatting (Ours) against Ground Truth, Mip-NeRF360, InstantNGP, and Plenoxels. The results demonstrate that Ours produces high-fidelity images, often matching or exceeding the perceptual quality of Mip-NeRF360 and clearly outperforming InstantNGP and Plenoxels, especially in fine details and overall sharpness. This figure provides strong visual evidence of the method's state-of-the-art quality.

Synthetic Bounded Scenes: For the synthetic Blender dataset, where scenes are bounded and views are exhaustive, 3DGS achieves SOTA results even with random initialization (100K points). The adaptive density control quickly prunes them to 6-10K meaningful Gaussians, and the final model reaches 200-500K Gaussians. Table 2 shows PSNR scores on this dataset.

The following are the results from Table 2 of the original paper:

	Mic	Chair	Ship	Materials	Lego	Drums	Ficus	Hotdog	Avg.
Plenoxels	33.26	33.98	29.62	29.14	34.10	25.35	31.83	36.81	31.76
INGP-Base	36.22	35.00	31.10	29.78	36.39	26.02	33.51	37.40	33.18
Mip-NeRF	36.51	35.14	30.41	30.71	35.70	25.48	33.29	37.48	33.09
Point-NeRF	35.95	35.40	30.97	29.61	35.04	26.06	36.13	37.30	33.30
Ours-30K	35.36	35.83	30.80	30.00	35.78	26.15	34.87	37.72	33.32

For the synthetic Blender dataset, Ours-30K achieves an average PSNR of 33.32, which is the highest among all compared methods, slightly surpassing Point-NeRF (33.30) and INGP-Base (33.18). The rendering FPS for these scenes was 180-300. This confirms the method's effectiveness even when starting from a less structured initial state.

Compactness: The anisotropic Gaussians prove to be a compact representation. When compared against the point-based models of [Zhang et al. 2022] (which use foreground masks and space carving), 3DGS surpasses their reported PSNR scores using approximately one-fourth of their point count and significantly smaller model sizes (average 3.8 MB vs. 9 MB). This demonstrates the efficiency of using anisotropic shapes to model complex geometry.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Method\|Metric	Mip-NeRF360						Tanks&Temples						Deep Blending
Method\|Metric	SSIM↑	PSNR↑	LPIPS↓	Train	FPS	Mem	SSIM↑	PSNR↑	LPIPS↓	Train	FPS	Mem	SSIM↑	PSNR↑	LPIPS↓	Train	FPS	Mem
Plenoxels	0.626	23.08	0.463	25m49s	6.79	2.1GB	0.719	21.08	0.379	25m5s	13.0	2.3GB	0.795	23.06	0.510	27m49s	11.2	2.7GB
INGP-Base	0.671	25.30	0.371	5m37s	11.7	13MB	0.723	21.72	0.330	5m26s	17.1	13MB	0.797	23.62	0.423	6m31s	3.26	13MB
INGP-Big	0.6699	25.59	0.331	7m30s	9.43	48MB	0.745	21.92	0.305	6m59s	14.4	48MB	0.817	24.96	0.390	8m	2.79	48MB
M-NeRF360	0.792*	27.69†	0.237	48h	0.06	8.6MB	0.759	22.22	0.257	48h	0.14	8.6MB	0.901	29.40	0.245	48h	0.09	8.6MB
Ours-7K	0.770	25.60	0.279	6m25s	160	523MB	0.767	21.20	0.280	6m55s	197	270MB	0.875	27.78	0.317	4m35s	172	386MB
Ours-30K	0.815	27.21	0.214	41m33s	134	734MB	0.841	23.14	0.183	26m54s	154	411MB	0.903	29.41	0.243	36m2s	137	676MB

The following are the results from Table 2 of the original paper:

	Mic	Chair	Ship	Materials	Lego	Drums	Ficus	Hotdog	Avg.
Plenoxels	33.26	33.98	29.62	29.14	34.10	25.35	31.83	36.81	31.76
INGP-Base	36.22	35.00	31.10	29.78	36.39	26.02	33.51	37.40	33.18
Mip-NeRF	36.51	35.14	30.41	30.71	35.70	25.48	33.29	37.48	33.09
Point-NeRF	35.95	35.40	30.97	29.61	35.04	26.06	36.13	37.30	33.30
Ours-30K	35.36	35.83	30.80	30.00	35.78	26.15	34.87	37.72	33.32

The following are the results from Table 4 of the original paper (SSIM scores for Mip-NeRF360 scenes):

	bicycle	flowers	garden	stump	treehill	room	counter	kitchen	bonsai
Plenoxels	0.496	0.431	0.6063	0.523	0.509	0.8417	0.759	0.648	0.814
INGP-Base	0.491	0.450	0.649	0.574	0.518	0.855	0.798	0.818	0.890
INGP-Big	0.512	0.486	0.701	0.594	0.542	0.871	0.817	0.858	0.906
Mip-NeRF360†	0.685	0.583	0.813	0.744	0.632	0.913	0.894	0.920	0.941
Mip-NeRF360	0.685	0.584	0.809	0.745	0.631	0.910	0.892	0.917	0.938
Ours-7k	0.675	0.525	0.836	0.728	0.598	0.884	0.873	0.900	0.910
Ours-30k	0.771	0.605	0.868	0.775	0.638	0.914	0.905	0.922	0.938

The following are the results from Table 5 of the original paper (PSNR scores for Mip-NeRF360 scenes):

	bicycle	flowers	garden	stump	treehill	room	counter	kitchen	bonsai
Plenoxels	21.912	20.097	23.4947	20.661	22.248	27.594	23.624	23.420	24.669
INGP-Base	22.193	20.348	24.599	23.626	22.364	29.269	26.439	28.548	30.337
INGP-Big	22.171	20.652	25.069	23.466	22.373	29.690	26.691	29.479	30.685
Mip-NeRF360†	24.37	21.73	26.98	26.40	22.87	31.63	29.55	32.23	33.46
Mip-NeRF360	24.305	21.649	26.875	26.175	22.929	31.467	29.447	31.989	33.397
Ours-7k	23.604	20.515	26.245	25.709	22.085	28.139	26.705	28.546	28.850
Ours-30k	25.246	21.520	27.410	26.550	22.490	30.632	28.700	30.317	31.980

The following are the results from Table 6 of the original paper (LPIPS scores for Mip-NeRF360 scenes):

	bicycle	flowers	garden	stump	treehill	room	counter	kitchen	bonsai
Plenoxels	0.506	0.521	0.3864	0.503	0.540	0.4186	0.441	0.447	0.398
INGP-Base	0.487	0.481	0.312	0.450	0.489	0.301	0.342	0.254	0.227
INGP-Big	0.446	0.441	0.257	0.421	0.450	0.261	0.306	0.195	0.205
Mip-NeRF360†	0.301	0.344	0.170	0.261	0.339	0.211	0.204	0.127	0.176
Mip-NeRF360	0.305	0.346	0.171	0.265	0.347	0.213	0.207	0.128	0.179
Ours-7k	0.318	0.417	0.153	0.287	0.404	0.272	0.254	0.161	0.244
Ours-30k	0.205	0.336	0.103	0.210	0.317	0.220	0.204	0.129	0.205

The following are the results from Table 7 of the original paper (SSIM scores for Tanks&Temples and Deep Blending scenes):

	Truck	Train	Dr Johnson	Playroom
Plenoxels	0.774	0.663	0.787	0.802
INGP-Base	0.779	0.666	0.839	0.754
INGP-Big	0.800	0.689	0.854	0.779
Mip-NeRF360	0.857	0.660	0.901	0.900
Ours-7k	0.840	0.694	0.853	0.896
Ours-30k	0.879	0.802	0.899	0.906

The following are the results from Table 8 of the original paper (PSNR scores for Tanks&Temples and Deep Blending scenes):

	Truck	Train	Dr Johnson	Playroom
Plenoxels	23.221	18.927	23.142	22.980
INGP-Base	23.260	20.170	27.750	19.483
INGP-Big	23.383	20.456	28.257	21.665
Mip-NeRF360	24.912	19.523	29.140	29.657
Ours-7k	23.506	18.892	26.306	29.245
Ours-30k	25.187	21.097	28.766	30.044

The following are the results from Table 9 of the original paper (LPIPS scores for Tanks&Temples and Deep Blending scenes):

	Truck	Train	Dr Johnson	Playroom
Plenoxels	0.335	0.422	0.521	0.499
INGP-Base	0.274	0.386	0.381	0.465
INGP-Big	0.249	0.360	0.352	0.428
Mip-NeRF360	0.159	0.354	0.237	0.252
Ours-7k	0.209	0.350	0.343	0.291
Ours-30k	0.148	0.218	0.244	0.241

6.3. Ablation Studies / Parameter Analysis

The authors conducted a thorough ablation study to evaluate the contribution of each key component and design choice in 3D Gaussian Splatting. The quantitative results are summarized in Table 3.

The following are the results from Table 3 of the original paper (PSNR scores for Synthetic NeRF, we start with 100K randomly initialized points. Competing metrics extracted from respective papers.):

	Truck-5K	Garden-5K	Bicycle-5K	Truck-30K	Garden-30K	Bicycle-30K	Average-5K	Average-30K
Limited-BW	14.66	22.07	20.77	13.84	22.88	20.87	19.16	19.19
Random Init	16.75	20.90	19.86	18.02	22.19	21.05	19.17	20.42
No-Split	18.31	23.98	22.21	20.59	26.11	25.02	21.50	23.90
No-SH	22.36	25.22	22.88	24.39	26.59	25.08	23.48	25.35
No-Clone	22.29	25.61	22.15	24.82	27.47	25.46	23.35	25.91
Isotropic	22.40	25.49	22.81	23.89	27.00	24.81	23.56	25.23
Full	22.71	25.82	23.18	24.81	27.70	25.65	23.90	26.05

Initialization from SfM (vs. Random Init):
- Result: Random Init (Average-30K PSNR: 20.42) performs significantly worse than Full (Average-30K PSNR: 26.05).
- Analysis: Initializing 3D Gaussians from SfM points (the default Full method) is crucial for performance, especially in the background and poorly observed regions. While random initialization can work for well-constrained synthetic scenes, it leads to more floaters and poorer reconstruction in real-world scenarios. Figure 7 visually supports this finding.
  
  该图像是示意图，展示了 SfM 点与随机点云的初始化对比。上方为随机点云初始化，底部为使用 SfM 点的初始化，展示了在场景优化中所取得的明显差异。
Figure 7 demonstrates the impact of initialization on scene quality. The top image, initialized with a random point cloud, shows significant degradation, especially in background areas. The bottom image, initialized using SfM points, results in a much higher quality scene, proving the importance of leveraging SfM data for robust scene reconstruction.
Densification Strategies (No-Split and No-Clone):
- Result: No-Split (Average-30K PSNR: 23.90) and No-Clone (Average-30K PSNR: 25.91) both reduce quality compared to Full (Average-30K PSNR: 26.05).
- Analysis: Both cloning and splitting mechanisms in the adaptive density control are important.
  - Splitting large Gaussians is essential for reconstructing backgrounds and detailed geometry (No-Split shows a more significant drop).
  - Cloning smaller Gaussians helps to cover under-reconstructed regions and accelerate convergence, particularly for thin structures (No-Clone also shows a drop, albeit smaller).
- Figure 8 visually illustrates the impact of these strategies.
  
  该图像是图表，展示了不同的稠密化策略在合成场景中的效果，包括 "No Split-5k"、"No Clone-5k" 和 "Full-5k" 的对比。图中左上、右上、左下、右下分别显示了这些策略的视觉效果，重点突出细节的差异。
Figure 8 illustrates the effects of different densification strategies. The "No Split-5k" image shows artifacts where large areas are inadequately refined. The "No Clone-5k" image struggles with covering new geometry. The "Full-5k" image demonstrates superior detail and completeness, validating the effectiveness of both cloning and splitting mechanisms.
Unlimited Depth Complexity of Splats with Gradients (Limited-BW):
- Result: Limited-BW (limiting gradient computation to a fixed number of front-most points) results in a severe quality degradation (Average-30K PSNR: 19.19) compared to Full (Average-30K PSNR: 26.05). For the Truck scene, this was an 11dB drop in PSNR.
- Analysis: This demonstrates the necessity of allowing all splats that contribute to a pixel's color to receive gradients, regardless of their depth. Limiting this (as some prior methods do for speed) leads to unstable optimization and significant visual artifacts, especially in scenes with high depth complexity. Figure 9 vividly shows this degradation.
  
  该图像是对比图，展示了当限制接收梯度的高斯点数量时，对视觉质量的显著影响。左侧展示了仅限10个接收梯度的高斯点的结果，而右侧则呈现了完整方法的效果。
Figure 9 highlights the crucial impact of allowing gradients to propagate through all contributing Gaussians. The left image, with a limit of only 10 Gaussians receiving gradients, exhibits significant visual degradation and artifacts. The right image, rendered with the full method (no limit), demonstrates vastly superior quality and detail, proving that restricting gradient flow severely compromises scene reconstruction.
Anisotropic Covariance (Isotropic):
- Result: Isotropic Gaussians (Average-30K PSNR: 25.23) lead to lower quality than Full (Average-30K PSNR: 26.05).
- Analysis: Optimizing the full anisotropic covariance matrix is critical. Anisotropy allows Gaussians to adapt their shape and orientation to precisely align with surfaces and fine structures, leading to a more compact and accurate scene representation. Without it, the Gaussians are spheres, which are less efficient and accurate for modeling most real-world geometry. Figure 10 provides a visual comparison.
  
  该图像是插图，展示了三种不同渲染策略下的场景效果，包括真实场景（Ground Truth）、完整渲染（Full）和各向同性渲染（Isotropic），分别表现了不同的视觉质量。
Figure 10 visually compares the rendering quality between Ground Truth, Full (our method with anisotropic covariance), and Isotropic (our method with spherical Gaussians). The Full rendering closely matches the Ground Truth, demonstrating the benefit of anisotropic Gaussians in capturing detailed geometry. The Isotropic rendering shows a noticeable drop in quality, confirming that anisotropic covariance is essential for high-fidelity representation.
Spherical Harmonics (No-SH):
- Result: No-SH (Average-30K PSNR: 25.35) results in reduced quality compared to Full (Average-30K PSNR: 26.05).
- Analysis: The use of Spherical Harmonics (SH) is beneficial for capturing view-dependent effects and accurately representing the color of Gaussians from various viewing directions, which contributes to higher overall PSNR scores and more realistic appearance.
  
  In summary, the ablation studies confirm that all three core components (anisotropic Gaussians, adaptive density control, and the differentiable rasterizer with unlimited gradient depth) are essential for achieving the state-of-the-art quality and real-time performance of 3D Gaussian Splatting.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces 3D Gaussian Splatting (3DGS), a pioneering approach that redefines the landscape of novel-view synthesis. By representing scenes with anisotropic 3D Gaussians, the method successfully bridges the gap between explicit point-based representations and implicit volumetric radiance fields. The combination of a robust optimization pipeline, featuring adaptive density control and anisotropic covariance adjustment, with a highly efficient tile-based differentiable rasterizer, enables unprecedented performance. 3DGS achieves state-of-the-art visual quality comparable to or even surpassing the best NeRF-based methods (e.g., Mip-NeRF360), drastically reduces training times from hours to minutes, and critically, delivers the first truly real-time ( $\ge 30$ fps) high-resolution (1080p) novel-view synthesis for complex, unbounded scenes. This work demonstrates that continuous implicit representations are not strictly necessary for high-quality and fast radiance field training and rendering, advocating for the power of explicit, GPU-friendly primitives and optimized rasterization principles.

7.2. Limitations & Future Work

The authors acknowledge several limitations of 3D Gaussian Splatting:

Artifacts in Poorly Observed Regions: Like other methods, 3DGS can produce artifacts in areas of the scene that are sparsely covered by training images. These can manifest as coarse, anisotropic Gaussians leading to low-detail visuals, or "splotchy" appearance. Figure 11 and Figure 12 illustrate these failure cases.

该图像是图表，展示了两个场景的比较：左侧为Mip-NeRF360方法的结果，显示了浮动物体和颗粒感；右侧为我们的方法，展示了粗糙、各向异性的高斯表示，导致低细节视觉效果。该图展示了在TRAIN场景中的失效伪影对比。

Figure 11 illustrates a comparison of failure artifacts in the TRAIN scene. On the left, Mip-NeRF360 exhibits "floaters" and a grainy appearance in the foreground. On the right, our method produces coarse, anisotropic Gaussians resulting in low-detail visuals in the background. Both methods show limitations, but with different types of artifacts.

该图像是一个比较图，左侧展示了Mip-NeRF360在训练时视角重叠较少的情况下产生的伪影，右侧是我们的方法在相同场景中的表现。两者均在少数重叠视角下产生 artifacts，展示了不同技术的效果对比。

Figure 12 shows artifacts that can arise when views have little overlap with those seen during training in the DrJohnson scene. The left image (Mip-NeRF360) and the right image (our method) both exhibit issues, indicating a common challenge for novel-view synthesis in poorly constrained viewing conditions.
Popping Artifacts: Occasionally, popping artifacts can occur when the optimization creates large Gaussians, particularly in regions with strong view-dependent appearance. This is attributed to the simple visibility algorithm (which might cause Gaussians to suddenly switch depth/blending order) and the trivial rejection of Gaussians via a guard band in the rasterizer.
Lack of Regularization: The current optimization does not employ any regularization techniques, which could potentially help mitigate issues in unseen regions and reduce popping artifacts.
Hyperparameter Sensitivity: While the same hyperparameters were used for the main evaluation, reducing the position learning rate might be necessary for convergence in very large scenes (e.g., urban datasets).
Memory Consumption: Compared to NeRF-based solutions, 3DGS has significantly higher memory consumption. During training of large scenes, peak GPU memory can exceed 20 GB. Even for rendering, trained models require several hundred megabytes, plus an additional 30-500 MB for the rasterizer. The authors note that this prototype implementation is unoptimized and could be significantly reduced.

Future Work: The authors suggest several promising directions for future research:

Improved Culling and Antialiasing: A more principled culling approach for the rasterizer and incorporating antialiasing techniques could alleviate popping artifacts and improve visual quality.
Regularization: Integrating regularization into the optimization process could enhance robustness in poorly observed regions and reduce artifacts.
Optimization Efficiency: Porting the remaining Python optimization code entirely to CUDA could lead to further significant speedups, especially for performance-critical applications.
Memory Optimization: Applying compression techniques (well-studied in point cloud literature) to the 3D Gaussian representation could drastically reduce memory footprint.
Mesh Reconstruction: Exploring whether the 3D Gaussians can be used to perform mesh reconstruction of the captured scene. This would clarify the method's position between volumetric and surface representations and open up new applications.

7.3. Personal Insights & Critique

This paper represents a paradigm shift in novel-view synthesis, moving away from the purely implicit representations of NeRFs and demonstrating that an explicit, yet differentiable, primitive like 3D Gaussians can achieve superior performance across the board.

Inspirations and Applications:

Real-Time Potential: The most significant inspiration is the achievement of real-time high-quality rendering. This opens doors for applications previously unattainable, such as VR/AR, interactive experiences, games, virtual tourism, and rapid content creation for digital twins. The ability to navigate complex scenes at 1080p and $\ge 30$ FPS is transformative.
Hybrid Representation: The 3D Gaussian is a brilliant choice, acting as a "learned voxel" or "learned particle" that naturally bridges the gap between discrete points and continuous volumes. It leverages the strengths of both, offering the geometric flexibility of points with the differentiable properties of continuous fields.
Engineering Excellence: The emphasis on a highly optimized tile-based rasterizer and explicit gradient computation highlights the importance of classic computer graphics engineering principles in achieving real-world performance for AI-driven tasks. It shows that algorithmic efficiency is as crucial as neural network architecture.
Simplicity and Adaptability: Initializing from readily available SfM points and then dynamically adapting the Gaussian density and shape is an elegant solution to reconstruction challenges. It makes the method highly adaptable to diverse capture conditions.

Potential Issues, Unverified Assumptions, and Areas for Improvement:

Memory Footprint: While the authors acknowledge it, the current memory footprint (hundreds of MBs for inference, GBs for training) might still be a bottleneck for deployment on highly constrained devices (e.g., mobile AR/VR headsets with limited VRAM). Aggressive compression techniques will be vital.
Interpretability and Editability: While 3D Gaussians are explicit, directly editing individual Gaussians or understanding their precise contribution to a semantic part of the scene might still be complex. Future work on semantic segmentation or hierarchical structures over Gaussians could improve editability and interpretability.
Generalization to Unseen Conditions: The artifacts in poorly observed regions suggest that while 3DGS is robust, it still relies heavily on sufficient multi-view coverage. Further research into inpainting or generative models to intelligently "fill in" gaps in sparse regions could enhance robustness.
Anisotropic Artifacts: The mention of "splotchy" or elongated artifacts suggests that while anisotropy is powerful, it can sometimes be over-optimized or misaligned, leading to unnatural appearances. Regularization or geometric constraints could help.
Dynamic Scenes: The current method focuses on static scenes. Extending 3D Gaussian Splatting to dynamic environments (e.g., deformable objects, moving scenes) would be a significant but challenging next step, requiring tracking and optimizing many more moving parts.

Overall, 3D Gaussian Splatting is a landmark paper that delivers on a long-standing promise of radiance fields. Its practical implications are enormous, pushing the boundaries of what's possible in real-time photorealistic rendering and setting a new benchmark for future research.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

3D Gaussian Splatting for Real-Time Radiance Field Rendering

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~37 min read · 49,566 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Radiance Fields

Novel-View Synthesis

Neural Radiance Fields (NeRF)

Structure-from-Motion (SfM)

Multi-view Stereo (MVS)

Spherical Harmonics (SH)

Alpha Blending

Splatting

Gaussians

3.2. Previous Works

Traditional Scene Reconstruction and Rendering

Neural Rendering and Radiance Fields

Point-Based Rendering and Radiance Fields

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Initializing 3D Gaussians

4.2.2. Representing the Scene with 3D Gaussians

4.2.3. Differentiable Gradient Computation

4.2.4. Optimization with Adaptive Density Control

4.2.5. Fast Differentiable Rasterizer for Gaussians

4.2.6. Image Formation Model

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

Peak Signal-to-Noise Ratio (PSNR)

Structural Similarity Index Measure (SSIM)

Learned Perceptual Image Patch Similarity (LPIPS)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers