Paper status: completed

Mip-Splatting: Alias-free 3D Gaussian Slatting

Original Link
Price: 0.100000
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces a 3D smoothing filter that eliminates artifacts in 3D Gaussian Splatting by constraining the size of 3D Gaussian primitives based on sampling frequency, and effectively mitigates aliasing issues using a 2D Mip filter.

Abstract

Recently, 3D Gaussian Splatting has demonstrated impressive novel view synthesis results, reaching high fidelity and efficiency. However, strong artifacts can be observed when changing the sampling rate, e.g., by changing focal length or camera distance. We find that the source for this phenomenon can be attributed to the lack of 3D frequency constraints and the usage of a 2D dilation filter. To address this problem, we introduce a 3D smoothing filter which constrains the size of the 3D Gaussian primitives based on the maximal sampling frequency induced by the input views, eliminating high-frequency artifacts when zooming in. Moreover, replacing 2D dilation with a 2D Mip filter, which simulates a 2D box filter, effectively mitigates aliasing and dilation issues. Our evaluation, including scenarios such a training on single-scale images and testing on multiple scales, validates the effectiveness of our approach.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Mip-Splatting: Alias-free 3D Gaussian Splatting."

1.2. Authors

  • Zehao Yu (University of Tübingen, Tübingen AI Center)

  • Anpei Chen (University of Tübingen, Tübingen AI Center)

  • Binbin Huang (ShanghaiTech University)

  • Torsten Sattler (Czech Technical University in Prague)

  • Andreas Geiger (University of Tübingen, Tübingen AI Center)

    The authors are affiliated with universities and research centers, indicating a strong academic background in computer vision and graphics. Andreas Geiger, in particular, is a well-known researcher in the field of autonomous driving and 3D vision.

1.3. Journal/Conference

The paper does not explicitly state the journal or conference it was published in, but a common practice for papers building upon 3D Gaussian Splatting (which was published in ACM Transactions on Graphics, a top-tier venue) is to target high-impact computer graphics or computer vision conferences/journals. Given the citation style (e.g., ICCV, CVPR) in the references, it is likely intended for a major conference in these fields.

1.4. Publication Year

The publication year is not explicitly mentioned in the provided text, but context from the abstract (e.g., "Recently, 3D Gaussian Splatting has demonstrated...") suggests it is a very recent work, likely 2023 or 2024, building on 3DGS [18] from 2023.

1.5. Abstract

The paper addresses significant artifact issues in 3D Gaussian Splatting (3DGS), specifically when changing sampling rates (e.g., varying focal length or camera distance). The authors attribute these artifacts to a lack of 3D frequency constraints and the use of a 2D dilation filter in the original 3DGS. To solve this, they introduce two main components:

  1. A 3D smoothing filter: This constrains the size of 3D Gaussian primitives based on the maximal sampling frequency derived from the input views, effectively eliminating high-frequency artifacts during zoom-in.
  2. A 2D Mip filter: This replaces the original 2D dilation filter, simulating a 2D box filter to mitigate aliasing and dilation issues, particularly during zoom-out. The evaluation, including scenarios like training on single-scale images and testing on multiple scales, demonstrates the effectiveness of their approach, termed Mip-Splatting, in producing alias-free novel view synthesis across various sampling rates.

/files/papers/691356dc430ad52d5a9ef405/paper.pdf This is a local file path, indicating the paper content was provided directly. Its publication status (e.g., officially published, preprint) is not specified.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the presence of strong artifacts in 3D Gaussian Splatting (3DGS) when the camera's sampling rate changes significantly from the training views. This includes actions like zooming in (increasing effective resolution) or zooming out (decreasing effective resolution), or changing camera distance.

This problem is highly important in the field of Novel View Synthesis (NVS) because 3DGS has emerged as a groundbreaking method due to its impressive rendering quality and real-time performance. However, practical applications in virtual reality, cinematography, or robotics often require robust performance across a wide range of camera parameters and scales. The artifacts observed in 3DGS (e.g., erosion when zooming in, dilation/aliasing when zooming out) limit its generalization and practical utility in these dynamic scenarios. Existing methods like Mip-NeRF address multi-scale issues for neural radiance fields but are not directly applicable to the explicit Gaussian representation and rasterization pipeline of 3DGS. The specific challenges are:

  1. Lack of 3D frequency constraints: 3DGS does not inherently limit the "fineness" or "detail level" of its 3D Gaussians based on the input image resolution, leading to high-frequency artifacts (erosion) when rendered at higher resolutions than trained.

  2. Problematic 2D dilation filter: The original 3DGS uses a fixed 2D dilation filter in screen space to prevent degenerate small Gaussians. While useful at training scale, this filter causes incorrect radiance spread (dilation artifacts) when zooming out and ineffective filtering (leading to erosion artifacts) when zooming in.

    The paper's entry point is to address these fundamental limitations of 3DGS by introducing principled anti-aliasing mechanisms directly into the Gaussian representation and rendering process. The innovative idea is to regularize the 3D Gaussians in 3D space based on multi-view sampling rates and to replace the problematic 2D dilation with a physically motivated 2D Mip filter.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

  • Introduction of a 3D smoothing filter for 3D Gaussian primitives: This filter effectively regularizes the maximum frequency (or minimum size) of 3D Gaussians based on the maximal sampling rate available from the training images for each primitive. This prevents the Gaussians from becoming excessively small and introducing high-frequency artifacts when rendered at higher resolutions (e.g., zooming in) than those seen during training. This filter becomes an intrinsic part of the 3D scene representation.

  • Replacement of the 2D dilation filter with a 2D Mip filter: The original 2D dilation operation in 3DGS is replaced with a novel 2D Mip filter, which approximates the physical imaging process's 2D box filter (integrating light over a pixel area) using a 2D Gaussian. This effectively mitigates aliasing and dilation artifacts, particularly when rendering at lower sampling rates (e.g., zooming out).

  • Demonstration of alias-free rendering across various sampling rates with single-scale training: The proposed Mip-Splatting method enables faithful rendering at different zoom levels and camera distances, even when trained exclusively on single-scale (full-resolution) images. This highlights excellent out-of-distribution generalization.

  • Principled and simple modifications: The proposed changes are principled, grounded in sampling theory and physical imaging processes, and are simple to integrate into the existing 3DGS framework.

    The key conclusions and findings are that by constraining the 3D representation's frequency based on multi-view sampling and by implementing a more physically accurate 2D screen-space filter, Mip-Splatting effectively solves the aliasing and dilation artifacts observed in 3DGS. It significantly enhances the robustness and generalization capabilities of 3D Gaussian Splatting for novel view synthesis across different scales, making it more practical for real-world applications.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Novel View Synthesis (NVS)

Novel View Synthesis (NVS) is a subfield of computer graphics and computer vision focused on generating new images of a 3D scene from arbitrary camera viewpoints, given a set of existing images of that scene. The goal is to create photo-realistic images that look as if a camera was physically present at the new viewpoint. This has broad applications in virtual reality, augmented reality, 3D content creation, and robotics.

3.1.2. Neural Radiance Fields (NeRF)

Neural Radiance Fields (NeRF) [28] is a landmark technique in NVS. It represents a 3D scene using a multilayer perceptron (MLP) neural network. This MLP maps a 3D coordinate (x,y,z)(\mathbf{x}, \mathbf{y}, \mathbf{z}) and a 2D viewing direction (θ,ϕ)(\mathbf{\theta}, \mathbf{\phi}) to a predicted color (r,g,b)(\mathbf{r}, \mathbf{g}, \mathbf{b}) and volume density σ\sigma. To render an image, rays are cast from the camera through each pixel. For each ray, samples are taken along its path, and the MLP predicts color and density at these sample points. These are then composited using volume rendering techniques to produce the final pixel color. While NeRF produces high-quality results, its rendering speed can be slow due to the repeated MLP evaluations for each ray.

3.1.3. 3D Gaussian Splatting (3DGS)

3D Gaussian Splatting (3DGS) [18] is a more recent and highly efficient NVS method that explicitly represents a 3D scene as a collection of anisotropic 3D Gaussian primitives. Each 3D Gaussian is defined by its position, covariance (representing size and orientation), opacity, and view-dependent color (often modeled using spherical harmonics). Instead of ray tracing, 3DGS uses a splatting-based rasterization pipeline. This means that each 3D Gaussian is projected onto the 2D image plane, forming a 2D Gaussian. These 2D Gaussians are then sorted by depth and alpha-blended to produce the final image. A key feature of 3DGS is its differentiable rendering, allowing all Gaussian parameters to be optimized end-to-end using a photometric loss. It achieves real-time rendering at high resolutions, making it very appealing for practical applications.

3.1.4. Sampling Theorem (Nyquist-Shannon Sampling Theorem)

The Nyquist-Shannon Sampling Theorem [33, 45] is a fundamental principle in signal processing. It states that to perfectly reconstruct a continuous signal from its discrete samples, two conditions must be met:

  1. The continuous signal must be band-limited, meaning it contains no frequency components above a certain maximum frequency, ν\nu.
  2. The sampling rate (ν^\hat{\nu}) must be at least twice the highest frequency present in the continuous signal; i.e., ν^2ν\hat{\nu} \geq 2\nu. This minimum sampling rate is known as the Nyquist rate. If the sampling rate is below the Nyquist rate for a given frequency component, aliasing occurs, where high-frequency components are incorrectly represented as lower-frequency components. To prevent this, low-pass filtering (also called anti-aliasing filtering) is typically applied to the signal before sampling to remove frequencies above half the sampling rate.

3.1.5. Aliasing and Anti-aliasing

Aliasing refers to the misrepresentation of a high-frequency signal as a lower-frequency signal when the signal is sampled at a rate lower than the Nyquist rate. In computer graphics, this often manifests as "jaggies" (stair-step patterns on edges), moiré patterns, or flickering artifacts when fine details are rendered. Anti-aliasing techniques aim to mitigate aliasing artifacts. This typically involves applying a low-pass filter to the signal before it is sampled or rendered. A low-pass filter removes or attenuates high-frequency components that could cause aliasing, effectively "blurring" or smoothing out fine details to represent them correctly at the given sampling resolution.

3.1.6. Low-pass Filter

A low-pass filter is a filter that passes signals with a frequency lower than a certain cutoff frequency and attenuates signals with frequencies higher than the cutoff frequency. In image processing and computer graphics, applying a low-pass filter (like a Gaussian blur) effectively smooths the image by removing fine details and high-frequency noise. This is a common technique for anti-aliasing.

3.2. Previous Works

The paper builds primarily on the foundations of Neural Radiance Fields (NeRF) and directly addresses limitations of 3D Gaussian Splatting (3DGS).

  • NeRF [28]: As described above, NeRF revolutionized NVS by using MLPs for implicit scene representation. It achieved remarkable quality but suffered from slow rendering speeds.
  • Subsequent NeRF improvements [4, 11, 19, 32, 46, 51]: Many works improved NeRF's efficiency and training, often by using explicit scene representations like feature grids or sparse voxel fields (e.g., Plenoxels [11], TensoRF [4], Instant-NGP [32]). These methods still typically rely on volume rendering or similar sampling strategies.
  • Mip-NeRF [1] and Tri-MipRF [17]: These are significant prior works in anti-aliasing for NeRF-based methods.
    • Mip-NeRF [1] introduced integrated positional encoding (IPE) and cone tracing. Instead of sampling points along a ray, it samples volumes (cones) whose size corresponds to the pixel footprint. It integrates positional encoding over these conical frustums to attenuate high-frequency details, preventing aliasing. Mip-NeRF typically requires training with multi-scale input images to learn how to render effectively at different scales.
    • Tri-MipRF [17] adapts similar multi-scale ideas to feature grid-based representations. Like Mip-NeRF, it learns multi-scale signals, typically requiring multi-scale supervision during training.
    • Differentiation from Mip-Splatting: Both Mip-NeRF and Tri-MipRF are primarily MLP-based or feature grid-based methods that rely on the neural network's ability to learn and interpolate multi-scale signals, often requiring multi-scale training data. In contrast, Mip-Splatting is based on 3DGS's explicit 3D Gaussian representation and uses closed-form modifications (filters) to the Gaussians themselves. This allows Mip-Splatting to achieve excellent multi-scale generalization even when trained on single-scale images.
  • Primitive-based Differentiable Rendering [13, 14, 38, 44, 59, 60]: This category includes methods that render geometric primitives (like points or spheres) onto the image plane.
    • Pulsar [20]: An example of efficient sphere rasterization.
    • 3DGS [18]: As discussed, it represents scenes as 3D Gaussians and uses a tile-based rasterization for real-time rendering. This paper builds directly upon 3DGS.
  • Anti-aliasing in Rendering [7, 8, 15, 31, 47, 50, 59]: Traditional computer graphics has many anti-aliasing techniques.
    • EWA splatting [59]: This method applies an elliptical weighted average (EWA) filter, which is a Gaussian low-pass filter, to projected 2D Gaussians in screen space to produce a band-limited output. It aims to limit the frequency signal's bandwidth to the rendered image.
    • Differentiation from Mip-Splatting: While EWA also uses a Gaussian low-pass filter in screen space, its filter size is often chosen empirically and is primarily designed to address the rendering problem (how to correctly draw primitives), not the reconstruction problem (how to optimize the 3D representation itself to be alias-free). Mip-Splatting, however, applies a band-limited filter in 3D space whose size is determined by the training images' sampling rates, directly constraining the 3D representation during optimization. Its 2D Mip filter is specifically designed to approximate a single pixel's box filter, mimicking the physical imaging process, which is distinct from EWA's general band-limiting. The paper claims EWA can lead to overly smooth results.

3.3. Technological Evolution

The evolution in NVS has generally moved from:

  1. Implicit Representations (NeRF): Highly realistic but computationally expensive due to MLP query per ray.

  2. Hybrid/Explicit Grid-based Representations (Plenoxels, Instant-NGP, TensoRF): Improved efficiency by structuring the scene representation in grids or tensors, but still often rely on volume rendering concepts.

  3. Explicit Primitive-based Representations (3DGS): Achieved real-time rendering by moving to a differentiable rasterization pipeline of explicit geometric primitives (3D Gaussians). This greatly boosted efficiency.

    This paper's work, Mip-Splatting, fits into the technological timeline as a crucial refinement of 3DGS. While 3DGS brought unprecedented efficiency, it inherited aliasing challenges common in rasterization and lacked principled anti-aliasing mechanisms for multi-scale viewing. Mip-Splatting addresses this gap, making 3DGS more robust and practically usable across varying camera parameters, enhancing its generalization capabilities without sacrificing its core efficiency.

3.4. Differentiation Analysis

Compared to the main methods in related work, Mip-Splatting's core differences and innovations are:

  • From NeRF/Grid-based Anti-aliasing (e.g., Mip-NeRF, Tri-MipRF):

    • Representation: Mip-Splatting works with an explicit 3D Gaussian representation and a rasterization pipeline, whereas Mip-NeRF/Tri-MipRF work with implicit MLP/feature grid representations and volume rendering.
    • Anti-aliasing Mechanism: Mip-NeRF uses integrated positional encoding and cone tracing, relying on the MLP to learn multi-scale representations. Mip-Splatting employs closed-form analytical filters (3D smoothing and 2D Mip filters) directly modifying the Gaussian properties.
    • Training Requirement: Mip-NeRF and Tri-MipRF often require multi-scale images for supervision to learn how to render at different scales. Mip-Splatting achieves excellent out-of-distribution generalization (rendering at scales different from training) even when trained on single-scale images.
  • From 3DGS [18]:

    • Addressing Core Artifacts: Mip-Splatting directly tackles the aliasing, dilation, and erosion artifacts present in original 3DGS when changing sampling rates.
    • 3D Frequency Constraint: Mip-Splatting introduces a novel 3D smoothing filter that regularizes the size of 3D Gaussians based on the Nyquist-Shannon Sampling Theorem and the actual sampling rate of training views. This is absent in original 3DGS.
    • Improved 2D Filter: Mip-Splatting replaces 3DGS's problematic fixed-size 2D dilation filter with a 2D Mip filter that approximates a physical 2D box filter for anti-aliasing, leading to more accurate radiance accumulation.
  • From EWA Splatting [59]:

    • Scope: EWA primarily addresses the rendering problem (how to filter projected primitives) without directly constraining the 3D reconstruction. Mip-Splatting tackles both the reconstruction problem (by modifying 3D Gaussians in 3D space during optimization) and the rendering problem (with the 2D Mip filter).

    • Filter Principle: EWA uses a Gaussian low-pass filter to generally limit bandwidth, often with empirically chosen sizes. Mip-Splatting's 3D filter is based on Nyquist limits from training views, and its 2D Mip filter specifically approximates a single-pixel box filter to simulate the physical imaging process. The paper notes EWA can lead to overly smooth results.

      In essence, Mip-Splatting provides a principled, analytical, and efficient solution to the multi-scale aliasing problem in 3DGS, enabling robust performance across varying camera settings with minimal training overhead compared to prior neural rendering anti-aliasing methods.

4. Methodology

The paper proposes two key modifications to the original 3D Gaussian Splatting (3DGS) model to achieve alias-free rendering across various sampling rates: a 3D smoothing filter and a 2D Mip filter. These modifications address the issues of uncontrolled 3D Gaussian sizes and problematic 2D dilation in the original 3DGS.

4.1. Principles

The core idea behind Mip-Splatting is to ensure that the 3D representation (the collection of 3D Gaussians) and its 2D projection for rendering are appropriately band-limited according to the sampling rates involved.

  1. 3D Frequency Constraint: Based on the Nyquist-Shannon Sampling Theorem, the highest frequency (or finest detail) that can be faithfully reconstructed from the input images is limited by their sampling rates. Therefore, the 3D Gaussians should not represent details finer than what can be resolved by the available input views. This is achieved by applying a 3D low-pass filter during optimization that constrains the size of 3D Gaussians relative to the maximal sampling frequency from the training views.
  2. Physically Motivated 2D Filtering: The 2D rendering process should simulate how real camera sensors work, where light is integrated over a finite pixel area. This means replacing the heuristic 2D dilation filter with one that effectively approximates a 2D box filter (representing a pixel) to correctly accumulate radiance and mitigate aliasing when projecting 3D Gaussians onto the image plane.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Review of 3D Gaussian Splatting (3DGS)

The paper first reviews the foundational aspects of 3DGS, as its method builds directly upon it. In 3DGS, a scene is represented as a set of KK 3D Gaussian primitives, {Gkk=1,,K}\{\mathcal{G}_k | k=1, \dots, K\}. Each Gaussian Gk\mathcal{G}_k is parameterized by its opacity αk[0,1]\alpha_k \in [0, 1], center position pkR3×1\mathbf{p}_k \in \mathbb{R}^{3 \times 1}, and a 3D covariance matrix ΣkR3×3\mathbf{\Sigma}_k \in \mathbb{R}^{3 \times 3}. The mathematical form of a 3D Gaussian at a point x\mathbf{x} is given by:

Gk(x)=e12(xpk)TΣk1(xpk) \mathcal{G}_k(\mathbf{x}) = e^{-\frac{1}{2}(\mathbf{x} - \mathbf{p}_k)^T \mathbf{\Sigma}_k^{-1} (\mathbf{x} - \mathbf{p}_k)}

where:

  • x\mathbf{x} is a 3D point in world space.

  • pk\mathbf{p}_k is the 3D center of the kk-th Gaussian.

  • Σk\mathbf{\Sigma}_k is the 3×33 \times 3 covariance matrix determining the shape and orientation of the kk-th Gaussian.

    To ensure Σk\mathbf{\Sigma}_k is a valid covariance matrix (positive semi-definite), it is parameterized using a semi-definite decomposition:

Σk=OkskskTOkT \mathbf{\Sigma}_k = \mathbf{O}_k \mathbf{s}_k \mathbf{s}_k^T \mathbf{O}_k^T

where:

  • sk\mathbf{s}_k is a scaling vector (representing the lengths of the principal axes of the Gaussian ellipsoid).
  • OkR3×3\mathbf{O}_k \in \mathbb{R}^{3 \times 3} is a rotation matrix, parameterized by a quaternion, defining the orientation of the Gaussian.

Rendering Process: To render an image from a given viewpoint (defined by camera rotation RR3×3\mathbf{R} \in \mathbb{R}^{3 \times 3} and translation tR3\mathbf{t} \in \mathbb{R}^3), the 3D Gaussians are first transformed into camera coordinates:

pk=Rpk+t,Σk=RΣkRT \mathbf{p}_k^\prime = \mathbf{R} \mathbf{p}_k + \mathbf{t}, \quad \mathbf{\Sigma}_k^\prime = \mathbf{R} \mathbf{\Sigma}_k \mathbf{R}^T

where:

  • pk\mathbf{p}_k^\prime is the Gaussian center in camera coordinates.

  • Σk\mathbf{\Sigma}_k^\prime is the covariance matrix in camera coordinates.

    These camera-space Gaussians are then projected to 2D screen space (referred to as "ray space" in the paper) via an affine transformation:

Σk=Jk:Σk:JkT \mathbf{\Sigma}_k^{\prime\prime} = \mathbf{J}_k : \mathbf{\Sigma}_k^\prime : \mathbf{J}_k^T

Self-correction note: The original paper uses Jk:Σk:JkT\mathbf{J}_k : \mathbf{\Sigma}_k^\prime : \mathbf{J}_k^T. This notation with colons is unusual for matrix multiplication. It's likely a simplified representation or a typo for standard matrix multiplication like JkΣkJkT\mathbf{J}_k \mathbf{\Sigma}_k^\prime \mathbf{J}_k^T. However, adhering to the strict rule of 100% faithfulness, I will present it exactly as written and then provide an explanation based on common interpretation. Here, Jk\mathbf{J}_k is the Jacobian matrix, which provides an affine approximation of the perspective projection centered at the 3D Gaussian's projection pk\mathbf{p}_k^\prime. Σk\mathbf{\Sigma}_k^{\prime\prime} (also denoted as Σk2D\mathbf{\Sigma}_k^{2D} in the paper) is the resulting 2×22 \times 2 covariance matrix of the projected 2D Gaussian in screen space. The corresponding 2D Gaussian is denoted as Gk2D\mathcal{G}_k^{2D}.

Finally, 3DGS models view-dependent color ck\mathbf{c}_k using spherical harmonics and renders the image by alpha blending 2D Gaussians sorted by depth:

c(x)=k=1KckαkGk2D(x)j=1k1(1αjGj2D(x)) \mathbf{c}(\mathbf{x}) = \sum_{k=1}^{K} \mathbf{c}_k \alpha_k \mathcal{G}_k^{2D}(\mathbf{x}) \prod_{j=1}^{k-1} (1 - \alpha_j \mathcal{G}_j^{2D}(\mathbf{x}))

where:

  • c(x)\mathbf{c}(\mathbf{x}) is the final color at pixel x\mathbf{x}.
  • ck\mathbf{c}_k is the view-dependent color of the kk-th Gaussian.
  • αk\alpha_k is the opacity of the kk-th Gaussian.
  • Gk2D(x)\mathcal{G}_k^{2D}(\mathbf{x}) is the value of the 2D Gaussian at pixel x\mathbf{x}.
  • The product term j=1k1(1αjGj2D(x))\prod_{j=1}^{k-1} (1 - \alpha_j \mathcal{G}_j^{2D}(\mathbf{x})) represents the accumulated transparency of preceding Gaussians.

Original 3DGS Dilation: To prevent projected 2D Gaussians from becoming infinitesimally small (degenerate) in screen space, 3DGS applies a dilation operation:

Gk2D(x)=e12(xpk)T(Σk2D+sI)1(xpk) \mathcal{G}_k^{2D}(\mathbf{x}) = e^{-\frac{1}{2}(\mathbf{x} - \mathbf{p}_k)^T (\mathbf{\Sigma}_k^{2D} + s \mathbf{I})^{-1} (\mathbf{x} - \mathbf{p}_k)}

where:

  • I\mathbf{I} is a 2×22 \times 2 identity matrix.
  • ss is a scalar dilation hyperparameter. This operation effectively adds a fixed isotropic variance to the projected 2D Gaussian's covariance, making it larger. The paper notes this is analogous to morphological dilation but is problematic because it's a fixed amount, leading to artifacts when the camera's sampling rate changes.

4.2.2. Problematic Sensitivity to Sampling Rate in 3DGS

The paper highlights that the original 3DGS suffers from ambiguities in optimization, leading to underestimation of 3D Gaussian scale. This is partly due to the implicit shrinkage bias during optimization and the fixed 2D dilation.

  • Zooming In (Higher Sampling Rate): If 3D Gaussians are too small and then dilated, when the camera zooms in, the dilated 2D Gaussians appear smaller relative to the magnified pixel size. This can lead to erosion artifacts and visible gaps between Gaussians, making objects appear thinner.
  • Zooming Out (Lower Sampling Rate): When zooming out, the fixed dilation spreads radiance over a larger area in screen space relative to the smaller effective pixel size. This can cause dilation artifacts (oversized Gaussians) and increased brightness where it's not physically accurate, blurring fine details. The paper also mentions that simply removing dilation leads to optimization challenges for complex scenes (excessive small Gaussians, GPU memory issues) and still results in aliasing when zooming out due to lack of anti-aliasing.

4.2.3. Mip-Splatting Modifications

4.2.3.1. 3D Smoothing Filter

To address the uncontrolled scale of 3D Gaussians and the resulting high-frequency artifacts during zoom-in, Mip-Splatting introduces a 3D smoothing filter. The principle is to constrain the maximal frequency of the 3D representation based on the effective sampling rate of the input views.

Multiview Frequency Bounds: The sampling rate for a 3D scene is determined by the input image resolution, camera focal length, and distance to the scene. For a pixel with unit sampling interval in screen space, when back-projected to 3D world space at depth dd, it corresponds to a world space sampling interval T^\hat{T}. The sampling frequency ν^\hat{\nu} is the inverse of this interval:

T^=1ν^=df \hat{T} = \frac{1}{\hat{\nu}} = \frac{d}{f}

where:

  • T^\hat{T} is the sampling interval in 3D world space.

  • ν^\hat{\nu} is the sampling frequency in 3D world space.

  • dd is the depth from the camera to a point in the scene.

  • ff is the focal length of the camera in pixel units.

    According to the Nyquist-Shannon Sampling Theorem, a reconstruction algorithm can only accurately represent signal components up to a frequency of ν^2\frac{\hat{\nu}}{2} (or f2d\frac{f}{2d}). Thus, a 3D Gaussian primitive smaller than 2T^2\hat{T} (i.e., with a frequency higher than ν^2\frac{\hat{\nu}}{2}) might lead to aliasing during splatting. The paper approximates depth dd using the center of the primitive pk\mathbf{p}_k.

Since a 3D Gaussian primitive can be visible in multiple training views, each with potentially different sampling rates, the paper determines the maximal sampling rate for each primitive kk as:

ν^k=max({1n(pk)fndn}n=1N) \hat{\nu}_k = \operatorname*{max} \left( \left\{ \mathbb{1}_n(\mathbf{p}_k) \cdot \frac{f_n}{d_n} \right\}_{n=1}^N \right)

where:

  • NN is the total number of training images.
  • 1n(pk)\mathbb{1}_n(\mathbf{p}_k) is an indicator function that is 1 if the Gaussian center pk\mathbf{p}_k is visible within the view frustum of the nn-th camera, and 0 otherwise.
  • fnf_n is the focal length of the nn-th camera.
  • dnd_n is the depth of the Gaussian center pk\mathbf{p}_k from the nn-th camera. This calculation is recomputed periodically (e.g., every m=100m=100 iterations) during training, as 3D Gaussian centers tend to be stable. This ensures that the size constraint is based on the most permissive sampling rate (smallest T^\hat{T}) among all visible cameras, as illustrated in Figure 3.

The 3D smoothing is achieved by applying a Gaussian low-pass filter Glow\mathcal{G}_{\mathrm{low}} to each 3D Gaussian primitive Gk\mathcal{G}_k before it is projected onto screen space. This is an efficient operation because convolving two Gaussians (with covariance matrices Σ1\mathbf{\Sigma}_1 and Σ2\mathbf{\Sigma}_2) results in another Gaussian with a combined covariance Σ1+Σ2\mathbf{\Sigma}_1 + \mathbf{\Sigma}_2. Thus, the regularized 3D Gaussian Gk(x)reg\mathcal{G}_k(\mathbf{x})_{\mathrm{reg}} becomes:

Gk(x)reg=ΣkΣk+sν^kI e12(xpk)T(Σk+sν^kI)1(xpk) \mathcal{G}_k(\mathbf{x})_{\mathrm{reg}} = \sqrt{\frac{|\boldsymbol{\Sigma}_k|}{|\boldsymbol{\Sigma}_k + \frac{s}{\hat{\nu}_k} \cdot \mathbf{I}|}} \ e^{-\frac{1}{2}\left(\mathbf{x} - \mathbf{p}_k\right)^T \left(\boldsymbol{\Sigma}_k + \frac{s}{\hat{\nu}_k} \cdot \mathbf{I}\right)^{-1} \left(\mathbf{x} - \mathbf{p}_k\right)}

where:

  • Gk(x)reg\mathcal{G}_k(\mathbf{x})_{\mathrm{reg}} is the regularized 3D Gaussian.

  • |\cdot| denotes the determinant of a matrix. The square root term is a normalization factor to maintain the peak value.

  • pk\mathbf{p}_k and Σk\mathbf{\Sigma}_k are the original center and covariance of the Gaussian.

  • I\mathbf{I} is a 3×33 \times 3 identity matrix.

  • ss is a scalar hyperparameter that controls the strength (size) of the 3D low-pass filter.

  • sν^k\frac{s}{\hat{\nu}_k} represents the variance added by the 3D low-pass filter, which is inversely proportional to the maximal sampling frequency ν^k\hat{\nu}_k. This makes the filter size adaptive to the sampling capabilities of the training views for that specific Gaussian.

    By employing this 3D Gaussian smoothing, the paper ensures that the highest frequency component (i.e., the smallest scale feature) of any Gaussian does not exceed half of its maximal sampling rate. Crucially, this filter becomes an intrinsic part of the 3D representation during optimization and remains constant post-training, so it doesn't incur rendering overhead.

4.2.3.2. 2D Mip Filter

Even with the 3D smoothing filter, rendering at lower sampling rates (e.g., zooming out) can still lead to aliasing if the 2D projection is not properly handled. To address this, Mip-Splatting replaces the original 3DGS screen space dilation filter with a 2D Mip filter.

This 2D Mip filter is designed to replicate the physical imaging process, where photons hitting a pixel on a camera sensor are integrated over the pixel's area. Ideally, this would involve a 2D box filter in image space, which averages radiance over the pixel area. For efficiency, the paper approximates this ideal box filter with a 2D Gaussian filter:

Gk2D(x)mip=Σk2DΣk2D+sIe12(xpk)T(Σk2D+sI)1(xpk) \mathcal{G}_k^{2D}(\mathbf{x})_{\mathrm{mip}} = \sqrt{\frac{|\boldsymbol{\Sigma}_k^{2D}|}{|\boldsymbol{\Sigma}_k^{2D} + s \mathbf{I}|}} e^{-\frac{1}{2}(\mathbf{x} - \mathbf{p}_k)^T (\boldsymbol{\Sigma}_k^{2D} + s \mathbf{I})^{-1} (\mathbf{x} - \mathbf{p}_k)}

where:

  • Gk2D(x)mip\mathcal{G}_k^{2D}(\mathbf{x})_{\mathrm{mip}} is the 2D Gaussian after applying the Mip filter.

  • Σk2D\mathbf{\Sigma}_k^{2D} is the covariance matrix of the projected 2D Gaussian (after 3D smoothing and projection).

  • I\mathbf{I} is a 2×22 \times 2 identity matrix.

  • ss is a scalar hyperparameter specifically chosen to cover a single pixel in screen space. This effectively adds an isotropic variance corresponding to a pixel's area to the projected 2D Gaussian.

    This Mip filter's similarity to EWA splatting [59] is acknowledged, but the underlying principles differ. Mip-Splatting's 2D Mip filter specifically targets approximating the box filter of a single pixel, aiming for accurate radiance integration. EWA, on the other hand, is generally for limiting frequency bandwidth, and its filter size is often chosen empirically (sometimes occupying a 3x3 pixel region), which can lead to overly smooth results when zooming out. The proposed 2D Mip filter, by closely mimicking physical pixel integration, provides a more principled way to handle aliasing in screen space.

In combination, the 3D smoothing filter constrains the detail level of the 3D scene representation itself, preventing high-frequency artifacts upon magnification, while the 2D Mip filter ensures correct anti-aliased rendering at any screen-space resolution, particularly mitigating issues during minification (zoom-out).

5. Experimental Setup

5.1. Datasets

The experiments in the paper are conducted on two widely used benchmark datasets in Novel View Synthesis:

5.1.1. Blender Dataset

  • Source: Introduced by NeRF [28].
  • Characteristics: This is a synthetic dataset featuring eight Lambertian (non-specular) objects rendered under a hemisphere of cameras. The scenes are typically object-centric with simple backgrounds.
  • Scale: It contains images rendered from various viewpoints around the objects.
  • Domain: Synthetic, controlled environment.
  • Why chosen: It's a standard dataset for evaluating the quality of novel view synthesis methods in controlled, static environments, allowing for clear comparison of rendering fidelity.

5.1.2. Mip-NeRF 360 Dataset

  • Source: Introduced by Mip-NeRF 360 [2].
  • Characteristics: This is a more challenging dataset containing real-world, unbounded 360-degree scenes. It includes both indoor and outdoor environments, featuring complex lighting and varied textures. The scenes are often captured with a camera moving in a large trajectory, making reconstruction and generalization more difficult.
  • Scale: Contains numerous images per scene, covering a full 360-degree view.
  • Domain: Real-world, unbounded scenes.
  • Why chosen: This dataset is used to test the method's robustness in more complex and realistic scenarios, especially for handling unbounded environments and varying levels of detail, which are directly relevant to multi-scale rendering.

5.2. Evaluation Metrics

The performance of Mip-Splatting is evaluated using three standard image quality metrics:

5.2.1. PSNR (Peak Signal-to-Noise Ratio)

  • Conceptual Definition: PSNR is a quantitative metric used to measure the quality of reconstruction of lossy compression codecs or, in this case, rendered images compared to a ground truth image. It is most easily defined via the Mean Squared Error (MSE). A higher PSNR generally indicates a higher quality (less noisy) image. It is typically expressed in decibels (dB).
  • Mathematical Formula: First, the Mean Squared Error (MSE) between two images II and KK is calculated: $ \mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ Then, PSNR is defined as: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $
  • Symbol Explanation:
    • I(i,j) is the pixel value (e.g., intensity for grayscale or per-channel for color images) of the original image at coordinate (i,j).
    • K(i,j) is the pixel value of the reconstructed (rendered) image at coordinate (i,j).
    • mm is the number of rows (height) of the image.
    • nn is the number of columns (width) of the image.
    • MAXI\mathrm{MAX}_I is the maximum possible pixel value of the image. For an 8-bit image, MAXI=255\mathrm{MAX}_I = 255.

5.2.2. SSIM (Structural Similarity Index)

  • Conceptual Definition: SSIM is a perceptual metric that quantifies the similarity between two images. Unlike PSNR, which measures absolute errors, SSIM attempts to model human visual perception, considering changes in structural information (e.g., patterns of pixels), luminance, and contrast. It ranges from -1 to 1, where 1 indicates perfect structural similarity.
  • Mathematical Formula: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
  • Symbol Explanation:
    • xx and yy are the two image patches being compared.
    • μx\mu_x is the average of xx.
    • μy\mu_y is the average of yy.
    • σx2\sigma_x^2 is the variance of xx.
    • σy2\sigma_y^2 is the variance of yy.
    • σxy\sigma_{xy} is the covariance of xx and yy.
    • c1=(k1L)2c_1 = (k_1 L)^2 and c2=(k2L)2c_2 = (k_2 L)^2 are two small constants to avoid division by zero, where LL is the dynamic range of the pixel values (e.g., 255 for 8-bit grayscale), and k1=0.01k_1=0.01, k2=0.03k_2=0.03 are common default values.

5.2.3. LPIPS (Learned Perceptual Image Patch Similarity)

  • Conceptual Definition: LPIPS is a metric that measures the perceptual distance between two images, often referred to as "perceptual loss." It is based on the idea that deep neural networks trained for tasks like image classification learn features that are perceptually meaningful. LPIPS computes the distance between deep features extracted from two images using a pre-trained CNN (e.g., AlexNet, VGG). A lower LPIPS score indicates higher perceptual similarity.
  • Mathematical Formula: $ \mathrm{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} |w_l \odot (\phi_l(x){h,w} - \phi_l(y){h,w})|_2^2 $
  • Symbol Explanation:
    • xx and yy are the two input images.
    • ϕl()\phi_l(\cdot) denotes the feature stack from layer ll of a pre-trained deep network (e.g., AlexNet).
    • wlw_l is a learned scaling vector for each channel, trained to match human judgments.
    • \odot denotes element-wise multiplication.
    • HlH_l and WlW_l are the height and width of the feature map at layer ll.
    • 22\|\cdot\|_2^2 is the squared L2 norm.
    • The sum l\sum_l aggregates distances across different layers of the feature extractor.

5.3. Baselines

The Mip-Splatting method is compared against a comprehensive set of state-of-the-art NVS methods, representing different paradigms:

  • NeRF [28]: The foundational Neural Radiance Field model.
  • NeRF w/o Larea [1, 28]: Likely a variant of NeRF without specific anti-aliasing or large-area sampling techniques, used as a basic baseline.
  • MipNeRF [1]: An anti-aliased NeRF model that uses cone tracing and integrated positional encoding.
  • Plenoxels [11]: A grid-based explicit radiance field method, faster than NeRF.
  • TensoRF [4]: Another efficient explicit radiance field method using tensorial decompositions.
  • Instant-NGP [32]: A highly efficient hash grid-based method for NeRF, known for fast training and rendering.
  • Tri-MipRF [17]: An anti-aliased feature grid-based method, building on Mip-NeRF concepts for efficiency.
  • 3DGS [18]: The original 3D Gaussian Splatting method, which Mip-Splatting aims to improve upon.
  • 3DGS [18] + EWA [59]: A variant where the 2D dilation of 3DGS is replaced with an Elliptical Weighted Average (EWA) filter, a traditional anti-aliasing technique for splatting. This is a direct competitor for the 2D filtering aspect.
  • Mip-NeRF 360 [2]: An extension of Mip-NeRF designed for unbounded 360-degree scenes.
  • Zip-NeRF [3]: An anti-aliased grid-based neural radiance field method, a highly performant and recent NeRF variant.

5.4. Key Experimental Settings

  • Training Schedule: Models are trained for 30K iterations across all scenes, following the original 3DGS [18] schedule.

  • Loss Function: The same multi-view photometric loss function as in 3DGS [18] is used.

  • Gaussian Density Control: The adaptive adding and deleting of 3D Gaussians (density control mechanism) from 3DGS [18] is retained.

  • Sampling Rate Recomputation: The maximal sampling rate (ν^k\hat{\nu}_k) for each 3D Gaussian primitive is recomputed every m=100m=100 iterations for efficiency, as Gaussian centers are relatively stable.

  • Hyperparameters for Mip-Splatting Filters:

    • Variance of the 2D Mip filter: s=0.1s = 0.1. This value is chosen to approximate a single pixel.
    • Variance of the 3D smoothing filter: s=0.2s = 0.2.
    • Total variance in 2D projection for comparison: 0.1+0.2=0.30.1 + 0.2 = 0.3. This is done for a fair comparison with 3DGS and 3DGS+EWA, which also implicitly have a filter size.
  • Evaluation Scenarios:

    1. Multi-scale Training and Multi-scale Testing (Blender): Training is done with multi-scale data (sampling 40% full resolution, 20% from other resolutions), and evaluation is also done at multiple scales. This is a common setup for anti-aliasing methods.
    2. Single-scale Training and Multi-scale Testing (Blender, Zoom-Out): Models are trained on full-resolution images (1×1 \times) and evaluated at lower resolutions (1/2×,1/4×,1/8×1/2 \times, 1/4 \times, 1/8 \times) to simulate zoom-out effects. This tests generalization to unseen lower scales.
    3. Single-scale Training and Multi-scale Testing (Mip-NeRF 360, Zoom-In): Models are trained on downsampled data (factor of 8) and evaluated at successively higher resolutions (1×,2×,4×,8×1 \times, 2 \times, 4 \times, 8 \times) to simulate zoom-in/super-resolution effects. This tests generalization to unseen higher scales.
    4. Single-scale Training and Same-scale Testing (Mip-NeRF 360): A standard in-distribution evaluation where models are trained and tested at the same scale (indoor scenes downsampled by 2, outdoor by 4). This confirms the method doesn't degrade performance in the standard setting.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results consistently demonstrate Mip-Splatting's effectiveness, especially in out-of-distribution scenarios (i.e., when rendering at sampling rates different from those used during training).

6.1.1. Multi-scale Training and Multi-scale Testing (Blender Dataset)

In this setting, where models are trained and evaluated with multi-scale data, Mip-Splatting achieves performance comparable to or superior to state-of-the-art methods like Mip-NeRF [1] and Tri-MipRF [17]. Crucially, it significantly outperforms the original 3DGS [18] and its EWA-filtered variant (3DGS [18] + EWA [59]) across all metrics (PSNR, SSIM, LPIPS), highlighting the benefit of its 2D Mip filter even when multi-scale information is available during training.

6.1.2. Single-scale Training and Multi-scale Testing (Blender Dataset, Zoom-Out Simulation)

This is a critical scenario testing the method's ability to generalize to lower sampling rates when only full-resolution images were available during training. Mip-Splatting significantly outperforms all existing state-of-the-art methods.

  • 3DGS [18]: Exhibits severe dilation artifacts at lower resolutions, confirming the paper's analysis of its fixed 2D dilation.
  • 3DGS [18] + EWA [59]: While performing better than plain 3DGS, it results in oversmoothed images, particularly at lower resolutions. This supports the paper's claim that EWA's general band-limiting, not tailored to single-pixel approximation, can be overly aggressive.
  • Mip-NeRF [1] and Tri-MipRF [17]: Although strong in multi-scale trained scenarios, they do not generalize as effectively when trained only on single-scale images and then asked to render at much lower scales, indicating a dependency on learned multi-scale signals.
  • Mip-Splatting: Maintains faithful image quality across various downsampled resolutions, effectively mitigating both dilation and aliasing artifacts without prior multi-scale training, demonstrating strong out-of-distribution generalization. Qualitatively, Figure 4 shows clear visual superiority.

6.1.3. Single-scale Training and Multi-scale Testing (Mip-NeRF 360 Dataset, Zoom-In Simulation)

This scenario tests generalization to higher sampling rates (super-resolution) when trained on downsampled data. Mip-Splatting performs comparably at the training scale (1×1 \times) but significantly exceeds all state-of-the-art methods at higher resolutions (2×,4×,8×2 \times, 4 \times, 8 \times).

  • Mip-NeRF 360 [2] and Zip-NeRF [3]: Show subpar performance at increased resolutions. The paper attributes this to their MLPs' inability to extrapolate to unseen higher frequencies or details not present in the lower-resolution training data. This highlights a fundamental limitation of learned implicit representations.
  • 3DGS [18]: Introduces erosion artifacts at higher resolutions due to its dilation operations, where the Gaussians become too small and effectively disappear, creating gaps.
  • 3DGS [18] + EWA [59]: Performs better than plain 3DGS but still yields pronounced high-frequency artifacts.
  • Mip-Splatting: Avoids these artifacts, producing aesthetically pleasing images that more closely resemble ground truth. This confirms the effectiveness of the 3D smoothing filter in constraining the maximal frequency of the 3D representation, preventing the generation of unresolvable fine details that would otherwise cause artifacts upon magnification. Qualitatively, Figure 5 provides visual evidence.

6.1.4. Single-scale Training and Same-scale Testing (Mip-NeRF 360 Dataset)

In the standard in-distribution setting, Mip-Splatting performs on par with 3DGS [18] and 3DGS [18] + EWA [59]. This is an important finding, as it confirms that the proposed modifications do not degrade performance in the scenario where original 3DGS is typically evaluated, demonstrating the method's versatility.

6.2. Data Presentation (Tables)

The following are the results from [Table 1] of the original paper:

PSNR ↑ SSIM↑ LPIPS ↓
Full Res. 1/2 Res. 1/4 Res. 1/8 Res. Avg. Full Res. 1/2 Res. 1/4 Res. 1/8 Res. Avg. Full Res. 1/2 Res. 1/4 Res. 1/8 Res. Avg.
NeRF w/o Larea [1, 28] 31.20 30.65 26.25 22.53 27.66 0.950 0.956 0.930 0.871 0.927 0.055 0.034 0.043 0.075 0.052
NeRF [28] 29.90 32.13 33.40 29.47 31.23 0.938 0.959 0.973 0.962 0.958 0.074 0.040 0.024 0.039 0.044
MipNeRF [1] 32.63 34.34 35.47 35.60 34.51 0.958 0.970 0.979 0.983 0.973 0.047 0.026 0.017 0.012 0.026
Plenoxels [11] 31.60 32.85 30.26 26.63 30.34 0.956 0.967 0.961 0.936 0.955 0.052 0.032 0.045 0.077 0.051
TensoRF [4] 32.11 33.03 30.45 26.80 30.60 0.956 0.966 0.962 0.939 0.956 0.056 0.038 0.047 0.076 0.054
Instant-NGP [32] 30.00 32.15 33.31 29.35 31.20 0.939 0.961 0.974 0.963 0.959 0.079 0.043 0.026 0.040 0.047
Tri-MipRF [17]* 32.65 34.24 35.02 35.53 34.36 0.958 0.971 0.980 0.987 0.974 0.047 0.027 0.018 0.012 0.026
3DGS [18] 28.79 30.66 31.64 27.98 29.77 0.943 0.962 0.972 0.960 0.960 0.065 0.038 0.025 0.031 0.040
3DGS [18] + EWA [59] 31.54 33.26 33.78 33.48 33.01 0.961 0.973 0.979 0.983 0.974 0.043 0.026 0.021 0.019 0.027
Mip-Splatting (ours) 32.81 34.49 35.45 35.50 34.56 0.967 0.977 0.983 0.988 0.979 0.035 0.019 0.013 0.010 0.019

The following are the results from [Table 2] of the original paper:

PSNR ↑ SSIM ↑ LPIPS ↓
Full Res. 1/2 Res. 1/4 Res. 1/8 Res. Avg. Full Res. 1/2 Res. 1/4 Res. 1/8 Res. Avg. Full Res. 1/2 Res. 1/4 Res. 1/8 Res. Avg.
NeRF [28] 31.48 32.43 30.29 26.70 30.23 0.949 0.962 0.964 0.951 0.956 0.061 0.041 0.044 0.067 0.053
MipNeRF [1] 33.08 33.31 30.91 27.97 31.31 0.961 0.970 0.969 0.961 0.965 0.045 0.031 0.036 0.052 0.041
TensoRF [4] 32.53 32.91 30.01 26.45 30.48 0.960 0.969 0.965 0.948 0.961 0.044 0.031 0.044 0.073 0.048
Instant-NGP [32] 33.09 33.00 29.84 26.33 30.57 0.962 0.969 0.964 0.947 0.961 0.044 0.033 0.046 0.075 0.049
Tri-MipRF [17] 32.89 32.84 28.29 23.87 29.47 0.958 0.967 0.951 0.913 0.947 0.046 0.033 0.046 0.075 0.050
3DGS [18] 33.33 26.95 21.38 17.69 24.84 0.969 0.949 0.875 0.766 0.890 0.030 0.032 0.066 0.121 0.063
3DGS [18] + EWA [59] 33.51 31.66 27.82 24.63 29.40 0.969 0.971 0.959 0.940 0.960 0.032 0.024 0.033 0.047 0.034
Mip-Splatting (ours) 33.36 34.00 31.85 28.67 31.97 0.969 0.977 0.978 0.973 0.974 0.031 0.019 0.019 0.026 0.024

The following are the results from [Table 3] of the original paper:

PSNR ↑ SSIM↑ LPIPS ↓
1× Res. 2× Res. 4× Res. 8× Res. Avg. 1× Res. 2× Res. 4× Res. 8× Res. Avg. 1× Res. 2× Res. 4× Res. 8× Res. Avg.
Instant-NGP [32] 26.79 24.76 24.27 24.27 25.02 0.746 0.639 0.626 0.698 0.677 0.239 0.367 0.445 0.475 0.382
mip-NeRF 360 [2] 29.26 25.18 24.16 24.10 25.67 0.860 0.727 0.670 0.706 0.741 0.122 0.260 0.370 0.428 0.295
zip-NeRF [3] 29.66 23.27 20.87 20.27 23.52 0.875 0.696 0.565 0.559 0.674 0.097 0.257 0.421 0.494 0.318
3DGS [18] 29.19 23.50 20.71 19.59 23.25 0.880 0.740 0.619 0.619 0.715 0.107 0.243 0.394 0.476 0.305
3DGS [18] + EWA [59] 29.30 25.90 23.70 22.81 25.43 0.880 0.775 0.667 0.643 0.741 0.114 0.236 0.369 0.449 0.292
Mip-Splatting (ours) 29.39 27.39 26.47 26.22 27.37 0.884 0.808 0.754 0.765 0.803 0.108 0.205 0.305 0.392 0.252

The following are the results from [Table 4] of the original paper:

PSNR ↑ SSIM↑ LPIPS ↓
NeRF [9, 28] 23.85 0.605 0.451
mip-NeRF [1] 24.04 0.616 0.441
NeRF++ [56] 25.11 0.676 0.375
Plenoxels [11] 23.08 0.626 0.463
Instant NGP [32, 52] 25.68 0.705 0.302
mip-NeRF 360 [2, 30] 27.57 0.793 0.234
Zip-NeRF [3] 28.54 0.828 0.189
3DGS [18] 27.21 0.815 0.214
3DGS [18]* 27.70 0.826 0.202
3DGS [18] + EWA [59] 27.77 0.826 0.206
Mip-Splatting (ours) 27.79 0.827 0.203

6.3. Ablation Studies / Parameter Analysis

The paper conducts ablation studies to isolate and evaluate the contribution of each proposed component: the 3D smoothing filter and the 2D Mip filter.

6.3.1. Effectiveness of the 3D Smoothing Filter

This ablation is conducted in the single-scale training (downsampled by 8) and multi-scale testing (zoom-in simulation) setting on the Mip-NeRF 360 dataset.

The following are the results from [Table 5] of the original paper:

PSNR ↑ SSIM ↑ LPIPS ↓
1× Res. 2× Res. 4× Res. 8× Res. Avg. 1× Res. 2× Res. 4× Res. 8× Res. Avg. 1× Res. 2× Res. 4× Res. 8× Res. Avg.
3DGS [18] 29.19 23.50 20.71 19.59 23.25 0.880 0.740 0.619 0.619 0.715 0.107 0.243 0.394 0.476 0.305
3DGS [18] + EWA [59] 29.30 25.90 23.70 22.81 25.43 0.880 0.775 0.667 0.643 0.741 0.114 0.236 0.369 0.449 0.292
Mip-Splatting (ours) 29.39 27.39 26.47 26.22 27.37 0.884 0.808 0.754 0.765 0.803 0.108 0.205 0.305 0.392 0.252
Mip-Splatting (ours) w/o 3D smoothing filter 29.41 27.09 25.83 25.38 26.93 0.881 0.795 0.722 0.713 0.778 0.107 0.214 0.342 0.424 0.272
Mip-Splatting (ours) w/o 2D Mip filter 29.29 27.22 26.31 26.08 27.23 0.882 0.798 0.742 0.759 0.795 0.107 0.214 0.319 0.407 0.262

As seen in Table 5, removing the 3D smoothing filter (Mip-Splatting (ours) w/o 3D smoothing filter) leads to a notable decline in performance, particularly at higher resolutions (e.g., PSNR drops from 27.39 to 27.09 at 2x Res., and from 26.47 to 25.83 at 4x Res.). This confirms that the 3D smoothing filter is crucial for mitigating high-frequency artifacts when rendering at resolutions higher than the training data, effectively enabling zoom-in capabilities. Qualitatively, Figure 6 further illustrates that omitting the 3D smoothing filter results in visible high-frequency artifacts. The paper also notes that excluding the 2D Mip filter causes only a slight decline in this zoom-in scenario, which is expected since its primary role is for zoom-out artifacts. Furthermore, the absence of both filters leads to memory errors due to an excessive generation of small Gaussians, emphasizing the necessity of some form of regularization.

6.3.2. Effectiveness of the 2D Mip Filter

This ablation evaluates the 2D Mip filter in the single-scale training (full resolution) and multi-scale testing (zoom-out simulation) setting on the Blender dataset.

The following are the results from [Table 6] of the original paper:

PSNR ↑ SSIM ↑ LPIPS ↓
Full Res. 1/2 Res. 1/4 Res. 1/8 Res. Avg. Full Res. 1/2 Res. 1/4 Res. 1/8 Res. Avg. Full Res. 1/2 Res. 1/4 Res. 1/8 Res. Avg.
3DGS [18] 33.33 26.95 21.38 17.69 24.84 0.969 0.949 0.875 0.766 0.890 0.030 0.032 0.066 0.121 0.063
3DGS [18] + EWA [59] 33.51 31.66 27.82 24.63 29.40 0.969 0.971 0.959 0.940 0.960 0.032 0.024 0.033 0.047 0.034
3DGS [18] - Dilation 33.38 33.06 29.68 26.19 30.58 0.969 0.973 0.964 0.945 0.963 0.030 0.024 0.041 0.075 0.042
Mip-Splatting (ours) 33.36 34.00 31.85 28.67 31.97 0.969 0.977 0.978 0.973 0.974 0.031 0.019 0.019 0.026 0.024
Mip-Splatting (ours) w/ 3D smoothing filter 33.67 34.16 31.56 28.20 31.90 0.970 0.977 0.978 0.971 0.974 0.030 0.018 0.019 0.027 0.024
Mip-Splatting (ours) w/o 2D Mip filter 33.51 33.38 29.87 26.28 30.76 0.970 0.975 0.966 0.946 0.964 0.031 0.022 0.039 0.073 0.041

Table 6 shows that Mip-Splatting significantly outperforms all baseline methods, including 3DGS, 3DGS + EWA, and 3DGS with dilation removed (3DGS - Dilation). Removing the original dilation (3DGS - Dilation) improves performance over 3DGS, but it still suffers from aliasing artifacts due to the lack of any anti-aliasing. The Mip-Splatting (ours) w/o 2D Mip filter variant, which still includes the 3D smoothing filter, shows a notable decline in performance compared to the full Mip-Splatting, especially at lower resolutions (e.g., LPIPS worsens from 0.026 to 0.073 at 1/8 Res.). This validates the critical role of the 2D Mip filter in anti-aliasing and mitigating zoom-out artifacts. The performance of Mip-Splatting (ours) w/ 3D smoothing filter (which means only the 3D filter is applied, but the 2D Mip filter is also effectively present as part of the full model unless explicitly stated w/o 2D Mip filter) is similar to the full model, as the 3D filter's main impact is for zoom-in.

6.3.3. Single-scale Training and Multi-scale Testing (Mip-NeRF 360, Combined Zoom)

This experiment evaluates both zoom-in and zoom-out effects on the Mip-NeRF 360 dataset, training on images downsampled by a factor of 4 and evaluating at multiple resolutions (1/4×,1/2×,1×,2×,4×1/4 \times, 1/2 \times, 1 \times, 2 \times, 4 \times).

The following are the results from [Table 7] of the original paper:

PSNR ↑ SSIM↑ LPIPS ↓
1/4 Res. 1/2 Res. 1× Res. 2× Res. 4× Res. Avg. 1/4 Res. 1/2 Res. 1× Res. 2× Res. 4× Res. Avg. 1/4 Res. 1/2 Res. 1× Res. 2× Res. 4× Res. Avg.
3DGS [18] 20.85 24.66 28.01 25.08 23.37 24.39 0.681 0.812 0.834 0.766 0.735 0.765 0.203 0.158 0.275 0.383 0.331 0.270
3DGS [18] + EWA [59] 27.40 28.39 28.09 26.43 25.30 27.12 0.888 0.871 0.833 0.774 0.738 0.821 0.103 0.126 0.166 0.276 0.385 0.212
Mip-Splatting (ours) 28.98 29.02 28.09 27.25 26.95 28.06 0.908 0.880 0.835 0.798 0.800 0.844 0.086 0.114 0.168 0.248 0.331 0.189
Mip-Splatting (ours) w/o 3D smoothing filter 28.69 28.94 28.05 27.06 26.61 27.87 0.905 0.879 0.833 0.790 0.780 0.837 0.088 0.115 0.168 0.261 0.359 0.198
Mip-Splatting (ours) w/o 2D Mip filter 26.09 28.04 28.05 27.27 27.00 27.29 0.815 0.856 0.834 0.798 0.802 0.821 0.167 0.132 0.167 0.249 0.335 0.210

The results in Table 7 further confirm the individual contributions of both filters. Mip-Splatting significantly outperforms baselines across all scales.

  • Mip-Splatting (ours) w/o 3D smoothing filter: Shows degradation, particularly at higher resolutions (zoom-in), leading to high-frequency artifacts (Figure 7).
  • Mip-Splatting (ours) w/o 2D Mip filter: Shows significant degradation at lower resolutions (zoom-out), leading to aliasing artifacts (Figure 7). This combined experiment clearly validates that both the 3D smoothing filter and the 2D Mip filter are essential for achieving robust, alias-free rendering across a wide range of sampling rates, effectively handling both zoom-in and zoom-out scenarios.

The following are the results from [Table 8] of the original paper:

PSNR SSIM LPIPS
chair drums ficus hotdog lego materials mic ship Average chair drums ficus hotdog lego materials mic ship Average chair drums ficus hotdog lego materials mic ship Average
NeRF w/o Larea [1, 28] 29.92 23.27 27.15 32.00 35.64 27.75 26.30 28.40 26.46 27.66 0.944 0.891 0.942 0.959 0.926 0.934 0.958 0.861 0.927 0.035 0.069 0.032 0.028 0.041 0.045 0.031 0.095 0.052
NeRF [28] 33.39 25.87 30.37 31.65 30.18 32.60 30.09 31.23 0.971 0.932 0.971 0.979 0.965 0.967 0.980 0.900 0.958 0.028 0.059 0.032 0.028 0.041 0.045 0.031 0.085 0.044
MipNeRF [1] 37.14 27.02 33.19 39.31 35.74 32.56 38.04 33.08 34.51 0.988 0.945 0.984 0.988 0.984 0.977 0.993 0.922 0.973 0.011 0.044 0.014 0.012 0.013 0.019 0.007 0.062 0.026
Plenoxels [11] 32.79 25.25 30.28 34.65 31.26 28.33 31.53 28.59 30.34 0.968 0.929 0.972 0.976 0.964 0.959 0.979 0.892 0.955 0.040 0.070 0.032 0.037 0.038 0.055 0.036 0.104 0.051
TensoRF [4] 32.47 25.37 31.16 34.96 31.73 28.53 31.48 29.08 30.60 0.967 0.930 0.974 0.977 0.967 0.957 0.978 0.895 0.956 0.042 0.075 0.032 0.035 0.036 0.063 0.040 0.112 0.054
Instant-ngp [32] 32.95 26.43 30.41 35.87 31.83 29.31 32.58 30.23 31.20 0.971 0.940 0.973 0.979 0.966 0.959 0.981 0.904 0.959 0.035 0.066 0.029 0.028 0.040 0.051 0.032 0.095 0.047
Tri-MipRF [17]* 37.67 27.35 33.57 38.78 35.72 31.42 37.63 32.74 34.36 0.990 0.951 0.985 0.988 0.986 0.969 0.992 0.929 0.974 0.011 0.046 0.016 0.014 0.013 0.033 0.008 0.069 0.026
3DGS [18] 32.73 25.30 29.00 35.03 29.44 27.13 31.17 28.33 29.77 0.976 0.941 0.968 0.982 0.964 0.956 0.979 0.910 0.960 0.025 0.056 0.030 0.022 0.038 0.040 0.023 0.086 0.040
3DGS [18] + EWA [59] 35.77 27.14 33.65 37.74 32.75 30.21 35.21 31.63 33.01 0.986 0.958 0.988 0.988 0.979 0.972 0.990 0.929 0.974 0.017 0.039 0.013 0.016 0.024 0.026 0.011 0.070 0.027
Mip-Splatting (ours) 37.48 27.74 34.71 39.15 35.07 31.88 37.68 32.80 34.56 0.991 0.963 0.990 0.990 0.987 0.978 0.994 0.936 0.979 0.010 0.031 0.009 0.011 0.012 0.018 0.005 0.059 0.019

The following are the results from [Table 9] of the original paper:

chair drums ficus hotdog lego materials mic ship Average
PSNR NeRF [28] 31.99 25.31 30.74 34.45 30.69 28.86 31.41 28.36 30.23
MipNeRF [1] 32.89 25.58 31.80 35.40 32.24 29.46 33.26 29.88 31.31
TensoRF [4] 32.17 25.51 31.19 34.69 31.46 28.60 31.50 28.71 30.48
Instant-ngp [32] 32.18 25.05 31.32 34.85 31.53 28.59 32.15 28.84 30.57
Tri-MipRF [17] 32.48 24.01 28.41 34.45 30.41 27.82 31.19 27.02 29.47
3DGS [18] 26.81 21.17 26.02 28.80 25.36 23.10 24.39 23.05 24.84
3DGS [18] + EWA [59] 32.85 24.91 31.94 33.33 29.76 27.36 27.68 27.41 29.40
Mip-Splatting (ours) 35.69 26.50 32.99 36.18 32.76 30.01 31.66 29.98 31.97
SSIM NeRF [28] 0.968 0.936 0.976 0.977 0.963 0.964 0.980 0.887 0.956
MipNeRF [1] 0.974 0.939 0.981 0.982 0.973 0.969 0.987 0.915 0.965
TensoRF [4] 0.970 0.938 0.978 0.979 0.970 0.963 0.981 0.906 0.961
Instant-ngp [32] 0.970 0.935 0.977 0.980 0.969 0.962 0.982 0.909 0.961
Tri-MipRF [17] 0.971 0.908 0.957 0.975 0.957 0.953 0.975 0.883 0.947
3DGS [18] 0.915 0.851 0.921 0.930 0.882 0.882 0.909 0.827 0.890
3DGS [18] + EWA [59] 0.978 0.942 0.983 0.977 0.964 0.958 0.963 0.912 0.960
Mip-Splatting (ours) 0.988 0.958 0.988 0.987 0.982 0.974 0.986 0.930 0.974
LPIPS NeRF [28] 0.040 0.067 0.027 0.034 0.043 0.049 0.035 0.132 0.053
MipNeRF [1] 0.033 0.062 0.022 0.025 0.030 0.041 0.023 0.092 0.041
TensoRF [4] 0.036 0.066 0.027 0.030 0.035 0.052 0.034 0.102 0.048
Instant-ngp [32] 0.036 0.074 0.035 0.030 0.035 0.054 0.034 0.096 0.049
Tri-MipRF [17] 0.026 0.086 0.041 0.023 0.036 0.048 0.023 0.117 0.050
3DGS [18] 0.047 0.087 0.055 0.034 0.064 0.055 0.046 0.113 0.063
3DGS [18] + EWA [59] 0.023 0.051 0.017 0.018 0.033 0.027 0.024 0.077 0.034
Mip-Splatting (ours) 0.014 0.035 0.012 0.014 0.016 0.019 0.015 0.066 0.024

The following are the results from [Table 10] of the original paper:

bicycle flowers garden stump treehill room counter kitchen bonsai
PSNR NeRF [9, 28] 21.76 19.40 23.11 21.73 21.28 28.56 25.67 26.31 26.81
mip-NeRF [1] 21.69 19.31 23.16 23.10 21.21 28.73 25.59 26.47 27.13
NeRF++ [56] 22.64 20.31 24.32 24.34 22.20 28.87 26.38 27.80 29.15
Plenoxels [11] 21.91 20.10 23.49 20.661 22.25 27.59 23.62 23.42 24.67
Instant NGP [32, 52] 22.79 19.19 25.26 24.80 22.46 30.31 26.21 29.00 31.08
mip-NeRF 360 [2, 30] 24.40 21.64 26.94 26.36 22.81 31.40 29.44 32.02 33.11
Zip-NeRF [3] 25.80 22.40 28.20 27.55 23.89 32.65 29.38 32.50 34.46
3DGS [18] 25.25 21.52 27.41 26.55 22.49 30.63 28.70 30.32 31.98
3DGS [18]* 25.63 21.77 27.70 26.87 22.75 31.69 29.08 31.56 32.29
3DGS [18] + EWA [59] 25.64 21.86 27.65 26.87 22.91 31.68 29.21 31.59 32.51
Mip-Splatting (ours) 25.72 21.93 27.76 26.94 22.98 31.74 29.16 31.55 32.31
SSIM NeRF [9, 28] 0.455 0.376 0.546 0.453 0.459 0.843 0.775 0.749 0.792
mip-NeRF [1] 0.454 0.373 0.543 0.517 0.466 0.851 0.779 0.745 0.818
NeRF++ [56] 0.526 0.453 0.635 0.594 0.530 0.852 0.802 0.816 0.876
Plenoxels [11] 0.496 0.431 0.606 0.523 0.509 0.842 0.759 0.648 0.814
Instant NGP [32, 52] 0.540 0.378 0.709 0.654 0.547 0.893 0.845 0.857 0.924
mip-NeRF 360 [2, 30] 0.693 0.583 0.816 0.746 0.632 0.913 0.895 0.920 0.939
Zip-NeRF [3] 0.769 0.642 0.860 0.800 0.681 0.925 0.902 0.928 0.949
3DGS [18] 0.771 0.605 0.868 0.775 0.638 0.914 0.905 0.922 0.938
3DGS [18]* 0.777 0.622 0.873 0.783 0.652 0.928 0.916 0.933 0.948
3DGS [18] + EWA [59] 0.777 0.620 0.871 0.784 0.655 0.927 0.916 0.933 0.948
Mip-Splatting (ours) 0.780 0.623 0.875 0.786 0.655 0.928 0.916 0.933 0.948
LPIPS NeRF [9, 28] 0.536 0.529 0.415 0.551 0.546 0.353 0.394 0.335 0.398
mip-NeRF [1] 0.541 0.535 0.422 0.490 0.538 0.346 0.390 0.336 0.370
NeRF++ [56] 0.455 0.466 0.331 0.416 0.466 0.335 0.351 0.260 0.291
Plenoxels [11] 0.506 0.521 0.3864 0.503 0.540 0.419 0.441 0.447 0.398
Instant NGP [32, 52] 0.398 0.441 0.255 0.339 0.420 0.242 0.255 0.170 0.198
mip-NeRF 360 [2, 30] 0.289 0.345 0.164 0.254 0.338 0.211 0.203 0.126 0.177
Zip-NeRF [3] 0.208 0.273 0.118 0.193 0.242 0.196 0.185 0.116 0.173
3DGS [18] 0.205 0.336 0.103 0.210 0.317 0.220 0.204 0.129 0.205
3DGS [18]* 0.205 0.329 0.103 0.208 0.318 0.192 0.178 0.113 0.174
3DGS [18] + EWA [59] 0.213 0.335 0.111 0.210 0.325 0.192 0.179 0.113 0.173
Mip-Splatting (ours) 0.206 0.331 0.103 0.209 0.320 0.192 0.179 0.113 0.173

The following are the results from [Table 11] of the original paper:

bicycle flowers garden stump treehill room counter kitchen bonsai
PSNR Instant-NGP [32] 22.51 20.25 24.65 23.15 22.24 29.48 26.18 27.10 29.66
mip-NeRF 360 [2] 24.21 21.60 25.82 25.59 22.78 29.58 27.72 28.78 31.63
zip-NeRF [3] 23.05 20.05 18.07 23.94 22.53 20.51 26.08 27.37 30.05
3DGS [18] 21.34 19.43 21.94 22.63 20.91 28.10 25.33 23.68 25.89
3DGS [18] + EWA [59] 23.74 20.94 24.69 24.81 21.93 29.80 27.23 27.07 28.63
Mip-Splatting (ours) 25.26 22.02 26.78 26.65 22.92 31.56 28.87 30.73 31.49
SSIM Instant-NGP [32] 0.538 0.473 0.647 0.590 0.544 0.868 0.795 0.764 0.877
mip-NeRF 360 [2] 0.662 0.567 0.716 0.715 0.628 0.795 0.845 0.828 0.910
zip-NeRF [3] 0.640 0.521 0.548 0.661 0.590 0.655 0.784 0.800 0.865
3DGS [18] 0.638 0.536 0.675 0.662 0.591 0.878 0.826 0.789 0.838
3DGS [18] + EWA [59] 0.671 0.563 0.613 0.718 0.693 0.608 0.889 0.843 0.813
Mip-Splatting (ours) 0.738 0.786 0.776 0.659 0.921 0.897 0.903 0.933
LPIPS Instant-NGP [32] 0.500 0.486 0.372 0.469 0.511 0.270 0.310 0.286 0.229
mip-NeRF 360 [2] 0.358 0.400 0.296 0.333 0.391 0.256 0.228 0.210 0.182
zip-NeRF [3] 0.353 0.397 0.346 0.349 0.353 0.366 0.302 0.277 0.232
3DGS [18] 0.336 0.406 0.295 0.406 0.405 0.223 0.239 0.245 0.242
3DGS [18] + EWA [59] 0.322 0.395 0.281 0.334 0.217 0.231 0.216 0.227
Mip-Splatting (ours) 0.281 0.373 0.233 0.281 0.369 0.193 0.199 0.165 0.176

6.4. Visual Results

The visual results presented in the paper (Figure 2, 4, 5, 6, 7, 8, 9) consistently support the quantitative findings. For example, Figure 2 effectively illustrates how Mip-Splatting renders faithful images across different scales (8x resolution, full resolution, 1/4 resolution) compared to 3DGS and 3DGS + EWA, which show strong artifacts (erosion or blurriness). Similarly, Figure 4 for Blender and Figure 5 for Mip-NeRF 360 demonstrate the qualitative superiority of Mip-Splatting, showing cleaner textures, sharper edges, and fewer artifacts (such as holes or over-smoothing) across various zoom levels. The bicycle wheel spokes in Figure 1 in particular, show how Mip-Splatting preserves fine details without dilation. The ablation figures (Figure 6 and 7) visually confirm the distinct roles of the 3D smoothing filter (preventing high-frequency artifacts during zoom-in) and the 2D Mip filter (preventing aliasing during zoom-out).

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Mip-Splatting, an enhanced version of 3D Gaussian Splatting, to achieve alias-free novel view synthesis across arbitrary scales. This is accomplished through two primary innovations: a 3D smoothing filter and a 2D Mip filter. The 3D smoothing filter regularizes the maximum frequency of 3D Gaussian primitives based on the actual sampling rates of the training views, thereby preventing high-frequency artifacts (e.g., erosion) during magnification. The 2D Mip filter replaces the problematic 2D dilation, approximating a physical 2D box filter to effectively mitigate aliasing and dilation artifacts, especially during minification. Experimental results on both synthetic (Blender) and real-world (Mip-NeRF 360) datasets demonstrate that Mip-Splatting is competitive with state-of-the-art methods in in-distribution settings and significantly outperforms them in out-of-distribution scenarios, particularly when changing camera focal length or distance (zoom-in/out). The method's ability to achieve robust multi-scale rendering from single-scale training data is a key strength, offering better generalization and practical applicability.

7.2. Limitations & Future Work

The authors acknowledge a few limitations and suggest avenues for future work:

  • Gaussian Approximation Error: The 2D Mip filter approximates a theoretical 2D box filter (for perfect pixel integration) with a Gaussian filter. This approximation introduces errors, especially when the Gaussian itself is small in screen space. The paper notes that these errors tend to increase with greater zoom-out factors, as evidenced in Table 2.
  • Training Overhead: The calculation of the maximal sampling rate for each 3D Gaussian (required by the 3D smoothing filter) introduces a slight increase in training overhead. Currently, this computation is performed using PyTorch, and the authors suggest that a more efficient CUDA implementation could reduce this overhead.
  • Data Structure for Sampling Rate: The sampling rate computation depends solely on camera poses and intrinsics. The authors propose that designing a better data structure for precomputing and storing this information could further improve efficiency.
  • Rendering Overhead: It's important to note that the 3D smoothing filter's effect is fused with the Gaussian primitives per Equation 9, meaning it becomes part of the 3D representation. Therefore, there is no additional overhead during the actual rendering process after training.

7.3. Personal Insights & Critique

Mip-Splatting presents a highly valuable and practical improvement to 3D Gaussian Splatting. The shift from implicit (NeRF) to explicit (3DGS) representations brought immense speed benefits, but anti-aliasing remained a challenge. Mip-Splatting addresses this challenge with a principled approach, which is commendable.

  • Innovations and Strengths:

    • Principled Anti-Aliasing: The use of the Nyquist-Shannon Sampling Theorem to guide 3D Gaussian regularization is a strong theoretical foundation. This is a critical distinction from empirical filtering or learned, less interpretable anti-aliasing methods.
    • Single-Scale Training, Multi-Scale Generalization: This is perhaps the most significant practical advantage. Requiring only single-scale training data to achieve robust performance across various zoom levels simplifies data collection and training pipelines for real-world deployment. Previous methods often required multi-scale data, which is more cumbersome.
    • Closed-Form Analytical Solution: Modifying the Gaussians directly with analytical filters (instead of relying on neural networks to learn multi-scale properties) provides greater control and interpretability.
    • Direct Enhancement of 3DGS: By building directly on 3DGS, Mip-Splatting inherits its real-time rendering capabilities while mitigating its key artifactual drawbacks.
  • Potential Issues and Areas for Improvement:

    • Gaussian Approximation for Box Filter: While pragmatic for efficiency, the Gaussian approximation to the ideal box filter is a known limitation. Further research could explore more accurate yet efficient analytical approximations or hybrid approaches.
    • Hyperparameter Sensitivity: The scalar hyperparameters ss for both filters are chosen empirically. While effective, their optimal values might vary across scenes or specific application requirements. An adaptive or learned method for these parameters could be explored.
    • Computational Overhead for Sampling Rate: Although the authors acknowledge and suggest improvements, the per-iteration recomputation of sampling rates during training could still be a bottleneck for extremely large scenes with millions of Gaussians, especially if not fully optimized in CUDA.
    • Dynamic Scenes: The current method, like 3DGS, is primarily designed for static scenes. Extending alias-free rendering to dynamic scenes, where Gaussian parameters change over time, would introduce new challenges.
  • Transferability and Future Value:

    • The core idea of constraining 3D primitive properties based on multi-view sampling rates could potentially be generalized to other primitive-based scene representations beyond Gaussians.
    • The principles of multi-scale anti-aliasing from single-scale training are highly valuable for fields like augmented reality (AR) and virtual reality (VR), where users frequently change their viewpoint and zoom level, and efficient, artifact-free rendering is paramount.
    • This work significantly advances the practicality of 3DGS, making it a more viable candidate for real-world computer vision and graphics applications that demand both quality and efficiency.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.