Mip-Splatting: Alias-free 3D Gaussian Slatting
TL;DR Summary
The paper introduces a 3D smoothing filter that eliminates artifacts in 3D Gaussian Splatting by constraining the size of 3D Gaussian primitives based on sampling frequency, and effectively mitigates aliasing issues using a 2D Mip filter.
Abstract
Recently, 3D Gaussian Splatting has demonstrated impressive novel view synthesis results, reaching high fidelity and efficiency. However, strong artifacts can be observed when changing the sampling rate, e.g., by changing focal length or camera distance. We find that the source for this phenomenon can be attributed to the lack of 3D frequency constraints and the usage of a 2D dilation filter. To address this problem, we introduce a 3D smoothing filter which constrains the size of the 3D Gaussian primitives based on the maximal sampling frequency induced by the input views, eliminating high-frequency artifacts when zooming in. Moreover, replacing 2D dilation with a 2D Mip filter, which simulates a 2D box filter, effectively mitigates aliasing and dilation issues. Our evaluation, including scenarios such a training on single-scale images and testing on multiple scales, validates the effectiveness of our approach.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Mip-Splatting: Alias-free 3D Gaussian Splatting."
1.2. Authors
-
Zehao Yu (University of Tübingen, Tübingen AI Center)
-
Anpei Chen (University of Tübingen, Tübingen AI Center)
-
Binbin Huang (ShanghaiTech University)
-
Torsten Sattler (Czech Technical University in Prague)
-
Andreas Geiger (University of Tübingen, Tübingen AI Center)
The authors are affiliated with universities and research centers, indicating a strong academic background in computer vision and graphics. Andreas Geiger, in particular, is a well-known researcher in the field of autonomous driving and 3D vision.
1.3. Journal/Conference
The paper does not explicitly state the journal or conference it was published in, but a common practice for papers building upon 3D Gaussian Splatting (which was published in ACM Transactions on Graphics, a top-tier venue) is to target high-impact computer graphics or computer vision conferences/journals. Given the citation style (e.g., ICCV, CVPR) in the references, it is likely intended for a major conference in these fields.
1.4. Publication Year
The publication year is not explicitly mentioned in the provided text, but context from the abstract (e.g., "Recently, 3D Gaussian Splatting has demonstrated...") suggests it is a very recent work, likely 2023 or 2024, building on 3DGS [18] from 2023.
1.5. Abstract
The paper addresses significant artifact issues in 3D Gaussian Splatting (3DGS), specifically when changing sampling rates (e.g., varying focal length or camera distance). The authors attribute these artifacts to a lack of 3D frequency constraints and the use of a 2D dilation filter in the original 3DGS. To solve this, they introduce two main components:
- A 3D smoothing filter: This constrains the size of 3D Gaussian primitives based on the maximal sampling frequency derived from the input views, effectively eliminating high-frequency artifacts during zoom-in.
- A 2D Mip filter: This replaces the original 2D dilation filter, simulating a 2D box filter to mitigate aliasing and dilation issues, particularly during zoom-out. The evaluation, including scenarios like training on single-scale images and testing on multiple scales, demonstrates the effectiveness of their approach, termed Mip-Splatting, in producing alias-free novel view synthesis across various sampling rates.
1.6. Original Source Link
/files/papers/691356dc430ad52d5a9ef405/paper.pdf
This is a local file path, indicating the paper content was provided directly. Its publication status (e.g., officially published, preprint) is not specified.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the presence of strong artifacts in 3D Gaussian Splatting (3DGS) when the camera's sampling rate changes significantly from the training views. This includes actions like zooming in (increasing effective resolution) or zooming out (decreasing effective resolution), or changing camera distance.
This problem is highly important in the field of Novel View Synthesis (NVS) because 3DGS has emerged as a groundbreaking method due to its impressive rendering quality and real-time performance. However, practical applications in virtual reality, cinematography, or robotics often require robust performance across a wide range of camera parameters and scales. The artifacts observed in 3DGS (e.g., erosion when zooming in, dilation/aliasing when zooming out) limit its generalization and practical utility in these dynamic scenarios. Existing methods like Mip-NeRF address multi-scale issues for neural radiance fields but are not directly applicable to the explicit Gaussian representation and rasterization pipeline of 3DGS. The specific challenges are:
-
Lack of 3D frequency constraints: 3DGS does not inherently limit the "fineness" or "detail level" of its 3D Gaussians based on the input image resolution, leading to high-frequency artifacts (erosion) when rendered at higher resolutions than trained.
-
Problematic 2D dilation filter: The original 3DGS uses a fixed 2D dilation filter in screen space to prevent degenerate small Gaussians. While useful at training scale, this filter causes incorrect radiance spread (dilation artifacts) when zooming out and ineffective filtering (leading to erosion artifacts) when zooming in.
The paper's entry point is to address these fundamental limitations of 3DGS by introducing principled anti-aliasing mechanisms directly into the Gaussian representation and rendering process. The innovative idea is to regularize the 3D Gaussians in 3D space based on multi-view sampling rates and to replace the problematic 2D dilation with a physically motivated 2D Mip filter.
2.2. Main Contributions / Findings
The paper makes the following primary contributions:
-
Introduction of a 3D smoothing filter for 3D Gaussian primitives: This filter effectively regularizes the maximum frequency (or minimum size) of 3D Gaussians based on the maximal sampling rate available from the training images for each primitive. This prevents the Gaussians from becoming excessively small and introducing high-frequency artifacts when rendered at higher resolutions (e.g., zooming in) than those seen during training. This filter becomes an intrinsic part of the 3D scene representation.
-
Replacement of the 2D dilation filter with a 2D Mip filter: The original 2D dilation operation in 3DGS is replaced with a novel 2D Mip filter, which approximates the physical imaging process's 2D box filter (integrating light over a pixel area) using a 2D Gaussian. This effectively mitigates aliasing and dilation artifacts, particularly when rendering at lower sampling rates (e.g., zooming out).
-
Demonstration of alias-free rendering across various sampling rates with single-scale training: The proposed Mip-Splatting method enables faithful rendering at different zoom levels and camera distances, even when trained exclusively on single-scale (full-resolution) images. This highlights excellent out-of-distribution generalization.
-
Principled and simple modifications: The proposed changes are principled, grounded in sampling theory and physical imaging processes, and are simple to integrate into the existing 3DGS framework.
The key conclusions and findings are that by constraining the 3D representation's frequency based on multi-view sampling and by implementing a more physically accurate 2D screen-space filter, Mip-Splatting effectively solves the aliasing and dilation artifacts observed in 3DGS. It significantly enhances the robustness and generalization capabilities of 3D Gaussian Splatting for novel view synthesis across different scales, making it more practical for real-world applications.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Novel View Synthesis (NVS)
Novel View Synthesis (NVS) is a subfield of computer graphics and computer vision focused on generating new images of a 3D scene from arbitrary camera viewpoints, given a set of existing images of that scene. The goal is to create photo-realistic images that look as if a camera was physically present at the new viewpoint. This has broad applications in virtual reality, augmented reality, 3D content creation, and robotics.
3.1.2. Neural Radiance Fields (NeRF)
Neural Radiance Fields (NeRF) [28] is a landmark technique in NVS. It represents a 3D scene using a multilayer perceptron (MLP) neural network. This MLP maps a 3D coordinate and a 2D viewing direction to a predicted color and volume density . To render an image, rays are cast from the camera through each pixel. For each ray, samples are taken along its path, and the MLP predicts color and density at these sample points. These are then composited using volume rendering techniques to produce the final pixel color. While NeRF produces high-quality results, its rendering speed can be slow due to the repeated MLP evaluations for each ray.
3.1.3. 3D Gaussian Splatting (3DGS)
3D Gaussian Splatting (3DGS) [18] is a more recent and highly efficient NVS method that explicitly represents a 3D scene as a collection of anisotropic 3D Gaussian primitives. Each 3D Gaussian is defined by its position, covariance (representing size and orientation), opacity, and view-dependent color (often modeled using spherical harmonics). Instead of ray tracing, 3DGS uses a splatting-based rasterization pipeline. This means that each 3D Gaussian is projected onto the 2D image plane, forming a 2D Gaussian. These 2D Gaussians are then sorted by depth and alpha-blended to produce the final image. A key feature of 3DGS is its differentiable rendering, allowing all Gaussian parameters to be optimized end-to-end using a photometric loss. It achieves real-time rendering at high resolutions, making it very appealing for practical applications.
3.1.4. Sampling Theorem (Nyquist-Shannon Sampling Theorem)
The Nyquist-Shannon Sampling Theorem [33, 45] is a fundamental principle in signal processing. It states that to perfectly reconstruct a continuous signal from its discrete samples, two conditions must be met:
- The continuous signal must be
band-limited, meaning it contains no frequency components above a certain maximum frequency, . - The
sampling rate() must be at least twice the highest frequency present in the continuous signal; i.e., . This minimum sampling rate is known as theNyquist rate. If the sampling rate is below the Nyquist rate for a given frequency component,aliasingoccurs, where high-frequency components are incorrectly represented as lower-frequency components. To prevent this,low-pass filtering(also calledanti-aliasing filtering) is typically applied to the signal before sampling to remove frequencies above half the sampling rate.
3.1.5. Aliasing and Anti-aliasing
Aliasing refers to the misrepresentation of a high-frequency signal as a lower-frequency signal when the signal is sampled at a rate lower than the Nyquist rate. In computer graphics, this often manifests as "jaggies" (stair-step patterns on edges), moiré patterns, or flickering artifacts when fine details are rendered.
Anti-aliasing techniques aim to mitigate aliasing artifacts. This typically involves applying a low-pass filter to the signal before it is sampled or rendered. A low-pass filter removes or attenuates high-frequency components that could cause aliasing, effectively "blurring" or smoothing out fine details to represent them correctly at the given sampling resolution.
3.1.6. Low-pass Filter
A low-pass filter is a filter that passes signals with a frequency lower than a certain cutoff frequency and attenuates signals with frequencies higher than the cutoff frequency. In image processing and computer graphics, applying a low-pass filter (like a Gaussian blur) effectively smooths the image by removing fine details and high-frequency noise. This is a common technique for anti-aliasing.
3.2. Previous Works
The paper builds primarily on the foundations of Neural Radiance Fields (NeRF) and directly addresses limitations of 3D Gaussian Splatting (3DGS).
- NeRF [28]: As described above, NeRF revolutionized NVS by using MLPs for implicit scene representation. It achieved remarkable quality but suffered from slow rendering speeds.
- Subsequent NeRF improvements [4, 11, 19, 32, 46, 51]: Many works improved NeRF's efficiency and training, often by using explicit scene representations like
feature gridsorsparse voxel fields(e.g., Plenoxels [11], TensoRF [4], Instant-NGP [32]). These methods still typically rely on volume rendering or similar sampling strategies. - Mip-NeRF [1] and Tri-MipRF [17]: These are significant prior works in anti-aliasing for NeRF-based methods.
- Mip-NeRF [1] introduced
integrated positional encoding (IPE)andcone tracing. Instead of sampling points along a ray, it samples volumes (cones) whose size corresponds to the pixel footprint. It integrates positional encoding over these conical frustums to attenuate high-frequency details, preventing aliasing. Mip-NeRF typically requires training with multi-scale input images to learn how to render effectively at different scales. - Tri-MipRF [17] adapts similar multi-scale ideas to feature grid-based representations. Like Mip-NeRF, it learns multi-scale signals, typically requiring multi-scale supervision during training.
- Differentiation from Mip-Splatting: Both Mip-NeRF and Tri-MipRF are primarily
MLP-basedorfeature grid-basedmethods that rely on the neural network's ability to learn and interpolate multi-scale signals, often requiring multi-scale training data. In contrast, Mip-Splatting is based on3DGS's explicit3D Gaussianrepresentation and usesclosed-form modifications(filters) to the Gaussians themselves. This allows Mip-Splatting to achieve excellent multi-scale generalization even when trained onsingle-scale images.
- Mip-NeRF [1] introduced
- Primitive-based Differentiable Rendering [13, 14, 38, 44, 59, 60]: This category includes methods that render geometric primitives (like points or spheres) onto the image plane.
- Pulsar [20]: An example of efficient sphere rasterization.
- 3DGS [18]: As discussed, it represents scenes as 3D Gaussians and uses a tile-based rasterization for real-time rendering. This paper builds directly upon 3DGS.
- Anti-aliasing in Rendering [7, 8, 15, 31, 47, 50, 59]: Traditional computer graphics has many anti-aliasing techniques.
- EWA splatting [59]: This method applies an
elliptical weighted average (EWA)filter, which is a Gaussian low-pass filter, to projected 2D Gaussians in screen space to produce a band-limited output. It aims to limit the frequency signal's bandwidth to the rendered image. - Differentiation from Mip-Splatting: While EWA also uses a Gaussian low-pass filter in screen space, its filter size is often chosen empirically and is primarily designed to address the
rendering problem(how to correctly draw primitives), not thereconstruction problem(how to optimize the 3D representation itself to be alias-free). Mip-Splatting, however, applies aband-limited filter in 3D spacewhose size is determined by the training images' sampling rates, directly constraining the 3D representation during optimization. Its2D Mip filteris specifically designed to approximate a single pixel'sbox filter, mimicking the physical imaging process, which is distinct from EWA's general band-limiting. The paper claims EWA can lead to overly smooth results.
- EWA splatting [59]: This method applies an
3.3. Technological Evolution
The evolution in NVS has generally moved from:
-
Implicit Representations (NeRF): Highly realistic but computationally expensive due to MLP query per ray.
-
Hybrid/Explicit Grid-based Representations (Plenoxels, Instant-NGP, TensoRF): Improved efficiency by structuring the scene representation in grids or tensors, but still often rely on volume rendering concepts.
-
Explicit Primitive-based Representations (3DGS): Achieved real-time rendering by moving to a differentiable rasterization pipeline of explicit geometric primitives (3D Gaussians). This greatly boosted efficiency.
This paper's work, Mip-Splatting, fits into the technological timeline as a crucial refinement of 3DGS. While 3DGS brought unprecedented efficiency, it inherited aliasing challenges common in rasterization and lacked principled anti-aliasing mechanisms for multi-scale viewing. Mip-Splatting addresses this gap, making 3DGS more robust and practically usable across varying camera parameters, enhancing its generalization capabilities without sacrificing its core efficiency.
3.4. Differentiation Analysis
Compared to the main methods in related work, Mip-Splatting's core differences and innovations are:
-
From NeRF/Grid-based Anti-aliasing (e.g., Mip-NeRF, Tri-MipRF):
- Representation: Mip-Splatting works with an
explicit 3D Gaussian representationand arasterization pipeline, whereas Mip-NeRF/Tri-MipRF work withimplicit MLP/feature grid representationsandvolume rendering. - Anti-aliasing Mechanism: Mip-NeRF uses integrated positional encoding and cone tracing, relying on the MLP to learn multi-scale representations. Mip-Splatting employs
closed-form analytical filters(3D smoothing and 2D Mip filters) directly modifying the Gaussian properties. - Training Requirement: Mip-NeRF and Tri-MipRF often require
multi-scale images for supervisionto learn how to render at different scales. Mip-Splatting achieves excellentout-of-distribution generalization(rendering at scales different from training) even whentrained on single-scale images.
- Representation: Mip-Splatting works with an
-
From 3DGS [18]:
- Addressing Core Artifacts: Mip-Splatting directly tackles the
aliasing,dilation, anderosion artifactspresent in original 3DGS when changing sampling rates. - 3D Frequency Constraint: Mip-Splatting introduces a novel
3D smoothing filterthat regularizes the size of 3D Gaussians based on theNyquist-Shannon Sampling Theoremand the actual sampling rate of training views. This is absent in original 3DGS. - Improved 2D Filter: Mip-Splatting replaces 3DGS's problematic fixed-size
2D dilation filterwith a2D Mip filterthat approximates aphysical 2D box filterfor anti-aliasing, leading to more accurate radiance accumulation.
- Addressing Core Artifacts: Mip-Splatting directly tackles the
-
From EWA Splatting [59]:
-
Scope: EWA primarily addresses the
rendering problem(how to filter projected primitives) without directly constraining the3D reconstruction. Mip-Splatting tackles both thereconstruction problem(by modifying 3D Gaussians in 3D space during optimization) and therendering problem(with the 2D Mip filter). -
Filter Principle: EWA uses a Gaussian low-pass filter to generally limit bandwidth, often with empirically chosen sizes. Mip-Splatting's
3D filteris based onNyquist limits from training views, and its2D Mip filterspecifically approximates asingle-pixel box filterto simulate the physical imaging process. The paper notes EWA can lead to overly smooth results.In essence, Mip-Splatting provides a principled, analytical, and efficient solution to the multi-scale aliasing problem in 3DGS, enabling robust performance across varying camera settings with minimal training overhead compared to prior neural rendering anti-aliasing methods.
-
4. Methodology
The paper proposes two key modifications to the original 3D Gaussian Splatting (3DGS) model to achieve alias-free rendering across various sampling rates: a 3D smoothing filter and a 2D Mip filter. These modifications address the issues of uncontrolled 3D Gaussian sizes and problematic 2D dilation in the original 3DGS.
4.1. Principles
The core idea behind Mip-Splatting is to ensure that the 3D representation (the collection of 3D Gaussians) and its 2D projection for rendering are appropriately band-limited according to the sampling rates involved.
- 3D Frequency Constraint: Based on the Nyquist-Shannon Sampling Theorem, the highest frequency (or finest detail) that can be faithfully reconstructed from the input images is limited by their sampling rates. Therefore, the 3D Gaussians should not represent details finer than what can be resolved by the available input views. This is achieved by applying a 3D low-pass filter during optimization that constrains the size of 3D Gaussians relative to the maximal sampling frequency from the training views.
- Physically Motivated 2D Filtering: The 2D rendering process should simulate how real camera sensors work, where light is integrated over a finite pixel area. This means replacing the heuristic 2D dilation filter with one that effectively approximates a 2D box filter (representing a pixel) to correctly accumulate radiance and mitigate aliasing when projecting 3D Gaussians onto the image plane.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Review of 3D Gaussian Splatting (3DGS)
The paper first reviews the foundational aspects of 3DGS, as its method builds directly upon it. In 3DGS, a scene is represented as a set of 3D Gaussian primitives, . Each Gaussian is parameterized by its opacity , center position , and a 3D covariance matrix . The mathematical form of a 3D Gaussian at a point is given by:
where:
-
is a 3D point in world space.
-
is the 3D center of the -th Gaussian.
-
is the covariance matrix determining the shape and orientation of the -th Gaussian.
To ensure is a valid covariance matrix (positive semi-definite), it is parameterized using a semi-definite decomposition:
where:
- is a scaling vector (representing the lengths of the principal axes of the Gaussian ellipsoid).
- is a rotation matrix, parameterized by a quaternion, defining the orientation of the Gaussian.
Rendering Process: To render an image from a given viewpoint (defined by camera rotation and translation ), the 3D Gaussians are first transformed into camera coordinates:
where:
-
is the Gaussian center in camera coordinates.
-
is the covariance matrix in camera coordinates.
These camera-space Gaussians are then projected to 2D screen space (referred to as "ray space" in the paper) via an affine transformation:
Self-correction note: The original paper uses . This notation with colons is unusual for matrix multiplication. It's likely a simplified representation or a typo for standard matrix multiplication like . However, adhering to the strict rule of 100% faithfulness, I will present it exactly as written and then provide an explanation based on common interpretation. Here, is the Jacobian matrix, which provides an affine approximation of the perspective projection centered at the 3D Gaussian's projection . (also denoted as in the paper) is the resulting covariance matrix of the projected 2D Gaussian in screen space. The corresponding 2D Gaussian is denoted as .
Finally, 3DGS models view-dependent color using spherical harmonics and renders the image by alpha blending 2D Gaussians sorted by depth:
where:
- is the final color at pixel .
- is the view-dependent color of the -th Gaussian.
- is the opacity of the -th Gaussian.
- is the value of the 2D Gaussian at pixel .
- The product term represents the accumulated transparency of preceding Gaussians.
Original 3DGS Dilation: To prevent projected 2D Gaussians from becoming infinitesimally small (degenerate) in screen space, 3DGS applies a dilation operation:
where:
- is a identity matrix.
- is a scalar dilation hyperparameter. This operation effectively adds a fixed isotropic variance to the projected 2D Gaussian's covariance, making it larger. The paper notes this is analogous to morphological dilation but is problematic because it's a fixed amount, leading to artifacts when the camera's sampling rate changes.
4.2.2. Problematic Sensitivity to Sampling Rate in 3DGS
The paper highlights that the original 3DGS suffers from ambiguities in optimization, leading to underestimation of 3D Gaussian scale. This is partly due to the implicit shrinkage bias during optimization and the fixed 2D dilation.
- Zooming In (Higher Sampling Rate): If 3D Gaussians are too small and then dilated, when the camera zooms in, the dilated 2D Gaussians appear smaller relative to the magnified pixel size. This can lead to
erosion artifactsand visible gaps between Gaussians, making objects appear thinner. - Zooming Out (Lower Sampling Rate): When zooming out, the fixed dilation spreads radiance over a larger area in screen space relative to the smaller effective pixel size. This can cause
dilation artifacts(oversized Gaussians) and increased brightness where it's not physically accurate, blurring fine details. The paper also mentions that simply removing dilation leads to optimization challenges for complex scenes (excessive small Gaussians, GPU memory issues) and still results in aliasing when zooming out due to lack of anti-aliasing.
4.2.3. Mip-Splatting Modifications
4.2.3.1. 3D Smoothing Filter
To address the uncontrolled scale of 3D Gaussians and the resulting high-frequency artifacts during zoom-in, Mip-Splatting introduces a 3D smoothing filter. The principle is to constrain the maximal frequency of the 3D representation based on the effective sampling rate of the input views.
Multiview Frequency Bounds: The sampling rate for a 3D scene is determined by the input image resolution, camera focal length, and distance to the scene. For a pixel with unit sampling interval in screen space, when back-projected to 3D world space at depth , it corresponds to a world space sampling interval . The sampling frequency is the inverse of this interval:
where:
-
is the sampling interval in 3D world space.
-
is the sampling frequency in 3D world space.
-
is the depth from the camera to a point in the scene.
-
is the focal length of the camera in pixel units.
According to the Nyquist-Shannon Sampling Theorem, a reconstruction algorithm can only accurately represent signal components up to a frequency of (or ). Thus, a 3D Gaussian primitive smaller than (i.e., with a frequency higher than ) might lead to aliasing during splatting. The paper approximates depth using the center of the primitive .
Since a 3D Gaussian primitive can be visible in multiple training views, each with potentially different sampling rates, the paper determines the maximal sampling rate for each primitive as:
where:
- is the total number of training images.
- is an indicator function that is 1 if the Gaussian center is visible within the view frustum of the -th camera, and 0 otherwise.
- is the focal length of the -th camera.
- is the depth of the Gaussian center from the -th camera. This calculation is recomputed periodically (e.g., every iterations) during training, as 3D Gaussian centers tend to be stable. This ensures that the size constraint is based on the most permissive sampling rate (smallest ) among all visible cameras, as illustrated in Figure 3.
The 3D smoothing is achieved by applying a Gaussian low-pass filter to each 3D Gaussian primitive before it is projected onto screen space. This is an efficient operation because convolving two Gaussians (with covariance matrices and ) results in another Gaussian with a combined covariance . Thus, the regularized 3D Gaussian becomes:
where:
-
is the regularized 3D Gaussian.
-
denotes the determinant of a matrix. The square root term is a normalization factor to maintain the peak value.
-
and are the original center and covariance of the Gaussian.
-
is a identity matrix.
-
is a scalar hyperparameter that controls the strength (size) of the 3D low-pass filter.
-
represents the variance added by the 3D low-pass filter, which is inversely proportional to the maximal sampling frequency . This makes the filter size adaptive to the sampling capabilities of the training views for that specific Gaussian.
By employing this 3D Gaussian smoothing, the paper ensures that the highest frequency component (i.e., the smallest scale feature) of any Gaussian does not exceed half of its maximal sampling rate. Crucially, this filter becomes an intrinsic part of the 3D representation during optimization and remains constant post-training, so it doesn't incur rendering overhead.
4.2.3.2. 2D Mip Filter
Even with the 3D smoothing filter, rendering at lower sampling rates (e.g., zooming out) can still lead to aliasing if the 2D projection is not properly handled. To address this, Mip-Splatting replaces the original 3DGS screen space dilation filter with a 2D Mip filter.
This 2D Mip filter is designed to replicate the physical imaging process, where photons hitting a pixel on a camera sensor are integrated over the pixel's area. Ideally, this would involve a 2D box filter in image space, which averages radiance over the pixel area. For efficiency, the paper approximates this ideal box filter with a 2D Gaussian filter:
where:
-
is the 2D Gaussian after applying the Mip filter.
-
is the covariance matrix of the projected 2D Gaussian (after 3D smoothing and projection).
-
is a identity matrix.
-
is a scalar hyperparameter specifically chosen to cover a
single pixelin screen space. This effectively adds an isotropic variance corresponding to a pixel's area to the projected 2D Gaussian.This Mip filter's similarity to
EWA splatting[59] is acknowledged, but the underlying principles differ. Mip-Splatting's 2D Mip filter specifically targets approximating thebox filterof a single pixel, aiming for accurate radiance integration. EWA, on the other hand, is generally for limiting frequency bandwidth, and its filter size is often chosen empirically (sometimes occupying a 3x3 pixel region), which can lead to overly smooth results when zooming out. The proposed 2D Mip filter, by closely mimicking physical pixel integration, provides a more principled way to handle aliasing in screen space.
In combination, the 3D smoothing filter constrains the detail level of the 3D scene representation itself, preventing high-frequency artifacts upon magnification, while the 2D Mip filter ensures correct anti-aliased rendering at any screen-space resolution, particularly mitigating issues during minification (zoom-out).
5. Experimental Setup
5.1. Datasets
The experiments in the paper are conducted on two widely used benchmark datasets in Novel View Synthesis:
5.1.1. Blender Dataset
- Source: Introduced by NeRF [28].
- Characteristics: This is a synthetic dataset featuring eight Lambertian (non-specular) objects rendered under a hemisphere of cameras. The scenes are typically object-centric with simple backgrounds.
- Scale: It contains images rendered from various viewpoints around the objects.
- Domain: Synthetic, controlled environment.
- Why chosen: It's a standard dataset for evaluating the quality of novel view synthesis methods in controlled, static environments, allowing for clear comparison of rendering fidelity.
5.1.2. Mip-NeRF 360 Dataset
- Source: Introduced by Mip-NeRF 360 [2].
- Characteristics: This is a more challenging dataset containing real-world, unbounded 360-degree scenes. It includes both indoor and outdoor environments, featuring complex lighting and varied textures. The scenes are often captured with a camera moving in a large trajectory, making reconstruction and generalization more difficult.
- Scale: Contains numerous images per scene, covering a full 360-degree view.
- Domain: Real-world, unbounded scenes.
- Why chosen: This dataset is used to test the method's robustness in more complex and realistic scenarios, especially for handling unbounded environments and varying levels of detail, which are directly relevant to multi-scale rendering.
5.2. Evaluation Metrics
The performance of Mip-Splatting is evaluated using three standard image quality metrics:
5.2.1. PSNR (Peak Signal-to-Noise Ratio)
- Conceptual Definition: PSNR is a quantitative metric used to measure the quality of reconstruction of lossy compression codecs or, in this case, rendered images compared to a ground truth image. It is most easily defined via the Mean Squared Error (MSE). A higher PSNR generally indicates a higher quality (less noisy) image. It is typically expressed in decibels (dB).
- Mathematical Formula: First, the Mean Squared Error (MSE) between two images and is calculated: $ \mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ Then, PSNR is defined as: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $
- Symbol Explanation:
I(i,j)is the pixel value (e.g., intensity for grayscale or per-channel for color images) of the original image at coordinate(i,j).K(i,j)is the pixel value of the reconstructed (rendered) image at coordinate(i,j).- is the number of rows (height) of the image.
- is the number of columns (width) of the image.
- is the maximum possible pixel value of the image. For an 8-bit image, .
5.2.2. SSIM (Structural Similarity Index)
- Conceptual Definition: SSIM is a perceptual metric that quantifies the similarity between two images. Unlike PSNR, which measures absolute errors, SSIM attempts to model human visual perception, considering changes in structural information (e.g., patterns of pixels), luminance, and contrast. It ranges from -1 to 1, where 1 indicates perfect structural similarity.
- Mathematical Formula: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
- and are the two image patches being compared.
- is the average of .
- is the average of .
- is the variance of .
- is the variance of .
- is the covariance of and .
- and are two small constants to avoid division by zero, where is the dynamic range of the pixel values (e.g., 255 for 8-bit grayscale), and , are common default values.
5.2.3. LPIPS (Learned Perceptual Image Patch Similarity)
- Conceptual Definition: LPIPS is a metric that measures the perceptual distance between two images, often referred to as "perceptual loss." It is based on the idea that deep neural networks trained for tasks like image classification learn features that are perceptually meaningful. LPIPS computes the distance between deep features extracted from two images using a pre-trained CNN (e.g., AlexNet, VGG). A lower LPIPS score indicates higher perceptual similarity.
- Mathematical Formula: $ \mathrm{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} |w_l \odot (\phi_l(x){h,w} - \phi_l(y){h,w})|_2^2 $
- Symbol Explanation:
- and are the two input images.
- denotes the feature stack from layer of a pre-trained deep network (e.g., AlexNet).
- is a learned scaling vector for each channel, trained to match human judgments.
- denotes element-wise multiplication.
- and are the height and width of the feature map at layer .
- is the squared L2 norm.
- The sum aggregates distances across different layers of the feature extractor.
5.3. Baselines
The Mip-Splatting method is compared against a comprehensive set of state-of-the-art NVS methods, representing different paradigms:
- NeRF [28]: The foundational Neural Radiance Field model.
- NeRF w/o Larea [1, 28]: Likely a variant of NeRF without specific anti-aliasing or large-area sampling techniques, used as a basic baseline.
- MipNeRF [1]: An anti-aliased NeRF model that uses cone tracing and integrated positional encoding.
- Plenoxels [11]: A grid-based explicit radiance field method, faster than NeRF.
- TensoRF [4]: Another efficient explicit radiance field method using tensorial decompositions.
- Instant-NGP [32]: A highly efficient hash grid-based method for NeRF, known for fast training and rendering.
- Tri-MipRF [17]: An anti-aliased feature grid-based method, building on Mip-NeRF concepts for efficiency.
- 3DGS [18]: The original 3D Gaussian Splatting method, which Mip-Splatting aims to improve upon.
- 3DGS [18] + EWA [59]: A variant where the 2D dilation of 3DGS is replaced with an Elliptical Weighted Average (EWA) filter, a traditional anti-aliasing technique for splatting. This is a direct competitor for the 2D filtering aspect.
- Mip-NeRF 360 [2]: An extension of Mip-NeRF designed for unbounded 360-degree scenes.
- Zip-NeRF [3]: An anti-aliased grid-based neural radiance field method, a highly performant and recent NeRF variant.
5.4. Key Experimental Settings
-
Training Schedule: Models are trained for 30K iterations across all scenes, following the original 3DGS [18] schedule.
-
Loss Function: The same multi-view photometric loss function as in 3DGS [18] is used.
-
Gaussian Density Control: The adaptive adding and deleting of 3D Gaussians (density control mechanism) from 3DGS [18] is retained.
-
Sampling Rate Recomputation: The maximal sampling rate () for each 3D Gaussian primitive is recomputed every iterations for efficiency, as Gaussian centers are relatively stable.
-
Hyperparameters for Mip-Splatting Filters:
- Variance of the 2D Mip filter: . This value is chosen to approximate a single pixel.
- Variance of the 3D smoothing filter: .
- Total variance in 2D projection for comparison: . This is done for a fair comparison with 3DGS and 3DGS+EWA, which also implicitly have a filter size.
-
Evaluation Scenarios:
- Multi-scale Training and Multi-scale Testing (Blender): Training is done with multi-scale data (sampling 40% full resolution, 20% from other resolutions), and evaluation is also done at multiple scales. This is a common setup for anti-aliasing methods.
- Single-scale Training and Multi-scale Testing (Blender, Zoom-Out): Models are trained on full-resolution images () and evaluated at lower resolutions () to simulate zoom-out effects. This tests generalization to unseen lower scales.
- Single-scale Training and Multi-scale Testing (Mip-NeRF 360, Zoom-In): Models are trained on downsampled data (factor of 8) and evaluated at successively higher resolutions () to simulate zoom-in/super-resolution effects. This tests generalization to unseen higher scales.
- Single-scale Training and Same-scale Testing (Mip-NeRF 360): A standard in-distribution evaluation where models are trained and tested at the same scale (indoor scenes downsampled by 2, outdoor by 4). This confirms the method doesn't degrade performance in the standard setting.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results consistently demonstrate Mip-Splatting's effectiveness, especially in out-of-distribution scenarios (i.e., when rendering at sampling rates different from those used during training).
6.1.1. Multi-scale Training and Multi-scale Testing (Blender Dataset)
In this setting, where models are trained and evaluated with multi-scale data, Mip-Splatting achieves performance comparable to or superior to state-of-the-art methods like Mip-NeRF [1] and Tri-MipRF [17]. Crucially, it significantly outperforms the original 3DGS [18] and its EWA-filtered variant (3DGS [18] + EWA [59]) across all metrics (PSNR, SSIM, LPIPS), highlighting the benefit of its 2D Mip filter even when multi-scale information is available during training.
6.1.2. Single-scale Training and Multi-scale Testing (Blender Dataset, Zoom-Out Simulation)
This is a critical scenario testing the method's ability to generalize to lower sampling rates when only full-resolution images were available during training. Mip-Splatting significantly outperforms all existing state-of-the-art methods.
- 3DGS [18]: Exhibits severe
dilation artifactsat lower resolutions, confirming the paper's analysis of its fixed 2D dilation. - 3DGS [18] + EWA [59]: While performing better than plain 3DGS, it results in
oversmoothed images, particularly at lower resolutions. This supports the paper's claim that EWA's general band-limiting, not tailored to single-pixel approximation, can be overly aggressive. - Mip-NeRF [1] and Tri-MipRF [17]: Although strong in multi-scale trained scenarios, they do not generalize as effectively when trained only on single-scale images and then asked to render at much lower scales, indicating a dependency on learned multi-scale signals.
- Mip-Splatting: Maintains faithful image quality across various downsampled resolutions, effectively mitigating both dilation and aliasing artifacts without prior multi-scale training, demonstrating strong out-of-distribution generalization. Qualitatively, Figure 4 shows clear visual superiority.
6.1.3. Single-scale Training and Multi-scale Testing (Mip-NeRF 360 Dataset, Zoom-In Simulation)
This scenario tests generalization to higher sampling rates (super-resolution) when trained on downsampled data. Mip-Splatting performs comparably at the training scale () but significantly exceeds all state-of-the-art methods at higher resolutions ().
- Mip-NeRF 360 [2] and Zip-NeRF [3]: Show subpar performance at increased resolutions. The paper attributes this to their MLPs' inability to
extrapolateto unseen higher frequencies or details not present in the lower-resolution training data. This highlights a fundamental limitation of learned implicit representations. - 3DGS [18]: Introduces
erosion artifactsat higher resolutions due to its dilation operations, where the Gaussians become too small and effectively disappear, creating gaps. - 3DGS [18] + EWA [59]: Performs better than plain 3DGS but still yields pronounced
high-frequency artifacts. - Mip-Splatting: Avoids these artifacts, producing aesthetically pleasing images that more closely resemble ground truth. This confirms the effectiveness of the 3D smoothing filter in constraining the maximal frequency of the 3D representation, preventing the generation of unresolvable fine details that would otherwise cause artifacts upon magnification. Qualitatively, Figure 5 provides visual evidence.
6.1.4. Single-scale Training and Same-scale Testing (Mip-NeRF 360 Dataset)
In the standard in-distribution setting, Mip-Splatting performs on par with 3DGS [18] and 3DGS [18] + EWA [59]. This is an important finding, as it confirms that the proposed modifications do not degrade performance in the scenario where original 3DGS is typically evaluated, demonstrating the method's versatility.
6.2. Data Presentation (Tables)
The following are the results from [Table 1] of the original paper:
| PSNR ↑ | SSIM↑ | LPIPS ↓ | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Full Res. | 1/2 Res. | 1/4 Res. | 1/8 Res. | Avg. | Full Res. | 1/2 Res. | 1/4 Res. | 1/8 Res. | Avg. | Full Res. | 1/2 Res. | 1/4 Res. | 1/8 Res. | Avg. | |
| NeRF w/o Larea [1, 28] | 31.20 | 30.65 | 26.25 | 22.53 | 27.66 | 0.950 | 0.956 | 0.930 | 0.871 | 0.927 | 0.055 | 0.034 | 0.043 | 0.075 | 0.052 |
| NeRF [28] | 29.90 | 32.13 | 33.40 | 29.47 | 31.23 | 0.938 | 0.959 | 0.973 | 0.962 | 0.958 | 0.074 | 0.040 | 0.024 | 0.039 | 0.044 |
| MipNeRF [1] | 32.63 | 34.34 | 35.47 | 35.60 | 34.51 | 0.958 | 0.970 | 0.979 | 0.983 | 0.973 | 0.047 | 0.026 | 0.017 | 0.012 | 0.026 |
| Plenoxels [11] | 31.60 | 32.85 | 30.26 | 26.63 | 30.34 | 0.956 | 0.967 | 0.961 | 0.936 | 0.955 | 0.052 | 0.032 | 0.045 | 0.077 | 0.051 |
| TensoRF [4] | 32.11 | 33.03 | 30.45 | 26.80 | 30.60 | 0.956 | 0.966 | 0.962 | 0.939 | 0.956 | 0.056 | 0.038 | 0.047 | 0.076 | 0.054 |
| Instant-NGP [32] | 30.00 | 32.15 | 33.31 | 29.35 | 31.20 | 0.939 | 0.961 | 0.974 | 0.963 | 0.959 | 0.079 | 0.043 | 0.026 | 0.040 | 0.047 |
| Tri-MipRF [17]* | 32.65 | 34.24 | 35.02 | 35.53 | 34.36 | 0.958 | 0.971 | 0.980 | 0.987 | 0.974 | 0.047 | 0.027 | 0.018 | 0.012 | 0.026 |
| 3DGS [18] | 28.79 | 30.66 | 31.64 | 27.98 | 29.77 | 0.943 | 0.962 | 0.972 | 0.960 | 0.960 | 0.065 | 0.038 | 0.025 | 0.031 | 0.040 |
| 3DGS [18] + EWA [59] | 31.54 | 33.26 | 33.78 | 33.48 | 33.01 | 0.961 | 0.973 | 0.979 | 0.983 | 0.974 | 0.043 | 0.026 | 0.021 | 0.019 | 0.027 |
| Mip-Splatting (ours) | 32.81 | 34.49 | 35.45 | 35.50 | 34.56 | 0.967 | 0.977 | 0.983 | 0.988 | 0.979 | 0.035 | 0.019 | 0.013 | 0.010 | 0.019 |
The following are the results from [Table 2] of the original paper:
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Full Res. | 1/2 Res. | 1/4 Res. | 1/8 Res. | Avg. | Full Res. | 1/2 Res. | 1/4 Res. | 1/8 Res. | Avg. | Full Res. | 1/2 Res. | 1/4 Res. | 1/8 Res. | Avg. | |
| NeRF [28] | 31.48 | 32.43 | 30.29 | 26.70 | 30.23 | 0.949 | 0.962 | 0.964 | 0.951 | 0.956 | 0.061 | 0.041 | 0.044 | 0.067 | 0.053 |
| MipNeRF [1] | 33.08 | 33.31 | 30.91 | 27.97 | 31.31 | 0.961 | 0.970 | 0.969 | 0.961 | 0.965 | 0.045 | 0.031 | 0.036 | 0.052 | 0.041 |
| TensoRF [4] | 32.53 | 32.91 | 30.01 | 26.45 | 30.48 | 0.960 | 0.969 | 0.965 | 0.948 | 0.961 | 0.044 | 0.031 | 0.044 | 0.073 | 0.048 |
| Instant-NGP [32] | 33.09 | 33.00 | 29.84 | 26.33 | 30.57 | 0.962 | 0.969 | 0.964 | 0.947 | 0.961 | 0.044 | 0.033 | 0.046 | 0.075 | 0.049 |
| Tri-MipRF [17] | 32.89 | 32.84 | 28.29 | 23.87 | 29.47 | 0.958 | 0.967 | 0.951 | 0.913 | 0.947 | 0.046 | 0.033 | 0.046 | 0.075 | 0.050 |
| 3DGS [18] | 33.33 | 26.95 | 21.38 | 17.69 | 24.84 | 0.969 | 0.949 | 0.875 | 0.766 | 0.890 | 0.030 | 0.032 | 0.066 | 0.121 | 0.063 |
| 3DGS [18] + EWA [59] | 33.51 | 31.66 | 27.82 | 24.63 | 29.40 | 0.969 | 0.971 | 0.959 | 0.940 | 0.960 | 0.032 | 0.024 | 0.033 | 0.047 | 0.034 |
| Mip-Splatting (ours) | 33.36 | 34.00 | 31.85 | 28.67 | 31.97 | 0.969 | 0.977 | 0.978 | 0.973 | 0.974 | 0.031 | 0.019 | 0.019 | 0.026 | 0.024 |
The following are the results from [Table 3] of the original paper:
| PSNR ↑ | SSIM↑ | LPIPS ↓ | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1× Res. | 2× Res. | 4× Res. | 8× Res. | Avg. | 1× Res. | 2× Res. | 4× Res. | 8× Res. | Avg. | 1× Res. | 2× Res. | 4× Res. | 8× Res. | Avg. | |
| Instant-NGP [32] | 26.79 | 24.76 | 24.27 | 24.27 | 25.02 | 0.746 | 0.639 | 0.626 | 0.698 | 0.677 | 0.239 | 0.367 | 0.445 | 0.475 | 0.382 |
| mip-NeRF 360 [2] | 29.26 | 25.18 | 24.16 | 24.10 | 25.67 | 0.860 | 0.727 | 0.670 | 0.706 | 0.741 | 0.122 | 0.260 | 0.370 | 0.428 | 0.295 |
| zip-NeRF [3] | 29.66 | 23.27 | 20.87 | 20.27 | 23.52 | 0.875 | 0.696 | 0.565 | 0.559 | 0.674 | 0.097 | 0.257 | 0.421 | 0.494 | 0.318 |
| 3DGS [18] | 29.19 | 23.50 | 20.71 | 19.59 | 23.25 | 0.880 | 0.740 | 0.619 | 0.619 | 0.715 | 0.107 | 0.243 | 0.394 | 0.476 | 0.305 |
| 3DGS [18] + EWA [59] | 29.30 | 25.90 | 23.70 | 22.81 | 25.43 | 0.880 | 0.775 | 0.667 | 0.643 | 0.741 | 0.114 | 0.236 | 0.369 | 0.449 | 0.292 |
| Mip-Splatting (ours) | 29.39 | 27.39 | 26.47 | 26.22 | 27.37 | 0.884 | 0.808 | 0.754 | 0.765 | 0.803 | 0.108 | 0.205 | 0.305 | 0.392 | 0.252 |
The following are the results from [Table 4] of the original paper:
| PSNR ↑ | SSIM↑ | LPIPS ↓ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| NeRF [9, 28] | 23.85 | 0.605 | 0.451 | ||||||
| mip-NeRF [1] | 24.04 | 0.616 | 0.441 | ||||||
| NeRF++ [56] | 25.11 | 0.676 | 0.375 | ||||||
| Plenoxels [11] | 23.08 | 0.626 | 0.463 | ||||||
| Instant NGP [32, 52] | 25.68 | 0.705 | 0.302 | ||||||
| mip-NeRF 360 [2, 30] | 27.57 | 0.793 | 0.234 | ||||||
| Zip-NeRF [3] | 28.54 | 0.828 | 0.189 | ||||||
| 3DGS [18] | 27.21 | 0.815 | 0.214 | ||||||
| 3DGS [18]* | 27.70 | 0.826 | 0.202 | ||||||
| 3DGS [18] + EWA [59] | 27.77 | 0.826 | 0.206 | ||||||
| Mip-Splatting (ours) | 27.79 | 0.827 | 0.203 | ||||||
6.3. Ablation Studies / Parameter Analysis
The paper conducts ablation studies to isolate and evaluate the contribution of each proposed component: the 3D smoothing filter and the 2D Mip filter.
6.3.1. Effectiveness of the 3D Smoothing Filter
This ablation is conducted in the single-scale training (downsampled by 8) and multi-scale testing (zoom-in simulation) setting on the Mip-NeRF 360 dataset.
The following are the results from [Table 5] of the original paper:
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1× Res. | 2× Res. | 4× Res. | 8× Res. | Avg. | 1× Res. | 2× Res. | 4× Res. | 8× Res. | Avg. | 1× Res. | 2× Res. | 4× Res. | 8× Res. | Avg. | |
| 3DGS [18] | 29.19 | 23.50 | 20.71 | 19.59 | 23.25 | 0.880 | 0.740 | 0.619 | 0.619 | 0.715 | 0.107 | 0.243 | 0.394 | 0.476 | 0.305 |
| 3DGS [18] + EWA [59] | 29.30 | 25.90 | 23.70 | 22.81 | 25.43 | 0.880 | 0.775 | 0.667 | 0.643 | 0.741 | 0.114 | 0.236 | 0.369 | 0.449 | 0.292 |
| Mip-Splatting (ours) | 29.39 | 27.39 | 26.47 | 26.22 | 27.37 | 0.884 | 0.808 | 0.754 | 0.765 | 0.803 | 0.108 | 0.205 | 0.305 | 0.392 | 0.252 |
| Mip-Splatting (ours) w/o 3D smoothing filter | 29.41 | 27.09 | 25.83 | 25.38 | 26.93 | 0.881 | 0.795 | 0.722 | 0.713 | 0.778 | 0.107 | 0.214 | 0.342 | 0.424 | 0.272 |
| Mip-Splatting (ours) w/o 2D Mip filter | 29.29 | 27.22 | 26.31 | 26.08 | 27.23 | 0.882 | 0.798 | 0.742 | 0.759 | 0.795 | 0.107 | 0.214 | 0.319 | 0.407 | 0.262 |
As seen in Table 5, removing the 3D smoothing filter (Mip-Splatting (ours) w/o 3D smoothing filter) leads to a notable decline in performance, particularly at higher resolutions (e.g., PSNR drops from 27.39 to 27.09 at 2x Res., and from 26.47 to 25.83 at 4x Res.). This confirms that the 3D smoothing filter is crucial for mitigating high-frequency artifacts when rendering at resolutions higher than the training data, effectively enabling zoom-in capabilities. Qualitatively, Figure 6 further illustrates that omitting the 3D smoothing filter results in visible high-frequency artifacts. The paper also notes that excluding the 2D Mip filter causes only a slight decline in this zoom-in scenario, which is expected since its primary role is for zoom-out artifacts. Furthermore, the absence of both filters leads to memory errors due to an excessive generation of small Gaussians, emphasizing the necessity of some form of regularization.
6.3.2. Effectiveness of the 2D Mip Filter
This ablation evaluates the 2D Mip filter in the single-scale training (full resolution) and multi-scale testing (zoom-out simulation) setting on the Blender dataset.
The following are the results from [Table 6] of the original paper:
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Full Res. | 1/2 Res. | 1/4 Res. | 1/8 Res. | Avg. | Full Res. | 1/2 Res. | 1/4 Res. | 1/8 Res. | Avg. | Full Res. | 1/2 Res. | 1/4 Res. | 1/8 Res. | Avg. | |
| 3DGS [18] | 33.33 | 26.95 | 21.38 | 17.69 | 24.84 | 0.969 | 0.949 | 0.875 | 0.766 | 0.890 | 0.030 | 0.032 | 0.066 | 0.121 | 0.063 |
| 3DGS [18] + EWA [59] | 33.51 | 31.66 | 27.82 | 24.63 | 29.40 | 0.969 | 0.971 | 0.959 | 0.940 | 0.960 | 0.032 | 0.024 | 0.033 | 0.047 | 0.034 |
| 3DGS [18] - Dilation | 33.38 | 33.06 | 29.68 | 26.19 | 30.58 | 0.969 | 0.973 | 0.964 | 0.945 | 0.963 | 0.030 | 0.024 | 0.041 | 0.075 | 0.042 |
| Mip-Splatting (ours) | 33.36 | 34.00 | 31.85 | 28.67 | 31.97 | 0.969 | 0.977 | 0.978 | 0.973 | 0.974 | 0.031 | 0.019 | 0.019 | 0.026 | 0.024 |
| Mip-Splatting (ours) w/ 3D smoothing filter | 33.67 | 34.16 | 31.56 | 28.20 | 31.90 | 0.970 | 0.977 | 0.978 | 0.971 | 0.974 | 0.030 | 0.018 | 0.019 | 0.027 | 0.024 |
| Mip-Splatting (ours) w/o 2D Mip filter | 33.51 | 33.38 | 29.87 | 26.28 | 30.76 | 0.970 | 0.975 | 0.966 | 0.946 | 0.964 | 0.031 | 0.022 | 0.039 | 0.073 | 0.041 |
Table 6 shows that Mip-Splatting significantly outperforms all baseline methods, including 3DGS, 3DGS + EWA, and 3DGS with dilation removed (3DGS - Dilation). Removing the original dilation (3DGS - Dilation) improves performance over 3DGS, but it still suffers from aliasing artifacts due to the lack of any anti-aliasing. The Mip-Splatting (ours) w/o 2D Mip filter variant, which still includes the 3D smoothing filter, shows a notable decline in performance compared to the full Mip-Splatting, especially at lower resolutions (e.g., LPIPS worsens from 0.026 to 0.073 at 1/8 Res.). This validates the critical role of the 2D Mip filter in anti-aliasing and mitigating zoom-out artifacts. The performance of Mip-Splatting (ours) w/ 3D smoothing filter (which means only the 3D filter is applied, but the 2D Mip filter is also effectively present as part of the full model unless explicitly stated w/o 2D Mip filter) is similar to the full model, as the 3D filter's main impact is for zoom-in.
6.3.3. Single-scale Training and Multi-scale Testing (Mip-NeRF 360, Combined Zoom)
This experiment evaluates both zoom-in and zoom-out effects on the Mip-NeRF 360 dataset, training on images downsampled by a factor of 4 and evaluating at multiple resolutions ().
The following are the results from [Table 7] of the original paper:
| PSNR ↑ | SSIM↑ | LPIPS ↓ | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1/4 Res. | 1/2 Res. | 1× Res. | 2× Res. | 4× Res. | Avg. | 1/4 Res. | 1/2 Res. | 1× Res. | 2× Res. | 4× Res. | Avg. | 1/4 Res. | 1/2 Res. | 1× Res. | 2× Res. | 4× Res. | Avg. | |
| 3DGS [18] | 20.85 | 24.66 | 28.01 | 25.08 | 23.37 | 24.39 | 0.681 | 0.812 | 0.834 | 0.766 | 0.735 | 0.765 | 0.203 | 0.158 | 0.275 | 0.383 | 0.331 | 0.270 |
| 3DGS [18] + EWA [59] | 27.40 | 28.39 | 28.09 | 26.43 | 25.30 | 27.12 | 0.888 | 0.871 | 0.833 | 0.774 | 0.738 | 0.821 | 0.103 | 0.126 | 0.166 | 0.276 | 0.385 | 0.212 |
| Mip-Splatting (ours) | 28.98 | 29.02 | 28.09 | 27.25 | 26.95 | 28.06 | 0.908 | 0.880 | 0.835 | 0.798 | 0.800 | 0.844 | 0.086 | 0.114 | 0.168 | 0.248 | 0.331 | 0.189 |
| Mip-Splatting (ours) w/o 3D smoothing filter | 28.69 | 28.94 | 28.05 | 27.06 | 26.61 | 27.87 | 0.905 | 0.879 | 0.833 | 0.790 | 0.780 | 0.837 | 0.088 | 0.115 | 0.168 | 0.261 | 0.359 | 0.198 |
| Mip-Splatting (ours) w/o 2D Mip filter | 26.09 | 28.04 | 28.05 | 27.27 | 27.00 | 27.29 | 0.815 | 0.856 | 0.834 | 0.798 | 0.802 | 0.821 | 0.167 | 0.132 | 0.167 | 0.249 | 0.335 | 0.210 |
The results in Table 7 further confirm the individual contributions of both filters. Mip-Splatting significantly outperforms baselines across all scales.
Mip-Splatting (ours) w/o 3D smoothing filter: Shows degradation, particularly at higher resolutions (zoom-in), leading to high-frequency artifacts (Figure 7).Mip-Splatting (ours) w/o 2D Mip filter: Shows significant degradation at lower resolutions (zoom-out), leading to aliasing artifacts (Figure 7). This combined experiment clearly validates that both the 3D smoothing filter and the 2D Mip filter are essential for achieving robust, alias-free rendering across a wide range of sampling rates, effectively handling both zoom-in and zoom-out scenarios.
The following are the results from [Table 8] of the original paper:
| PSNR | SSIM | LPIPS | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chair | drums | ficus | hotdog | lego | materials | mic | ship | Average | chair | drums | ficus | hotdog | lego | materials | mic | ship | Average | chair | drums | ficus | hotdog | lego | materials | mic | ship | Average | ||
| NeRF w/o Larea [1, 28] | 29.92 | 23.27 | 27.15 | 32.00 | 35.64 | 27.75 | 26.30 | 28.40 | 26.46 | 27.66 | 0.944 | 0.891 | 0.942 | 0.959 | 0.926 | 0.934 | 0.958 | 0.861 | 0.927 | 0.035 | 0.069 | 0.032 | 0.028 | 0.041 | 0.045 | 0.031 | 0.095 | 0.052 |
| NeRF [28] | 33.39 | 25.87 | 30.37 | 31.65 | 30.18 | 32.60 | 30.09 | 31.23 | 0.971 | 0.932 | 0.971 | 0.979 | 0.965 | 0.967 | 0.980 | 0.900 | 0.958 | 0.028 | 0.059 | 0.032 | 0.028 | 0.041 | 0.045 | 0.031 | 0.085 | 0.044 | ||
| MipNeRF [1] | 37.14 | 27.02 | 33.19 | 39.31 | 35.74 | 32.56 | 38.04 | 33.08 | 34.51 | 0.988 | 0.945 | 0.984 | 0.988 | 0.984 | 0.977 | 0.993 | 0.922 | 0.973 | 0.011 | 0.044 | 0.014 | 0.012 | 0.013 | 0.019 | 0.007 | 0.062 | 0.026 | |
| Plenoxels [11] | 32.79 | 25.25 | 30.28 | 34.65 | 31.26 | 28.33 | 31.53 | 28.59 | 30.34 | 0.968 | 0.929 | 0.972 | 0.976 | 0.964 | 0.959 | 0.979 | 0.892 | 0.955 | 0.040 | 0.070 | 0.032 | 0.037 | 0.038 | 0.055 | 0.036 | 0.104 | 0.051 | |
| TensoRF [4] | 32.47 | 25.37 | 31.16 | 34.96 | 31.73 | 28.53 | 31.48 | 29.08 | 30.60 | 0.967 | 0.930 | 0.974 | 0.977 | 0.967 | 0.957 | 0.978 | 0.895 | 0.956 | 0.042 | 0.075 | 0.032 | 0.035 | 0.036 | 0.063 | 0.040 | 0.112 | 0.054 | |
| Instant-ngp [32] | 32.95 | 26.43 | 30.41 | 35.87 | 31.83 | 29.31 | 32.58 | 30.23 | 31.20 | 0.971 | 0.940 | 0.973 | 0.979 | 0.966 | 0.959 | 0.981 | 0.904 | 0.959 | 0.035 | 0.066 | 0.029 | 0.028 | 0.040 | 0.051 | 0.032 | 0.095 | 0.047 | |
| Tri-MipRF [17]* | 37.67 | 27.35 | 33.57 | 38.78 | 35.72 | 31.42 | 37.63 | 32.74 | 34.36 | 0.990 | 0.951 | 0.985 | 0.988 | 0.986 | 0.969 | 0.992 | 0.929 | 0.974 | 0.011 | 0.046 | 0.016 | 0.014 | 0.013 | 0.033 | 0.008 | 0.069 | 0.026 | |
| 3DGS [18] | 32.73 | 25.30 | 29.00 | 35.03 | 29.44 | 27.13 | 31.17 | 28.33 | 29.77 | 0.976 | 0.941 | 0.968 | 0.982 | 0.964 | 0.956 | 0.979 | 0.910 | 0.960 | 0.025 | 0.056 | 0.030 | 0.022 | 0.038 | 0.040 | 0.023 | 0.086 | 0.040 | |
| 3DGS [18] + EWA [59] | 35.77 | 27.14 | 33.65 | 37.74 | 32.75 | 30.21 | 35.21 | 31.63 | 33.01 | 0.986 | 0.958 | 0.988 | 0.988 | 0.979 | 0.972 | 0.990 | 0.929 | 0.974 | 0.017 | 0.039 | 0.013 | 0.016 | 0.024 | 0.026 | 0.011 | 0.070 | 0.027 | |
| Mip-Splatting (ours) | 37.48 | 27.74 | 34.71 | 39.15 | 35.07 | 31.88 | 37.68 | 32.80 | 34.56 | 0.991 | 0.963 | 0.990 | 0.990 | 0.987 | 0.978 | 0.994 | 0.936 | 0.979 | 0.010 | 0.031 | 0.009 | 0.011 | 0.012 | 0.018 | 0.005 | 0.059 | 0.019 | |
The following are the results from [Table 9] of the original paper:
| chair | drums | ficus | hotdog | lego | materials | mic | ship | Average | ||
|---|---|---|---|---|---|---|---|---|---|---|
| PSNR | NeRF [28] | 31.99 | 25.31 | 30.74 | 34.45 | 30.69 | 28.86 | 31.41 | 28.36 | 30.23 |
| MipNeRF [1] | 32.89 | 25.58 | 31.80 | 35.40 | 32.24 | 29.46 | 33.26 | 29.88 | 31.31 | |
| TensoRF [4] | 32.17 | 25.51 | 31.19 | 34.69 | 31.46 | 28.60 | 31.50 | 28.71 | 30.48 | |
| Instant-ngp [32] | 32.18 | 25.05 | 31.32 | 34.85 | 31.53 | 28.59 | 32.15 | 28.84 | 30.57 | |
| Tri-MipRF [17] | 32.48 | 24.01 | 28.41 | 34.45 | 30.41 | 27.82 | 31.19 | 27.02 | 29.47 | |
| 3DGS [18] | 26.81 | 21.17 | 26.02 | 28.80 | 25.36 | 23.10 | 24.39 | 23.05 | 24.84 | |
| 3DGS [18] + EWA [59] | 32.85 | 24.91 | 31.94 | 33.33 | 29.76 | 27.36 | 27.68 | 27.41 | 29.40 | |
| Mip-Splatting (ours) | 35.69 | 26.50 | 32.99 | 36.18 | 32.76 | 30.01 | 31.66 | 29.98 | 31.97 | |
| SSIM | NeRF [28] | 0.968 | 0.936 | 0.976 | 0.977 | 0.963 | 0.964 | 0.980 | 0.887 | 0.956 |
| MipNeRF [1] | 0.974 | 0.939 | 0.981 | 0.982 | 0.973 | 0.969 | 0.987 | 0.915 | 0.965 | |
| TensoRF [4] | 0.970 | 0.938 | 0.978 | 0.979 | 0.970 | 0.963 | 0.981 | 0.906 | 0.961 | |
| Instant-ngp [32] | 0.970 | 0.935 | 0.977 | 0.980 | 0.969 | 0.962 | 0.982 | 0.909 | 0.961 | |
| Tri-MipRF [17] | 0.971 | 0.908 | 0.957 | 0.975 | 0.957 | 0.953 | 0.975 | 0.883 | 0.947 | |
| 3DGS [18] | 0.915 | 0.851 | 0.921 | 0.930 | 0.882 | 0.882 | 0.909 | 0.827 | 0.890 | |
| 3DGS [18] + EWA [59] | 0.978 | 0.942 | 0.983 | 0.977 | 0.964 | 0.958 | 0.963 | 0.912 | 0.960 | |
| Mip-Splatting (ours) | 0.988 | 0.958 | 0.988 | 0.987 | 0.982 | 0.974 | 0.986 | 0.930 | 0.974 | |
| LPIPS | NeRF [28] | 0.040 | 0.067 | 0.027 | 0.034 | 0.043 | 0.049 | 0.035 | 0.132 | 0.053 |
| MipNeRF [1] | 0.033 | 0.062 | 0.022 | 0.025 | 0.030 | 0.041 | 0.023 | 0.092 | 0.041 | |
| TensoRF [4] | 0.036 | 0.066 | 0.027 | 0.030 | 0.035 | 0.052 | 0.034 | 0.102 | 0.048 | |
| Instant-ngp [32] | 0.036 | 0.074 | 0.035 | 0.030 | 0.035 | 0.054 | 0.034 | 0.096 | 0.049 | |
| Tri-MipRF [17] | 0.026 | 0.086 | 0.041 | 0.023 | 0.036 | 0.048 | 0.023 | 0.117 | 0.050 | |
| 3DGS [18] | 0.047 | 0.087 | 0.055 | 0.034 | 0.064 | 0.055 | 0.046 | 0.113 | 0.063 | |
| 3DGS [18] + EWA [59] | 0.023 | 0.051 | 0.017 | 0.018 | 0.033 | 0.027 | 0.024 | 0.077 | 0.034 | |
| Mip-Splatting (ours) | 0.014 | 0.035 | 0.012 | 0.014 | 0.016 | 0.019 | 0.015 | 0.066 | 0.024 | |
The following are the results from [Table 10] of the original paper:
| bicycle | flowers | garden | stump | treehill | room | counter | kitchen | bonsai | ||
|---|---|---|---|---|---|---|---|---|---|---|
| PSNR | NeRF [9, 28] | 21.76 | 19.40 | 23.11 | 21.73 | 21.28 | 28.56 | 25.67 | 26.31 | 26.81 |
| mip-NeRF [1] | 21.69 | 19.31 | 23.16 | 23.10 | 21.21 | 28.73 | 25.59 | 26.47 | 27.13 | |
| NeRF++ [56] | 22.64 | 20.31 | 24.32 | 24.34 | 22.20 | 28.87 | 26.38 | 27.80 | 29.15 | |
| Plenoxels [11] | 21.91 | 20.10 | 23.49 | 20.661 | 22.25 | 27.59 | 23.62 | 23.42 | 24.67 | |
| Instant NGP [32, 52] | 22.79 | 19.19 | 25.26 | 24.80 | 22.46 | 30.31 | 26.21 | 29.00 | 31.08 | |
| mip-NeRF 360 [2, 30] | 24.40 | 21.64 | 26.94 | 26.36 | 22.81 | 31.40 | 29.44 | 32.02 | 33.11 | |
| Zip-NeRF [3] | 25.80 | 22.40 | 28.20 | 27.55 | 23.89 | 32.65 | 29.38 | 32.50 | 34.46 | |
| 3DGS [18] | 25.25 | 21.52 | 27.41 | 26.55 | 22.49 | 30.63 | 28.70 | 30.32 | 31.98 | |
| 3DGS [18]* | 25.63 | 21.77 | 27.70 | 26.87 | 22.75 | 31.69 | 29.08 | 31.56 | 32.29 | |
| 3DGS [18] + EWA [59] | 25.64 | 21.86 | 27.65 | 26.87 | 22.91 | 31.68 | 29.21 | 31.59 | 32.51 | |
| Mip-Splatting (ours) | 25.72 | 21.93 | 27.76 | 26.94 | 22.98 | 31.74 | 29.16 | 31.55 | 32.31 | |
| SSIM | NeRF [9, 28] | 0.455 | 0.376 | 0.546 | 0.453 | 0.459 | 0.843 | 0.775 | 0.749 | 0.792 |
| mip-NeRF [1] | 0.454 | 0.373 | 0.543 | 0.517 | 0.466 | 0.851 | 0.779 | 0.745 | 0.818 | |
| NeRF++ [56] | 0.526 | 0.453 | 0.635 | 0.594 | 0.530 | 0.852 | 0.802 | 0.816 | 0.876 | |
| Plenoxels [11] | 0.496 | 0.431 | 0.606 | 0.523 | 0.509 | 0.842 | 0.759 | 0.648 | 0.814 | |
| Instant NGP [32, 52] | 0.540 | 0.378 | 0.709 | 0.654 | 0.547 | 0.893 | 0.845 | 0.857 | 0.924 | |
| mip-NeRF 360 [2, 30] | 0.693 | 0.583 | 0.816 | 0.746 | 0.632 | 0.913 | 0.895 | 0.920 | 0.939 | |
| Zip-NeRF [3] | 0.769 | 0.642 | 0.860 | 0.800 | 0.681 | 0.925 | 0.902 | 0.928 | 0.949 | |
| 3DGS [18] | 0.771 | 0.605 | 0.868 | 0.775 | 0.638 | 0.914 | 0.905 | 0.922 | 0.938 | |
| 3DGS [18]* | 0.777 | 0.622 | 0.873 | 0.783 | 0.652 | 0.928 | 0.916 | 0.933 | 0.948 | |
| 3DGS [18] + EWA [59] | 0.777 | 0.620 | 0.871 | 0.784 | 0.655 | 0.927 | 0.916 | 0.933 | 0.948 | |
| Mip-Splatting (ours) | 0.780 | 0.623 | 0.875 | 0.786 | 0.655 | 0.928 | 0.916 | 0.933 | 0.948 | |
| LPIPS | NeRF [9, 28] | 0.536 | 0.529 | 0.415 | 0.551 | 0.546 | 0.353 | 0.394 | 0.335 | 0.398 |
| mip-NeRF [1] | 0.541 | 0.535 | 0.422 | 0.490 | 0.538 | 0.346 | 0.390 | 0.336 | 0.370 | |
| NeRF++ [56] | 0.455 | 0.466 | 0.331 | 0.416 | 0.466 | 0.335 | 0.351 | 0.260 | 0.291 | |
| Plenoxels [11] | 0.506 | 0.521 | 0.3864 | 0.503 | 0.540 | 0.419 | 0.441 | 0.447 | 0.398 | |
| Instant NGP [32, 52] | 0.398 | 0.441 | 0.255 | 0.339 | 0.420 | 0.242 | 0.255 | 0.170 | 0.198 | |
| mip-NeRF 360 [2, 30] | 0.289 | 0.345 | 0.164 | 0.254 | 0.338 | 0.211 | 0.203 | 0.126 | 0.177 | |
| Zip-NeRF [3] | 0.208 | 0.273 | 0.118 | 0.193 | 0.242 | 0.196 | 0.185 | 0.116 | 0.173 | |
| 3DGS [18] | 0.205 | 0.336 | 0.103 | 0.210 | 0.317 | 0.220 | 0.204 | 0.129 | 0.205 | |
| 3DGS [18]* | 0.205 | 0.329 | 0.103 | 0.208 | 0.318 | 0.192 | 0.178 | 0.113 | 0.174 | |
| 3DGS [18] + EWA [59] | 0.213 | 0.335 | 0.111 | 0.210 | 0.325 | 0.192 | 0.179 | 0.113 | 0.173 | |
| Mip-Splatting (ours) | 0.206 | 0.331 | 0.103 | 0.209 | 0.320 | 0.192 | 0.179 | 0.113 | 0.173 | |
The following are the results from [Table 11] of the original paper:
| bicycle | flowers | garden | stump | treehill | room | counter | kitchen | bonsai | ||
|---|---|---|---|---|---|---|---|---|---|---|
| PSNR | Instant-NGP [32] | 22.51 | 20.25 | 24.65 | 23.15 | 22.24 | 29.48 | 26.18 | 27.10 | 29.66 |
| mip-NeRF 360 [2] | 24.21 | 21.60 | 25.82 | 25.59 | 22.78 | 29.58 | 27.72 | 28.78 | 31.63 | |
| zip-NeRF [3] | 23.05 | 20.05 | 18.07 | 23.94 | 22.53 | 20.51 | 26.08 | 27.37 | 30.05 | |
| 3DGS [18] | 21.34 | 19.43 | 21.94 | 22.63 | 20.91 | 28.10 | 25.33 | 23.68 | 25.89 | |
| 3DGS [18] + EWA [59] | 23.74 | 20.94 | 24.69 | 24.81 | 21.93 | 29.80 | 27.23 | 27.07 | 28.63 | |
| Mip-Splatting (ours) | 25.26 | 22.02 | 26.78 | 26.65 | 22.92 | 31.56 | 28.87 | 30.73 | 31.49 | |
| SSIM | Instant-NGP [32] | 0.538 | 0.473 | 0.647 | 0.590 | 0.544 | 0.868 | 0.795 | 0.764 | 0.877 |
| mip-NeRF 360 [2] | 0.662 | 0.567 | 0.716 | 0.715 | 0.628 | 0.795 | 0.845 | 0.828 | 0.910 | |
| zip-NeRF [3] | 0.640 | 0.521 | 0.548 | 0.661 | 0.590 | 0.655 | 0.784 | 0.800 | 0.865 | |
| 3DGS [18] | 0.638 | 0.536 | 0.675 | 0.662 | 0.591 | 0.878 | 0.826 | 0.789 | 0.838 | |
| 3DGS [18] + EWA [59] | 0.671 | 0.563 | 0.613 | 0.718 | 0.693 | 0.608 | 0.889 | 0.843 | 0.813 | |
| Mip-Splatting (ours) | 0.738 | 0.786 | 0.776 | 0.659 | 0.921 | 0.897 | 0.903 | 0.933 | ||
| LPIPS | Instant-NGP [32] | 0.500 | 0.486 | 0.372 | 0.469 | 0.511 | 0.270 | 0.310 | 0.286 | 0.229 |
| mip-NeRF 360 [2] | 0.358 | 0.400 | 0.296 | 0.333 | 0.391 | 0.256 | 0.228 | 0.210 | 0.182 | |
| zip-NeRF [3] | 0.353 | 0.397 | 0.346 | 0.349 | 0.353 | 0.366 | 0.302 | 0.277 | 0.232 | |
| 3DGS [18] | 0.336 | 0.406 | 0.295 | 0.406 | 0.405 | 0.223 | 0.239 | 0.245 | 0.242 | |
| 3DGS [18] + EWA [59] | 0.322 | 0.395 | 0.281 | 0.334 | 0.217 | 0.231 | 0.216 | 0.227 | ||
| Mip-Splatting (ours) | 0.281 | 0.373 | 0.233 | 0.281 | 0.369 | 0.193 | 0.199 | 0.165 | 0.176 | |
6.4. Visual Results
The visual results presented in the paper (Figure 2, 4, 5, 6, 7, 8, 9) consistently support the quantitative findings. For example, Figure 2 effectively illustrates how Mip-Splatting renders faithful images across different scales (8x resolution, full resolution, 1/4 resolution) compared to 3DGS and 3DGS + EWA, which show strong artifacts (erosion or blurriness). Similarly, Figure 4 for Blender and Figure 5 for Mip-NeRF 360 demonstrate the qualitative superiority of Mip-Splatting, showing cleaner textures, sharper edges, and fewer artifacts (such as holes or over-smoothing) across various zoom levels. The bicycle wheel spokes in Figure 1 in particular, show how Mip-Splatting preserves fine details without dilation. The ablation figures (Figure 6 and 7) visually confirm the distinct roles of the 3D smoothing filter (preventing high-frequency artifacts during zoom-in) and the 2D Mip filter (preventing aliasing during zoom-out).
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces Mip-Splatting, an enhanced version of 3D Gaussian Splatting, to achieve alias-free novel view synthesis across arbitrary scales. This is accomplished through two primary innovations: a 3D smoothing filter and a 2D Mip filter. The 3D smoothing filter regularizes the maximum frequency of 3D Gaussian primitives based on the actual sampling rates of the training views, thereby preventing high-frequency artifacts (e.g., erosion) during magnification. The 2D Mip filter replaces the problematic 2D dilation, approximating a physical 2D box filter to effectively mitigate aliasing and dilation artifacts, especially during minification. Experimental results on both synthetic (Blender) and real-world (Mip-NeRF 360) datasets demonstrate that Mip-Splatting is competitive with state-of-the-art methods in in-distribution settings and significantly outperforms them in out-of-distribution scenarios, particularly when changing camera focal length or distance (zoom-in/out). The method's ability to achieve robust multi-scale rendering from single-scale training data is a key strength, offering better generalization and practical applicability.
7.2. Limitations & Future Work
The authors acknowledge a few limitations and suggest avenues for future work:
- Gaussian Approximation Error: The 2D Mip filter approximates a theoretical 2D box filter (for perfect pixel integration) with a Gaussian filter. This approximation introduces errors, especially when the Gaussian itself is small in screen space. The paper notes that these errors tend to increase with greater zoom-out factors, as evidenced in Table 2.
- Training Overhead: The calculation of the maximal sampling rate for each 3D Gaussian (required by the 3D smoothing filter) introduces a slight increase in training overhead. Currently, this computation is performed using PyTorch, and the authors suggest that a more efficient CUDA implementation could reduce this overhead.
- Data Structure for Sampling Rate: The sampling rate computation depends solely on camera poses and intrinsics. The authors propose that designing a better data structure for precomputing and storing this information could further improve efficiency.
- Rendering Overhead: It's important to note that the 3D smoothing filter's effect is fused with the Gaussian primitives per Equation 9, meaning it becomes part of the 3D representation. Therefore, there is no additional overhead during the actual rendering process after training.
7.3. Personal Insights & Critique
Mip-Splatting presents a highly valuable and practical improvement to 3D Gaussian Splatting. The shift from implicit (NeRF) to explicit (3DGS) representations brought immense speed benefits, but anti-aliasing remained a challenge. Mip-Splatting addresses this challenge with a principled approach, which is commendable.
-
Innovations and Strengths:
- Principled Anti-Aliasing: The use of the Nyquist-Shannon Sampling Theorem to guide 3D Gaussian regularization is a strong theoretical foundation. This is a critical distinction from empirical filtering or learned, less interpretable anti-aliasing methods.
- Single-Scale Training, Multi-Scale Generalization: This is perhaps the most significant practical advantage. Requiring only single-scale training data to achieve robust performance across various zoom levels simplifies data collection and training pipelines for real-world deployment. Previous methods often required multi-scale data, which is more cumbersome.
- Closed-Form Analytical Solution: Modifying the Gaussians directly with analytical filters (instead of relying on neural networks to learn multi-scale properties) provides greater control and interpretability.
- Direct Enhancement of 3DGS: By building directly on 3DGS, Mip-Splatting inherits its real-time rendering capabilities while mitigating its key artifactual drawbacks.
-
Potential Issues and Areas for Improvement:
- Gaussian Approximation for Box Filter: While pragmatic for efficiency, the Gaussian approximation to the ideal box filter is a known limitation. Further research could explore more accurate yet efficient analytical approximations or hybrid approaches.
- Hyperparameter Sensitivity: The scalar hyperparameters for both filters are chosen empirically. While effective, their optimal values might vary across scenes or specific application requirements. An adaptive or learned method for these parameters could be explored.
- Computational Overhead for Sampling Rate: Although the authors acknowledge and suggest improvements, the per-iteration recomputation of sampling rates during training could still be a bottleneck for extremely large scenes with millions of Gaussians, especially if not fully optimized in CUDA.
- Dynamic Scenes: The current method, like 3DGS, is primarily designed for static scenes. Extending alias-free rendering to dynamic scenes, where Gaussian parameters change over time, would introduce new challenges.
-
Transferability and Future Value:
- The core idea of constraining 3D primitive properties based on multi-view sampling rates could potentially be generalized to other primitive-based scene representations beyond Gaussians.
- The principles of multi-scale anti-aliasing from single-scale training are highly valuable for fields like augmented reality (AR) and virtual reality (VR), where users frequently change their viewpoint and zoom level, and efficient, artifact-free rendering is paramount.
- This work significantly advances the practicality of 3DGS, making it a more viable candidate for real-world computer vision and graphics applications that demand both quality and efficiency.
Similar papers
Recommended via semantic vector search.