Paper status: completed

GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels

Published:05/04/2025

real-time camera tracking (3)Dense RGB-D SLAM system (3)2D Gaussian surfels representation (1)Multi-view Geometric Consistency (1)Dynamic Local Map Isolation (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

GauS-SLAM resolves geometry distortion and tracking inaccuracy in Gaussian-based RGB-D SLAM by employing 2D Gaussian surfels with an incremental reconstruction strategy and Surface-aware Depth Rendering. This, alongside a dynamic local map, significantly improves tracking precisi

Abstract

We propose GauS-SLAM, a dense RGB-D SLAM system that leverages 2D Gaussian surfels to achieve robust tracking and high-fidelity mapping. Our investigations reveal that Gaussian-based scene representations exhibit geometry distortion under novel viewpoints, which significantly degrades the accuracy of Gaussian-based tracking methods. These geometry inconsistencies arise primarily from the depth modeling of Gaussian primitives and the mutual interference between surfaces during the depth blending. To address these, we propose a 2D Gaussian-based incremental reconstruction strategy coupled with a Surface-aware Depth Rendering mechanism, which significantly enhances geometry accuracy and multi-view consistency. Additionally, the proposed local map design dynamically isolates visible surfaces during tracking, mitigating misalignment caused by occluded regions in global maps while maintaining computational efficiency with increasing Gaussian density. Extensive experiments across multiple datasets demonstrate that GauS-SLAM outperforms comparable methods, delivering superior tracking precision and rendering fidelity. The project page will be made available at https://gaus-slam.github.io.

Mind Map

In-depth Reading

English Analysis~16 min read · 21,119 chars

1. Bibliographic Information

Title: GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels
Authors: Yongxin Su, Lin Chen, Kaiting Zhang, Zhongliang Zhao, Chenfeng Hou, Ziping Yu
Affiliations: Beihang University, Northwestern Polytechnical University
Journal/Conference: This paper is a preprint submitted to arXiv. The venue of formal publication is not specified, but its content and structure suggest it is intended for a top-tier computer vision conference like CVPR, ICCV, or ECCV.
Publication Year: 2024 (v1 submitted in May 2024)
Abstract: The authors propose GauS-SLAM, a dense SLAM system for RGB-D cameras that uses 2D Gaussian surfels for scene representation. They identify that existing Gaussian-based SLAM methods suffer from geometry distortion under new viewpoints, which harms tracking accuracy. They attribute this to issues in depth modeling and interference between surfaces during rendering. To solve this, they introduce an incremental reconstruction strategy with a novel Surface-aware Depth Rendering mechanism. They also propose a local map design to isolate visible surfaces for robust tracking and maintain efficiency. Experiments show that GauS-SLAM achieves state-of-the-art tracking precision and high-quality rendering.
Original Source Link:
- arXiv Link: https://arxiv.org/abs/2505.01934v1
- PDF Link: http://arxiv.org/pdf/2505.01934v1
- Status: This is a preprint and has not yet undergone formal peer review.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: The paper addresses critical challenges in dense visual Simultaneous Localization and Mapping (SLAM), a field that aims to build a detailed 3D map of an environment while simultaneously tracking a camera's position within it.
- Current Gaps: Recent advances have adopted 3D Gaussian Splatting (3DGS) for scene representation due to its flexibility and rendering quality. However, existing 3DGS-based SLAM systems often face two major problems:
  1. Geometry Distortion: When viewed from different angles, the 3D Gaussian representation can produce inconsistent geometry, leading to errors when aligning new camera frames with the map. This degrades tracking accuracy.
  2. Misalignment from Occlusions: During tracking, parts of the scene visible in the global map might be occluded from the camera's current view. These occluded regions can act as "outliers" and interfere with the alignment process, causing the estimated camera pose to be incorrect.
Main Contributions / Findings (What): The paper introduces GauS-SLAM, a tightly-coupled SLAM system that makes the following primary contributions to solve the identified problems:
1. 2D Gaussian Surfel Representation for SLAM: Instead of standard 3D Gaussians, the system uses 2D Gaussian surfels (flat, disk-like Gaussians) which are better at representing surfaces and provide more geometrically accurate depth estimates.
2. Surface-aware Depth Rendering: A novel rendering mechanism that mitigates depth errors caused by interference between foreground and background surfaces. It adjusts the contribution of each Gaussian to the final depth value based on its distance from the main surface, improving multi-view consistency.
3. Local Map Design for Robust Tracking: Camera tracking is performed on a small, dynamically managed local map containing only recently observed surfaces. This isolates the tracking process from distracting, occluded parts of the larger global map, preventing misalignment and ensuring computational efficiency as the scene grows.
4. A Complete High-Performance SLAM System: The paper integrates these components into a full front-end/back-end SLAM architecture that achieves state-of-the-art tracking accuracy and high-fidelity 3D reconstruction on multiple benchmark datasets.

Foundational Concepts:
- Simultaneous Localization and Mapping (SLAM): A core robotics and computer vision problem where a moving agent (e.g., a robot or a person with a camera) builds a map of an unknown environment while simultaneously keeping track of its own location within that map.
- Dense SLAM: A type of SLAM that aims to reconstruct a dense, detailed 3D model of the environment, as opposed to sparse SLAM which only maps a set of sparse feature points.
- RGB-D Camera: A sensor that captures both a standard color (RGB) image and a per-pixel depth (D) image, providing 3D information directly.
- 3D Gaussian Splatting (3DGS): A recent and popular technique for representing 3D scenes. A scene is modeled as a collection of millions of tiny, semi-transparent 3D ellipsoids (Gaussians). This representation allows for high-quality, real-time rendering from any viewpoint.
- Surfels (Surface Elements): Small, flat, disk-like primitives used to represent a 3D surface. A surfel typically has a position, a normal vector (its orientation), and a radius. Using surfels can be more efficient and geometrically faithful for representing surfaces than other primitives like points or voxels. GauS-SLAM uses 2D Gaussians as a form of surfel.
Previous Works:
- Implicit Neural Representations (e.g., NeRF): Methods like iMap and NICE-SLAM used neural networks (like MLPs) to implicitly represent the scene's geometry and color. While powerful, they are often slow to render because they require sampling many points along each camera ray.
- Explicit 3DGS-based SLAM: The advent of 3DGS led to a new wave of SLAM systems.
  - SplaTAM and Gaussian-SLAM were pioneering works that directly coupled tracking and mapping using a 3D Gaussian representation. However, the paper points out they suffer from the geometry distortion and misalignment issues mentioned earlier.
  - Other methods like GS-ICP and Photo-SLAM decoupled tracking from the Gaussian map, using traditional visual odometry for pose estimation to improve speed and robustness, but losing the benefits of a tightly coupled system where mapping and tracking mutually improve each other.
Differentiation: GauS-SLAM distinguishes itself from prior 3DGS-based SLAM systems, particularly its baseline SplaTAM, in several key ways:
1. Primitive Choice: It uses 2D Gaussian surfels instead of 3D Gaussian ellipsoids. This is a fundamental shift aimed at improving geometric accuracy. The paper argues that 2D surfels provide a more consistent "unbiased depth" estimate across different views.
2. Depth Rendering: It introduces the Surface-aware Depth Rendering mechanism. This is a specific solution to the problem of "ill-blended depth," where far-away surfaces corrupt the depth estimate of closer ones. No previous SLAM work cited had explicitly addressed this blending artifact in depth rendering.
3. Tracking Strategy: Instead of tracking against a global map, GauS-SLAM uses a local map that is periodically reset. This is a practical engineering choice to ensure tracking remains efficient and is not affected by occluded or irrelevant parts of the scene, a problem that becomes more severe as the map grows.

4. Methodology (Core Technology & Implementation)

The core of GauS-SLAM is its use of 2D Gaussian surfels and a set of novel techniques for rendering and mapping to enhance geometric accuracy and tracking robustness. The overall system follows a front-end/back-end architecture.

$该图像是GauS-SLAM系统的流程示意图，展示了前端RGB-D数据的跟踪与映射过程；局部地图中基于二维高斯surfels的表面感知深度渲染机制及其深度调节公式；后端局部及全局地图的融合与优化步骤。图中重点描述了深度α-混合调整公式 $d_i' = (1-\\beta_i)d_m + \\beta_i d_i$ 及局部地图的动态子图管理策略。$ Figure 1: The overall architecture of the GauS-SLAM system. The front-end handles RGB-D data for tracking and local mapping. The Surface-aware depth rendering module corrects depth estimation. The back-end manages the global map by merging local maps and performing global optimization.

Principles: The central idea is that by enforcing better geometric consistency in the scene representation, the camera tracking (which relies on this representation) will become more accurate. This is achieved by moving from 3D to 2D Gaussians and refining the depth rendering process.
Steps & Procedures:

4.1. Gaussian Surfel-based Representation Instead of 3D Gaussians, the scene is represented by a set of 2D Gaussian surfels. A surfel is a flat disk defined by a center point $\mu$ , two tangent vectors defining its plane, an opacity $o$ , and a color $c$ . A point $p$ in the surfel's plane is given by: $p = { \left[ \begin{array} { l l l l } { \mathbf { e } _ { u } } & { \mathbf { e } _ { v } } & { 0 } & { \mu } \\ { 0 } & { 0 } & { 0 } & { 1 } \end{array} \right] } { \left[ \begin{array} { l } { u } \\ { v } \\ { 1 } \\ { 1 } \end{array} \right] } = { \left[ \begin{array} { l l } { \Sigma } & { \mu } \\ { 0 } & { 1 } \end{array} \right] } p ^ { \prime }$
- $p' = (u,v,1,1)^T$ : Homogeneous coordinates of the point in the surfel's 2D plane.
- $\Sigma \in \mathbb{R}^{3 \times 3}$ : A transformation matrix representing the geometry (orientation and scale) of the 2D plane in 3D space. It is composed of a rotation $\mathbf{R}$ and a scaling $\mathbf{S}$ (with scale factor 0 in the third dimension to ensure flatness).
- $\mu$ : The 3D center of the surfel.
  
  Rendering is done via alpha blending. For a given camera ray, the intersections with all surfels are found and sorted by depth. The final color $I(\mathbf{r})$ for the ray is a weighted sum of the surfel colors: $\alpha _ { i } = o _ { i } \mathcal { G } _ { i } ( p _ { i } ^ { \prime } ) \qquad w _ { i } = \alpha _ { i } \prod _ { j = 1 } ^ { i - 1 } ( 1 - \alpha _ { j } )$ $I ( { \bf r } ) = \sum _ { i = 1 }^n c _ { i } w _ { i }$
- $\mathcal{G}_i(p_i')$ : The 2D Gaussian function evaluated at the intersection point.
- $o_i$ : The opacity of the $i$ -th surfel.
- $\alpha_i$ : The alpha value (opacity contribution) of the $i$ -th surfel for that ray.
- $w_i$ : The final blending weight of the $i$ -th surfel.
4.2. Surface-aware Depth Rendering This is a key contribution to fix geometry distortions, as illustrated in Figure 2.

Figure 2: This figure highlights two key challenges in Gaussian-based tracking. (a1) shows how the depth of a 3D Gaussian can appear different from varying viewpoints. (a2) illustrates how a distant surface (floor) can interfere with the depth rendering of a closer surface (chair), causing "ill-blended depth". (b) shows how occluded regions can have high opacity and cause misalignment during tracking.

The method has three components:
1. Unbiased Depth: It computes depth by finding the direct geometric intersection of a camera ray with the 2D surfel plane. This is more geometrically accurate than the depth approximation used in 3DGS.
2. Depth Adjustment: To prevent background surfels from distorting the depth of foreground surfaces, it adjusts the depth contribution of each surfel. It first finds the median depth $d_m$ $d_{m}$ , which is the depth of the surfel where the accumulated opacity along the ray first exceeds 0.5. For any surfel $i$ $i$ behind this median surface, its depth $d_i$ $d_{i}$ is adjusted towards $d_m$ $d_{m}$ : $d _ { i } ^ { \prime } = \beta _ { i } d _ { i } + ( 1 - \beta _ { i } ) d _ { m }$ The weight $\beta_i$ $β_{i}$ decreases as the distance between $d_i$ $d_{i}$ and $d_m$ $d_{m}$ increases, effectively down-weighting the influence of far-away surfels. $\beta _ { i } = Exp ( - \frac { ( d _ { i } - d _ { m } ) ^ { 2 } } { B \sigma _ { i } ^ { 2 } } ) , \quad i > m$
  - $d_i'$ : The adjusted depth for the $i$ -th surfel.
  - $\sigma_i^2$ : The variance of depths of surfels considered so far, which helps adapt the adjustment to the local geometry.
  - $B$ : A hyperparameter controlling the sensitivity.
3. Depth Normalization: The final rendered depth for a ray is normalized by the total accumulated opacity $A(\mathbf{r})$ to prevent underestimation in semi-transparent areas. $D ( \mathbf { r } ) = \frac { \sum _ { i = 1 } ^ { n } d _ { i } ^ { \prime } w _ { i } } { A ( \mathbf { r } ) }$
4.3. Camera Tracking The system uses a frame-to-model approach, optimizing the camera pose $\{\mathbf{R}, \mathbf{t}\}$ to align the input RGB-D frame with the rendered image from the current map. The loss function minimizes the difference between the rendered color/depth and the ground truth (GT) from the camera, but only in well-reconstructed regions (where accumulated opacity $A > 0.9$ ). $\mathcal { L } _ { t r a c k } = ( A > 0 . 9 ) \left( \mathcal { L } _ { 1 } ( D , \hat { D } ) + \lambda _ { 1 } \mathcal { L } _ { 1 } ( I , \hat { I } ) \right)$
- D, I: Rendered depth and color.
- $\hat{D}, \hat{I}$ : Ground-truth depth and color from the input frame.
- $\mathcal{L}_1$ : The L1 loss (mean absolute error).
- $\lambda_1$ : A weighting factor.
4.4. Incremental Mapping
- Initialization (Surfel Attachment): New Gaussian surfels are added in regions where the map is incomplete (accumulated opacity < 0.6). Their initial position, scale, and orientation are derived directly from the input depth map and its normals. This is more efficient than the clone/split strategy of 3DGS. In areas without GT depth but with partial reconstruction, a process called Edge Growth uses the rendered depth to initialize new Gaussians, helping to fill holes.
- Optimization: The parameters of the Gaussians (position, scale, color, opacity) are optimized using a loss function that includes color and depth reconstruction terms, plus a regularization term $\mathcal{L}_{reg}$ that encourages surfels along a ray to cluster around the median depth, reducing depth uncertainty. $\mathcal { L } _ { m a p } = \mathcal { L } _ { 1 } ( D , \hat { D } ) + \lambda _ { 1 } \mathcal { L } _ { 1 } ( I , \hat { I } ) + \lambda _ { 2 } \mathcal { L } _ { r e g }$ $\mathcal { L } _ { r e g } = \sum _ { r } \sum _ { i = 1 }^n w _ { i } ( d _ { i } ^ { \prime } - d _ { m } ) ^ { 2 }$
4.5. GauS-SLAM System
- Front-end: Operates on a local map. It performs tracking and decides when to create a new keyframe (KF). When the local map grows too large (number of Gaussians > $\tau_l$ ), it is sent to the back-end, and a new local map is initialized. This keeps the front-end fast.
- Back-end: Manages the global map. It receives local maps, merges them, and performs global optimization.
  - Merging & Pruning: Merges the local Gaussian map into the global one and prunes redundant or low-opacity Gaussians.
  - Bundle Adjustment (BA): Optimizes the poses of keyframes and the global map to reduce accumulated trajectory drift.
  - Random Optimization & Final Refinement: When idle, the back-end randomly selects frames to continue refining the global map, which helps combat catastrophic forgetting and improves overall rendering quality.

5. Experimental Setup

Datasets:
- Replica: A high-quality synthetic dataset of indoor environments with photorealistic renderings and perfect ground truth data. Ideal for evaluating reconstruction quality.
- TUM-RGBD: A real-world dataset for benchmarking RGB-D SLAM. It features challenges like motion blur and exposure changes.
- ScanNet & ScanNet++: Large-scale real-world datasets of indoor scenes captured with a commodity depth sensor. ScanNet++ provides higher-fidelity data.
Evaluation Metrics:
1. Absolute Trajectory Error (ATE-RMSE):
  - Conceptual Definition: Measures the global consistency of the estimated camera trajectory. It aligns the estimated trajectory with the ground-truth trajectory and calculates the Root Mean Square Error (RMSE) of the distances between corresponding camera positions. A lower value is better.
  - Mathematical Formula: $\text{ATE-RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} \left\| \text{trans}(S^{-1}P_i) - \text{trans}(Q_i) \right\|^2}$
  - Symbol Explanation: $N$ is the number of camera poses, $P_i$ is the estimated pose, $Q_i$ is the ground-truth pose, $S$ is the alignment transformation (found via optimization), and $\text{trans}(\cdot)$ extracts the translation part of a pose.
2. Peak Signal-to-Noise Ratio (PSNR):
  - Conceptual Definition: Measures the quality of a reconstructed image by comparing it to a ground-truth image. It is based on the Mean Squared Error (MSE). A higher PSNR value (in decibels, dB) indicates better rendering quality.
  - Mathematical Formula: $\text{PSNR} = 10 \cdot \log_{10} \left( \frac{\text{MAX}_I^2}{\text{MSE}} \right)$
  - Symbol Explanation: $\text{MAX}_I$ is the maximum possible pixel value (e.g., 255 for 8-bit images), and $\text{MSE}$ is the mean squared error between the rendered and ground-truth images.
3. Structural Similarity Index (SSIM):
  - Conceptual Definition: Measures image quality by comparing structural information, luminance, and contrast, which is often better aligned with human perception than PSNR. The value ranges from -1 to 1, with 1 indicating a perfect match.
  - Mathematical Formula: For two image windows $x$ and $y$ : $\text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}$
  - Symbol Explanation: $\mu_x, \mu_y$ are the means; $\sigma_x^2, \sigma_y^2$ are the variances; $\sigma_{xy}$ is the covariance; $c_1, c_2$ are small constants for stability.
4. Learned Perceptual Image Patch Similarity (LPIPS):
  - Conceptual Definition: A metric that measures the perceptual similarity between two images using features from a deep neural network (e.g., VGG). It is considered to be very close to human judgment of image similarity. A lower value is better.
5. Depth L1:
  - Conceptual Definition: The mean absolute error (L1 norm) between the rendered depth map and the ground-truth depth map. It directly measures the geometric accuracy of the reconstruction. A lower value is better.
6. F1-Score:
  - Conceptual Definition: Used here to evaluate the geometric reconstruction quality by comparing the reconstructed mesh to the ground-truth mesh. It is the harmonic mean of precision (fraction of the reconstructed surface that is close to the ground truth) and recall (fraction of the ground-truth surface that is successfully reconstructed). A higher value is better.
Baselines: The paper compares GauS-SLAM against a wide range of state-of-the-art methods, including:
- NeRF-based: ESLAM, NICE-SLAM
- Point-based: Point-SLAM
- 3DGS-based (coupled): SplaTAM, Gaussian-SLAM, MonoGS
- 3DGS-based (decoupled): GS-ICP
- Traditional: ORB-SLAM2

6. Results & Analysis

GauS-SLAM demonstrates superior performance across tracking, rendering, and reconstruction, especially on high-quality datasets.

Core Results:

该图像是科研论文中的多子图示意图，展示了GauS-SLAM系统在不同指标上的性能表现。左侧子图显示了SLAM追踪轨迹精度对比，中间上方和中间下方分别展示了渲染质量和几何准确度的视觉对比，右侧子图为多种SLAM方法在Replica数据集上的性能散点图，横轴为ATE-RMSE（cm），纵轴为PSNR（dB），GauS-SLAM在图中表现出较低的误差和较高的图像质量。 Figure 3: A summary of GauS-SLAM's performance. The left panel shows the highly accurate camera trajectory (ATE-RMSE of 0.42 cm) compared to a state-of-the-art method (SplaTAM, 1.91 cm). The middle panels show superior rendering quality (PSNR) and geometric accuracy (Depth L1). The right panel is a scatter plot showing that GauS-SLAM (the star symbol) achieves a better trade-off between tracking accuracy (ATE, x-axis) and rendering quality (PSNR, y-axis) than many other methods on the Replica dataset.

Tracking Performance: As shown in Table 1 (transcribed below), GauS-SLAM achieves an ATE-RMSE of 0.06 cm on the Replica dataset, significantly outperforming all baselines, including SplaTAM (0.36 cm) and GS-ICP (0.16 cm).

(Manual transcription of Table 1 from the paper)

Method	PSNR [db]↑	SSIM ↑	LPIPS ↓	ATE [cm]↓	Depth L1 [cm]↓	F1-Score [%]↑
ESLAM[11]	27.80	0.921	0.245	0.63	2.08	78.2
Point-SLAM[20]	35.17	0.975	0.124	0.52	0.44	89.7
MonoGS[14]	37.50	0.960	0.070	0.58	0.95	78.6
SplaTAM[12]	34.11	0.978	0.104	0.36	0.72	86.1
Gaussian SLAM[35]	42.08	0.996	0.018	0.31	0.68	88.9
GS-ICP[6]	38.83	0.975	0.041	0.16	-	-
GauS-SLAM(Ours)	40.25	0.991	0.027	0.06	0.43	90.5

On more challenging real-world datasets like TUM-RGBD and ScanNet (Table 2), it remains competitive. On the high-quality ScanNet++ dataset, it again achieves state-of-the-art, millimeter-level accuracy.

(Manual transcription of Table 2 from the paper)

| | TUM-RGBD[24] | ScanNet[4] | ScanNet++[33] | :--- | :--- | :--- | :--- | :--- | :--- | :--- | Method | fr2 | fr3 | 0059 | 0169 | S1 | S2 | ORB-SLAM2[16] | 0.40 | 1.00 | 14.25 | 8.72 | X | X | Point-SLAM[20] | 1.31 | 3.48 | 7.81 | 22.16 | X | X | MonoGS*[14] | 1.77 | 1.49 | 32.10 | 10.70 | 7.00 | 3.66 | SplaTAM[12] | 1.24 | 5.16 | 10.10 | 12.10 | 1.91 | 0.61 | Gaussian-SLAM*[35]| 1.39 | 5.31 | 12.80 | 16.30 | 1.37 | 2.28 | LoopSplat*[40] | - | 1.30 | 3.53 | 7.10 | 10.60 | - | GauS-SLAM(Ours)| 1.34 | 1.46 | 7.14 | 7.45 | 0.42 | 0.47

Rendering and Reconstruction Performance: From Table 1, GauS-SLAM achieves a PSNR of 40.25 dB, surpassing SplaTAM by over 6 dB. Its geometric accuracy is also top-tier, with the lowest Depth L1 error (0.43 cm) and highest F1-Score (90.5%). Figure 5 (labeled 4.jpg in source) shows that using 2D Gaussian surfels results in much smoother and more realistic mesh surfaces compared to methods using isotropic 3D Gaussians.

$Figure 5. Comparison of mesh results on Replica\[22\]. Compared to isotropic 3D Gaussians, Gaussian surfels produce smoother mesh reconstructions.$ Figure 4: A visual comparison of mesh quality on the Replica dataset. GauS-SLAM (right) produces significantly smoother surfaces, especially on walls and furniture, compared to MonoGS (left) and SplaTAM (middle), which show bumpy, uneven artifacts.

Geometry Consistency: The experiments in Figure 6 (labeled 6.jpg in source) demonstrate the effectiveness of the Surface-aware Depth Rendering. While standard 2DGS improves over 3DGS, it still shows errors at object boundaries. GauS-SLAM's method successfully cleans up these errors, leading to better geometric consistency across views.

Figure 6. In the geometry consistency experiment, the error maps of the depth rendering results on the Frame 4 0 and Frame 55 of Room 0 in the Replica dataset. Figure 5: Depth rendering error maps. The proposed GauS-SLAM method (right) shows significantly lower error (darker blue) compared to standard 3DGS (left) and 2DGS (middle), especially around object edges, confirming its superior geometry consistency.

Runtime Efficiency: The local map design is crucial for efficiency. Table 3 and Figure 8 (labeled 8.jpg in source) show that GauS-SLAM is significantly faster than SplaTAM. Its tracking and mapping times remain stable as the scene grows, whereas SplaTAM's performance degrades.

$Table 11. Comparison of tracking performance on Replica (ATE RMSE↓\[cm\]). Results are tacken from respective papers.$ Figure 6: A comparison of runtime efficiency. GauS-SLAM's mapping and tracking times (green lines) remain low and constant over time. In contrast, SplaTAM's times (blue lines) increase as more Gaussians are added to the map, demonstrating the efficiency benefit of GauS-SLAM's local map approach.

Ablation Studies:

The ablation studies systematically validate the contribution of each new component.

(Manual transcription of Table 4 from the paper)

Methods	Geo. Con [mm]↓	ATE [mm]↓	PSNR [dB]↑
A. w/o Unbiased Depth	1.94	2.10	36.06
B. w/o Depth Adjustment	1.75	0.85	38.10
C. w/o Depth Norm.	2.51	1.92	35.98
D. w/o Regulation Loss	1.01	0.63	38.25
Full Model	1.01	0.60	38.04

Depth Rendering Ablation (Table 4): Removing Unbiased Depth (i.e., using 3D Gaussians) or Depth Normalization severely degrades both tracking accuracy (ATE) and rendering quality (PSNR). Removing Depth Adjustment also hurts tracking performance. This confirms that all parts of the proposed rendering pipeline are crucial.

(Manual transcription of Table 5 from the paper)

Methods	Sequence	ATE [mm]↓	PSNR [dB]↑	Time [s]↓
E. w/o Keyframe	Room 0	0.52	38.28	2.13
	fr3/office	14.53	25.03	1.72
F. w/o LocalMap	Room 0	0.49	38.25	6.77
	fr3/office	52.91	24.16	5.58
G. w/o Random Optimization	Room 0	0.70	37.78	1.63
	fr3/office	14.37	25.03	1.62
H. w/o Final Refinement	Room 0	0.54	37.48	1.73
	fr3/office	14.30	24.34	1.63
Full Model	Room 0	0.60	38.04	1.73
	fr3/office	14.29	25.06	1.62

SLAM Components Ablation (Table 5):
- Removing the local map (F. w/o LocalMap) drastically increases computation time and leads to a massive tracking error (52.91 mm) in the fr3/office sequence, where the camera moves in a circle. This proves the local map is essential for handling occlusions and maintaining efficiency.
- Removing the back-end's random optimization (G. w/o Random Optimization) reduces tracking accuracy.
- Removing final refinement (H. w/o Final Refinement) slightly reduces the final rendering quality (PSNR).

7. Conclusion & Reflections

Conclusion Summary: The paper successfully identifies and addresses two fundamental problems in coupled Gaussian-based SLAM: multi-view geometric distortion and misalignment caused by occlusions. By introducing a system, GauS-SLAM, built on 2D Gaussian surfels, a novel Surface-aware Depth Rendering mechanism, and a robust local map design, the authors achieve a new state-of-the-art in tracking accuracy on several key benchmarks while also producing high-fidelity, geometrically consistent maps. The work underscores the importance of geometric accuracy in the underlying scene representation for achieving robust camera tracking.
Limitations & Future Work: The authors acknowledge that GauS-SLAM, like many similar methods, is sensitive to real-world camera artifacts such as motion blur and significant exposure variations. This explains its relatively less dominant performance on datasets like TUM-RGBD and ScanNet, which are rich in these challenges. Future work will focus on improving the system's robustness to these factors that cause multi-view inconsistency.
Personal Insights & Critique:
- Novelty: The primary novelty lies in the careful adaptation of 2D Gaussian surfels for a coupled SLAM system and the specific formulation of Surface-aware Depth Rendering. While 2D surfels existed, their integration into a full SLAM pipeline to explicitly solve the geometry-for-tracking problem is a strong contribution. The local map strategy is a very practical and effective engineering solution to a well-known problem in large-scale SLAM.
- Impact: This paper provides a solid blueprint for future dense SLAM systems. It demonstrates that moving beyond the standard 3D Gaussian primitive can yield significant gains in geometric fidelity, which directly translates to better localization. The focus on improving the core representation's consistency is a valuable direction for the field.
- Open Questions: The system still relies on high-quality depth data from an RGB-D camera. Its performance in a monocular (RGB-only) setting is an open question. Furthermore, while the local map improves efficiency, the system does not seem to include explicit loop closure detection and correction, which is critical for correcting drift in very large-scale, long-term mapping. Although BA is performed, it is on co-visible submaps, not necessarily large loops. The comparison to methods like LoopSplat suggests this is an area for future improvement.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.