GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels
TL;DR Summary
GauS-SLAM resolves geometry distortion and tracking inaccuracy in Gaussian-based RGB-D SLAM by employing 2D Gaussian surfels with an incremental reconstruction strategy and Surface-aware Depth Rendering. This, alongside a dynamic local map, significantly improves tracking precisi
Abstract
We propose GauS-SLAM, a dense RGB-D SLAM system that leverages 2D Gaussian surfels to achieve robust tracking and high-fidelity mapping. Our investigations reveal that Gaussian-based scene representations exhibit geometry distortion under novel viewpoints, which significantly degrades the accuracy of Gaussian-based tracking methods. These geometry inconsistencies arise primarily from the depth modeling of Gaussian primitives and the mutual interference between surfaces during the depth blending. To address these, we propose a 2D Gaussian-based incremental reconstruction strategy coupled with a Surface-aware Depth Rendering mechanism, which significantly enhances geometry accuracy and multi-view consistency. Additionally, the proposed local map design dynamically isolates visible surfaces during tracking, mitigating misalignment caused by occluded regions in global maps while maintaining computational efficiency with increasing Gaussian density. Extensive experiments across multiple datasets demonstrate that GauS-SLAM outperforms comparable methods, delivering superior tracking precision and rendering fidelity. The project page will be made available at https://gaus-slam.github.io.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels
- Authors: Yongxin Su, Lin Chen, Kaiting Zhang, Zhongliang Zhao, Chenfeng Hou, Ziping Yu
- Affiliations: Beihang University, Northwestern Polytechnical University
- Journal/Conference: This paper is a preprint submitted to arXiv. The venue of formal publication is not specified, but its content and structure suggest it is intended for a top-tier computer vision conference like CVPR, ICCV, or ECCV.
- Publication Year: 2024 (v1 submitted in May 2024)
- Abstract: The authors propose
GauS-SLAM, a dense SLAM system for RGB-D cameras that uses 2D Gaussian surfels for scene representation. They identify that existing Gaussian-based SLAM methods suffer from geometry distortion under new viewpoints, which harms tracking accuracy. They attribute this to issues in depth modeling and interference between surfaces during rendering. To solve this, they introduce an incremental reconstruction strategy with a novelSurface-aware Depth Renderingmechanism. They also propose a local map design to isolate visible surfaces for robust tracking and maintain efficiency. Experiments show thatGauS-SLAMachieves state-of-the-art tracking precision and high-quality rendering. - Original Source Link:
- arXiv Link: https://arxiv.org/abs/2505.01934v1
- PDF Link: http://arxiv.org/pdf/2505.01934v1
- Status: This is a preprint and has not yet undergone formal peer review.
2. Executive Summary
- Background & Motivation (Why):
- Core Problem: The paper addresses critical challenges in dense visual Simultaneous Localization and Mapping (SLAM), a field that aims to build a detailed 3D map of an environment while simultaneously tracking a camera's position within it.
- Current Gaps: Recent advances have adopted 3D Gaussian Splatting (3DGS) for scene representation due to its flexibility and rendering quality. However, existing
3DGS-based SLAM systems often face two major problems:- Geometry Distortion: When viewed from different angles, the 3D Gaussian representation can produce inconsistent geometry, leading to errors when aligning new camera frames with the map. This degrades tracking accuracy.
- Misalignment from Occlusions: During tracking, parts of the scene visible in the global map might be occluded from the camera's current view. These occluded regions can act as "outliers" and interfere with the alignment process, causing the estimated camera pose to be incorrect.
- Main Contributions / Findings (What):
The paper introduces
GauS-SLAM, a tightly-coupled SLAM system that makes the following primary contributions to solve the identified problems:- 2D Gaussian Surfel Representation for SLAM: Instead of standard 3D Gaussians, the system uses 2D Gaussian surfels (flat, disk-like Gaussians) which are better at representing surfaces and provide more geometrically accurate depth estimates.
- Surface-aware Depth Rendering: A novel rendering mechanism that mitigates depth errors caused by interference between foreground and background surfaces. It adjusts the contribution of each Gaussian to the final depth value based on its distance from the main surface, improving multi-view consistency.
- Local Map Design for Robust Tracking: Camera tracking is performed on a small, dynamically managed local map containing only recently observed surfaces. This isolates the tracking process from distracting, occluded parts of the larger global map, preventing misalignment and ensuring computational efficiency as the scene grows.
- A Complete High-Performance SLAM System: The paper integrates these components into a full front-end/back-end SLAM architecture that achieves state-of-the-art tracking accuracy and high-fidelity 3D reconstruction on multiple benchmark datasets.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Simultaneous Localization and Mapping (SLAM): A core robotics and computer vision problem where a moving agent (e.g., a robot or a person with a camera) builds a map of an unknown environment while simultaneously keeping track of its own location within that map.
- Dense SLAM: A type of SLAM that aims to reconstruct a dense, detailed 3D model of the environment, as opposed to sparse SLAM which only maps a set of sparse feature points.
- RGB-D Camera: A sensor that captures both a standard color (RGB) image and a per-pixel depth (D) image, providing 3D information directly.
- 3D Gaussian Splatting (3DGS): A recent and popular technique for representing 3D scenes. A scene is modeled as a collection of millions of tiny, semi-transparent 3D ellipsoids (Gaussians). This representation allows for high-quality, real-time rendering from any viewpoint.
- Surfels (Surface Elements): Small, flat, disk-like primitives used to represent a 3D surface. A surfel typically has a position, a normal vector (its orientation), and a radius. Using surfels can be more efficient and geometrically faithful for representing surfaces than other primitives like points or voxels.
GauS-SLAMuses 2D Gaussians as a form of surfel.
-
Previous Works:
- Implicit Neural Representations (e.g., NeRF): Methods like
iMapandNICE-SLAMused neural networks (like MLPs) to implicitly represent the scene's geometry and color. While powerful, they are often slow to render because they require sampling many points along each camera ray. - Explicit 3DGS-based SLAM: The advent of 3DGS led to a new wave of SLAM systems.
SplaTAMandGaussian-SLAMwere pioneering works that directly coupled tracking and mapping using a 3D Gaussian representation. However, the paper points out they suffer from the geometry distortion and misalignment issues mentioned earlier.- Other methods like
GS-ICPandPhoto-SLAMdecoupled tracking from the Gaussian map, using traditional visual odometry for pose estimation to improve speed and robustness, but losing the benefits of a tightly coupled system where mapping and tracking mutually improve each other.
- Implicit Neural Representations (e.g., NeRF): Methods like
-
Differentiation:
GauS-SLAMdistinguishes itself from prior 3DGS-based SLAM systems, particularly its baselineSplaTAM, in several key ways:- Primitive Choice: It uses 2D Gaussian surfels instead of 3D Gaussian ellipsoids. This is a fundamental shift aimed at improving geometric accuracy. The paper argues that 2D surfels provide a more consistent "unbiased depth" estimate across different views.
- Depth Rendering: It introduces the
Surface-aware Depth Renderingmechanism. This is a specific solution to the problem of "ill-blended depth," where far-away surfaces corrupt the depth estimate of closer ones. No previous SLAM work cited had explicitly addressed this blending artifact in depth rendering. - Tracking Strategy: Instead of tracking against a global map,
GauS-SLAMuses a local map that is periodically reset. This is a practical engineering choice to ensure tracking remains efficient and is not affected by occluded or irrelevant parts of the scene, a problem that becomes more severe as the map grows.
4. Methodology (Core Technology & Implementation)
The core of GauS-SLAM is its use of 2D Gaussian surfels and a set of novel techniques for rendering and mapping to enhance geometric accuracy and tracking robustness. The overall system follows a front-end/back-end architecture.
Figure 1: The overall architecture of the GauS-SLAM system. The front-end handles RGB-D data for tracking and local mapping. The Surface-aware depth rendering module corrects depth estimation. The back-end manages the global map by merging local maps and performing global optimization.
-
Principles: The central idea is that by enforcing better geometric consistency in the scene representation, the camera tracking (which relies on this representation) will become more accurate. This is achieved by moving from 3D to 2D Gaussians and refining the depth rendering process.
-
Steps & Procedures:
4.1. Gaussian Surfel-based Representation Instead of 3D Gaussians, the scene is represented by a set of 2D Gaussian surfels. A surfel is a flat disk defined by a center point , two tangent vectors defining its plane, an opacity , and a color . A point in the surfel's plane is given by:
-
: Homogeneous coordinates of the point in the surfel's 2D plane.
-
: A transformation matrix representing the geometry (orientation and scale) of the 2D plane in 3D space. It is composed of a rotation and a scaling (with scale factor 0 in the third dimension to ensure flatness).
-
: The 3D center of the surfel.
Rendering is done via alpha blending. For a given camera ray, the intersections with all surfels are found and sorted by depth. The final color for the ray is a weighted sum of the surfel colors:
-
: The 2D Gaussian function evaluated at the intersection point.
-
: The opacity of the -th surfel.
-
: The alpha value (opacity contribution) of the -th surfel for that ray.
-
: The final blending weight of the -th surfel.
4.2. Surface-aware Depth Rendering This is a key contribution to fix geometry distortions, as illustrated in Figure 2.
Figure 2: This figure highlights two key challenges in Gaussian-based tracking. (a1) shows how the depth of a 3D Gaussian can appear different from varying viewpoints. (a2) illustrates how a distant surface (floor) can interfere with the depth rendering of a closer surface (chair), causing "ill-blended depth". (b) shows how occluded regions can have high opacity and cause misalignment during tracking.The method has three components:
- Unbiased Depth: It computes depth by finding the direct geometric intersection of a camera ray with the 2D surfel plane. This is more geometrically accurate than the depth approximation used in 3DGS.
- Depth Adjustment: To prevent background surfels from distorting the depth of foreground surfaces, it adjusts the depth contribution of each surfel. It first finds the median depth , which is the depth of the surfel where the accumulated opacity along the ray first exceeds 0.5. For any surfel behind this median surface, its depth is adjusted towards :
The weight decreases as the distance between and increases, effectively down-weighting the influence of far-away surfels.
- : The adjusted depth for the -th surfel.
- : The variance of depths of surfels considered so far, which helps adapt the adjustment to the local geometry.
- : A hyperparameter controlling the sensitivity.
- Depth Normalization: The final rendered depth for a ray is normalized by the total accumulated opacity to prevent underestimation in semi-transparent areas.
4.3. Camera Tracking The system uses a frame-to-model approach, optimizing the camera pose to align the input RGB-D frame with the rendered image from the current map. The loss function minimizes the difference between the rendered color/depth and the ground truth (GT) from the camera, but only in well-reconstructed regions (where accumulated opacity ).
D, I: Rendered depth and color.- : Ground-truth depth and color from the input frame.
- : The L1 loss (mean absolute error).
- : A weighting factor.
4.4. Incremental Mapping
- Initialization (
Surfel Attachment): New Gaussian surfels are added in regions where the map is incomplete (accumulated opacity < 0.6). Their initial position, scale, and orientation are derived directly from the input depth map and its normals. This is more efficient than the clone/split strategy of 3DGS. In areas without GT depth but with partial reconstruction, a process calledEdge Growthuses the rendered depth to initialize new Gaussians, helping to fill holes. - Optimization: The parameters of the Gaussians (position, scale, color, opacity) are optimized using a loss function that includes color and depth reconstruction terms, plus a regularization term that encourages surfels along a ray to cluster around the median depth, reducing depth uncertainty.
4.5. GauS-SLAM System
- Front-end: Operates on a local map. It performs tracking and decides when to create a new keyframe (KF). When the local map grows too large (number of Gaussians > ), it is sent to the back-end, and a new local map is initialized. This keeps the front-end fast.
- Back-end: Manages the global map. It receives local maps, merges them, and performs global optimization.
- Merging & Pruning: Merges the local Gaussian map into the global one and prunes redundant or low-opacity Gaussians.
- Bundle Adjustment (BA): Optimizes the poses of keyframes and the global map to reduce accumulated trajectory drift.
- Random Optimization & Final Refinement: When idle, the back-end randomly selects frames to continue refining the global map, which helps combat catastrophic forgetting and improves overall rendering quality.
-
5. Experimental Setup
-
Datasets:
- Replica: A high-quality synthetic dataset of indoor environments with photorealistic renderings and perfect ground truth data. Ideal for evaluating reconstruction quality.
- TUM-RGBD: A real-world dataset for benchmarking RGB-D SLAM. It features challenges like motion blur and exposure changes.
- ScanNet & ScanNet++: Large-scale real-world datasets of indoor scenes captured with a commodity depth sensor. ScanNet++ provides higher-fidelity data.
-
Evaluation Metrics:
- Absolute Trajectory Error (ATE-RMSE):
- Conceptual Definition: Measures the global consistency of the estimated camera trajectory. It aligns the estimated trajectory with the ground-truth trajectory and calculates the Root Mean Square Error (RMSE) of the distances between corresponding camera positions. A lower value is better.
- Mathematical Formula:
- Symbol Explanation: is the number of camera poses, is the estimated pose, is the ground-truth pose, is the alignment transformation (found via optimization), and extracts the translation part of a pose.
- Peak Signal-to-Noise Ratio (PSNR):
- Conceptual Definition: Measures the quality of a reconstructed image by comparing it to a ground-truth image. It is based on the Mean Squared Error (MSE). A higher PSNR value (in decibels, dB) indicates better rendering quality.
- Mathematical Formula:
- Symbol Explanation: is the maximum possible pixel value (e.g., 255 for 8-bit images), and is the mean squared error between the rendered and ground-truth images.
- Structural Similarity Index (SSIM):
- Conceptual Definition: Measures image quality by comparing structural information, luminance, and contrast, which is often better aligned with human perception than PSNR. The value ranges from -1 to 1, with 1 indicating a perfect match.
- Mathematical Formula: For two image windows and :
- Symbol Explanation: are the means; are the variances; is the covariance; are small constants for stability.
- Learned Perceptual Image Patch Similarity (LPIPS):
- Conceptual Definition: A metric that measures the perceptual similarity between two images using features from a deep neural network (e.g., VGG). It is considered to be very close to human judgment of image similarity. A lower value is better.
- Depth L1:
- Conceptual Definition: The mean absolute error (L1 norm) between the rendered depth map and the ground-truth depth map. It directly measures the geometric accuracy of the reconstruction. A lower value is better.
- F1-Score:
- Conceptual Definition: Used here to evaluate the geometric reconstruction quality by comparing the reconstructed mesh to the ground-truth mesh. It is the harmonic mean of precision (fraction of the reconstructed surface that is close to the ground truth) and recall (fraction of the ground-truth surface that is successfully reconstructed). A higher value is better.
- Absolute Trajectory Error (ATE-RMSE):
-
Baselines: The paper compares
GauS-SLAMagainst a wide range of state-of-the-art methods, including:- NeRF-based:
ESLAM,NICE-SLAM - Point-based:
Point-SLAM - 3DGS-based (coupled):
SplaTAM,Gaussian-SLAM,MonoGS - 3DGS-based (decoupled):
GS-ICP - Traditional:
ORB-SLAM2
- NeRF-based:
6. Results & Analysis
GauS-SLAM demonstrates superior performance across tracking, rendering, and reconstruction, especially on high-quality datasets.
-
Core Results:
Figure 3: A summary of GauS-SLAM's performance. The left panel shows the highly accurate camera trajectory (ATE-RMSE of 0.42 cm) compared to a state-of-the-art method (SplaTAM, 1.91 cm). The middle panels show superior rendering quality (PSNR) and geometric accuracy (Depth L1). The right panel is a scatter plot showing thatGauS-SLAM(the star symbol) achieves a better trade-off between tracking accuracy (ATE, x-axis) and rendering quality (PSNR, y-axis) than many other methods on the Replica dataset.Tracking Performance: As shown in Table 1 (transcribed below),
GauS-SLAMachieves an ATE-RMSE of 0.06 cm on the Replica dataset, significantly outperforming all baselines, includingSplaTAM(0.36 cm) andGS-ICP(0.16 cm).(Manual transcription of Table 1 from the paper)
Method PSNR [db]↑ SSIM ↑ LPIPS ↓ ATE [cm]↓ Depth L1 [cm]↓ F1-Score [%]↑ ESLAM[11] 27.80 0.921 0.245 0.63 2.08 78.2 Point-SLAM[20] 35.17 0.975 0.124 0.52 0.44 89.7 MonoGS[14] 37.50 0.960 0.070 0.58 0.95 78.6 SplaTAM[12] 34.11 0.978 0.104 0.36 0.72 86.1 Gaussian SLAM[35] 42.08 0.996 0.018 0.31 0.68 88.9 GS-ICP[6] 38.83 0.975 0.041 0.16 - - GauS-SLAM(Ours) 40.25 0.991 0.027 0.06 0.43 90.5 On more challenging real-world datasets like TUM-RGBD and ScanNet (Table 2), it remains competitive. On the high-quality ScanNet++ dataset, it again achieves state-of-the-art, millimeter-level accuracy.
(Manual transcription of Table 2 from the paper)
| | TUM-RGBD[24] | ScanNet[4] | ScanNet++[33] | :--- | :--- | :--- | :--- | :--- | :--- | :--- | Method | fr2 | fr3 | 0059 | 0169 | S1 | S2 | ORB-SLAM2[16] | 0.40 | 1.00 | 14.25 | 8.72 | X | X | Point-SLAM[20] | 1.31 | 3.48 | 7.81 | 22.16 | X | X | MonoGS*[14] | 1.77 | 1.49 | 32.10 | 10.70 | 7.00 | 3.66 | SplaTAM[12] | 1.24 | 5.16 | 10.10 | 12.10 | 1.91 | 0.61 | Gaussian-SLAM*[35]| 1.39 | 5.31 | 12.80 | 16.30 | 1.37 | 2.28 | LoopSplat*[40] | - | 1.30 | 3.53 | 7.10 | 10.60 | - | GauS-SLAM(Ours)| 1.34 | 1.46 | 7.14 | 7.45 | 0.42 | 0.47
Rendering and Reconstruction Performance: From Table 1,
GauS-SLAMachieves a PSNR of 40.25 dB, surpassingSplaTAMby over 6 dB. Its geometric accuracy is also top-tier, with the lowest Depth L1 error (0.43 cm) and highest F1-Score (90.5%). Figure 5 (labeled4.jpgin source) shows that using 2D Gaussian surfels results in much smoother and more realistic mesh surfaces compared to methods using isotropic 3D Gaussians.
Figure 4: A visual comparison of mesh quality on the Replica dataset. GauS-SLAM(right) produces significantly smoother surfaces, especially on walls and furniture, compared toMonoGS(left) andSplaTAM(middle), which show bumpy, uneven artifacts.Geometry Consistency: The experiments in Figure 6 (labeled
6.jpgin source) demonstrate the effectiveness of theSurface-aware Depth Rendering. While standard 2DGS improves over 3DGS, it still shows errors at object boundaries.GauS-SLAM's method successfully cleans up these errors, leading to better geometric consistency across views.
Figure 5: Depth rendering error maps. The proposed GauS-SLAMmethod (right) shows significantly lower error (darker blue) compared to standard 3DGS (left) and 2DGS (middle), especially around object edges, confirming its superior geometry consistency.Runtime Efficiency: The local map design is crucial for efficiency. Table 3 and Figure 8 (labeled
8.jpgin source) show thatGauS-SLAMis significantly faster thanSplaTAM. Its tracking and mapping times remain stable as the scene grows, whereasSplaTAM's performance degrades.
Figure 6: A comparison of runtime efficiency. GauS-SLAM's mapping and tracking times (green lines) remain low and constant over time. In contrast,SplaTAM's times (blue lines) increase as more Gaussians are added to the map, demonstrating the efficiency benefit of GauS-SLAM's local map approach. -
Ablation Studies:
The ablation studies systematically validate the contribution of each new component.
(Manual transcription of Table 4 from the paper)
Methods Geo. Con [mm]↓ ATE [mm]↓ PSNR [dB]↑ A. w/o Unbiased Depth 1.94 2.10 36.06 B. w/o Depth Adjustment 1.75 0.85 38.10 C. w/o Depth Norm. 2.51 1.92 35.98 D. w/o Regulation Loss 1.01 0.63 38.25 Full Model 1.01 0.60 38.04 -
Depth Rendering Ablation (Table 4): Removing
Unbiased Depth(i.e., using 3D Gaussians) orDepth Normalizationseverely degrades both tracking accuracy (ATE) and rendering quality (PSNR). RemovingDepth Adjustmentalso hurts tracking performance. This confirms that all parts of the proposed rendering pipeline are crucial.(Manual transcription of Table 5 from the paper)
Methods Sequence ATE [mm]↓ PSNR [dB]↑ Time [s]↓ E. w/o Keyframe Room 0 0.52 38.28 2.13 fr3/office 14.53 25.03 1.72 F. w/o LocalMap Room 0 0.49 38.25 6.77 fr3/office 52.91 24.16 5.58 G. w/o Random Optimization Room 0 0.70 37.78 1.63 fr3/office 14.37 25.03 1.62 H. w/o Final Refinement Room 0 0.54 37.48 1.73 fr3/office 14.30 24.34 1.63 Full Model Room 0 0.60 38.04 1.73 fr3/office 14.29 25.06 1.62 -
SLAM Components Ablation (Table 5):
- Removing the local map (
F. w/o LocalMap) drastically increases computation time and leads to a massive tracking error (52.91 mm) in the fr3/office sequence, where the camera moves in a circle. This proves the local map is essential for handling occlusions and maintaining efficiency. - Removing the back-end's random optimization (
G. w/o Random Optimization) reduces tracking accuracy. - Removing final refinement (
H. w/o Final Refinement) slightly reduces the final rendering quality (PSNR).
- Removing the local map (
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully identifies and addresses two fundamental problems in coupled Gaussian-based SLAM: multi-view geometric distortion and misalignment caused by occlusions. By introducing a system,
GauS-SLAM, built on 2D Gaussian surfels, a novelSurface-aware Depth Renderingmechanism, and a robust local map design, the authors achieve a new state-of-the-art in tracking accuracy on several key benchmarks while also producing high-fidelity, geometrically consistent maps. The work underscores the importance of geometric accuracy in the underlying scene representation for achieving robust camera tracking. -
Limitations & Future Work: The authors acknowledge that
GauS-SLAM, like many similar methods, is sensitive to real-world camera artifacts such as motion blur and significant exposure variations. This explains its relatively less dominant performance on datasets like TUM-RGBD and ScanNet, which are rich in these challenges. Future work will focus on improving the system's robustness to these factors that cause multi-view inconsistency. -
Personal Insights & Critique:
- Novelty: The primary novelty lies in the careful adaptation of 2D Gaussian surfels for a coupled SLAM system and the specific formulation of
Surface-aware Depth Rendering. While 2D surfels existed, their integration into a full SLAM pipeline to explicitly solve the geometry-for-tracking problem is a strong contribution. The local map strategy is a very practical and effective engineering solution to a well-known problem in large-scale SLAM. - Impact: This paper provides a solid blueprint for future dense SLAM systems. It demonstrates that moving beyond the standard 3D Gaussian primitive can yield significant gains in geometric fidelity, which directly translates to better localization. The focus on improving the core representation's consistency is a valuable direction for the field.
- Open Questions: The system still relies on high-quality depth data from an RGB-D camera. Its performance in a monocular (RGB-only) setting is an open question. Furthermore, while the local map improves efficiency, the system does not seem to include explicit loop closure detection and correction, which is critical for correcting drift in very large-scale, long-term mapping. Although BA is performed, it is on co-visible submaps, not necessarily large loops. The comparison to methods like
LoopSplatsuggests this is an area for future improvement.
- Novelty: The primary novelty lies in the careful adaptation of 2D Gaussian surfels for a coupled SLAM system and the specific formulation of
Similar papers
Recommended via semantic vector search.