GI-SLAM: Gaussian-Inertial SLAM
TL;DR Summary
GI-SLAM proposes a novel 3D Gaussian Splatting SLAM system that integrates IMU data through an IMU-enhanced tracking module and a specialized IMU loss. This method significantly boosts camera tracking accuracy, robustness, and efficiency by seamlessly embedding inertial informati
Abstract
3D Gaussian Splatting (3DGS) has recently emerged as a powerful representation of geometry and appearance for dense Simultaneous Localization and Mapping (SLAM). Through rapid, differentiable rasterization of 3D Gaussians, many 3DGS SLAM methods achieve near real-time rendering and accelerated training. However, these methods largely overlook inertial data, witch is a critical piece of information collected from the inertial measurement unit (IMU). In this paper, we present GI-SLAM, a novel gaussian-inertial SLAM system which consists of an IMU-enhanced camera tracking module and a realistic 3D Gaussian-based scene representation for mapping. Our method introduces an IMU loss that seamlessly integrates into the deep learning framework underpinning 3D Gaussian Splatting SLAM, effectively enhancing the accuracy, robustness and efficiency of camera tracking. Moreover, our SLAM system supports a wide range of sensor configurations, including monocular, stereo, and RGBD cameras, both with and without IMU integration. Our method achieves competitive performance compared with existing state-of-the-art real-time methods on the EuRoC and TUM-RGBD datasets.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: GI-SLAM: Gaussian-Inertial SLAM
- Authors: Xulang Liu, Ning Tan
- Affiliations: Sun Yat-sen University, Guangzhou, China.
- Journal/Conference: The paper does not explicitly state its publication venue, but given its structure, content, and references to papers from CVPR 2024, it is formatted for a top-tier computer vision or robotics conference.
- Publication Year: 2024
- Abstract: The paper introduces GI-SLAM, a novel dense Simultaneous Localization and Mapping (SLAM) system that leverages 3D Gaussian Splatting (3DGS) for scene representation. Unlike previous 3DGS SLAM methods, GI-SLAM integrates inertial data from an Inertial Measurement Unit (IMU) to enhance performance. The core contributions are an IMU-enhanced camera tracking module, driven by a novel IMU loss function, and a high-fidelity 3D Gaussian mapping component. The system is versatile, supporting monocular, stereo, and RGBD cameras, with or without an IMU. The authors report competitive performance on the standard EuRoC and TUM-RGBD datasets, demonstrating improvements in tracking accuracy, robustness, and efficiency.
- Original Source Link: /files/papers/68e0abc29cc40dff7dd2bb24/paper.pdf. The paper appears to be a preprint or a final version submitted for publication.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Traditional dense visual SLAM methods struggle with photorealism, while modern neural approaches like NeRF-based SLAM are computationally expensive and slow. 3D Gaussian Splatting (3DGS) has emerged as a solution, offering both high-fidelity rendering and real-time performance.
- Existing Gap: Current state-of-the-art 3DGS SLAM systems are purely visual. They ignore the rich motion information provided by Inertial Measurement Units (IMUs), which are ubiquitous in modern robots and mobile devices. This omission makes them vulnerable to tracking failures during rapid movements, in texture-less environments, or when motion blur occurs.
- Innovation: GI-SLAM is the first system designed to bridge this gap by seamlessly fusing IMU data into a 3DGS SLAM framework. The core idea is to use inertial measurements as a powerful physical constraint to regularize and improve the camera pose optimization process.
-
Main Contributions / Findings (What):
- A Novel Gaussian-Inertial SLAM System (GI-SLAM): The paper proposes a complete SLAM system that combines a 3DGS-based map representation with an IMU-enhanced tracking module. The system is flexible and supports a wide array of sensor configurations.
- An Integrated IMU Loss Function: A novel loss term is introduced that directly penalizes inconsistencies between the optimized camera motion and the motion measured by the IMU. This loss integrates cleanly into the gradient-based optimization pipeline of 3DGS, improving tracking accuracy and robustness.
- A Motion-Aware Keyframe Selection Strategy: The system employs a sophisticated keyframing strategy that considers not only visual overlap (
Gaussian visibility) but also physical motion constraints from the IMU. This prevents the selection of motion-blurred frames, leading to higher-quality and more accurate map reconstructions.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- SLAM (Simultaneous Localization and Mapping): The fundamental problem in robotics where an agent, without prior knowledge of its surroundings, must construct a map of the environment while simultaneously determining its own position within that map.
- Dense SLAM: A variant of SLAM that aims to reconstruct a detailed, dense 3D model of the scene, as opposed to a sparse set of landmarks. This is crucial for applications requiring rich environmental understanding, like AR/VR and navigation.
- IMU (Inertial Measurement Unit): An electronic device that measures and reports a body's specific force, angular rate, and orientation using a combination of accelerometers and gyroscopes. In SLAM, IMUs provide high-frequency motion estimates that are independent of the visual environment, making them excellent for complementing cameras.
- NeRF (Neural Radiance Fields): A technique for representing a 3D scene with a neural network. This network learns to map a 3D location and a 2D viewing direction to the color and density at that point. NeRFs can generate stunningly photorealistic images but are notoriously slow to train and render.
- 3DGS (3D Gaussian Splatting): A more recent scene representation that models the world as a collection of 3D Gaussians (ellipsoids with color and opacity). 3DGS achieves rendering quality on par with or better than NeRFs but is significantly faster to train and can be rendered in real-time, making it highly suitable for SLAM.
-
Technological Evolution:
- Classical Dense SLAM: Methods like KinectFusion used voxel grids (
TSDFs) to store geometry, while others like ElasticFusion usedsurfels. These were geometrically accurate but lacked photorealism. - NeRF-based SLAM: Systems like iMAP and NICE-SLAM replaced classical representations with NeRFs, achieving photorealistic mapping. However, their reliance on large neural networks (
MLPs) led to slow performance and scalability issues. - 3DGS-based SLAM: Recent works like MonoGS and SplaTAM capitalized on the speed of 3DGS to create real-time, photorealistic SLAM systems. This is the current state-of-the-art.
- Classical Dense SLAM: Methods like KinectFusion used voxel grids (
-
Differentiation: GI-SLAM advances the state-of-the-art in 3DGS-based SLAM. While previous works were purely vision-based, GI-SLAM is the first to effectively integrate inertial data. This fusion makes the system more robust to the common failure modes of visual SLAM, such as fast motion and texture-less scenes, thereby pushing the boundaries of what is achievable in real-world scenarios.
4. Methodology (Core Technology & Implementation)
The GI-SLAM system is composed of three interconnected modules: localization, mapping, and keyframing, as illustrated in the system overview diagram.

Figure 1. SLAM system overview. This figure shows the data flow in GI-SLAM. Multi-sensor inputs (RGB, depth, IMU) are fed into two parallel processes. The Camera Pose Estimation (Localization) module uses the inputs and rendered data from the existing map to track the camera's current pose. The Keyframe Selection module decides if the current frame is valuable enough to be used for updating the Mapping component, which reconstructs the scene using 3D Gaussians.
3.1. Localization
The goal of the localization module is to estimate the camera's current pose. The system represents the camera pose as a rigid body transformation matrix . It estimates the current pose incrementally from the previous pose by optimizing for a small transformation update .
The optimization minimizes a composite loss function that aligns the current sensor measurements with the rendered view from the existing 3D Gaussian map.
-
RGB Loss: This loss ensures photometric consistency. It measures the difference between the rendered RGB image and the ground-truth input image . An L1 norm is used for robustness.
- : The differentiable rendering function.
- : The set of 3D Gaussians representing the map.
- : The current camera pose being optimized.
- : The observed RGB image from the camera.
-
Depth Loss: When depth data is available (from an RGBD or stereo camera), this loss enforces geometric consistency. It is the L1 difference between the rendered depth map and the observed depth map .
-
IMU Loss (Core Contribution): This novel loss regularizes the pose optimization using physical motion measurements from the IMU. It consists of two parts:
- Translational Loss: The relative position change is predicted by integrating the IMU's linear acceleration over the time step . The loss is the squared L2 norm between this IMU-predicted displacement and the optimized displacement .
- Rotational Loss: The relative rotation change is predicted by integrating the angular velocity . The loss penalizes the difference between the IMU-predicted rotation and the optimized rotation (in axis-angle form).
- Combined IMU Loss: The final IMU loss is a weighted sum of the translational and rotational components.
- : Hyperparameters to balance the two constraints.
3.2. Mapping
The mapping module builds and refines a 3D representation of the environment using 3D Gaussians.
-
3D Gaussian Scene Representation: The scene is modeled as a set of anisotropic 3D Gaussians . Each Gaussian is defined by:
- Position
- Covariance (shape and orientation)
- Color
- Opacity The covariance is parameterized by a scaling matrix and a rotation quaternion for efficient optimization: .
-
Differentiable Rendering: To connect the 3D map to 2D images, the 3D Gaussians are projected onto the image plane. The final color and depth for each pixel are computed by alpha-blending the projected Gaussians sorted by depth. This entire rendering pipeline is differentiable, allowing gradients to flow from the pixel losses back to the Gaussian parameters.
-
Map Update: The map is updated only at keyframes. The update involves optimizing the Gaussian parameters () to minimize a composite L1 loss between the rendered and ground-truth images and depth maps. The system also employs an adaptive density control mechanism, cloning Gaussians in areas with high error (under-reconstruction) and pruning them if their opacity is too low (redundant).
3.3. Keyframing
To maintain real-time performance, the system only uses a small window of high-quality, non-redundant frames (keyframes) for map optimization.
- Selection and Management: A frame is selected as a keyframe if its score exceeds a threshold. The score balances three criteria:
- Covisibility Term : Encourages selecting frames that see new parts of the scene. measures the overlap of visible Gaussians between the current frame and the last keyframe.
- Baseline Term : Promotes a large baseline (distance) between keyframes to improve geometric stability for triangulation.
- Motion Constraint Term : A penalty term that disqualifies frames with excessive linear () or angular () velocity. This is a core contribution that effectively prevents motion-blurred frames from corrupting the map.
5. Experimental Setup
- Datasets:
- TUM-RGBD: A standard indoor dataset for evaluating SLAM systems. Used for monocular and RGBD configurations. While it only provides accelerometer data, the system is adapted to handle it.
- EuRoC: A dataset collected from a Micro Aerial Vehicle (MAV), containing synchronized stereo camera images and IMU data. Used for evaluating the stereo+IMU setup.
- Evaluation Metrics:
- Tracking Accuracy: Absolute Trajectory Error (ATE) measured via Root Mean Square Error (RMSE) in centimeters. A lower value is better.
- Mapping Quality:
- PSNR (Peak Signal-to-Noise Ratio): Measures image reconstruction quality. Higher is better.
- SSIM (Structural Similarity Index Measure): Measures similarity between two images. Higher is better (max 1.0).
- LPIPS (Learned Perceptual Image Patch Similarity): Measures perceptual similarity using a deep network. Lower is better.
- Baselines: The paper compares against a comprehensive set of state-of-the-art methods, including visual odometry (DROID-VO), NeRF-based SLAM (NICE-SLAM), and other 3DGS-based SLAM systems (MonoGS, SplaTAM).
6. Results & Analysis
GI-SLAM demonstrates superior performance across both tracking and mapping tasks.
- Core Results:
-
Camera Tracking: As shown in Table 1 (TUM) and Table 5 (EuRoC), GI-SLAM consistently achieves a lower ATE RMSE than all baselines across monocular, RGBD, and stereo configurations. For instance, in the monocular TUM experiments, it reduces the average error from 3.88 cm (MonoGS) to 2.80 cm, a significant improvement of over 27%.

-
Figure 2. Trajectory tracking result. This figure visually confirms the quantitative results. The left and middle panels compare the trajectory of GI-SLAM (Ours, red solid line) against MonoGS and the ground truth (blue dashed line). The close-up views clearly show that GI-SLAM's trajectory adheres more closely to the ground truth, especially in curved sections, where purely visual SLAM often struggles. The right panel showcases the high-quality, dense 3D reconstruction of the scene.
* <strong>Rendering Quality:</strong> Table 2 shows that GI-SLAM also excels in mapping quality. It achieves the best scores across all photometric metrics (highest PSNR/SSIM, lowest LPIPS) compared to both NeRF-based and 3DGS-based methods. For example, it improves the average PSNR from 23.93 (MonoGS) to <strong>24.55</strong>. This is attributed to the motion-aware keyframing, which ensures only sharp, high-quality images are used for building the map.
- Ablations / Parameter Sensitivity: The ablation studies provide crucial insights into the contributions of the novel components.
- Impact of IMU Loss (Table 3): Removing the IMU fusion module significantly degrades tracking accuracy. In the monocular case, the ATE RMSE jumps from 2.63 cm to 3.90 cm. This confirms that the
IMU lossis highly effective at stabilizing pose estimation and reducing drift. - Impact of Motion-Constrained Keyframing (Table 4): Replacing the motion-aware keyframe selection with a conventional one leads to a noticeable drop in rendering quality. The average PSNR decreases from 24.55 to 23.78. This validates the hypothesis that filtering out motion-blurred frames is critical for achieving high-fidelity map reconstruction.
- Impact of IMU Loss (Table 3): Removing the IMU fusion module significantly degrades tracking accuracy. In the monocular case, the ATE RMSE jumps from 2.63 cm to 3.90 cm. This confirms that the
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents GI-SLAM, a robust and accurate SLAM system that pioneers the integration of IMU data into a 3DGS framework. By introducing a novel
IMU lossand amotion-constrained keyframing strategy, the system achieves state-of-the-art performance in both camera tracking and photorealistic mapping. -
Limitations & Future Work: The authors acknowledge two primary limitations:
- IMU Noise: The current model does not explicitly account for the noise and biases inherent in IMU measurements, which could be addressed with more sophisticated sensor fusion techniques (e.g., pre-integration or factor graph optimization).
- Monocular Scale Ambiguity: The system does not resolve the inherent scale ambiguity present in monocular SLAM, a classic problem in the field.
-
Personal Insights & Critique:
- Strengths: The paper's main strength is its elegant and effective fusion of classical robotics principles (inertial navigation) with modern computer graphics (differentiable rendering). The
IMU lossis a simple yet powerful concept that integrates seamlessly into the end-to-end optimization framework. The motion-aware keyframing is a practical and impactful contribution that addresses a common pain point in visual SLAM. - Potential for Improvement: While the current method is effective, future work could explore tighter coupling of the IMU data. Instead of just a loss term, an IMU pre-integration factor, as used in systems like VINS-Mono, could be incorporated into the optimization to better handle biases and provide more robust state estimation over longer periods.
- Overall Impact: GI-SLAM represents a significant step forward for 3DGS-based SLAM. By incorporating inertial data, it makes these systems more practical and reliable for real-world applications on robots and handheld devices, where robust performance under challenging motion is a necessity. This work will likely inspire further research into multi-sensor fusion within the 3DGS SLAM paradigm.
- Strengths: The paper's main strength is its elegant and effective fusion of classical robotics principles (inertial navigation) with modern computer graphics (differentiable rendering). The
Similar papers
Recommended via semantic vector search.