Paper status: completed

GRAND-SLAM: Local Optimization for Globally Consistent Large-Scale Multi-Agent Gaussian SLAM

Published:06/24/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

GRAND-SLAM extends 3DGS SLAM to large-scale, multi-agent outdoor environments via local submap optimization for tracking and a pose-graph for inter/intra-robot loop closures. It achieves state-of-the-art tracking, 28% higher PSNR, and 91% lower multi-agent tracking error, enablin

Abstract

3D Gaussian splatting has emerged as an expressive scene representation for RGB-D visual SLAM, but its application to large-scale, multi-agent outdoor environments remains unexplored. Multi-agent Gaussian SLAM is a promising approach to rapid exploration and reconstruction of environments, offering scalable environment representations, but existing approaches are limited to small-scale, indoor environments. To that end, we propose Gaussian Reconstruction via Multi-Agent Dense SLAM, or GRAND-SLAM, a collaborative Gaussian splatting SLAM method that integrates i) an implicit tracking module based on local optimization over submaps and ii) an approach to inter- and intra-robot loop closure integrated into a pose-graph optimization framework. Experiments show that GRAND-SLAM provides state-of-the-art tracking performance and 28% higher PSNR than existing methods on the Replica indoor dataset, as well as 91% lower multi-agent tracking error and improved rendering over existing multi-agent methods on the large-scale, outdoor Kimera-Multi dataset.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: GRAND-SLAM: Local Optimization for Globally Consistent Large-Scale Multi-Agent Gaussian SLAM
  • Authors: Annika Thomas, Aneesa Sonawalla, Alex Rose, Jonathan P. How. The authors are likely affiliated with the Massachusetts Institute of Technology (MIT), given Jonathan P. How's position as a professor in the Aerospace Controls Laboratory (ACL) at MIT.
  • Journal/Conference: The paper is available on arXiv, which is a preprint server. This means it has not yet undergone formal peer review for a journal or conference, but it represents a complete research work shared with the community.
  • Publication Year: 2025 (as indicated by the source link identifier).
  • Abstract: The paper introduces GRAND-SLAM, a collaborative SLAM system that uses 3D Gaussian splatting (3DGS) for scene representation, specifically designed for large-scale, multi-agent, outdoor environments. The core innovations are a local optimization module for tracking within submaps and a pose-graph optimization framework that incorporates both intra-robot (within a single robot's path) and inter-robot (between different robots) loop closures. The authors report state-of-the-art performance, with a 28% higher Peak Signal-to-Noise Ratio (PSNR) on the indoor Replica dataset and a 91% lower tracking error on the large-scale outdoor Kimera-Multi dataset compared to existing methods.
  • Original Source Link:

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: Existing Simultaneous Localization and Mapping (SLAM) systems that use 3D Gaussian Splatting (3DGS) for high-quality scene reconstruction are largely confined to small-scale, indoor environments. When applied to large areas or over long durations, they suffer from significant drift—an accumulation of small errors in pose estimation that leads to an inaccurate and inconsistent map.
    • Importance: For applications like autonomous driving, collaborative robotics, and large-scale augmented reality, having a globally consistent and photorealistic map is crucial. Multi-agent systems can map large areas much faster, but this introduces the challenge of merging maps from different agents accurately.
    • Gap in Prior Work: Previous multi-agent 3DGS SLAM systems either lacked robust loop closure mechanisms or were not designed to handle the challenges of large-scale outdoor environments (e.g., sensor noise, vast spaces, changing lighting).
    • Innovation: GRAND-SLAM tackles these challenges by introducing a scalable, submap-based architecture. Instead of optimizing the entire map at once, it performs local optimizations on smaller map sections. This is combined with a robust system for detecting and integrating loop closures, both for a single agent re-visiting a location and for different agents observing the same place. This allows the system to correct drift and build a globally consistent map.
  • Main Contributions / Findings (What):

    • Local Submap-Based Optimization: A novel tracking module that optimizes an agent's pose against local submaps. This approach is more computationally efficient and stable for large-scale scenes compared to optimizing against a single, massive global map.
    • Coarse-to-Fine Loop Closure: A robust mechanism for detecting both intra- and inter-robot loop closures. It first identifies potential matches using visual descriptors and then refines the alignment using a combination of photometric/geometric optimization and a precise Iterative Closest Point (ICP) algorithm.
    • Pose Graph Optimization for Global Consistency: The detected loop closures are integrated as constraints into a pose graph. Optimizing this graph corrects the trajectories of all agents and aligns their respective submaps into a single, globally consistent, high-fidelity 3D reconstruction.

3. Prerequisite Knowledge & Related Work

Foundational Concepts

  • Simultaneous Localization and Mapping (SLAM): A fundamental problem in robotics where a robot or agent, placed in an unknown environment, must build a map of its surroundings while simultaneously determining its own location within that map.
  • RGB-D Camera: A type of sensor that captures both a standard color (RGB) image and a depth (D) image. The depth image provides the distance from the camera to each point in the scene, which is crucial for 3D reconstruction.
  • 3D Gaussian Splatting (3DGS): A modern technique for representing 3D scenes. Instead of using traditional meshes or voxels, it uses a collection of 3D Gaussians. Each Gaussian has properties like position, shape (covariance), color, and opacity. This representation allows for extremely fast and photorealistic rendering of the scene from new viewpoints.
  • Pose Graph Optimization: A back-end optimization technique in SLAM. The robot's trajectory is represented as a graph where nodes are camera poses (position and orientation) at different points in time, and edges are constraints between these poses. These constraints come from odometry (relative motion between consecutive frames) and loop closures (recognizing a previously seen place). The goal is to adjust the nodes (poses) to best satisfy all the constraints, minimizing global error and correcting drift.
  • Loop Closure: The process of recognizing that an agent has returned to a previously visited location. This is critical for SLAM because it provides a powerful constraint that can significantly reduce the accumulated error (drift) in the trajectory and map.

Previous Works & Technological Evolution

The paper situates GRAND-SLAM within the evolution of SLAM technology:

  • Classical SLAM: Early systems like ORB-SLAM3 used sparse features (keypoints) to track motion. They are fast and robust but produce sparse maps that are not suitable for photorealistic rendering.
  • Dense Neural SLAM: More recent methods like NICE-SLAM and Co-SLAM use neural implicit representations (like Neural Radiance Fields or NeRFs) to create dense, photorealistic maps. However, they are often slow to train and optimize, making real-time performance and map correction (like applying a rigid transformation for loop closure) difficult.
  • Gaussian Splatting SLAM: The introduction of 3DGS offered a breakthrough, combining the photorealism of neural methods with the speed and explicit nature of classical methods. Systems like Gaussian-SLAM and SplaTAM demonstrated real-time, high-quality mapping. However, they still suffered from drift in large environments.
  • GS SLAM with Loop Closure: To address drift, methods like GLC-SLAM and MAGiC-SLAM integrated loop closure. However, their implementations were tested primarily in small-scale, indoor settings and did not scale well to the challenges of large outdoor environments.
  • Multi-Agent SLAM: Systems like CCM-SLAM (centralized) and Swarm-SLAM (decentralized) enabled collaborative mapping but typically used sparse representations. Neural multi-agent systems like CP-SLAM and MAGiC-SLAM brought photorealism to this domain but remained limited to indoor scenes.

Differentiation

GRAND-SLAM distinguishes itself by being the first system to successfully apply multi-agent 3DGS SLAM with robust loop closure to large-scale, outdoor environments. Its key innovations—local submap optimization for scalability and a multi-stage loop closure process integrated into a global pose graph—directly address the limitations of prior works.

4. Methodology (Core Technology & Implementation)

The core of GRAND-SLAM is a pipeline that processes RGB-D data from multiple agents to build a globally consistent 3D Gaussian map.

Fig. 1: GRAND-SLAM introduces local optimization by submap for each agent, integrates inter-agent loop closure by submap and reconstructs large-scale environments via pose graph optimization. 该图像是示意图,展示了GRAND-SLAM的核心流程。左上部分说明了每个智能体基于子图进行局部优化,右上部分展示了通过子图实现的跨智能体闭环检测,底部则描绘了多智能体大规模环境重建及轨迹融合的效果。

As shown in Image 1, the system is built around three key ideas: local optimization within each agent's submaps, loop closure detection between agents, and a final large-scale reconstruction that fuses all information.

该图像是示意图,展示了GRAND-SLAM方法的系统架构。左侧为Agent端,包含基于RGB-D输入的子地图跟踪及闭环检测,进行位姿和地图优化;右侧为服务器端,负责多Agent间闭环识别、场所识别和全局地图优化,最终实现全局一致的多Agent高精度地图融合。 该图像是示意图,展示了GRAND-SLAM方法的系统架构。左侧为Agent端,包含基于RGB-D输入的子地图跟踪及闭环检测,进行位姿和地图优化;右侧为服务器端,负责多Agent间闭环识别、场所识别和全局地图优化,最终实现全局一致的多Agent高精度地图融合。

Image 2 provides a more detailed system architecture, separating the process into an "Agent Side" and a "Server Side." Each agent independently tracks its position and builds local submaps. These submaps and potential loop closures are then sent to a central server which performs global optimization and fuses the maps.

A. Preliminary: Gaussian Splatting

The scene is represented by a set of 3D Gaussians. Each Gaussian is defined by:

  • Position (mean): μR3\mu \in \mathbb{R}^3

  • Shape and orientation (covariance matrix): ΣS+3\Sigma \in \mathbb{S}_+^3

  • Opacity: oRo \in \mathbb{R}

  • Color: cR3c \in \mathbb{R}^3

    These Gaussians are "splatted" (projected) onto a 2D image plane using a differentiable renderer, allowing for optimization via gradient descent by comparing the rendered image with the real input image.

B. Mapping

Instead of a single global map, the environment is divided into submaps.

  • Submap Representation: Each submap for an agent aa is a collection of Gaussians defined in a local coordinate frame ll: Pa,ls={Gis(μ,Σ,o,c)i=1,,N} { \bf P } _ { a , l } ^ { s } = \left\{ G _ { i } ^ { s } ( \mu , \Sigma , o , c ) \vert i = 1 , \ldots , N \right\}
  • Local vs. Global Frames: Each submap is initially built in its own local frame. A transformation Ta,lgT_{a,l}^g maps the submap from its local frame ll to the global frame gg. This transformation is initially based on odometry and later refined by pose graph optimization.
  • Submap Management:
    1. Initialization: A new submap is created when the agent's pose exceeds a certain translation or rotation threshold relative to the start of the current submap. This keeps each submap spatially constrained.
    2. Building: For new keyframes, new Gaussians are added to the active submap in areas that are not yet well-reconstructed (identified by low rendered opacity values).
    3. Optimization: The parameters of all Gaussians in the active submap are optimized to minimize a rendering loss. The loss function is a weighted sum of the difference between the rendered and ground-truth color and depth images: Lmap=MinMα(λcI^jsIjsL1+(1λc)D^jsDjsL1), \begin{array} { r l } & { \mathcal { L } _ { \mathrm { map } } = \displaystyle \sum { M _ { \mathrm { in } } M _ { \alpha } \cdot \left( \lambda _ { c } \| \hat { I } _ { j } ^ { s } - I _ { j } ^ { s } \| _ { \mathcal { L } 1 } \right. } } \\ & { \qquad \quad \left. + ( 1 - \lambda _ { c } ) \| \hat { D } _ { j } ^ { s } - D _ { j } ^ { s } \| _ { \mathcal { L } 1 } \right) } , \end{array}
      • I^js,D^js\hat{I}_j^s, \hat{D}_j^s: Rendered color and depth images for keyframe jj.
      • Ijs,DjsI_j^s, D_j^s: Ground-truth input color and depth images.
      • λc\lambda_c: A weight to balance the color and depth losses.
      • MαM_{\alpha}: A soft mask that gives less weight to poorly observed regions.
      • MinM_{\text{in}}: A mask that ignores pixels with very high reconstruction error (outliers).

C. Tracking

Tracking determines the camera's current pose. It's a two-stage process:

  1. Coarse Initialization: The current pose TiT_i is initialized based on the previous pose T_{i-1} and a relative transformation Ti1,iT_{i-1, i} estimated using fast visual odometry on the color and depth frames. Ti=Ti1Ti1,iT_i = T_{i-1} \cdot T_{i-1, i}

  2. Fine Refinement (Local Optimization): The pose is then refined by optimizing it against the current active submap. Crucially, the optimization is performed in a local frame relative to the current camera pose, not the global world origin. This prevents numerical instability and poor convergence when the agent is far from the origin.

    Fig. 3: Without local optimization, optimizing rotation results in large movements with respect to the origin which may be far from the agent's current position. This example demonstrates the renders… 该图像是插图,展示了Kimera-Multi Outdoor数据集中某场景在未进行局部优化时的渲染效果对比。左图为优化过程中渲染的场景,右图为以全局原点旋转后产生的大幅位移渲染,突显了局部优化对保持位置精度的重要性。

Image 3 illustrates this point: optimizing rotation with respect to a distant global origin (right) can cause dramatic, incorrect shifts in the scene, while local optimization maintains stability (left). The tracking loss minimizes the photometric and geometric error: argminR,tLtrack(I^(R,t),D^(R,t),I,D,α), \arg \operatorname* { m i n } _ { R , \mathbf { t } } \mathcal { L } _ { \mathrm { t r a c k } } \left( \hat { I } ( R , \mathbf { t } ) , \hat { D } ( R , \mathbf { t } ) , I , D , \alpha \right) , Here, only the rotation RR and translation t\mathbf{t} of the current pose are optimized; the Gaussian parameters of the map are frozen.

D. Loop Closure Detection

This module finds connections between different parts of the map to correct drift.

  1. Candidate Detection: Each keyframe is associated with a NetVLAD descriptor, a compact representation of the visual scene. To find a potential loop closure, the descriptor of the current keyframe is compared against a database of past descriptors (from the same agent or other agents). High cosine similarity suggests a match.
  2. Initial Alignment: An initial transformation between the matching submaps is estimated.
    • For short-range, intra-agent loops, this is calculated from the existing (but drifted) camera poses.
    • For long-range or inter-agent loops, a coarse geometric registration on the RGB-D frames is performed.
  3. Fine Refinement: This initial alignment is refined using Iterative Closest Point (ICP) on dense point clouds extracted from the submaps. The ICP minimizes the point-to-plane distance error: LICP=i((ni(piTqi))2) \mathcal { L } _ { \mathrm { I C P } } = \sum _ { i } \left( \left( \mathbf { n } _ { i } ^ { \top } ( \mathbf { p } _ { i } ^ { T } - \mathbf { q } _ { i } ) \right) ^ { 2 } \right)
    • pi\mathbf{p}_i: A point in the source point cloud.
    • qi\mathbf{q}_i: The corresponding point in the target point cloud.
    • ni\mathbf{n}_i: The surface normal at the target point qi\mathbf{q}_i.
  4. Validation: The refined alignment is accepted as a valid loop closure only if the alignment quality is high (high fitness score and low inlier error). Accepted closures are added as constraints to the pose graph.

E. Pose Graph Optimization

This is the final step to achieve global consistency.

  • Graph Construction: A global graph is built where nodes are the poses of all submap keyframes from all agents. Edges represent two types of constraints:
    • Odometry constraints: Relative motion between consecutive keyframes.
    • Loop closure constraints: Relative transformations between keyframes identified as viewing the same scene.
  • Optimization: The system solves for the set of globally consistent poses {Ti}\{T_i^*\} that minimizes the total error across all constraints. The objective function is: Lgraph=(i,j)Elog(T^i,j1Ti1Tj)Σi,j2 \mathcal { L } _ { \mathrm { g r aph } } = \sum _ { ( i , j ) \in \mathcal { E } } \left. \log \left( \hat { T } _ { i , j } ^ { - 1 } T _ { i } ^ { - 1 } T _ { j } \right) \right. _ { \Sigma _ { i, j } } ^ { 2 }
    • This formula measures the error between the measured relative pose T^i,j\hat{T}_{i,j} and the one computed from the optimized absolute poses (Ti1TjT_i^{-1} T_j). The log()\log(\cdot) map converts the SE(3) transformation error into a vector for least-squares optimization.
  • Map Update: After optimization, the corrected global poses Ta,lgT_{a,l}^g are used to transform all local submaps and their Gaussians into a single, unified, and globally consistent coordinate frame. The Gaussian means μ\boldsymbol{\mu} and covariances Σ\boldsymbol{\Sigma} are updated as follows: μ(g)=Rμ(l)+t,Σ(g)=RΣ(l)R \boldsymbol { \mu } ^ { ( g ) } = \boldsymbol { R } \boldsymbol { \mu } ^ { ( l ) } + t , \quad \boldsymbol { \Sigma } ^ { ( g ) } = \boldsymbol { R } \boldsymbol { \Sigma } ^ { ( l ) } \boldsymbol { R } ^ { \top } where [Rt]=Ta,lg[R|t] = T_{a,l}^g is the optimized transformation.

5. Experimental Setup

  • Datasets:

    • Multiagent Replica: A synthetic indoor dataset with two agents. It features clean, noise-free RGB-D data, making it ideal for evaluating the core accuracy and fidelity of SLAM algorithms under perfect sensing conditions.
    • Kimera-Multi Outdoor: A real-world, large-scale outdoor dataset captured by six agents. It includes significant challenges like sensor noise, motion blur, reflections, and large open spaces, which tests the robustness and scalability of the system.
  • Evaluation Metrics:

    • Absolute Trajectory Error (ATE) RMSE:
      1. Conceptual Definition: ATE measures the global consistency of a trajectory. It first aligns the estimated trajectory with the ground-truth trajectory and then computes the Root Mean Square Error (RMSE) of the distances between corresponding camera positions. A lower ATE RMSE value indicates higher tracking accuracy.
      2. Mathematical Formula: ATE RMSE=1Ni=1Ntrans(Qi1SPi)2 \text{ATE RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} \| \text{trans}(Q_i^{-1} S P_i) \|^2}
      3. Symbol Explanation:
        • NN: Number of camera poses.
        • PiP_i: Estimated pose at time ii.
        • QiQ_i: Ground-truth pose at time ii.
        • SS: The transformation that aligns the estimated trajectory to the ground-truth trajectory.
        • trans()\text{trans}(\cdot): The translational component of the resulting transformation error.
    • Peak Signal-to-Noise Ratio (PSNR):
      1. Conceptual Definition: PSNR measures the quality of a reconstructed image by comparing it to a ground-truth image. It is based on the Mean Squared Error (MSE). Higher PSNR values indicate better reconstruction quality. It is measured in decibels (dB).
      2. Mathematical Formula: PSNR=10log10(MAXI2MSE)\text{PSNR} = 10 \cdot \log_{10}\left(\frac{\text{MAX}_I^2}{\text{MSE}}\right)
      3. Symbol Explanation:
        • MAXI\text{MAX}_I: The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image).
        • MSE\text{MSE}: The Mean Squared Error between the ground-truth and reconstructed images.
    • Structural Similarity Index Measure (SSIM):
      1. Conceptual Definition: SSIM is a perceptual metric that quantifies image quality degradation as a change in structural information. It is considered to be more consistent with human visual perception than PSNR. Values range from -1 to 1, with 1 indicating a perfect match.
      2. Mathematical Formula: SSIM(x,y)=(2μxμy+c1)(2σxy+c2)(μx2+μy2+c1)(σx2+σy2+c2) \text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}
      3. Symbol Explanation:
        • μx,μy\mu_x, \mu_y: The mean intensity of images xx and yy.
        • σx2,σy2\sigma_x^2, \sigma_y^2: The variance of images xx and yy.
        • σxy\sigma_{xy}: The covariance of xx and yy.
        • c1,c2c_1, c_2: Small constants to stabilize the division.
    • Learned Perceptual Image Patch Similarity (LPIPS):
      1. Conceptual Definition: LPIPS evaluates the perceptual similarity between two images using features extracted from a pre-trained deep neural network (e.g., VGG). It is designed to correlate well with human judgments of image similarity. Lower values indicate that two images are more perceptually similar.
    • Depth L1:
      1. Conceptual Definition: This is the Mean Absolute Error (L1 norm) between the rendered depth map and the ground-truth depth map. It directly measures the accuracy of the reconstructed 3D geometry. Lower is better.
  • Baselines:

    • Tracking: ORB-SLAM3 (classical sparse), Gaussian-SLAM (single-agent 3DGS), MonoGS (monocular 3DGS), MAGiC-SLAM (multi-agent 3DGS), CCM-SLAM, Swarm-SLAM, CP-SLAM (multi-agent methods).
    • Rendering: CP-SLAM (neural point-based), Gaussian-SLAM, MAGiC-SLAM (both 3DGS-based).

6. Results & Analysis

Core Results

Camera Tracking:

  • Replica Dataset (Indoor, Single-Agent): As per the manual transcription of Table I, GRAND-SLAM with loop closure achieves the lowest average ATE RMSE (0.27 cm), outperforming all other methods, including the next best, MAGiC-SLAM (0.50 cm). Even without loop closures, GRAND-SLAM (0.36 cm) is highly competitive, demonstrating the strength of its local optimization-based tracking.

    Manual Transcription of Table I: Single-Agent tracking performance on Multiagent Replica (ATE RMSE ↓ [cm]).

    Method Off0 Apt0 Apt1 Apt2 Avg
    ORB-SLAM3 [24] 0.60 1.07 4.94 1.36 1.99
    Gaussian-SLAM [17] 0.33 0.41 30.13 121.96 38.21
    MonoGS [16] 0.38 0.21 3.33 0.54 1.15
    MAGiC-SLAM [38] 0.42 0.38 0.54 0.66 0.50
    GRAND-SLAM (w/o LC) 0.44 0.28 0.46 0.27 0.36
    GRAND-SLAM 0.32 0.23 0.35 0.17 0.27
  • Replica Dataset (Indoor, Multi-Agent): Table II shows that in the multi-agent setting, GRAND-SLAM (Avg ATE 0.25 cm) performs on par with MAGiC-SLAM (0.26 cm) and significantly outperforms other multi-agent baselines like CP-SLAM (1.23 cm) and Swarm-SLAM (3.60 cm). This confirms its state-of-the-art performance in controlled indoor environments.

    Manual Transcription of Table II: Multi-Agent Tracking performance on Multiagent Replica (ATE RMSE ↓ [cm]).

    Method Agent O-0 A-0 A-1 A-2 Avg
    CCM-SLAM [39] Agt 1 9.84 X 2.12 0.51 -
    Swarm-SLAM [41] 1.07 1.61 4.62 2.69 2.50
    CP-SLAM [42] 0.50 0.62 1.11 1.41 0.91
    MAGiC-SLAM (w/o LC) 0.44 0.30 0.48 0.91 0.53
    MAGiC-SLAM [38] 0.31 0.13 0.21 0.42 0.27
    GRAND-SLAM (w/o LC) 0.27 0.27 0.47 0.33 0.34
    GRAND-SLAM 0.28 0.27 0.28 0.18 0.25
    CCM-SLAM [39] Agt 2 0.76 X 9.31 0.48 -
    Swarm-SLAM [41] 1.76 1.98 6.50 8.53 4.69
    CP-SLAM [42] 0.79 1.28 1.72 2.41 1.55
    MAGiC-SLAM (w/o LC) 0.41 0.46 0.61 0.41 0.47
    MAGiC-SLAM [38] 0.24 0.21 0.30 0.22 0.24
    GRAND-SLAM (w/o LC) 0.43 0.22 0.44 0.20 0.32
    GRAND-SLAM 0.25 0.19 0.36 0.18 0.25
    CCM-SLAM [39] Avg 5.30 X 5.71 0.49 -
    Swarm-SLAM [41] 1.42 1.80 5.56 5.61 3.60
    CP-SLAM [42] 0.65 0.95 1.42 1.91 1.23
    MAGiC-SLAM (w/o LC) 0.42 0.38 0.54 0.66 0.50
    MAGiC-SLAM [38] 0.28 0.17 0.26 0.32 0.26
    GRAND-SLAM (w/0 LC) 0.44 0.28 0.46 0.27 0.36
    GRAND-SLAM 0.27 0.23 0.32 0.18 0.25
  • Kimera-Multi Dataset (Outdoor, Large-Scale): This is the key result demonstrating the paper's main claim. As shown in Table III, GRAND-SLAM dramatically outperforms all baselines. Its average ATE is 4.99 m, compared to 60.79 m for MAGiC-SLAM and 316.92 m for Gaussian SLAM. This represents a 91.8% reduction in tracking error compared to MAGiC-SLAM, validating the claim in the abstract. The other methods struggle significantly, with frequent failures (marked with *), highlighting the difficulty of this dataset and the robustness of GRAND-SLAM.

    Manual Transcription of Table III: Tracking Performance on Kimera-Multi Outdoor (ATE RMSE ↓ [m]).

    Method Agent Outside 1 Outside 2 Avg
    ORB-SLAM3 [24] Agt 1 14.11 2.72 8.42
    Gaussian SLAM [12] 356.61 71.16 213.89
    MAGiC-SLAM [38] 24.13 11.13 17.63
    GRAND-SLAM (w/o LC) 6.43 10.19 8.31
    GRAND-SLAM 3.95 10.93 7.44
    ORB-SLAM3 [24] Agt 2 7.07 13.12 10.10
    Gaussian SLAM [12] 119.68 7.66* 63.67
    MAGiC-SLAM [38] 98.33 10.50* 54.42
    GRAND-SLAM (w/0 LC) 9.74 4.63 7.19
    GRAND-SLAM 8.93 4.54 6.74
    ORB-SLAM3 [24] Agt 3 7.48 18.99 13.24
    Gaussian SLAM [12] 1150.42 195.95 673.19
    MAGiC-SLAM [38] 172.81 47.86 110.34
    GRAND-SLAM (w/0 LC) 1.05 7.93 4.49
    GRAND-SLAM 1.30 6.25 3.78
    ORB-SLAM3 [24] Avg 9.55 11.61 10.58
    Gaussian SLAM [12] 542.24 91.59 316.92
    MAGiC-SLAM [38] 98.42 23.16 60.79
    GRAND-SLAM (w/ LC) 5.74 7.58 6.66
    GRAND-SLAM 4.73 7.24 4.99

Rendering Quality:

  • Replica Dataset: Table V shows that GRAND-SLAM achieves an average PSNR of 41.35, a 20.7% improvement over MAGiC-SLAM's 34.26 (the abstract claims 28%, which may refer to a specific scene or a different calculation, but the improvement is substantial regardless). It also leads in all other rendering metrics (SSIM, LPIPS, Depth L1).

    Manual Transcription of Table V: Training view synthesis performance on Multiagent Replica dataset.

    Methods Metrics O-0 A-0 A-1 A-2 Avg
    CP SLAM PSNR ↑ 28.56 26.12 12.16 23.98 22.71
    SSIM ↑ 0.87 0.79 0.31 0.81 0.69
    LPIPS ↓ 0.29 0.41 0.97 0.39 0.51
    Depth L1 ↓ 2.74 19.93 66.77 2.47 22.98
    MAGiC SLAM PSNR ↑ 39.32 36.96 30.01 30.73 34.26
    SSIM ↑ 0.99 0.98 0.95 0.96 0.97
    LPIPS ↓ 0.05 0.09 0.18 0.17 0.12
    Depth L1 ↓ 0.41 0.64 3.16 0.99 1.30
    GRAND SLAM PSNR ↑ 43.12 44.15 38.65 39.46 41.35
    SSIM ↑ 0.99 0.99 0.99 0.99 0.99
    LPIPS ↓ 0.03 0.03 0.05 0.05 0.04
    Depth L1 ↓ 0.25 0.31 0.77 0.29 0.41
  • Kimera-Multi Dataset: Table IV and Figure 4 show GRAND-SLAM's superior rendering in the challenging outdoor setting. It achieves a PSNR of 27.44, far ahead of Gaussian SLAM (24.45) and MAGiC SLAM (15.88).

    Manual Transcription of Table IV: Training view synthesis performance on Kimera-Multi dataset.

    Methods Metrics Outside 1 Outside 2 Avg
    Gaussian SLAM PSNR ↑ 24.59 24.31* 24.45
    SSIM ↑ 0.90 0.89* 0.90
    LPIPS ↓ 0.18 0.17* 0.18
    Depth L1 ↓ 1.19 1.45* 1.32
    MAGiC SLAM PSNR ↑ 16.12 15.63* 15.88
    SSIM ↑ 0.49 0.50* 0.50
    LPIPS ↓ 0.54 0.53* 0.54
    Depth L1 ↓ 3.86 4.71* 4.29
    GRAND SLAM PSNR ↑ 28.48 26.62 27.44
    SSIM ↑ 0.97 0.96 0.97
    LPIPS ↓ 0.10 0.12 0.11
    Depth L1 ↓ 1.17 1.61 1.39

    该图像是比较图,展示了不同SLAM方法在三个位姿视角(Outside 1 Agent 1、Outside 1 Agent 2、Outside 2 Agent 3)下的视觉重建效果。图中依次比较了Gaussian SLAM、MAGIC SLAM、GRAND SLAM(本文方法)与真实场景(Ground Truth)的细节还原情况。结果显示GRAND SLAM能更接近真实图像,细节更清晰、准确,尤其… 该图像是比较图,展示了不同SLAM方法在三个位姿视角(Outside 1 Agent 1、Outside 1 Agent 2、Outside 2 Agent 3)下的视觉重建效果。图中依次比较了Gaussian SLAM、MAGIC SLAM、GRAND SLAM(本文方法)与真实场景(Ground Truth)的细节还原情况。结果显示GRAND SLAM能更接近真实图像,细节更清晰、准确,尤其在纹理和边缘部分优于其他方法,验证了其在多智能体大规模环境中提高渲染质量和定位精度的效果。

Image 4 provides compelling visual evidence. The renderings from Gaussian SLAM and MAGiC SLAM are blurry, distorted, or incomplete. In contrast, GRAND-SLAM produces sharp, detailed, and geometrically correct reconstructions that closely match the ground truth, even for fine details like pavement texture and text on the ground.

Ablations / Parameter Sensitivity

The inclusion of GRAND-SLAM (w/o LC) results in Tables I-III serves as an ablation study on the loop closure module.

  • On the Replica dataset, the tracking performance without loop closure is already very strong, indicating the effectiveness of the local optimization tracker.
  • On the Kimera-Multi dataset, the impact is more pronounced. The full GRAND-SLAM system (4.99 m ATE) is significantly better than the version without loop closure (6.66 m ATE), a 25% error reduction. This confirms that for large-scale, complex environments, robust loop closure is essential for maintaining global consistency and correcting long-term drift.

7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully presents GRAND-SLAM, a pioneering multi-agent SLAM system that scales 3D Gaussian splatting to large, outdoor environments. By combining local submap optimization for efficient tracking with a robust inter- and intra-agent loop closure mechanism integrated into a pose graph, it achieves state-of-the-art results in both tracking accuracy and rendering quality. The experiments on the challenging Kimera-Multi dataset demonstrate a significant leap forward in creating globally consistent, photorealistic maps for real-world robotics applications.

  • Limitations & Future Work: The authors acknowledge that in very large-scale deployments, communication bandwidth and memory can become constraints. As future work, they plan to integrate compression techniques for the Gaussian submaps to make the system even more scalable and efficient for large teams of robots or extremely vast environments.

  • Personal Insights & Critique:

    • Strengths: The core idea of local submap optimization is a powerful and practical solution to the scalability problem in dense SLAM. It elegantly sidesteps the computational burden and numerical instability of optimizing against a massive, monolithic map. The system's performance on the Kimera-Multi dataset is genuinely impressive and sets a new benchmark for photorealistic SLAM in the wild.
    • Potential Weaknesses: The system appears to follow a centralized architecture (with a "Server Side" for global optimization), which could be a single point of failure and a communication bottleneck in a truly distributed robotic swarm. A decentralized approach, while more complex, could offer greater resilience. The reliance on NetVLAD for place recognition is standard, but it can struggle in environments with repetitive structures or significant appearance changes (e.g., day vs. night), which could be a failure point for loop closure.
    • Future Impact: GRAND-SLAM represents a significant step towards making high-fidelity, collaborative 3D mapping a practical reality for real-world applications. This work could pave the way for next-generation autonomous systems that can rapidly build and share detailed, photorealistic models of their environments for tasks like navigation, inspection, and virtual/augmented reality.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.