Paper status: completed

4D Gaussian Splatting SLAM

Published:03/21/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
7 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper proposes 4D Gaussian Splatting SLAM for dynamic scenes, simultaneously localizing cameras and reconstructing 4D scenes by explicitly modeling static and dynamic Gaussian primitives. Using RGB-D data, MLPs, and novel 2D optical flow supervision, it achieves robust track

Abstract

Simultaneously localizing camera poses and constructing Gaussian radiance fields in dynamic scenes establish a crucial bridge between 2D images and the 4D real world. Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in unknown scenarios by using a sequence of RGB-D images. First, by generating motion masks, we obtain static and dynamic priors for each pixel. To eliminate the influence of static scenes and improve the efficiency on learning the motion of dynamic objects, we classify the Gaussian primitives into static and dynamic Gaussian sets, while the sparse control points along with an MLP is utilized to model the transformation fields of the dynamic Gaussians. To more accurately learn the motion of dynamic Gaussians, a novel 2D optical flow map reconstruction algorithm is designed to render optical flows of dynamic objects between neighbor images, which are further used to supervise the 4D Gaussian radiance fields along with traditional photometric and geometric constraints. In experiments, qualitative and quantitative evaluation results show that the proposed method achieves robust tracking and high-quality view synthesis performance in real-world environments.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: 4D Gaussian Splatting SLAM
  • Authors: Yanyan Li, Youxu Fang, Zunjie Zhu, Kunyi Li, Yong Ding, Federico Tombari
  • Affiliations: Technical University of Munich, Hangzhou Dianzi University, Zhejiang University, Google
  • Journal/Conference: This paper is a preprint submitted to arXiv. As of the analysis date, it has not yet been published in a peer-reviewed conference or journal. arXiv is a common platform for researchers to share their work pre-publication.
  • Publication Year: 2025 (as listed on arXiv, though the submission date is in 2025 according to the source link).
  • Abstract: The paper presents a novel SLAM system that can simultaneously perform camera localization and build a 4D representation of dynamic scenes using a stream of RGB-D images. Unlike previous methods that often treat dynamic objects as outliers to be removed, this work explicitly models them. The core idea is to segment the scene into static and dynamic components, represented by two sets of Gaussian primitives. The motion of dynamic Gaussians is modeled using sparse control points and a Multi-Layer Perceptron (MLP). A key innovation is a new method for rendering 2D optical flow maps from the dynamic Gaussians, which serves as an additional supervisory signal alongside standard photometric and geometric losses. Experimental results demonstrate that the system achieves robust camera tracking and high-quality novel view synthesis in dynamic, real-world environments.
  • Original Source Link:

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: Traditional Simultaneous Localization and Mapping (SLAM) systems perform well in static environments but often fail or produce significant errors in dynamic scenes where objects are moving. This is because moving objects violate the static world assumption, which is fundamental to most geometric SLAM algorithms.
    • Gap in Prior Work: Many existing SLAM systems designed for dynamic environments simply detect and remove moving objects, treating them as noise or "distractors." This results in an incomplete map of the environment, containing only the static background. While this improves localization, it fails to capture the full 4D reality of the scene (3D space + time). Recent works on dynamic scene reconstruction using methods like 3D Gaussian Splatting often assume that the camera poses are already known, which is not the case in a true SLAM setting.
    • Innovation: This paper bridges the gap by creating a unified SLAM framework that both localizes the camera and reconstructs the complete 4D scene, including dynamic objects. Instead of discarding dynamic content, it explicitly models its motion. The primary innovation is the introduction of a novel optical flow rendering technique that provides powerful supervision for learning the motion of dynamic objects, leading to more accurate tracking and reconstruction.
  • Main Contributions / Findings (What):

    1. A Novel 4D Gaussian Splatting SLAM Pipeline: The paper proposes the first complete system that simultaneously localizes a camera and reconstructs a dynamic scene into a 4D Gaussian radiance field using a live stream of RGB-D images.
    2. Explicit Static-Dynamic Modeling: The system separates Gaussian primitives into static and dynamic sets. The motion of dynamic Gaussians is efficiently modeled using a set of sparse control points whose time-varying transformations are learned by an MLP. This isolates the complex motion modeling from the static background, improving efficiency and robustness.
    3. Optical Flow Supervision: A new algorithm is introduced to render 2D optical flow maps directly from the 3D dynamic Gaussians. By comparing these rendered maps to optical flow estimated from the input images, the system gains a strong supervisory signal to learn the motion field accurately. This is a significant departure from using only photometric and geometric losses.

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • SLAM (Simultaneous Localization and Mapping): A core problem in robotics and computer vision where a device (e.g., a robot or camera) builds a map of an unknown environment while simultaneously keeping track of its own location within that map.
    • RGB-D Camera: A type of sensor that captures both a standard color (RGB) image and a per-pixel depth (D) image. The depth information provides direct 3D geometry, which is highly beneficial for SLAM.
    • 3D Gaussian Splatting (3DGS): A novel scene representation technique that models a 3D scene as a collection of 3D Gaussians. Each Gaussian has properties like position, shape (covariance), color, and opacity. To render an image, these 3D Gaussians are "splatted" (projected and rasterized) onto the 2D image plane. 3DGS is known for its high-quality rendering and fast training/rendering speeds compared to alternatives like NeRF.
    • Neural Radiance Fields (NeRF): A technique that represents a scene using a neural network (typically an MLP) that maps a 3D coordinate and viewing direction to a color and density. While producing stunning results, NeRF is often slow to train and render.
    • Optical Flow: A 2D vector field that describes the apparent motion of pixels between two consecutive frames in a video sequence. It is a fundamental concept for motion estimation in computer vision.
  • Previous Works:

    • Camera Pose Estimation: The paper builds on classical geometry-based pose estimation techniques (e.g., matching 2D-3D or 3D-3D points) and modern systems like ORB-SLAM. However, it extends them to handle the challenges of dynamic scenes.
    • 3DGS and Non-static GS SLAM: Previous GS-SLAM systems like Gaussian-SLAM, SplaTAM, and MonoGS achieved impressive results but were primarily designed for static scenes. When faced with dynamic objects, their performance degrades. Some "non-static" variants like DGS-SLAM were proposed, but they typically work by masking out and ignoring dynamic objects, thus only mapping the static parts of the scene.
    • Dynamic Gaussian Splatting: Methods like 4DGS and SC-GS focus on reconstructing dynamic scenes using 3D Gaussians but operate offline and assume that precise camera poses are provided beforehand. They do not solve the "localization" part of SLAM. This paper integrates their dynamic modeling ideas into an online SLAM framework.
  • Technological Evolution: The field has progressed from sparse, feature-based SLAM (e.g., ORB-SLAM) to dense SLAM that reconstructs detailed geometry (e.g., KinectFusion). The rise of neural representations led to NeRF-SLAM, which offered photorealistic mapping but was computationally heavy. 3D Gaussian Splatting then emerged as a more efficient alternative, leading to GS-SLAM systems. This paper represents the next logical step: extending the efficiency and quality of GS-SLAM to fully dynamic environments.

  • Differentiation:

    • vs. Static GS-SLAM: This work explicitly models dynamic objects rather than assuming a static world.
    • vs. Distractor-Removal SLAM: It reconstructs dynamic objects to create a complete 4D map instead of just discarding them to improve localization.
    • vs. Offline Dynamic 3DGS: It performs simultaneous localization and mapping in an online fashion, without requiring pre-computed camera poses.
    • vs. Prior Dynamic SLAM: The key technical innovation is the use of rendered optical flow from the Gaussians themselves as a supervisory signal, which provides a more direct and powerful constraint on the learned motion field compared to just relying on photometric consistency.

4. Methodology (Core Technology & Implementation)

The proposed system architecture is visualized in Image 1. It consists of three main components: Initialization, Tracking, and 4D Mapping.

该图像是论文中系统架构的示意图,展示了基于输入RGB-D序列的4D高斯点云SLAM流程。包括静态和动态高斯成分分离,动态高斯变换场和控制点初始化,基于关键帧的光流估计及动态损失计算,联合优化高斯参数、相机位姿和动态变形,最终实现鲁棒的相机追踪与高质量动态场景重建。 该图像是论文中系统架构的示意图,展示了基于输入RGB-D序列的4D高斯点云SLAM流程。包括静态和动态高斯成分分离,动态高斯变换场和控制点初始化,基于关键帧的光流估计及动态损失计算,联合优化高斯参数、相机位姿和动态变形,最终实现鲁棒的相机追踪与高质量动态场景重建。

Image 1: This diagram illustrates the complete pipeline of the 4D Gaussian Splatting SLAM system. An input RGB-D stream is separated into static and dynamic components. The static Gaussians are used for robust camera tracking. The dynamic Gaussians are modeled using control points and an MLP. In the mapping stage, a novel optical flow loss is computed to supervise the learning of the dynamic deformation field, which is then jointly optimized with Gaussian parameters and camera poses.

3.1. Initialization

  • Gaussian Representation: The scene is represented by a set of 3D Gaussians G\mathcal{G}. Each Gaussian is defined by its mean (position) μ\boldsymbol{\mu}, covariance Σ\boldsymbol{\Sigma}, opacity α\alpha, and color c\mathbf{c}. The authors add a new binary attribute dy to distinguish between static and dynamic Gaussians. The full representation is G=[Σ,μ,α,c,dy]\mathcal{G} = [\boldsymbol{\Sigma}, \boldsymbol{\mu}, \alpha, \mathbf{c}, dy].

  • Rendering: The color C(p)C(\boldsymbol{p}), depth D(p)D(\boldsymbol{p}), and opacity O(p)O(\boldsymbol{p}) for a pixel p\boldsymbol{p} are rendered using alpha-blending, where Gaussians sorted by depth are blended front-to-back.

    • Color Rendering: C(p)=i=1nciαij=1i1(1αj) C ( { \boldsymbol { p } } ) = \sum _ { i = 1 } ^ { n } c _ { i } \alpha _ { i } \prod _ { j=1 } ^ { i - 1 } ( 1 - \alpha _ { j } )
    • Depth and Opacity Rendering: D(p)=i=1ndiαij=1i1(1αj) D ( p ) = \sum _ { i = 1 } ^ { n } d _ { i } \alpha _ { i } \prod _ { j=1 } ^ { i - 1 } ( 1 - \alpha _ { j } ) O(p)=i=1nαij=1i1(1αj) O ( p ) = \sum _ { i = 1 } ^ { n } \alpha _ { i } \prod _ { j=1 } ^ { i - 1 } ( 1 - \alpha _ { j } ) Here, cic_i and αi\alpha_i are the color and opacity of the ii-th Gaussian, and did_i is its depth along the camera ray.
  • Static/Dynamic Separation:

    • To initialize the Gaussians, the system first needs to know which parts of the scene are moving. A pre-trained object detector (YOLOv9) is used to generate a motion mask for known dynamic object categories. For unknown objects, this is combined with optical flow estimation.
    • During initialization, new Gaussians created from pixels within the motion mask are labeled as dynamic (Gdy\mathcal{G}_{dy}), while all others are labeled as static (Gst\mathcal{G}_{st}).
  • Dynamic Motion Modeling:

    • Inspired by SC-GS, the motion of the dynamic Gaussians is controlled by a set of sparse control points. These points are initialized within the dynamic regions of the first frame.
    • A small MLP, denoted as Ψ\Psi, learns a time-varying 6-DoF (Degrees of Freedom) transformation (rotation Rt\mathbf{R}^t and translation Tt\mathbf{T}^t) for each control point PkP_k at time tt: Ψ(Pk,t)=[Rt,Tt]\Psi ( P _ { k } , t ) = [ { \bf R } ^ { t } , { \bf T } ^ { t } ]
    • The transformation of any individual dynamic Gaussian is then computed by interpolating the transformations of its KK nearest neighboring control points using Linear Blend Skinning (LBS). This allows a compact set of control points to define a dense motion field for all dynamic objects.

3.2. Tracking

  • The goal of the tracking module is to estimate the camera's current pose. To prevent moving objects from corrupting the pose estimate, only the static Gaussians (Gst\mathcal{G}_{st}) are used for rendering during this stage.

  • The camera pose is optimized by minimizing an L1L_1 loss between the rendered color/depth maps (from static Gaussians) and the observed input images. The pre-computed motion mask M\mathcal{M} is used to ignore dynamic regions in the input images.

  • Tracking Loss: Lt=pM(λO(p)L1(C(p))+(1λ)L1(D(p))) L _ { t } = \sum _ { p } \mathcal { M } ( \lambda O ( p ) L _ { 1 } ( C ( p ) ) + ( 1 - \lambda ) L _ { 1 } ( D ( p ) ) )

    • LtL_t: The total tracking loss.
    • M\mathcal{M}: The motion mask, which is 0 for dynamic pixels and 1 for static pixels.
    • C(p),D(p),O(p)C(p), D(p), O(p): Rendered color, depth, and opacity for pixel pp.
    • λ\lambda: A weighting factor between the color and depth losses.
    • The depth loss is only applied to reliable pixels (where rendered opacity is high and ground-truth depth is valid). The color loss is applied only in regions with high image gradients to focus on textured areas.
  • Keyframe Selection: A new frame is selected as a keyframe if it has low visual overlap with existing keyframes or if the camera has moved significantly. A novel criterion is added for dynamic scenes: a new keyframe is also added if the motion mask changes substantially, or at a fixed interval (every 5 frames), ensuring that the system captures the scene's temporal evolution even if the camera is stationary.

3.3. 4D Mapping

This module optimizes the 4D Gaussian representation, including the static and dynamic Gaussians and the dynamic motion network (MLP and control points).

  • Optical Flow Map Rendering: This is a core contribution. Instead of just rendering color and depth, the system renders a 2D optical flow map from the dynamic Gaussians.

    1. The positions of dynamic Gaussians Gdy\mathcal{G}_{dy} are computed at two different times, tt and t-1, using the motion network. This gives two sets of 3D points.
    2. Both sets of 3D points are projected onto the 2D image plane of the current keyframe (at time tt). Let the projected 2D coordinates be pt\mathbf{p}_t and pt1\mathbf{p}_{t-1}.
    3. The displacement vector for each Gaussian is dx=ptpt1d_x = \mathbf{p}_t - \mathbf{p}_{t-1}.
    4. This displacement vector dxd_x is treated like a color attribute and rendered using the same alpha-blending formula to produce a dense 2D optical flow map F(p)F(p): F(p)=i=1ndxαiji1(1αj) F ( { \boldsymbol { p } } ) = \sum _ { i = 1 } ^ { n } d _ { x } \alpha _ { i } \prod _ { j } ^ { i - 1 } ( 1 - \alpha _ { j } )
  • Optical Flow Loss: The rendered optical flow map is supervised by a "ground truth" flow map estimated from the actual input images using a pre-trained model (RAFT). The loss is an L1L_1 difference between the rendered and estimated flow maps (both forward and backward) within the dynamic mask region. Lflow=pM(L1(F(p)tt1,RAFT(p)tt1)+L1(F(p)t1t,RAFT(p)t1t)) \mathcal { L } _ { f l o w } = \sum _ { p } \mathcal { M } ( L _ { 1 } ( F ( p ) _ { t \to t - 1 } , R A F T ( p ) _ { t \to t - 1 } ) + L _ { 1 } ( F ( p ) _ { t - 1 \to t } , R A F T ( p ) _ { t - 1 \to t } ) )

  • Joint Optimization: The mapping process optimizes all parameters using a combined loss function. Lmapping=λL1(C(p))+(1λ)L1(D(p))+λflowLflow+W1Larap+W2Eiso L _ { m a p p i n g } = \lambda L _ { 1 } ( C ( p ) ) + ( 1 - \lambda ) L _ { 1 } ( D ( p ) ) + \lambda _ { f l o w } \mathcal { L } _ { f l o w } + W _ { 1 } \mathcal{L}_{arap} + W _ { 2 } E _ { i s o } This loss includes:

    • Photometric loss: L1(C(p))L_1(C(p))
    • Geometric loss: L1(D(p))L_1(D(p))
    • Optical flow loss: Lflow\mathcal{L}_{flow}
    • ARAP loss (Larap\mathcal{L}_{arap}): As-Rigid-As-Possible loss, encourages the transformations of nearby control points to be similar, preventing unrealistic stretching and shearing of dynamic objects.
    • Isotropic loss (EisoE_{iso}): A regularization term that penalizes overly stretched Gaussians. Eiso=i=1Gsisi~1 E _ { i s o } = \sum _ { i = 1 } ^ { | \mathcal { G } | } \| s _ { i } - \tilde { s _ { i } } \| _ { 1 } where sis_i are the scales of the ellipsoid and si~\tilde{s_i} is their mean.
  • Two-Stage Mapping Strategy: Optimization is performed in two stages for stability.

    1. Stage 1: Optimize only camera poses, exposure, and the dynamic deformation network. The Gaussian parameters are frozen. The loss weight for dynamic regions is doubled to focus on learning the motion correctly first.
    2. Stage 2: Optimize all parameters jointly, including the Gaussian attributes (color, shape, etc.).
  • Color Refinement: A final global optimization step is run for 1500 iterations over a random selection of keyframes to fine-tune the entire model, using a loss that includes a D-SSIM term for perceptual quality.

5. Experimental Setup

  • Datasets:

    • TUM RGB-D: A widely used benchmark for evaluating RGB-D SLAM systems. It features various indoor scenes with different textures, lighting conditions, and camera motions. Some sequences contain dynamic elements, like people moving.
    • BONN RGB-D Dynamic: A dataset specifically designed for evaluating SLAM in dynamic environments. It includes scenes with more pronounced and challenging dynamic object motions.
  • Evaluation Metrics:

    1. Absolute Trajectory Error (ATE):
      • Conceptual Definition: Measures the global consistency of the estimated camera trajectory. It computes the direct difference between the ground truth and estimated camera poses after they have been aligned. The Root Mean Square Error (RMSE) of these differences is reported, providing a single number for overall trajectory accuracy. A lower ATE is better.
      • Mathematical Formula: ATERMSE=1Ni=1Ntrans((Tgt,i)1STest,i)2 \text{ATE}_{\text{RMSE}} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} \| \text{trans}((\mathbf{T}_{gt,i})^{-1} \mathbf{S} \mathbf{T}_{est,i}) \|^2}
      • Symbol Explanation:
        • NN: Total number of camera poses (frames).
        • Tgt,iSE(3)\mathbf{T}_{gt,i} \in SE(3): Ground truth camera pose at time ii.
        • Test,iSE(3)\mathbf{T}_{est,i} \in SE(3): Estimated camera pose at time ii.
        • SSE(3)\mathbf{S} \in SE(3): The alignment transformation that maps the estimated trajectory to the ground truth trajectory (computed via least-squares).
        • trans()\text{trans}(\cdot): A function that extracts the translational part of a transformation matrix.
    2. Peak Signal-to-Noise Ratio (PSNR):
      • Conceptual Definition: Measures the quality of a reconstructed image by comparing it to a ground truth image. It is based on the Mean Squared Error (MSE). Higher PSNR indicates better reconstruction quality. It is measured in decibels (dB).
      • Mathematical Formula: PSNR=10log10(MAXI2MSE) \text{PSNR} = 10 \cdot \log_{10} \left( \frac{\text{MAX}_I^2}{\text{MSE}} \right) where MSE=1mni=0m1j=0n1[I(i,j)K(i,j)]2\text{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2.
      • Symbol Explanation:
        • MAXI\text{MAX}_I: Maximum possible pixel value of the image (e.g., 255 for 8-bit images).
        • m, n: Dimensions of the image.
        • I(i,j)I(i,j): Pixel value at (i,j) in the ground truth image.
        • K(i,j)K(i,j): Pixel value at (i,j) in the rendered image.
    3. Structural Similarity Index Measure (SSIM):
      • Conceptual Definition: Measures image similarity by considering changes in structural information, luminance, and contrast. It is designed to be more consistent with human perception than PSNR. The value ranges from -1 to 1, where 1 indicates a perfect match. A higher SSIM is better.
      • Mathematical Formula: SSIM(x,y)=(2μxμy+c1)(2σxy+c2)(μx2+μy2+c1)(σx2+σy2+c2) \text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}
      • Symbol Explanation:
        • x, y: The two image windows being compared.
        • μx,μy\mu_x, \mu_y: Mean intensity of xx and yy.
        • σx2,σy2\sigma_x^2, \sigma_y^2: Variance of xx and yy.
        • σxy\sigma_{xy}: Covariance of xx and yy.
        • c1,c2c_1, c_2: Small constants to stabilize the division.
    4. Learned Perceptual Image Patch Similarity (LPIPS):
      • Conceptual Definition: A metric that measures the perceptual distance between two images. It uses features extracted from deep neural networks (e.g., VGG) and computes the distance in this feature space. It is considered to align very well with human judgments of image similarity. A lower LPIPS is better.
      • Mathematical Formula (Conceptual): d(x,x0)=l1HlWlh,wwl(y^hwly^0hwl)22 d(x, x_0) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} \| w_l \odot (\hat{y}_{hw}^l - \hat{y}_{0hw}^l) \|_2^2
      • Symbol Explanation:
        • d(x,x0)d(x, x_0): The LPIPS distance between images xx and x0x_0.
        • ll: Index of the layer in the deep network.
        • y^l,y^0l\hat{y}^l, \hat{y}_0^l: Feature activations from layer ll for each image.
        • wlw_l: A learned weight to scale the contribution of each layer.
        • \odot: Element-wise product.
  • Baselines:

    • GS-SLAM Methods: SplaTAM, Gaussian-SLAM, MonoGS (These are state-of-the-art but designed for static scenes).
    • Dynamic GS Method: SC-GS (This method requires known camera poses, so for the experiment, it was provided with the ground truth trajectory).
    • Dynamic NeRF-SLAM: RoDyn-SLAM (A state-of-the-art dynamic SLAM system based on NeRF, serving as a direct competitor).

6. Results & Analysis

Core Results

  • Pose Estimation (Tracking Performance):

    • The following are transcriptions of Table 1 and Table 4 from the paper.

    Table 1: Trajectory errors in ATE [cm] ↓ in the BONN sequences.

    Method ballon ballon2 ps_track ps_track2 sync sync2 p_no_box p_no_box2 p_no_box3 Avg.
    RoDyn-SLAM[20] 7.9 11.5 14.5 13.8 1.3 1.4 4.9 6.2 10.2 7.9
    MonoGS[30] 29.6 22.1 54.5 36.9 68.5 0.56 71.5 10.7 3.6 33.1
    Gaussian-SLAM[50] 66.9 32.8 107.2 114.4 111.8 164.8 69.9 53.8 37.9 84.3
    SplaTAM[21] 32.9 30.4 77.8 116.7 59.5 66.7 91.9 18.5 17.1 56.8
    Ours 2.4 3.7 8.9 9.4 2.8 0.56 1.8 1.5 2.2 3.6

    Table 4: Trajectory errors in ATE [cm] ↓ in the TUM RGB-D sequences.

    Method fr3/sit_st fr3/sit_xyz fr3/sit_rpy fr3/walk_st fr3/walk_xyz fr3/walk_rpy Avg.
    RoDyn-SLAM[20] 1.5 5.6 5.7 1.7 8.3 8.1 5.1
    MonoGS[30] 0.48 1.7 6.1 21.9 30.7 34.2 15.8
    Gaussian-SLAM[50] 0.72 1.4 21.02 91.50 168.1 152.0 72.4
    SplaTAM[21] 0.52 1.5 11.8 83.2 134.2 142.3 62.2
    Ours 0.58 2.9 2.6 0.52 2.1 2.6 1.8
    • Analysis: The results are compelling. On the highly dynamic BONN dataset (Table 1), the proposed method achieves an average ATE of 3.6 cm, which is more than twice as accurate as the next best method (RoDyn-SLAM at 7.9 cm) and an order of magnitude better than other GS-SLAM methods (MonoGS, Gaussian-SLAM, SplaTAM), which clearly fail in these scenarios. Similarly, on the TUM RGB-D dataset (Table 4), while static-focused methods perform well on static sequences (fr3/sit_st), they degrade significantly on dynamic ones (walk_xyz, walk_rpy). The proposed method maintains very low error across all sequences, achieving the best average ATE of 1.8 cm. This demonstrates the effectiveness of separating static and dynamic Gaussians for robust tracking.
  • Quality of Reconstructed Map (Rendering Performance):

    • The following are transcriptions of Table 2 and Table 3 from the paper.

    Table 2: Rendering quality on TUM RGB-D sequences.

    Method Metric fr3/sit_st fr3/sit_xyz fr3/sit_rpy fr3/walk_st fr3/walk_xyz fr3/walk_rpy Avg.
    \multirow{3}{*}{MonoGS[30]} PSNR[dB] ↑ 19.95 23.92 16.99 16.47 14.02 15.12 17.74
    SSIM↑ 0.739 0.803 0.572 0.604 0.436 0.497 0.608
    LPIPS↓ 0.213 0.182 0.405 0.355 0.581 0.56 0.382
    \multirow{3}{*}{Gaussian-SLAM[50]} PSNR[dB] ↑ 18.57 19.22 16.75 14.91 14.67 14.5 16.43
    SSIM↑ 0.848 0.796 0.652 0.607 0.483 0.467 0.642
    LPIPS↓ 0.291 0.326 0.521 0.489 0.626 0.630 0.480
    \multirow{3}{*}{SplaTAM[21]} PSNR[dB] ↑ 24.12 22.07 19.97 16.70 17.03 16.54 19.40
    SSIM↑ 0.915 0.879 0.799 0.688 0.650 0.635 0.757
    LPIPS↓ 0.101 0.163 0.205 0.287 0.339 0.353 0.241
    \multirow{3}{*}{SC-GS[18]} PSNR[dB] ↑ 27.01 21.45 18.93 20.99 19.89 16.44 20.78
    SSIM↑ 0.900 0.686 0.529 0.762 0.590 0.475 0.657
    LPIPS↓ 0.182 0.369 0.512 0.291 0.470 0.554 0.396
    \multirow{3}{*}{\textbf{Ours}} PSNR[dB] ↑ 27.68 24.37 20.71 22.99 19.83 19.22 22.46
    SSIM↑ 0.892 0.822 0.746 0.820 0.730 0.708 0.786
    LPIPS↓ 0.116 0.179 0.265 0.195 0.281 0.337 0.228

    Table 3: Rendering quality on BONN sequences.

    Method Metric ballon ballon2 ps_track ps_track2 sync sync2 p_no_box p_no_box2 p_no_box3 Avg.
    \multirow{3}{*}{MonoGS[30]} PSNR[dB] ↑ 21.35 20.22 20.53 20.09 22.03 20.55 20.764 19.38 24.81 21.06
    SSIM↑ 0.803 0.758 0.779 0.718 0.766 0.841 0.748 0.753 0.857 0.780
    LPIPS↓ 0.316 0.354 0.408 0.426 0.328 0.5210 0.428 0.372 0.243 0.342
    \multirow{3}{*}{Gaussian-SLAM[50]} PSNR[dB] ↑ 20.45 18.55 19.60 19.09 21.04 21.35 19.99 20.35 21.22 20.18
    SSIM↑ 0.792 0.718 0.744 0.719 0.784 0.837 0.750 0.768 0.814 0.769
    LPIPS↓ 0.457 0.480 0.484 0.496 0.402 0.364 0.509 0.493 0.441 0.458
    \multirow{3}{*}{SplaTAM[21]} PSNR[dB] ↑ 19.65 17.67 18.30 15.57 19.33 19.67 20.81 21.69 21.41 19.34
    SSIM↑ 0.781 0.702 0.670 0.606 0.776 0.730 0.824 0.852 0.873 0.757
    LPIPS↓ 0.211 0.280 0.283 0.331 0.227 0.258 0.191 0.165 0.152 0.233
    \multirow{3}{*}{SC-GS[18]} PSNR[dB] ↑ 22.3 21.38 - - 23.62 22.74 20.60 21.55 19.24 21.63
    SSIM↑ 0.737 0.708 - - 0.788 0.801 0.688 0.722 0.628 0.724
    LPIPS↓ 0.448 0.450 - - 0.427 0.359 0.515 0.491 0.539 0.461
    \multirow{3}{*}{\textbf{Ours}} PSNR[dB] ↑ 25.90 22.71 21.78 20.65 23.25 25.42 23.14 24.28 25.88 23.66
    SSIM↑ 0.874 0.838 0.832 0.820 0.812 0.892 0.845 0.873 0.886 0.852
    LPIPS↓ 0.234 0.264 0.289 0.294 0.250 0.169 0.239 0.224 0.207 0.241
    • Analysis: On both datasets, the proposed method achieves the best average PSNR, SSIM, and LPIPS scores. This indicates that it not only tracks accurately but also produces the highest-fidelity renderings. The visual results in Image 3, 5, 7, and 8 corroborate these numbers. Other methods either produce blurry artifacts for moving people (MonoGS, SplaTAM) or fail to reconstruct parts of the scene (SplaTAM), while the proposed method renders both the static background and the dynamic person sharply and correctly.

      Figure 3. Visual comparison of the rendering images on the TUM RGB-D dataset. 该图像是论文中针对TUM RGB-D数据集的渲染图像视觉对比,展示了Ground Truth、MonoGS、SplaTAM、SC-GS和本文所提方法五种结果。图中三组场景分别是两个人坐在桌旁和静态办公桌,体现了各方法在动态与静态环境下的视觉复原效果差异。 Image 3: This image shows a qualitative comparison on the TUM dataset. The proposed method ("Ours") produces renderings that are visually closest to the Ground Truth, with sharper details on the moving people and fewer artifacts compared to MonoGS, SplaTAM, and SC-GS.

      Table 3. More qualitative results have been added in Supplementary. 该图像是一个包含两组动态场景对比的插图,展示了不同方法在动态目标处理上的效果。上半部分展示了一个人搬运箱子的动作序列,下半部分为两人行走的序列。每组图像从左至右依次为真实场景(Ground Truth)、MonoGS、SplaTAM、SC-GS以及本文方法的重建结果。图中重点对比了各方法在动态物体运动模糊和重建清晰度上的表现,体现了本文方法在动态场景下更鲁棒且更精确的跟踪与合成能力。 Image 5: This figure provides further visual evidence of the method's superiority on the BONN dataset. In the top row (placing_no_box3), other methods show significant motion blur on the person, while the proposed method captures the pose crisply. In the bottom row (synchronous), the ghosting and blurring artifacts on the moving people are severe for competitors, whereas the proposed method's reconstruction is much cleaner.

Ablations / Parameter Sensitivity

  • Mapping Strategy:

    Figure 4. The comparison of rendering results with different mapping strategies. 该图像是多张渲染结果对比的插图,展示了不同映射策略下动态场景的重建效果。图中(a)为真实图像(GT),其余(b)-(f)为不同方法生成的渲染图像。结果显示,采用两阶段映射和最终方法(fin)能更清晰地还原人物和环境细节,而去除两阶段映射或减少映射次数则导致图像模糊或失真。 Image 4: This figure ablates the keyframe selection strategy for the mapping module. Comparing images (b) through (f) to the ground truth (a), it's clear that the choice of which keyframes to include in the optimization window significantly impacts quality. The final method (f), which optimizes 3 recent keyframes, 5 overlapping keyframes, and 2 random global keyframes, provides the best balance, resulting in a sharp reconstruction of both the dynamic person and the static background. Other strategies lead to blurriness or artifacts.

  • Optical-flow Loss and Separate Gaussians:

    • The following is a transcription of Table 5 from the paper.

    Table 5: Ablation study on BONN sequences (PSNR [dB] ↑).

    Optical Flow Separate Gaussians sync sync2
    × × 18.37 22.11
    × 22.87 24.84
    × 17.40 21.03
    23.25 25.42
    • Analysis: This ablation study is crucial.
      • Row 1 (Baseline): Without separating Gaussians and without optical flow loss, the performance is the lowest.
      • Row 2 (Adding Optical Flow): Adding the optical flow loss significantly boosts performance. This confirms that rendering optical flow provides a very effective supervisory signal for learning the motion.
      • Row 3 (Separating Gaussians only): Simply separating static and dynamic Gaussians without the optical flow supervision leads to worse results than the baseline. This suggests that without a strong motion constraint, the dynamic model struggles to learn, and the separation itself is not enough.
      • Row 4 (Full Method): The combination of both components—separating Gaussians and using optical flow loss—yields the best results. This demonstrates that the two contributions work synergistically.

7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully introduces a novel and robust SLAM system for dynamic environments based on 4D Gaussian Splatting. By explicitly separating and modeling static and dynamic scene components and introducing a novel optical flow rendering loss for supervision, the method achieves state-of-the-art performance in both camera tracking accuracy and 4D map reconstruction quality. It effectively solves the long-standing problem of handling dynamic objects in SLAM without simply discarding them.

  • Limitations & Future Work:

    • Reliance on Pre-trained Models: The system relies on YOLOv9 for initial motion segmentation and RAFT for optical flow supervision. Its performance may be limited by the capabilities of these pre-trained models, especially for unseen object categories or challenging lighting conditions.
    • Motion Complexity: The motion model, based on sparse control points and an MLP, might struggle with highly complex, non-rigid deformations or topological changes (e.g., a person putting on a coat).
    • Initialization of Dynamic Objects: The paper mentions that for some sequences, the initial frame for dynamic object initialization is pre-specified. A fully automatic system would need a robust mechanism to detect and initialize new dynamic objects that appear mid-sequence.
    • The authors do not explicitly list future work, but natural extensions would be to improve the automation of dynamic object detection, handle more complex deformations, and explore monocular (RGB-only) inputs.
  • Personal Insights & Critique:

    • The core idea of rendering optical flow from the 3D representation itself to create a supervisory signal is highly innovative and powerful. It forms a self-consistent loop where the geometry and motion model must align with the 2D pixel motion observed in the images. This is a more direct constraint than relying solely on photometric error, which can be ambiguous.

    • The separation of Gaussians into static and dynamic sets is a clean and effective architectural choice. It allows the system to leverage the stability of the static world for tracking while dedicating specialized modeling capacity to the challenging dynamic parts.

    • The system demonstrates a significant step forward for practical applications like robotics and augmented reality, where environments are almost always dynamic. A robot equipped with such a system could not only navigate but also understand and interact with moving agents in its environment.

    • The quality of the dynamic object reconstruction, as shown in the figures, is remarkable for an online SLAM system. It suggests that high-fidelity 4D capture is becoming feasible in real-time applications.

      Figure 7. Novel view rendering and Gaussian visualizations in the fr3 sitting static sequence of TUM RGB-D. 该图像是渲染结果插图,展示了论文中在TUM RGB-D数据集fr3 sitting static序列上的新视角渲染与高斯体素可视化。左图为通过4D高斯斑点方法生成的场景新视角图像,右图显示对应的高斯体素以不同大小和颜色球体形式分布在场景中。 Image 6: This figure showcases the quality of the underlying Gaussian representation. The rendered novel view (left) is photorealistic, and the visualization of the Gaussians (right) shows how the scene is composed of these primitives, capturing both detailed geometry and appearance.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.