Paper status: completed

MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM

Published:09/25/2025

Sim(3)-Based Visual Alignment Constraints (1)Visual SLAM (1)IMU data fusion (1)GNSS-assisted Localization (1)Hierarchical Factor Graph Optimization (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MASt3R-Fusion integrates feed-forward visual pointmap regression with IMU/GNSS to overcome traditional SLAM limitations. It innovatively converts Sim(3) visual constraints to a metric SE(3) factor graph for robust multi-sensor fusion and hierarchical optimization. The system demo

Abstract

Visual SLAM is a cornerstone technique in robotics, autonomous driving and extended reality (XR), yet classical systems often struggle with low-texture environments, scale ambiguity, and degraded performance under challenging visual conditions. Recent advancements in feed-forward neural network-based pointmap regression have demonstrated the potential to recover high-fidelity 3D scene geometry directly from images, leveraging learned spatial priors to overcome limitations of traditional multi-view geometry methods. However, the widely validated advantages of probabilistic multi-sensor information fusion are often discarded in these pipelines. In this work, we propose MASt3R-Fusion,a multi-sensor-assisted visual SLAM framework that tightly integrates feed-forward pointmap regression with complementary sensor information, including inertial measurements and GNSS data. The system introduces Sim(3)-based visualalignment constraints (in the Hessian form) into a universal metric-scale SE(3) factor graph for effective information fusion. A hierarchical factor graph design is developed, which allows both real-time sliding-window optimization and global optimization with aggressive loop closures, enabling real-time pose tracking, metric-scale structure perception and globally consistent mapping. We evaluate our approach on both public benchmarks and self-collected datasets, demonstrating substantial improvements in accuracy and robustness over existing visual-centered multi-sensor SLAM systems. The code will be released open-source to support reproducibility and further research (https://github.com/GREAT-WHU/MASt3R-Fusion).

Mind Map

In-depth Reading

English Analysis~16 min read · 17,429 chars

1. Bibliographic Information

Title: MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM
Authors: Yuxuan Zhou, Xingxing Li, Shengyu Li, Zhuohao Yan, Chunxi Xia, Shaoquan Feng
Affiliations: The authors are with the School of Geodesy and Geomatics, Wuhan University, China. This affiliation suggests strong expertise in sensor fusion, geodesy, and spatial information systems.
Journal/Conference: The paper is presented as a preprint on arXiv. The formatting and content suggest it is intended for a top-tier robotics or computer vision conference like ICRA (IEEE International Conference on Robotics and Automation) or CVPR (Conference on Computer Vision and Pattern Recognition).
Publication Year: The paper lists future publication years (e.g., 2025 in references), which is typical for preprints submitted for review for conferences in the upcoming year.
Abstract: The paper introduces MASt3R-Fusion, a novel SLAM system that combines a modern feed-forward visual model (for regressing 3D pointmaps from images) with traditional multi-sensor fusion techniques. It tightly integrates visual information with Inertial Measurement Unit (IMU) and Global Navigation Satellite System (GNSS) data. The core technical innovation is a method to convert Sim(3) visual alignment constraints (which have scale ambiguity) into a metric-scale SE(3) factor graph. The system features a hierarchical design for both real-time local optimization and global optimization with loop closures. The authors claim significant improvements in accuracy and robustness over existing systems, validated on public and private datasets.
Original Source Link: https://arxiv.org/abs/2509.20757. Note: As of the analysis date, this appears to be a placeholder or future arXiv ID and is not yet active.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Traditional Visual SLAM systems fail in visually challenging scenarios (e.g., low-texture environments) and suffer from scale ambiguity (cannot determine the true size of the scene from a single camera). While recent deep learning-based "feed-forward" models like DUSt3R and MASt3R can generate high-quality 3D geometry directly from images using learned priors, they often discard the well-established benefits of fusing information from other sensors like IMU and GNSS.
- Importance & Gaps: There is a critical need to merge the power of these new visual foundation models with the metric accuracy and robustness provided by inertial and global positioning sensors. Existing methods either rely solely on vision (leading to scale drift) or use older visual front-ends that are less robust. The key gap is the lack of a framework to tightly integrate the Sim(3) (scale-ambiguous) outputs of these new visual models into a standard, metric-scale SE(3) SLAM backend.
- Innovation: MASt3R-Fusion proposes a principled way to bridge this gap. It introduces a novel framework that converts the powerful, dense visual constraints from a feed-forward model into a format compatible with a multi-sensor factor graph, enabling robust, real-time, and globally consistent SLAM.
Main Contributions / Findings (What):
1. A Novel Fusion Framework: The paper presents a method to tightly fuse feed-forward pointmap regression with multi-sensor data by bridging Sim(3)-based visual constraints and a metric-scale SE(3) factor graph.
2. Real-Time Visual-Inertial SLAM: A complete, real-time system is developed that provides accurate metric-scale pose estimation and dense 3D perception by combining the visual model with IMU data in a sliding-window optimizer.
3. Globally Consistent SLAM: The system is extended with global optimization capabilities, incorporating aggressive loop closures (validated by a novel geometry-based filtering method) and GNSS data to achieve globally consistent, drift-free maps.
4. Comprehensive Evaluation: The system's performance is validated on challenging public datasets (KITTI-360, SubT-MRS) and a self-collected urban dataset, demonstrating superior accuracy and robustness.
5. Open Source Code: The authors commit to releasing the code to foster further research.

Foundational Concepts:
- SLAM (Simultaneous Localization and Mapping): The fundamental robotics problem of constructing a map of an unknown environment while simultaneously keeping track of the agent's location within it.
- Visual SLAM (vSLAM): A type of SLAM that uses cameras as its primary sensor.
- IMU (Inertial Measurement Unit): A sensor that measures acceleration and angular velocity. It provides high-frequency motion information and helps resolve scale ambiguity and motion blur in vSLAM.
- GNSS (Global Navigation Satellite System): Provides absolute global position (e.g., GPS). It is used to eliminate long-term drift in SLAM systems.
- Factor Graph Optimization: A state-of-the-art technique for solving the SLAM problem. It represents the robot's poses and environmental landmarks as nodes in a graph, and measurements (from sensors) as factors (edges) connecting them. The goal is to find the configuration of nodes that best satisfies all the measurement constraints.
- Feed-Forward Pointmap Regression: A recent paradigm in 3D vision where a neural network takes one or more images as input and directly outputs a 3D point cloud (a "pointmap") for each image, all within a common, but arbitrarily scaled, reference frame. This leverages learned priors about 3D geometry.
- SE(3) vs. Sim(3):
  - SE(3) (Special Euclidean Group of dimension 3): Represents rigid body transformations (rotation and translation) in 3D space. This is the standard for metric-scale SLAM.
  - Sim(3) (Similarity Group of dimension 3): Represents similarity transformations, which include rotation, translation, and a uniform scale factor. Monocular visual systems can only determine geometry up to a scale, so their natural representation is Sim(3).
Previous Works & Technological Evolution:
- Classical SLAM: Systems like ORB-SLAM3 represent the pinnacle of traditional, feature-based vSLAM. They are highly optimized but can struggle in texture-poor areas.
- Deep Learning in SLAM: Early deep learning methods focused on replacing modules like feature detection (SuperPoint) or depth estimation. More advanced systems like DROID-SLAM use deep learning for end-to-end pose and geometry estimation through a differentiable bundle adjustment layer.
- Feed-Forward Models: The introduction of DUSt3R and its successor MASt3R marked a shift. These models can perform robust dense matching and 3D reconstruction from a pair of images without needing camera poses beforehand, showing remarkable robustness to large viewpoint changes.
- SLAM with Feed-Forward Models: Recent works like MASt3R-SLAM and VGGT-SLAM have started building SLAM systems on top of these models. However, they remain primarily vision-only, inheriting issues like scale drift.
- Multi-Sensor Fusion: Systems like VINS-Fusion are the gold standard for tightly-coupled visual-inertial odometry, but they typically use traditional visual front-ends.
Differentiation: MASt3R-Fusion distinguishes itself by being the first system to create a tightly-coupled fusion between the new feed-forward visual paradigm and a full suite of complementary sensors (IMU, GNSS). Unlike pose-graph methods that loosely combine sensor data, it integrates all information at the factor level, preserving probabilistic information for optimal state estimation. The core novelty is the mathematical bridge it builds between the Sim(3) visual world and the SE(3) metric world.

4. Methodology (Core Technology & Implementation)

The system is architecturally split into a real-time component and a global optimization component, as shown in the flowchart below.

System flowchart of MASt3R-Fusion, divided into a real-time SLAM part and a global optimization part. 该图像为系统流程图，展示了MASt3R-Fusion多传感器视觉SLAM框架的整体架构。流程分为实时部分和全局部分，融合了图像、IMU、GNSS数据，通过帧初始化、跟踪、帧匹配、本地和全局优化实现地图帧的精准构建与更新。图中以绿色箭头标示了基于前馈模型的关键步骤。

A. Visual Measurement Based on Feed-Forward Model

The system uses MASt3R as its visual front-end.

Two-View Pointmap Regression: For any pair of images $\mathbf{I}_i$ and $\mathbf{I}_j$ , the model encodes them and then jointly decodes them to produce two pointmaps, $\mathbf{X}_i^{ij}$ and $\mathbf{X}_j^{ij}$ , and two descriptor maps, $\mathbf{D}_i^{ij}$ and $\mathbf{D}_j^{ij}$ . Crucially, both pointmaps are expressed in a common reference frame (e.g., that of camera $i$ ).

该图像为示意图，展示了利用编码器（Enc.）和解码器（Dec.）对多视角图像帧（ith和jth）进行处理的流程。编码器提取图像特征后，解码器生成点的描述子（Points Desc.）及置信度（Conf.），并输出包含深度信息的点云（Points (Depths)）及对应的描述子和置信度，体现了视觉信息的多尺度融合。
Dense Matching: Dense correspondences $\hat{\mathbf{u}}_j^i$ (the pixel in image $i$ corresponding to a pixel in image $j$ ) are found in three steps:
1. Geometric Matching: An initial match is found by minimizing the angular distance between rays defined by the pointmaps.
2. Descriptor Refinement: The match is refined by searching for the highest descriptor similarity in the neighborhood of the geometric match.
3. Sub-pixel Accuracy: To achieve higher precision, the descriptor map of one image is upsampled, and a final search is performed to get sub-pixel accurate matches.
Visual Constraint via Pointmap Alignment: Unlike traditional Bundle Adjustment (BA) which optimizes 3D point locations and camera poses simultaneously, this method leverages the high-quality structure from the feed-forward model. It assumes the shape of the pointmap $\mathbf{X}_j$ is correct and only its pose $\mathbf{S}_j^i \in \mathrm{Sim}(3)$ relative to frame $i$ is unknown.
- The visual residual is a reprojection error, formulated as: $\mathbf { r } _ { i j } \left( \mathbf { S } _ { j } ^ { i } \right) = \left[ \mathbf { u } _ { j } ^ { i } - \pi \left( \mathbf { S } _ { j } ^ { i } \circ \mathbf { X } _ { j } \right) \right. \left. \right]$ where $\pi(\cdot)$ is the camera projection function, and $\mathbf{u}_j^i$ are the matched coordinates in image $i$ . This process is visualized in the diagram below.
  
  该图像为流程示意图，展示了两个视图（jth和ith）通过“Match”模块进行匹配，随后匹配结果 $j \rightarrow i$ 输入到“Proj.”投影模块，结合一个被标记为“Maintained”的投影结果，计算出残差 $r_{ij}(\mathbf{S}_j^i)$ ，用于视觉特征匹配和投影误差的建模。
- The relative Sim(3) pose is found by minimizing this residual bi-directionally. For efficiency in the factor graph, the optimization problem is linearized, and the constraint is stored in Hessian form: $(\mathbf{H}_{ij}, \mathbf{v}_{ij})$ . This compact representation contains all the information from the dense alignment.
- A special down-weighting mechanism is introduced to handle cases where depth uncertainty in one frame could cause large projection errors in another, which is common in large-scale forward-moving scenes.

B. Real-Time SLAM with Multi-Sensor Fusion

This stage runs a sliding-window optimization to provide real-time state estimates.

Isomorphic Group Transformation: This is the key theoretical contribution for fusion. The Sim(3) transformation is represented isomorphically as an SE(3) transformation plus a separate scale factor: $\mathrm{Sim}(3) \cong \mathrm{SE}(3) \times \mathbb{R}$ . The paper derives the linear relationship between the Lie algebras of these two representations: $\left[ \begin{array} { c } { \boldsymbol { \omega } } \\ { \boldsymbol { \nu } } \\ { \sigma } \end{array} \right] = \underbrace { \left[ \begin{array} { l l l } { \mathbf { I } } & { } & { } \\ { } & { s \mathbf { I } } & { } \\ { } & { } & { 1 } \end{array} \right] } _ { \boldsymbol { \Lambda } } \left[ \begin{array} { c } { \boldsymbol { \theta } } \\ { \boldsymbol { \tau } } \\ { \delta s } \end{array} \right]$
- Here, $(\boldsymbol{\omega}, \boldsymbol{\nu}, \sigma)$ are perturbations in the Sim(3) Lie algebra, and $(\boldsymbol{\theta}, \boldsymbol{\tau}, \delta s)$ are the corresponding perturbations for the SE(3) pose and scale. $\mathbf{\Lambda}$ is the transformation matrix. This allows the Sim(3) Hessian visual factor to be correctly applied to SE(3) state variables.
Multi-Sensor Factor Graph:
- State Vector: The system maintains a state vector $\mathcal{X}_i = (\mathbf{T}_i, s_i, \mathbf{v}_i, \mathbf{b}_i)$ for each keyframe in a sliding window, where $\mathbf{T}_i \in \mathrm{SE}(3)$ is the pose, $s_i$ is the scale, $\mathbf{v}_i$ is velocity, and $\mathbf{b}_i$ are the IMU biases.
- Factors: The optimization minimizes the sum of squared errors from three types of factors:
  1. Visual Factors ( $\mathbf{E_v}$ ): The dense Sim(3) visual constraints, transformed into the SE(3) state space using the isomorphic mapping.
  2. IMU Factors ( $\mathbf{r_b}$ ): Standard IMU pre-integration constraints that link consecutive states, providing metric scale and gravity direction.
  3. Marginalization Factor ( $\mathbf{E_m}$ ): A probabilistic summary of the information from keyframes that have been removed from the sliding window, ensuring no information is lost.
- The final cost function for the real-time system is: $\sum _ { i \in \mathcal { W } } \Vert \mathbf { r _ { b } } ( \mathcal { X } _ { i } , \mathcal { X } _ { i + 1 } ) \Vert ^ { 2 } + \sum _ { ( i , j ) \in \mathcal { E } } \mathbf { E _ { v } } ( \mathcal { X } _ { i } , \mathcal { X } _ { j } ) + \mathbf { E } _ { m } ( \mathcal { X } )$

C. Global SLAM

This offline or background process refines the entire trajectory using all available information.

Loop Closure:
- Candidate Detection: Potential loop closures are first identified by finding images with similar global descriptors (derived from the model's encoder).
- Geometric Filtering: A novel and efficient filtering step is proposed to prune false positives. It estimates the positional uncertainty accumulated by the visual-inertial odometry over time. It then rejects candidate pairs that are geometrically implausible, even considering this uncertainty. This allows the system to be more aggressive in finding true loop closures without being overwhelmed by expensive dense matching checks on false positives.
Global Factor Graph Optimization:
- A global factor graph is constructed containing all keyframes and measurements.
- GNSS Factors ( $\mathbf{r_g}$ ): GNSS position measurements are added as unary factors, anchoring the map to a global coordinate frame.
- Loop Closure Factors: Verified loop closures are added as binary factors between non-adjacent keyframes.
- Two-Stage Optimization:
  1. First, a robust optimization is performed where loop closure and GNSS factors are added as relative pose constraints with a robust kernel to reject outliers.
  2. After identifying inliers, a second, more accurate optimization is performed where the loop closure constraints are converted to the full-information Hessian form, leading to a near-optimal global estimate. This preserves more information than standard pose-graph optimization.

5. Experimental Setup

Datasets:
- KITTI-360: A large-scale autonomous driving dataset with kilometer-long trajectories in urban and suburban environments. Used to test large-scale VIO and global SLAM performance.
- SubT-MRS: A dataset collected in challenging, unconventional environments like karst caves and indoor-outdoor transitions. Used to test the model's generalization ability.
- Wuhan Urban Dataset: A self-collected dataset with high-quality GNSS ground truth, used to rigorously evaluate the visual-inertial-GNSS fusion performance in dense urban scenarios.
Evaluation Metrics:
- Relative Pose Error (RPE): Measures local odometry drift. Reported as translational error t_rel (%) and rotational error r_rel (°/100m).
- Absolute Trajectory Error (ATE): Measures the global consistency of the trajectory against ground truth. Reported as the Root Mean Square Error (RMSE) in meters.
Baselines: The method is compared against a comprehensive set of state-of-the-art systems:
- VINS-Fusion, ORB-SLAM3: Classical feature-based visual-inertial systems.
- DM-VIO: A modern direct-method-based visual-inertial system.
- DBA-Fusion: A learning-based visual-inertial system that uses a differentiable BA layer.
- MASt3R-SLAM: The vision-only SLAM system based on the same visual front-end.

6. Results & Analysis

A. KITTI-360 Dataset

VIO Performance (Table I): MASt3R-Fusion consistently achieves the lowest or second-lowest relative translation error across all sequences, outperforming both classic (VINS, ORB-SLAM3) and learning-based (DBA-Fusion) VIO systems. This highlights the benefit of the dense, high-quality visual constraints. In contrast, the vision-only MASt3R-SLAM fails dramatically due to severe scale drift, proving the necessity of IMU fusion.
Global SLAM Performance (Table II & Image 7): With loop closure activated, MASt3R-Fusion reduces the ATE by an order of magnitude compared to ORB-SLAM3 (e.g., 2.13m vs. 26.03m on Seq. 0000). The trajectories show a much better global alignment. This is attributed to more robust and numerous loop closures, including those with large viewpoint differences.

该图像为图表，展示了两组轨迹匹配结果及其对应的误差热力图。左侧和右侧分别表示不同区域的轨迹匹配，红色线条突出了匹配的闭环部分。中间的热力图反映了轨迹点之间的误差距离，误差越小颜色越接近绿色。下方轨迹图则显示优化后去除冗余路径的平滑轨迹。总体体现了该方法在闭环检测和全局优化中的准确性和鲁棒性。
3D Perception (Image 5): Qualitatively, MASt3R-Fusion produces dense, metric-scale point clouds that are more complete and less noisy than DBA-Fusion and handle challenging objects better than the single-image depth model Metric3D v2.

该图像为示意图，左侧展示了传统多视图几何中多个相机视角与空间点之间的对应关系，通过连线表示视角与三维点的匹配；右侧则示意了基于学得空间先验的点图回归方法，采用曲线拟合代替离散点匹配，展示了该方法对场景几何的连续建模能力。

B. SubT-MRS Dataset

Generalization (Tables III & IV, Image 6): In these unseen, challenging environments, MASt3R-Fusion again demonstrates superior performance. It achieves the lowest ATE in both real-time VIO and global SLAM modes, validating the generalization capability of the feed-forward visual model when properly regularized by an IMU. The system successfully finds associations even in texture-poor caves and across large viewpoint changes.

该图像为示意图，展示了MASt3R-Fusion框架中多传感器信息融合的因子图结构。图中上层为基于Sim(3)的相机姿态节点，中层为结合尺度的SE(3)相机姿态节点，下层为IMU姿态及速度、偏置节点，节点之间通过不同颜色的边表示约束和信息传递，体现了视觉与惯性测量的紧耦合优化过程。

C. Wuhan Urban Dataset

V-I SLAM Performance (Image 9): In this large-scale urban environment, MASt3R-Fusion shows significantly less scale drift than VINS-Fusion. Adding loop closures further corrects attitude drift and improves global consistency.

该图像为图表，展示了不同视觉SLAM方法在带有和不带闭环检测条件下的轨迹对比。子图(a)中无闭环检测，多个方法轨迹与真实轨迹（GT虚线）存在明显偏差；子图(b)有闭环检测时，ORB-SLAM3和MASt3R-Fusion的轨迹更接近GT，闭环显著提升了定位精度和一致性。图中还包含轨迹局部放大细节，突出显示各方法在关键转弯处的表现差异。
V-I-GNSS Fusion Performance (Image 10): This is a key result. When GNSS is available but noisy (with gross errors and outages), the proposed method achieves decimeter-level accuracy. It significantly outperforms VINS-Fusion, which uses a looser pose-graph fusion. The tight coupling and full-information factor graph in MASt3R-Fusion allow the robust V-I odometry to effectively bridge GNSS gaps and reject outliers, resulting in a much smoother and more accurate trajectory.

该图像为图表，展示了不同方法（LIDAR、DBA-Fusion、Metric3D V2、MASt3R-Fusion）在三维点云重建任务中的效果对比。图中以四组不同场景为例，分别展示各方法恢复的三维结构细节，红色框标注了关键较复杂或动态区域，MASt3R-Fusion在细节完整性和结构连续性上表现更优，展示了其多传感器融合在视觉SLAM中的优势。*

7. Conclusion & Reflections

Conclusion Summary: The paper successfully presents MASt3R-Fusion, a high-functionality SLAM system that effectively integrates a state-of-the-art feed-forward visual model with IMU and GNSS data. The core contribution—a principled method to fuse Sim(3) visual constraints into a metric SE(3) factor graph—is shown to be highly effective. The system achieves state-of-the-art performance in terms of accuracy, robustness, and global consistency across a variety of challenging environments.
Limitations & Future Work: The authors plan to extend the work by incorporating semantic information and exploring more advanced scene representations (like NeRFs or 3D Gaussians) to support higher-level robotics tasks such as embodied navigation.
Personal Insights & Critique:
- Strength: This is an excellent systems paper that elegantly solves a critical problem: how to make the new generation of powerful visual foundation models "play nice" with the established world of probabilistic sensor fusion. The isomorphic group transformation is a clean and effective solution. The geometric loop closure filtering is also a clever and practical contribution.
- Weakness/Dependency: The system's performance is fundamentally tied to the performance of the underlying MASt3R model. If the visual model fails on a completely out-of-distribution scene, the entire system's performance could degrade. However, the tight IMU coupling is designed to mitigate exactly this risk.
- Impact: This work provides a blueprint for future multi-sensor SLAM systems. As visual foundation models become even more powerful, frameworks like this that can ground their outputs in the metric world using other sensors will become increasingly vital for real-world applications in robotics and autonomous driving.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.