Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras
TL;DR Summary
Photo-SLAM proposes a real-time SLAM framework for portable devices, addressing resource-intensive neural methods. It uses a hyper primitives map for joint explicit localization and implicit photorealistic mapping, enhanced by geometry-based densification and Gaussian-Pyramid lea
Abstract
The integration of neural rendering and the SLAM system recently showed promising results in joint localization and photorealistic view reconstruction. However, existing methods, fully relying on implicit representations, are so resource-hungry that they cannot run on portable devices, which deviates from the original intention of SLAM. In this paper, we present Photo-SLAM, a novel SLAM framework with a hyper primitives map. Specifically, we simultaneously exploit explicit geometric features for localization and learn implicit photometric features to represent the texture information of the observed environment. In addition to actively densifying hyper primitives based on geometric features, we further introduce a Gaussian-Pyramid-based training method to progressively learn multi-level features, enhancing photorealistic mapping performance. The extensive experiments with monocular, stereo, and RGB-D datasets prove that our proposed system Photo-SLAM significantly outperforms current state-of-the-art SLAM systems for online photorealistic mapping, e.g., PSNR is 30% higher and rendering speed is hundreds of times faster in the Replica dataset. Moreover, the Photo-SLAM can run at real-time speed using an embedded platform such as Jetson AGX Orin, showing the potential of robotics applications.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
-
Title: Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras
-
Authors: Huajian Huang (The Hong Kong University of Science and Technology), Longwei Li (Sun Yat-sen University), Hui Cheng (Sun Yat-sen University), Sai-Kit Yeung (The Hong Kong University of Science and Technology)
-
Journal/Conference: While not explicitly stated, the paper's structure, rigor, and references to top-tier conferences like CVPR, ECCV, and ICRA suggest it was submitted to or published in a similar high-impact venue.
-
Publication Year: The content and references suggest a publication date around 2023.
-
Abstract: The paper introduces Photo-SLAM, a novel framework that integrates simultaneous localization and mapping (SLAM) with photorealistic rendering. It addresses a key limitation of existing neural rendering SLAM systems: their high computational cost, which prevents real-time operation on portable devices. Photo-SLAM uses a unique
hyper primitives mapthat combines explicit geometric features for efficient localization with implicit photometric features for high-quality rendering. The system introduces ageometry-based densificationstrategy and aGaussian-Pyramid-based learningmethod to enhance mapping quality. Experiments show that Photo-SLAM significantly outperforms state-of-the-art methods in rendering quality (e.g., 30% higher PSNR) and speed (hundreds of times faster) on the Replica dataset. Crucially, it can run in real-time on embedded platforms like the Jetson AGX Orin, demonstrating its practical applicability for robotics. -
Original Source Link: The paper is available at
/files/papers/68e0ab569cc40dff7dd2bb22/paper.pdf.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Traditional SLAM systems create geometrically accurate but visually simple maps (like point clouds or wireframes). Recently, researchers have integrated neural rendering techniques (like NeRF) into SLAM to create photorealistic 3D maps. However, these "Neural SLAM" systems are extremely computationally expensive, require powerful GPUs, and are too slow for real-time applications. They are often limited to RGB-D cameras and struggle with scalability. This deviates from the core purpose of SLAM, which is to enable real-time navigation for robots and devices in unknown environments.
- Importance & Gaps: There is a significant gap between the stunning visual quality of neural rendering and the real-time, resource-constrained requirements of practical robotics. Existing methods rely fully on implicit representations optimized by neural networks, which is a slow and ill-posed process, especially without dense depth information.
- Innovation: Photo-SLAM proposes a hybrid approach. It smartly separates the tasks: it uses a fast, robust, and proven classical method (graph-based optimization with explicit features) for localization and a highly efficient modern rendering technique (3D Gaussian Splatting) for mapping. This combination aims to achieve the best of both worlds: the real-time efficiency of classical SLAM and the visual fidelity of neural rendering.
-
Main Contributions / Findings (What):
-
A Novel Hybrid Framework: The paper introduces the first SLAM system built on a
hyper primitives map. This map structure cleverly stores both explicit geometric features (for localization) and learnable implicit photometric attributes (for rendering), enabling a system that supports monocular, stereo, and RGB-D inputs for both indoor and outdoor scenes. -
Gaussian-Pyramid-Based Learning: A new progressive training strategy is proposed that trains the map on increasingly detailed versions of the input images (from blurry to sharp). This coarse-to-fine approach significantly improves the quality of the photorealistic map, especially in challenging monocular scenarios where depth information is sparse.
-
Real-Time Performance on Embedded Systems: The entire system is implemented in C++ and CUDA, making it highly efficient. It achieves state-of-the-art performance while running in real-time not just on high-end desktops but also on portable platforms like the NVIDIA Jetson AGX Orin, a crucial step for practical robotics.
-
3. Prerequisite Knowledge & Related Work
To understand this paper, it's helpful to be familiar with the following concepts:
-
SLAM (Simultaneous Localization and Mapping): The fundamental challenge for an autonomous agent (like a robot or drone) to build a map of an unknown environment while simultaneously determining its own position within that map.
-
Explicit vs. Implicit Scene Representation:
- Explicit: Directly storing geometric data. Examples include point clouds (collections of 3D points), meshes (surfaces made of triangles), and voxels (3D grid). These are easy to query but can be memory-intensive for high detail.
- Implicit: Representing a scene as a continuous function. A neural network, for instance, can learn a function that outputs the color and density at any 3D coordinate. This is memory-efficient but requires a computationally expensive query (running the network) to get information. NeRF (Neural Radiance Fields) is a famous example.
-
Graph Solver vs. Neural Solver:
- Graph Solver: The classical approach to SLAM optimization. It creates a "factor graph" where nodes represent camera poses and map points, and edges represent observations (e.g., a point seen from a certain pose). The solver's job is to adjust the nodes to minimize the errors (e.g., reprojection error) across the entire graph. ORB-SLAM is a prime example.
- Neural Solver: A modern approach where parts of the SLAM pipeline (or the entire pipeline) are replaced with deep learning models. For example, a neural network might directly predict camera poses from images or jointly optimize poses and a neural scene representation. DROID-SLAM is a well-known example.
-
3D Gaussian Splatting: A recent and revolutionary rendering technique. Instead of shooting rays (like in NeRF), it represents a scene as a collection of 3D ellipsoids (Gaussians). Each Gaussian has properties like position, rotation, scale (shape), color, and opacity. Rendering a view involves "splatting" (projecting) these 3D Gaussians onto the 2D image plane, which is extremely fast and produces high-quality results.
-
ORB Features: ORB (Oriented FAST and Rotated BRIEF) is a very fast and efficient algorithm for detecting key interest points in an image and generating a unique descriptor for each point. It's a cornerstone of many classical real-time SLAM systems like ORB-SLAM.
-
Spherical Harmonics (SH): A mathematical tool used to efficiently represent functions on a sphere. In computer graphics, they are used to model complex, view-dependent lighting effects (like reflections) without storing a huge amount of data.
Image 10(a) provides a clear taxonomy of SLAM systems. Photo-SLAM carves a unique niche by combining a Graph Solver and Explicit features for localization with an Implicit representation for photorealistic appearance.

-
Differentiation:
-
vs. Classical SLAM (e.g., ORB-SLAM3): Classical systems are fast and accurate for localization but only produce sparse geometric maps, lacking visual realism. Photo-SLAM builds on their robust localization but adds a dense, photorealistic mapping layer.
-
vs. Implicit Neural SLAM (e.g., Nice-SLAM, ESLAM): These systems produce beautiful results but are slow because they jointly optimize camera poses and a complex neural representation (like an MLP or feature grids) through ray tracing. This process is computationally heavy and often requires depth data to converge. Photo-SLAM decouples localization from photorealistic mapping and uses the much faster 3D Gaussian Splatting for rendering, eliminating the need for slow ray tracing.
-
4. Methodology (Core Technology & Implementation)
Photo-SLAM's architecture, shown in Image 10(b), consists of four parallel threads that operate on a central data structure: the Hyper Primitives Map.

4.1. Hyper Primitives Map
This is the core data structure. A hyper primitive is essentially an enhanced 3D point. Each primitive stores:
- Geometric Information (Explicit):
P ∈ ℝ³: The 3D position of the point.O ∈ ℝ²⁵⁶: The ORB feature descriptor extracted from the image, used for matching and localization.
- Photometric Information (Implicit/Learnable):
-
: A rotation (quaternion).
-
s ∈ ℝ³: A 3D scaling vector. Together, P, r, and s define a 3D Gaussian ellipsoid. -
: A density (opacity) value.
-
SH ∈ ℝ¹⁶: Spherical Harmonic coefficients to model view-dependent color.This hybrid structure allows the system to use the stable, explicit ORB features for classical localization while learning the photometric properties needed for high-quality rendering via 3D Gaussian Splatting.
-
4.2. Localization and Geometry Mapping
This part of the system is largely based on the robust principles of ORB-SLAM. It runs in two threads:
-
Localization Thread (Tracking): For each new incoming frame, this thread estimates the camera's current pose by matching 2D keypoints in the frame to the 3D
hyper primitivesin the map. It minimizes the reprojection error using a motion-only bundle adjustment. The optimization goal is:{R, t}: The camera rotation and translation (pose) being optimized.- : A 2D keypoint in the current image.
- : The corresponding 3D hyper primitive from the map.
- : The projection function that maps a 3D point to a 2D pixel coordinate.
- : The Huber cost function, used to make the optimization robust to outliers.
- : A covariance matrix associated with the keypoint's scale.
-
Geometry Mapping Thread: This thread runs in the background. It takes
keyframes(select, high-quality frames) and performs a local bundle adjustment. This refines the 3D positions ofhyper primitivesand the poses of recentkeyframessimultaneously, ensuring the geometric map stays consistent. It also creates newhyper primitivesfrom unmatched keypoints via triangulation.
4.3. Photorealistic Mapping
This thread is responsible for making the map look beautiful. It optimizes the photometric properties of the hyper primitives (r, s, , SH).
-
Rendering: The system renders an image from a given keyframe pose using a differentiable 3D Gaussian Splatting renderer. The color C of a pixel is calculated by blending the projected Gaussians from front to back:
- : The color of the i-th Gaussian (derived from its
SHcoefficients based on the viewing direction). - : The opacity of the i-th Gaussian in the image, calculated as its density multiplied by the 2D Gaussian's value at that pixel.
- : The color of the i-th Gaussian (derived from its
-
Optimization: The learnable parameters are updated by minimizing a photometric loss between the rendered image () and the ground truth keyframe image (
I_gt):- This is a combination of L1 loss (pixel-wise color difference) and SSIM (Structural Similarity Index), which captures perceptual quality better than L1 alone. is a weighting factor, set to 0.2.
4.3.1. Geometry-based Densification
A key challenge in SLAM is that only a small fraction of detected 2D features are triangulated into stable 3D map points. However, the remaining "inactive" features often lie in textured areas crucial for visual quality. This method cleverly utilizes them:
- For each new
keyframe, the system identifies these inactive 2D feature points. - It estimates their depth:
- RGB-D: Directly use the depth value from the sensor.
- Stereo: Use a stereo matching algorithm.
- Monocular: Infer depth from the nearest neighboring active 3D point.
- New temporary
hyper primitivesare created at these estimated 3D locations. This provides the photorealistic mapping thread with more "raw material" to work with in complex regions, leading to a more detailed reconstruction.
4.4. Gaussian-Pyramid-Based Learning (GP Learning)
This is the paper's most novel contribution to the optimization process. Instead of always training against the original, high-resolution image, it uses a progressive, coarse-to-fine approach.

As shown in Image 1(e), the process works as follows:
-
An image pyramid is created from the ground truth image by repeatedly blurring and downsampling it.
-
Initial Training Stage (t₀): The
hyper primitivesare optimized to reconstruct the lowest-resolution (most blurry) image in the pyramid (). This forces the model to learn the coarse, overall structure and color of the scene first. -
Intermediate Stages (t₁ to tₙ₋₁): As training progresses, the optimization target shifts to increasingly higher-resolution images from the pyramid. At each stage, more
hyper primitivescan be densified. -
Final Stage (tₙ): The model is fine-tuned on the original, full-resolution ground truth image ().
This progressive learning acts as a curriculum, guiding the optimization to avoid getting stuck in bad local minima and leading to significantly better final rendering quality, particularly for the difficult monocular case where initial geometry is uncertain.
4.5. Loop Closure
When the system recognizes a place it has seen before, the loop closure module calculates a similarity transformation to correct the accumulated drift in the trajectory. This correction is applied to the camera poses and the hyper primitives, improving global consistency for both the geometric map and the photorealistic rendering (reducing ghosting artifacts).
5. Experimental Setup
-
Datasets:
- Replica: A high-quality synthetic dataset of indoor environments, ideal for evaluating rendering quality. Used for monocular and RGB-D tests.
- TUM RGB-D: A real-world dataset of indoor scenes, commonly used for benchmarking RGB-D SLAM. Used for monocular and RGB-D tests.
- EuRoC MAV: A real-world dataset captured from a micro aerial vehicle, used for benchmarking stereo and visual-inertial SLAM. Used for stereo tests.
- Custom Outdoor Dataset: Captured with a ZED 2 stereo camera to demonstrate performance in unbounded outdoor environments.
-
Evaluation Metrics:
- Localization Accuracy:
ATE(Absolute Trajectory Error), which measures the global consistency of the estimated camera trajectory against the ground truth. Lower is better. - Mapping Quality:
PSNR(Peak Signal-to-Noise Ratio): Measures pixel-wise reconstruction quality. Higher is better.SSIM(Structural Similarity Index): Measures perceptual similarity between images. Higher is better.LPIPS(Learned Perceptual Image Patch Similarity): Uses a deep neural network to measure perceptual similarity, often correlating better with human judgment. Lower is better.
- Resource Usage:
Tracking FPS(frames per second),Rendering FPS, andGPU Memory Usage.
- Localization Accuracy:
-
Baselines: A comprehensive set of state-of-the-art systems were used for comparison, including:
-
Classical SLAM: ORB-SLAM3
-
Classical Dense Reconstruction: BundleFusion
-
Deep Learning SLAM: DROID-SLAM
-
Neural Rendering SLAM: Nice-SLAM, Orbeez-SLAM, ESLAM, Co-SLAM, Point-SLAM, Go-SLAM.
-
6. Results & Analysis
6.1. Core Results (Replica Dataset)
The results on the Replica dataset (Table 1 in the paper) are striking:
-
Localization (ATE): Photo-SLAM achieves competitive localization accuracy, on par with top methods like DROID-SLAM and better than most neural rendering SLAM systems.
-
Mapping Quality: Photo-SLAM dominates.
- In the monocular setting, it achieves a PSNR of 33.3, while competitors like Orbeez-SLAM and Go-SLAM are at 23.2 and 21.2, respectively. Nice-SLAM fails completely without depth supervision.
- In the RGB-D setting, it achieves a PSNR of 34.9, outperforming all other real-time methods, including Point-SLAM (34.6), ESLAM (32.5), and Co-SLAM (30.2).
-
Performance: This is where Photo-SLAM truly shines. It achieves a rendering speed of ~900-1000 FPS on a desktop GPU. This is hundreds of times faster than competitors, whose rendering speeds are typically in the range of 1-4 FPS. Furthermore, it runs at over 40 FPS for tracking and can operate in real-time (~18 FPS tracking, ~95 FPS rendering) on a Jetson AGX Orin embedded device, a feat unattainable by other methods.

The qualitative results in Figure 1 and the supplementary material visually confirm these numbers, showing crisp, detailed reconstructions where other methods produce blurry or artifact-ridden images.
6.2. Tracking Stability Analysis

Image 5 shows the per-frame tracking time. While methods like Go-SLAM have a reasonable average tracking time, they suffer from huge latency spikes (up to 1 second) due to expensive global optimization. In contrast, Photo-SLAM's tracking time is consistently low and stable (average 20ms), which is critical for real-time robotic applications.
6.3. Ablation Study
The ablation study in Table 4 validates the effectiveness of the two key proposed components: Geo (geometry-based densification) and GP (Gaussian-Pyramid-based learning).
-
Without
Geo: Removing the densification based on inactive features leads to a significant drop in PSNR (e.g., from 33.3 to 31.3 in the monocular case). The rendered image shows artifacts, especially in texture-rich areas like the ceiling (Image 3(a)). This proves that adding more primitives in the right places is crucial for detail. -
Without
GP: Removing the progressive learning strategy is catastrophic, especially for the monocular case. The PSNR plummets from 33.3 to 20.0. The resulting image (Image 3(b)) is extremely blurry. This happens because without the coarse-to-fine guidance, the optimization process incorrectly positions the newly densified primitives and fails to converge properly. -
Conclusion: Both
GeoandGPare essential and work synergistically.Geoprovides the necessary density of primitives, whileGPensures they are optimized correctly to produce a high-quality map.
6.4. Online vs. Offline Mapping
The supplementary material includes a fascinating comparison (Table 5) against the offline 3D Gaussian Splatting (3DG) method. When given the same amount of time and the same camera poses, Photo-SLAM (online) produces a map that is much smaller (31-35 MB vs. 144-219 MB) and has a much faster rendering speed (~900-1000 FPS vs. ~450-480 FPS) while achieving comparable or even superior rendering quality to some 3DG variants. This demonstrates that Photo-SLAM's online densification and optimization strategies create a more efficient and compact scene representation than offline methods that tend to naively add points.
7. Conclusion & Reflections
-
Conclusion Summary: Photo-SLAM successfully bridges the gap between traditional, efficient SLAM and modern, high-fidelity neural rendering. By introducing the
hyper primitives mapand a hybrid architecture, it achieves real-time performance, superior photorealistic mapping quality, and remarkable efficiency. The novelGaussian-Pyramid-based learningmethod is a key enabler for high-quality reconstruction, especially in challenging monocular settings. The ability to run on embedded platforms like the Jetson AGX Orin makes it one of the first systems of its kind with clear potential for real-world robotics applications. -
Limitations & Future Work:
- Dynamic Scenes: The current system assumes a static environment. It would fail or produce significant artifacts in scenes with moving objects or people. Extending it to handle dynamic elements is a natural next step.
- Textureless Regions: The localization front-end relies on ORB features, which can struggle in environments with little to no texture. This could lead to tracking failure.
- Relighting: The current model captures the scene under fixed lighting conditions. Future work could involve decomposing materials and lighting to allow for scene relighting.
-
Personal Insights & Critique: This paper presents a very intelligent and pragmatic engineering solution to a difficult problem. Instead of pursuing a pure end-to-end neural approach, the authors made the wise decision to combine the strengths of different paradigms: the proven robustness of classical graph-based optimization for geometry and the speed and quality of 3D Gaussian Splatting for appearance.
The
hyper primitives mapis a clever data structure that elegantly unifies these two worlds. TheGaussian-Pyramid-based learningis a simple yet powerful idea that elegantly addresses the instability of monocular 3D reconstruction by imposing a coarse-to-fine learning curriculum.The most significant achievement of this work is its practicality. By demonstrating real-time performance on an embedded device (as shown in Image 4), Photo-SLAM moves photorealistic mapping from a purely academic pursuit into the realm of deployable technology for robotics, augmented reality, and digital twins. It sets a new and very high bar for future research in this area.
Similar papers
Recommended via semantic vector search.