Paper status: completed

Towards Physically Executable 3D Gaussian for Embodied Navigation

Published:10/24/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces SAGE-3D, addressing the limitations of 3D Gaussian Splatting in embodied navigation with object-centric semantic grounding and physics-aware execution. The released InteriorGS dataset contains 1K annotated indoor scenes, and SAGE-Bench is the first VLN bench

Abstract

3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: (1) Object-Centric Semantic Grounding, which adds object-level fine-grained annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release InteriorGS, containing 1K object-annotated 3DGS indoor scene data, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments show that 3DGS scene data is more difficult to converge, while exhibiting strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task. Our data and code are available at: https://sage-3d.github.io.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Towards Physically Executable 3D Gaussian for Embodied Navigation".

1.2. Authors

The authors are:

  • Bingchen Miao (Zhejiang University, Manycore Tech Inc)
  • Rong Wei (Manycore Tech Inc)
  • Zhiqi Ge (Zhejiang University)
  • Xiaoquan Sun (Manycore Tech Inc, Huazhong University of Science and Technology)
  • Shiqi Gao (Zhejiang University)
  • Jingzhe Zhu (Zhejiang University)
  • Renhan Wang (Manycore Tech Inc)
  • Siliang Tang (Zhejiang University)
  • Jun Xiao (Zhejiang University)
  • Rui Tang (Manycore Tech Inc)
  • Juncheng Li (Zhejiang University)

1.3. Journal/Conference

The paper is published at an unspecified venue, with the provided date 2025-10-24T10:05:00.000Z. Given the arXiv links, it is currently available as a preprint, suggesting it might be submitted to a conference or journal in 2025. The publication venue for 2025 is not explicitly stated in the provided text.

1.4. Publication Year

The publication year is 2025.

1.5. Abstract

This paper addresses the limitations of 3D Gaussian Splatting (3DGS)—a 3D representation method known for photorealistic real-time rendering—in Visual-Language Navigation (VLN) tasks, specifically its lack of fine-grained semantics and physical executability. To overcome these issues, the authors propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation). This new paradigm upgrades 3DGS into an executable, semantically, and physically aligned environment through two main components: (1) Object-Centric Semantic Grounding, which adds object-level annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects and creates rich physical interfaces within 3DGS. The authors release InteriorGS, a dataset of 1K object-annotated 3DGS indoor scenes, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments demonstrate that 3DGS scene data, while harder to converge, exhibits strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inability of 3D Gaussian Splatting (3DGS)—despite its advantages in photorealistic real-time rendering—to be directly applied to Visual-Language Navigation (VLN) tasks due to two significant limitations:

  1. Lack of fine-grained object-level semantics: Current 3DGS scenes only contain color and density information, making it impossible to uniquely identify or ground objects mentioned in natural language instructions (e.g., "go to the red chair"). This makes tasks requiring semantic understanding difficult without complex and error-prone post-processing.

  2. Absence of a physically executable structure: 3DGS is primarily a volumetric rendering technique, making it challenging to derive reliable collision geometries. Without accurate physical properties, embodied agents cannot interact with the environment realistically, leading to issues like mesh penetration that hinder effective learning.

    This problem is important because VLN is a core capability for Vision-Language Action (VLA) models, enabling robots to follow natural language instructions in complex indoor spaces. Training these models directly in the real world is costly and risky, leading to the widespread adoption of the sim-to-real paradigm. Reducing the sim-to-real gap is crucial, and 3DGS has been recognized as an effective tool for this due to its superior visual fidelity and view consistency compared to older scanned mesh representations (like Matterport3D and HM3D). However, 3DGS's inherent lack of semantics and physics prevents it from being a truly executable environment for VLN.

The paper's entry point is to bridge this gap by transforming 3DGS from a purely perceptual representation into a semantically and physically aligned, executable environment.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • SAGE-3D Paradigm: It proposes SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a novel paradigm that augments 3DGS with semantic granularity and physical validity, transforming it into an executable environment foundation for embodied navigation.
  • InteriorGS Dataset: It constructs and releases InteriorGS, the first large-scale dataset of 1,000 fully furnished indoor 3DGS reconstructions. This dataset includes dense object-level annotations (instance IDs, object categories, bounding box information) across 554k object instances and 755 categories, providing a semantically rich foundation.
  • SAGE-Bench Benchmark: It introduces SAGE-Bench, the first 3DGS-based VLN benchmark. SAGE-Bench features 2M new trajectory-instruction pairs, accurate per-object physical simulation, and rich interfaces for robot embodiments, along with a hierarchical instruction scheme and novel navigation natural continuity metrics.
  • Experimental Insights:
    • 3DGS scene data renders faster but is harder to converge during training compared to scanned mesh data, suggesting higher demands due to its richness and photorealism.

    • The scene-rich, photorealistic 3DGS VLN data exhibits strong generalizability, leading to a significant performance improvement (31% SR increase) for models trained on it when tested in unseen VLN-CE environments.

    • The newly proposed Continuous Success Ratio (CSR), Integrated Collision Penalty (ICP), and Path Smoothness (PS) metrics enable a more effective study of navigation's natural continuity, addressing gaps in conventional metrics that fail to capture issues like continuous collisions and unsmooth motion.

      These findings collectively solve the problem of making 3DGS a viable and effective environment for training and evaluating VLN agents, significantly narrowing the sim-to-real gap by providing high-fidelity, semantically rich, and physically interactive simulation environments.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following core concepts:

  • 3D Gaussian Splatting (3DGS):

    • Conceptual Definition: 3DGS is a novel 3D scene representation method that uses a collection of thousands to millions of anisotropic (non-uniform in all directions) 3D Gaussian primitives to model a scene. Each Gaussian is essentially a small, soft, transparent blob defined by its position, covariance (describing its shape and orientation), color, and opacity.
    • How it works: Instead of voxels or meshes, 3DGS optimizes these individual Gaussian primitives directly from a set of input images captured from various viewpoints. During rendering, these Gaussians are projected onto the 2D image plane and blended using standard alpha compositing techniques.
    • Advantages: It enables photorealistic real-time rendering from novel viewpoints, which is crucial for dynamic camera movements in navigation tasks. It also offers view-consistent appearance, meaning the scene looks consistent from any angle, unlike scanned mesh textures that can stretch or break.
    • Limitations (addressed by this paper): By default, 3DGS only encodes appearance (color, density) and lacks semantic information (what objects are, their categories, instance IDs) and explicit geometric surfaces for physical interaction.
  • Visual-Language Navigation (VLN):

    • Conceptual Definition: VLN is an embodied artificial intelligence (AI) task where an agent (e.g., a robot) must navigate through a 3D environment to a target location or perform an action, guided solely by natural language instructions.
    • Example: An instruction like "Go to the kitchen and find the red mug on the counter." requires the agent to understand the language, perceive the visual environment, and execute a sequence of actions (move, turn) to reach the goal.
    • Importance: It's a key step towards more capable and general-purpose embodied AI agents that can follow human commands in real-world settings.
  • Sim-to-Real Gap:

    • Conceptual Definition: This refers to the challenge of transferring policies or models trained in a simulated environment to the real world. Models trained in simulation often struggle in reality due to discrepancies in physics, visual appearance, sensor noise, and environmental complexity between the simulation and the real world.
    • Relevance to VLN: High-fidelity simulations that closely resemble reality are crucial for training VLN agents effectively, as direct real-world training is often expensive, time-consuming, and risky. 3DGS is seen as a way to narrow this gap due to its photorealism.
  • Embodied Navigation / Embodied AI:

    • Conceptual Definition: Embodied AI focuses on creating AI agents that perceive, act, and learn within physical or simulated environments, similar to living beings. Embodied Navigation is a subfield specifically dealing with an agent's ability to move and find its way in these environments.
    • Key Aspect: These agents need to interact with the environment physically (e.g., avoid collisions, open doors), not just visually.
  • Scanned Mesh Reconstructions (e.g., Matterport3D, HM3D):

    • Conceptual Definition: These are 3D models of real-world environments created by scanning them with RGB-D cameras (which capture both color and depth information). The scans are then stitched together to form a mesh (a collection of interconnected triangles forming surfaces).
    • Limitations:
      • Noisy Geometry: Depth scans can be noisy, leading to imperfect, bumpy, or estimated mesh surfaces that might merge objects with their surroundings.
      • Poor Semantics: Separating individual objects from a continuous scanned mesh for semantic annotation is difficult and costly.
      • View-Inconsistent Textures: Textures are often stitched from sparse RGB viewpoints, leading to seams, stretching, or blur when viewed from novel angles. This can be distracting for agents trying to learn visual cues.
  • Partially Observable Markov Decision Process (POMDP):

    • Conceptual Definition: A POMDP is a mathematical framework for modeling decision-making in environments where the agent cannot directly observe the full state of the world. Instead, the agent receives observations that provide partial information about the state.
    • Components:
      • U\mathcal{U}: Instruction space (the natural language commands).
      • S\mathcal{S}: Continuous state space (the full underlying state of the environment and agent, which may not be fully observable).
      • A\mathcal{A}: Action space (the set of actions the agent can take, e.g., move forward, turn).
      • O\mathcal{O}: Multimodal observation space (what the agent perceives, e.g., RGB images, depth maps, semantic segmentation).
      • TT: State transition function (describes how the state changes after an action, often influenced by physics).
      • ZZ: Observation function (describes the probability of making an observation given a state).
    • Relevance: The paper defines its SAGE-3D environment as a semantics- and physics-augmented POMDP, highlighting that the agent's decision-making is based on partial observations within a dynamic, physically realistic world.
  • Convex Hull Decomposition:

    • Conceptual Definition: A convex hull of a set of points is the smallest convex set that contains all the points. Convex decomposition is the process of breaking down a complex, non-convex 3D mesh (like a detailed 3D model of a chair) into a set of simpler, convex sub-meshes.
    • Purpose in Physics Simulation: Convex shapes are computationally much easier to handle for collision detection and physics simulation than complex non-convex shapes. By decomposing objects into convex parts, real-time physical interactions (like collisions between a robot and an object) can be simulated efficiently and accurately.
  • A-based Shortest-Path Search:*

    • Conceptual Definition: AA* (pronounced "A-star") is a popular pathfinding algorithm used in many areas of computer science, including robotics and game development. It is an informed search algorithm, meaning it uses heuristic functions to guide its search.
    • How it works: It finds the shortest path between a starting point and a goal point in a graph. It does this by evaluating each node (possible position) based on the cost from the start to that node (g-score) and an estimated cost from that node to the goal (h-score, the heuristic). It prioritizes nodes that appear to be on the shortest path.
    • Relevance: The paper uses AA*-based search to generate optimal trajectories for VLN agents within the constructed navigation maps.

3.2. Previous Works

The paper contextualizes its work by discussing previous approaches in VLN and scene representations:

  • Early VLN Work (e.g., Anderson et al., 2018; Ku et al., 2020): These studies often used Matterport3D-based discrete panoramic graphs. Agents navigated by selecting discrete viewpoints rather than continuous motion.

  • Continuous Navigation (e.g., Krantz et al., 2020 - VLN-CE): This shifted VLN research towards continuous control in environments like Habitat (Savva et al., 2019), allowing agents more realistic movement. However, mainstream benchmarks for continuous navigation still largely relied on scanned mesh reconstructions.

  • Scene Representations:

    • Scanned Mesh Reconstructions: Matterport3D (Chang et al., 2017) and HM3D (Ramakrishnan et al., 2021) are prominent datasets based on RGB-D scans. While widely used, they suffer from texture/semantic limitations as discussed above.
    • 3D Gaussian Splatting (3DGS): Introduced by Kerbl et al. (2023), 3DGS emerged as an efficient 3D representation for photorealistic real-time rendering. The paper notes that 3DGS has been integrated into embodied learning contexts (e.g., Jia et al., 2025; Zhu et al., 2025b for MuJoCo/Isaac Sim), sometimes adopting dual-representation (Gaussians for rendering, meshes for collision) (Lou et al., 2025; Wu et al., 2025b), and enhanced with lighting estimation (Phongthawee et al., 2024).
  • VLN Benchmarks: The paper lists several existing benchmarks for continuous navigation tasks, highlighting their scene sources, instruction types, and 3D representations:

    • VLN-CE (Krantz et al., 2020): 4.5k tasks, 90 scenes, MP3D, Estimated Scene Geometry, Scanned Mesh 3D Representation.

    • OVON (Yokoyama et al., 2024): 53k tasks, 181 scenes, HM3D, Estimated Scene Geometry, Scanned Mesh.

    • GOAT-Bench (Khanna et al., 2024): 725k tasks, 181 scenes, HM3D, Estimated Scene Geometry, Scanned Mesh.

    • IR2R-CE (Krantz et al., 2022): 414 tasks, 71 scenes, MP3D, Estimated Scene Geometry, Scanned Mesh.

    • LHPR-VLN (Song et al., 2025): 3.3k tasks, 216 scenes, HM3D, Estimated Scene Geometry, Scanned Mesh.

    • OctoNav-Bench (Gao et al., 2025): 45k tasks, 438 scenes, MP3D, HM3D, Estimated Scene Geometry, Scanned Mesh.

      The key takeaway is that most existing VLN benchmarks rely on scanned mesh representations, which have limitations in terms of texture/semantic quality and physical validity. They also primarily focus on A-to-B navigation without complex causal dependencies.

3.3. Technological Evolution

The evolution of scene representations for VLN has been driven by the need to narrow the sim-to-real gap and provide more realistic training environments:

  1. Early Scanned Mesh Reconstructions (e.g., Matterport3D, HM3D): These provided the first large-scale 3D indoor environments for Embodied AI. They enabled early VLN research, but their limitations in semantic annotation, texture quality, and physical accuracy became apparent as research advanced.
  2. Transition to Continuous Environments: The shift from discrete panoramic graphs to continuous navigation in environments like Habitat highlighted the need for more robust scene representations that support free movement and physical interaction.
  3. Emergence of 3D Gaussian Splatting (3DGS): 3DGS represented a significant leap in photorealistic real-time rendering, offering superior visual fidelity and view consistency compared to scanned mesh. This made it a highly attractive candidate for VLN environments.
  4. This Paper's Contribution: The current paper, SAGE-3D, addresses the critical next step in this evolution: making 3DGS not just visually stunning but also semantically meaningful and physically executable. It pushes the boundary of 3DGS beyond pure rendering to serve as a foundational, interactive environment for Embodied AI.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

  • Upgrading 3DGS from Perceptual to Executable: Prior 3DGS applications in Embodied AI often treated Gaussians primarily for rendering, sometimes using dual-representations where meshes handle physics. SAGE-3D provides a unified paradigm to integrate semantics and physics directly into the 3DGS environment structure, making it natively executable and semantically aligned.
  • Addressing Semantics in 3DGS: Existing 3DGS scenes lack object-level semantics. This paper introduces Object-Level Semantic Grounding through InteriorGS (a manually annotated 3DGS dataset) and 2D semantic top-down map generation, which is a significant innovation for VLN tasks requiring fine-grained understanding of instructions like "go to the red chair".
  • Enabling Physics in 3DGS Environments: Native 3DGS struggles with deriving reliable collision geometries. SAGE-3D introduces Physics-Aware Execution Jointing by creating a 3DGS-Mesh Hybrid Representation where collision bodies are derived from artist-created meshes via convex decomposition and integrated with 3DGS for appearance. This provides accurate physical simulation without sacrificing visual fidelity.
  • Novel VLN Benchmark: SAGE-Bench is the first 3DGS-based VLN benchmark. Unlike prior benchmarks that largely use scanned mesh and focus on A-to-B navigation, SAGE-Bench offers:
    • Ground Truth Scene Geometry (derived from artist meshes) instead of estimated from scans.
    • Hierarchical Instruction Generation: It includes Instruction with Causality (high-level semantic goals like "I'm thirsty, get water from the table") in addition to low-level actions, moving beyond simple A-to-B tasks.
    • Navigation Natural Continuity Metrics: It introduces CSR, ICP, and PS to evaluate continuous motion, addressing the limitations of traditional metrics like SR and CR that don't capture nuanced agent behavior (e.g., persistent collisions, jerky movements).
  • Enhanced Data Generalizability: The photorealistic and scene-rich nature of the 3DGS data in SAGE-Bench is shown to significantly improve the generalizability of VLN models to unseen environments, which is a crucial step for real-world deployment.

4. Methodology

4.1. Principles

The core idea of SAGE-3D is to transform 3DGS from a purely visual scene representation into a fully functional, executable environment for embodied navigation. The theoretical basis is that for an agent to effectively navigate and interact in a VLN task, the environment must provide three key components:

  1. Photorealistic Appearance: To accurately mirror real-world visual complexity and reduce the sim-to-real gap. 3DGS excels at this.

  2. Fine-grained Semantics: To enable agents to understand and ground natural language instructions that refer to specific objects, their attributes, and relationships.

  3. Physical Executability: To allow agents to move and interact with objects in a physically plausible manner, avoiding collisions and supporting realistic dynamics.

    The intuition behind SAGE-3D is to decouple the strengths of different representations: 3DGS for rendering high-fidelity visuals, and traditional mesh-based approaches for providing precise collision geometry and semantic annotations. By carefully aligning these layers, the system aims to offer the best of both worlds within a unified framework.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. SAGE-3D Paradigm Definition

The paper formally defines SAGE-3D as a process that upgrades 3DGS into an executable, semantically and physically aligned environment foundation for embodied navigation. This transformation is represented as:

G :+ :M :+ :Φ : :Eexec G \ : + \ : M \ : + \ : \Phi \ : \longrightarrow \ : \mathcal { E } _ { \mathrm { e x e c } }

  • GG: Represents the set of Gaussian primitives ({gi}i=1N\{ g_i \}_{i=1}^N) that constitute the 3DGS scene, providing the photorealistic visual appearance.

  • MM: Denotes the semantic layer, which includes fine-grained information such as instance/category maps and object attributes. This layer allows for the understanding and grounding of natural language instructions.

  • Φ\Phi: Represents the physics layer, comprising collision bodies and dynamics properties. This layer enables physically accurate interactions and movement within the environment.

  • Eexec\mathcal{E}_{\mathrm{exec}}: The resulting executable environment, which supports continuous Visual-Language Navigation and related embodied AI tasks.

    The resulting environment is formalized as a semantics- and physics-augmented Partially Observable Markov Decision Process (POMDP):

E=(U,S,A,O,T,Z;M,Φ) \mathcal { E } = ( \mathcal { U } , \mathcal { S } , \mathcal { A } , \mathcal { O } , T , Z ; M , \Phi )

  • U\mathcal{U}: The instruction space, which encompasses the natural language commands provided to the agent.

  • S\mathcal{S}: The continuous state space, representing the complete, underlying state of the environment and the agent's pose.

  • A\mathcal{A}: The action space, defining the set of continuous or discrete actions the agent can perform (e.g., velocity commands, turning, moving).

  • O\mathcal{O}: The multimodal observation space, which includes all sensor data available to the agent (e.g., RGB, depth, semantic segmentation, poses, contact events).

  • TT: The physics-driven state transition function, describing how the environment and agent state evolve over time based on actions and physical laws.

  • ZZ: The rendering function, which generates observations from the current state, primarily driven by the 3DGS appearance.

  • MM: The semantic layer is explicitly included in the POMDP definition, indicating that semantic information is an integral part of the environment's state and observations.

  • Φ\Phi: The physics layer is also explicitly included, signifying that physical properties and interactions are central to the environment's dynamics and agent behavior.

    This paradigm ensures that the high-fidelity rendering of 3DGS is preserved while object-level semantics and physical executability are introduced, making 3DGS a practical foundation for training and evaluating embodied agents.

4.2.2. Object-Level Semantic Grounding

To address the lack of fine-grained semantics in conventional 3DGS, the paper introduces InteriorGS and a 2D semantic top-down map generator.

4.2.2.1. InteriorGS Dataset Construction

The InteriorGS dataset is constructed with the goal of providing object-level semantic annotations for 3DGS scenes.

  1. Source Data: The 3DGS data in InteriorGS is sampled from artist-created mesh scenes. These scenes provide clean, accurate geometry and object boundaries, which are critical for reliable semantic annotation and later for extracting collision bodies.
  2. 3DGS Reconstruction: To achieve high-fidelity 3DGS reconstructions, especially in occlusion-rich indoor environments, an average of 3,000 camera views are rendered per scene using a ray tracing renderer. The poses from this renderer are then used with the open-source GSplat pipeline (Ye et al., 2025) to estimate the 3DGS parameters (positions, colors, opacities, covariances of Gaussians).
    • Camera Placement Policies (detailed in Appendix B):
      • Perimeter-aware floorplan sweeps ("surround"): For each room polygon, a global camera budget is distributed proportionally to its perimeter. Cameras are uniformly spaced along the polygon's perimeter, with optical axes aligned to inward edge normals. Three tangential baselines (left/center/right) and three vertical tiers are instantiated:
        • Outer tiers: lower (150mm above floor, +30+30^\circ pitch), middle (mid-height, 00^\circ pitch), upper (500mm below ceiling, 30-30^\circ pitch).
        • Interior tiers (if applicable): heights interpolated, upper (15-15^\circ pitch), lower (+15+15^\circ pitch), middle matching outer middle.
      • Volume-uniform sampling: The global camera budget is distributed across rooms proportionally to their volume. Poisson-disk sampling is used to draw 3D positions, ensuring space-filling uniformity. At each sampled position, six cameras with canonical yaw-pitch templates are instantiated, with a small random perturbation applied to their orientations.
      • Viewpoint Selection: Viewpoints are selected to be at appropriate distances from mesh surfaces. Viewpoints too close to the nearest mesh surface (below a safety threshold) are discarded to mitigate undersampling-induced 3DGS underfitting.
  3. Manual Annotation: After 3DGS reconstruction, the scenes undergo careful manual labeling and double verification to add object-level semantics. This includes:
    • Object Categories: e.g., "chair," "table," "sofa."
    • Instance IDs: Unique identifiers for each individual object instance (e.g., chair1chair_1, chair2chair_2).
    • Bounding Box Information: 3D axis-aligned bounding boxes for each object.
  4. Dataset Content: InteriorGS comprises 1,000 high-fidelity indoor 3DGS scenes (752 residential interiors, 248 public spaces like concert halls). It contains over 554k object instances across 755 categories, providing a dense, semantically consistent, and diverse foundation.

4.2.2.2. 2D Semantic Top-Down Map Generation

Since 3DGS lacks inherent semantics and discrete entities for traditional NavMesh generation, a 2D semantic top-down map is designed.

  1. Projection: Annotated 3D objects from InteriorGS (initially stored as axis-aligned 3D bounding boxes) are projected onto the ground plane.
  2. Refinement: Each object's footprint is refined into an irregular mask. This is done by sampling points from the object's surface, projecting them onto the ground plane, and then computing the 2D convex hull of these projected points.
  3. Formula for Mask Generation: Mk=Fuse(Hull{Πtop(p)pSurf(ok)}) \mathcal { M } _ { k } = \operatorname { Fuse } \left( \operatorname { Hull } \left\{ \Pi _ { \mathrm { t o p } } ( p ) \mid p \in \operatorname { Surf } ( o _ { k } ) \right\} \right)
    • Mk\mathcal{M}_k: Represents the 2D mask generated for a specific object oko_k. This mask delineates the object's footprint on the ground plane.
    • Surf(ok)\mathrm{Surf}(o_k): Denotes the set of sampled surface points belonging to the 3D object oko_k. These points are representative of the object's exterior.
    • Πtop()\Pi_{\mathrm{top}}(\cdot): This is the projection operator that maps a 3D point pp onto the ground plane.
    • Hull()\mathrm{Hull}(\cdot): The 2D convex-hull operator, which takes a set of 2D points (the projected surface points) and returns the smallest convex polygon that encloses all of them. This effectively creates a tight boundary around the object's base.
    • Fuse()\mathrm{Fuse}(\cdot): A merging function that consolidates multi-view masks into a single, consistent footprint. This step is important if an object's footprint is perceived or processed from multiple angles, ensuring a unified representation.
  4. Additional Annotations: Doors are tagged with their state (open/closed/half-open), and walls are marked as non-traversable, providing critical information for path planning and instruction generation.

4.2.3. Physics-Aware Execution Jointing

To enable physical interaction and address issues like mesh penetration in 3DGS, the paper introduces object-level collision geometry and a physics simulation interface.

4.2.3.1. Physics Simulation with 3DGS-Mesh Hybrid Representation

This component is crucial for making the environment physically executable.

  1. Collision Geometry Extraction: Starting from the artist-created triangle meshes of each object (which are clean and accurate, unlike scanned meshes), the CoACD (Wei et al., 2022) algorithm is applied for convex decomposition. This process breaks down complex object meshes into simpler convex shapes, which are ideal for efficient collision detection and physics simulation. These resulting convex shapes form the per-object collision bodies.
  2. 3DGS-Mesh Hybrid Representation Assembly: A USDA scene (Universal Scene Description Archive, a file format for 3D content interchange) is assembled. In this USDA scene:
    • The collision bodies (derived from meshes) are authored as invisible rigid shapes. These rigid shapes are responsible for driving contact and dynamics during simulation.
    • The 3DGS file (exported from 3DGUT (Wu et al., 2025a), supporting 3DGS assets from USDZ files in Isaac Sim 5.0+) provides the photorealistic appearance and remains visible.
  3. Decoupled Design: This decoupled design means that 3DGS handles the high-fidelity rendering, while the mesh-based collision bodies provide accurate geometry for physics simulation. Each object is instantiated as a USD prim (a fundamental building block in USD) and augmented with Φk\Phi_k (rigid-body and contact parameters). Static scene objects are set as static bodies, while a curated subset can be configured as movable or articulated to support advanced interactions.
  4. Benefits: This hybrid representation avoids the need to ray trace detailed artist meshes at runtime, preserves the visual quality of 3DGS, and supplies accurate collision geometry for robust physics simulation with connectivity to robotics APIs.

4.2.3.2. Agents, Control, and Observations in a Continuous Environment

The SAGE-3D environment provides comprehensive interfaces for embodied agents:

  1. Robot APIs: The simulator exposes robot APIs compatible with various ground platforms (e.g., Unitree G1/Go2/H1 for legged and wheeled robots) and aerial robots (e.g., quadrotor UAVs).
  2. Action Interfaces:
    • Discrete commands: e.g., turn, forward, stop.
    • Continuous control: velocity commands (v,ωv, \omega) for ground robots (linear velocity vv, angular velocity ω\omega) and 6-DoF velocity/attitude commands for UAVs. These actions are executed in a continuous environment (metric 3D space, no teleportation between nodes).
  3. Observation Space: The environment provides synchronized multimodal observations, including:
    • RGB images: Color images from the agent's viewpoint.
    • Depth maps: Distance information to surfaces.
    • Semantic segmentation: Pixel-level labels for objects.
    • Poses: Agent's 3D position and orientation.
    • Contact events: Information about collisions.
  4. Safety Features: Built-in collision detection, stuck/interpenetration monitoring, and recovery mechanisms ensure safe and robust simulation.
  5. Performance: Offline-generated collision bodies are cached to accelerate loading and guarantee stable, repeatable evaluation.

5. Experimental Setup

5.1. Datasets

The experiments primarily leverage two novel datasets introduced in the paper, InteriorGS and SAGE-Bench, and also involve the VLN-CE benchmark for generalizability testing.

5.1.1. InteriorGS

  • Source: InteriorGS is a dataset of 1,000 high-fidelity indoor 3DGS scenes. These scenes are sampled from artist-created mesh scenes.
  • Scale and Characteristics:
    • 752 residential interior scenes and 248 public spaces (e.g., concert halls, amusement parks, gyms).
    • Over 554k object instances across 755 categories.
    • Features double-verified object-level annotations, including object categories, instance IDs, and bounding box information.
    • The 3DGS data is reconstructed from an average of 3,000 camera views per scene using ray tracing and the GSplat pipeline.
    • Example: Figure 9 in the paper (More Visualization of InteriorGS) shows examples of highly detailed and photorealistic indoor scenes, demonstrating the quality and diversity of the data. Figure 10 (Distribution of non-home scenes of InteriorGS) illustrates the distribution of public space categories, and Figure 11 (Distribution of assets of InteriorGS) shows the distribution of asset types (e.g., Furniture, Lighting, Books). For instance, books are the most numerous assets, totaling 80,096 instances.
  • Purpose: Provides a dense, semantically consistent, and broadly diverse foundation for training and evaluating embodied agents, particularly for VLN tasks requiring fine-grained object understanding.

5.1.2. SAGE-Bench

  • Source: SAGE-Bench is the first 3DGS-based VLN benchmark, built upon the InteriorGS scenes.
  • Scale and Characteristics:
    • Contains 2M new instruction-trajectory pairs.
    • Includes 554k detailed collision bodies (derived from the original mesh scenes for physics simulation).
    • Hierarchical Instruction Scheme:
      • High-level instructions: Emphasize task semantics and human-oriented intent, addressing causal dependencies. Categories include Add Object (e.g., "Please move the book from the coffee table to the bookshelf"), Scenario Driven (e.g., "I'm thirsty, please bring me a drink from the fridge"), Relative Relationship (e.g., "Move to the chair next to that table"), Attribute-based (e.g., "Find an empty table"), and Area-based (e.g., "Walk to the kitchen area").
      • Low-level instructions: Focus on control and kinematic evaluation, including primitive actions (Base-Action: "Move forward two steps") and goal-directed point-to-point navigation (Single-Goal: "Walk to the bedroom").
    • Trajectory Generation: Uses the collision bodies from Section 2.1 to construct a navigation map (combining a 1.2m-height occupancy map with the 2D semantic map). A*-based shortest-path search is then used to generate trajectories, with a cost function integrating free-space distance, narrow-passage penalties, and area preferences. Start-end pairs are sampled across different rooms/objects, with a minimum safety distance.
  • Splits: A test split of 1,148 samples (944 high-level, 204 low-level across 35 distinct scenes) is selected, with the remainder used for training and validation.
  • Purpose: To benchmark VLN models in high-fidelity, physically executable environments with complex, semantically rich instructions, pushing beyond traditional A-to-B navigation.

5.1.3. VLN-CE

  • Purpose: Used to evaluate the generalizability of models trained on SAGE-Bench to unseen environments from a different source (scanned mesh).
  • Source: Based on Matterport3D (MP3D) scenes.

5.2. Evaluation Metrics

5.2.1. VLN Task Metrics

For the VLN task, both conventional metrics and new natural continuity metrics are used.

  • Success Rate (SR):

    1. Conceptual Definition: Measures the proportion of episodes where the agent successfully reaches the target location within a predefined tolerance at the end of the navigation episode. It's a binary (0 or 1) judgment: either the goal is reached, or it's not.
    2. Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successful episodes}}{\text{Total number of episodes}} $
    3. Symbol Explanation:
      • Number of successful episodes\text{Number of successful episodes}: The count of episodes where the agent's final position is within the goal region.
      • Total number of episodes\text{Total number of episodes}: The total count of navigation attempts.
  • Oracle Success Rate (OSR):

    1. Conceptual Definition: Similar to SR, but it considers success if any point along the agent's trajectory (not just the final one) is within the goal region. This evaluates whether the agent ever reached the goal, even if it later moved away or failed to stop at the correct point.
    2. Mathematical Formula: $ \mathrm{OSR} = \frac{\text{Number of episodes where goal was visited}}{\text{Total number of episodes}} $
    3. Symbol Explanation:
      • Number of episodes where goal was visited\text{Number of episodes where goal was visited}: Count of episodes where at least one point in the agent's path was within the goal region.
      • Total number of episodes\text{Total number of episodes}: The total count of navigation attempts.
  • Success Weighted by Path Length (SPL):

    1. Conceptual Definition: A metric that balances success rate with path efficiency. It penalizes agents that take long, inefficient paths to reach the goal. A higher SPL means both successful navigation and taking a relatively short path.
    2. Mathematical Formula: $ \mathrm{SPL} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}{S_i} \frac{L{GT,i}}{\max(L_{Act,i}, L_{GT,i})} $
    3. Symbol Explanation:
      • NN: Total number of episodes.
      • 1Si\mathbf{1}_{S_i}: An indicator function that is 1 if episode ii is successful, and 0 otherwise.
      • LGT,iL_{GT,i}: Length of the shortest path (ground truth) for episode ii.
      • LAct,iL_{Act,i}: Length of the agent's actual path for episode ii.
      • max(LAct,i,LGT,i)\max(L_{Act,i}, L_{GT,i}): Ensures the ratio is always 1\le 1 and handles cases where actual path is shorter than ground truth (though usually LAct,iLGT,iL_{Act,i} \ge L_{GT,i}).
  • Collision Rate (CR):

    1. Conceptual Definition: Measures the proportion of episodes in which the agent experienced one or more collisions with the environment. It's a binary indicator per episode.
    2. Mathematical Formula: $ \mathrm{CR} = \frac{\text{Number of episodes with collisions}}{\text{Total number of episodes}} $
    3. Symbol Explanation:
      • Number of episodes with collisions\text{Number of episodes with collisions}: Count of episodes where the agent registered at least one collision event.
      • Total number of episodes\text{Total number of episodes}: The total count of navigation attempts.

5.2.2. Navigation Natural Continuity Metrics

These three novel metrics are introduced to assess VLN model performance from the perspective of continuous motion, addressing limitations of conventional metrics.

  • Continuous Success Ratio (CSR):

    1. Conceptual Definition: Measures the fraction of time the agent stays within a permissible corridor around the reference path while also satisfying task conditions. Unlike SR which is a 0/1 judgment at the endpoint, CSR reflects the "goal-consistent" behavior throughout the entire episode, quantifying how well the agent follows the intended path.
    2. Mathematical Formula: CSR=1Tt=1Ts(t) \mathrm { C S R } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } s ( t )
    3. Symbol Explanation:
      • TT: The total length (duration or number of time steps) of the trajectory.
      • s(t): A binary indicator function at time step tt.
        • s(t)=1s(t) = 1 if the agent's position pos(t)\operatorname{pos}(t) is within the permissible corridor C\mathcal{C} and task conditions are satisfied at time tt.
        • s(t)=0s(t) = 0 otherwise.
      • C\mathcal{C}: The permissible corridor, defined by buffering the reference path with a radius rtolr_{\mathrm{tol}} (tolerance radius).
  • Integrated Collision Penalty (ICP):

    1. Conceptual Definition: Quantifies the time-averaged collision intensity along the trajectory, capturing both the frequency and duration of contacts. This metric goes beyond Collision Rate (CR) by differentiating between brief, occasional contacts and prolonged, persistent scraping against obstacles, providing a more nuanced measure of physical interaction.
    2. Mathematical Formula: ICP=1Tt=1Tc(t) \mathrm { I C P } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } c ( t )
    3. Symbol Explanation:
      • TT: The total length (duration or number of time steps) of the trajectory.
      • c(t): The collision intensity sequence at time step tt, where c(t)[0,1]c(t) \in [0, 1]. A value of 0 indicates no collision, and 1 indicates a continuous or severe collision. The paper doesn't explicitly define how c(t) is calculated, but implies it's a normalized measure of collision severity/duration at each step.
  • Path Smoothness (PS):

    1. Conceptual Definition: Evaluates the smoothness of the agent's path by analyzing consecutive changes in heading angle. Higher PS values indicate smoother paths with fewer abrupt turns and acceleration changes, which is desirable for real robots due to feasibility and stability concerns.
    2. Mathematical Formula: PS=11T1t=2Tmin(Δθtπ,1),Δθt=θtθt1 \mathrm { P S } = 1 - \frac { 1 } { T - 1 } \sum _ { t = 2 } ^ { T } \operatorname* { m i n } \left( \frac { \left| \Delta \theta _ { t } \right| } { \pi } , 1 \right) , \quad \Delta \theta _ { t } = \theta _ { t } - \theta _ { t - 1 }
    3. Symbol Explanation:
      • TT: The total length (duration or number of time steps) of the trajectory.
      • θt\theta_t: The agent's heading angle at trajectory time step tt.
      • Δθt\Delta \theta_t: The change in heading between two consecutive time steps, calculated as θtθt1\theta_t - \theta_{t-1}.
      • Δθtπ\frac{|\Delta \theta_t|}{\pi}: Normalizes the absolute change in heading angle by π\pi (180 degrees), ensuring the value is between 0 and 1 for typical turns. The min function caps it at 1 in case of very large turns or measurement noise.
      • The sum averages the normalized heading changes, and 1 - average yields a smoothness score, where 1 is perfectly smooth and lower values indicate more jerky motion.

5.2.3. No-goal-Nav Task Metrics

For the No-goal-Nav task, where the goal is to explore the environment safely, two metrics are used:

  • Episode Time: Measures the total duration (in seconds) an agent navigates in the environment until the episode terminates (e.g., due to collision or maximum time limit).
  • Explored Areas: Measures the total unique area (e.g., in square meters) covered by the agent during an episode.

5.3. Baselines

The paper evaluates a wide range of models, categorized as:

  • Closed-source MLLMs as Agent:
    • Qwen-VL-MAX (Bai et al., 2023)
    • GPT-4.1
    • GPT-5
  • Open-source MLLMs as Agent:
    • Qwen2.5-VL-7B (Bai et al., 2023)
    • InternVL-2.5-8B (Zhu et al., 2025a)
    • InternVL-3-8B (Chen et al., 2024)
    • Llama-3.2-11B
  • Vision-Language Models (VLN-specific models):
    • NaviLLM (Zheng et al., 2024)

    • NavGPT-2 (Zhou et al., 2024)

    • CMA (Krantz et al., 2020)

    • NaVid (Zhang et al., 2024)

    • NaVILA (Cheng et al., 2025)

    • NaVid-base: NaVid's pre-trained model navid-7b-full-224.

    • NaVid-SAGE (Ours): NaVid-base fine-tuned on SAGE-Bench data.

    • NaVILA-base: NaVILA's pre-trained model navila-siglip-lama3-8b-v1.5-pretrain.

    • NaVILA-SAGE (Ours): NaVILA-base fine-tuned on SAGE-Bench data.

      These baselines are representative as they cover a spectrum from general Multimodal Large Language Models (MLLMs) to specialized VLN models, including recent State-of-the-Art (SOTA) approaches. The inclusion of base models and their SAGE-trained counterparts (NaVid-SAGE, NaVILA-SAGE) allows for direct evaluation of the impact of training on the new 3DGS-based data.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Comparison on SAGE-Bench

The following are the results from Table 2 of the original paper:

Methods VLN (High-level Instruction) Nogoal-Nav
SR ↑ OSR ↑ SPL ↑ CR ↓ CSR ↑ ICP ↓ PS ↑ Episode Time ↑ Explored Areas ↑
Closed-source MLLMs as Agent
Qwen-VL-MAX 0.14 0.25 0.12 0.85 0.21 0.41 0.79 64.74 6.40
GPT-4.1 0.13 0.21 0.12 0.72 0.19 0.35 0.81 67.70 3.00
GPT-5 0.12 0.18 0.11 0.63 0.18 0.24 0.86 64.60 2.16
Open-source MLLMs as Agent
Qwen2.5-VL-7B 0.13 0.14 0.13 0.71 0.21 0.27 0.87 42.19 6.88
InternVL-2.5-8B 0.10 0.13 0.10 0.52 0.14 0.33 0.88 28.82 4.28
InternVL-3-8B 0.12 0.20 0.11 0.64 0.17 0.32 0.82 34.70 6.34
Llama-3.2-11B 0.13 0.18 0.14 0.74 0.16 0.29 0.83 38.45 6.68
Vision-Language Model
NaviLLM 0.05 0.06 0.05 0.21 0.09 0.24 0.90 18.73 5.74
NavGPT-2 0.10 0.12 0.11 0.33 0.14 0.29 0.83 24.51 3.36
CMA 0.13 0.15 0.14 0.54 0.26 0.28 0.86 44.26 3.22
NaVid 0.15 0.17 0.15 1.24 0.29 0.33 0.89 56.13 4.28
NaVILA 0.39 0.47 0.34 3.28 0.48 0.61 0.68 77.82 8.40
NaVid-base 0.10 0.13 0.10 0.33 0.15 0.28 0.84 20.37 3.42
NaVid-SAGE (Ours) 0.36 0.46 0.32 2.12 0.48 0.66 0.54 60.35 5.66
NaVILA-base 0.21 0.26 0.22 3.53 0.33 0.72 0.41 58.26 6.52
NaVILA-SAGE (Ours) 0.46 0.55 0.48 2.67 0.57 0.54 0.74 82.48 8.74
  • SAGE-Bench is Challenging: The results show that SAGE-Bench presents a significant challenge for current VLN models and MLLMs. Most models achieve SR values no higher than 0.15. For instance, NaVid, which performs well on VLN-CE R2R Val-Unseen, only achieves 0.15 SR on SAGE-Bench, and NaVILA's SR drops from 0.54 (VLN-CE) to 0.39 on SAGE-Bench. This indicates the higher complexity and demands of the 3DGS-based, semantically and physically rich environment.
  • MLLMs' Inherent VLN Capability: MLLMs (both open-source and closed-source) show a baseline VLN capability, achieving SRs between 0.10 and 0.14. This is comparable to, or even surpasses in some OSR cases, dedicated VLN models like CMA (0.13 SR) and NaVid (0.15 SR). For example, InternVL-3 achieves an OSR of 0.20, higher than NaVid's 0.17. This suggests that the multimodal understanding of MLLMs inherently equips them with some capacity for navigation, even without explicit VLN training on this benchmark.
  • Impact of SAGE Training: Fine-tuning on SAGE-Bench data significantly improves performance. NaVILA-SAGE achieves the best performance with an SR of 0.46, OSR of 0.55, and SPL of 0.48, outperforming all other baselines, including the original NaVILA. This validates the effectiveness of the proposed dataset and paradigm.
  • Continuity Metrics Insights: For models with weak VLN performance (SR<0.20SR < 0.20), their CR, ICP, and PS metrics are often non-comparable as they fail to navigate meaningfully (e.g., performing random or single-action predictions). However, for NaVILA and NaVILA-SAGE, while NaVILA attains a relatively high SR (0.39), its ICP of 0.61 indicates sustained collisions, and PS of 0.68 reflects unsmooth motion. NaVILA-SAGE improves CSR to 0.57 but still shows a high ICP (0.54) and PS (0.74), suggesting room for improvement in natural continuity despite high task success.

6.1.2. Insight 1: 3DGS Scene Data Renders Faster but is Harder to Converge

The following are the results from Table 3 of the original paper:

Environment Type Avg. Render Time / Frame (ms) ↓ Avg. Memory (MB) ↓ Iters to SR=40% (k) ↓ Time-to-SR=40% (hrs) ↓
Scanned Mesh (MP3D/HM3D) 16.7 850 120 4.8
3DGSMesh Hybrid Representation (Ours) 6.2 220 160 6.2

This comparison between scanned mesh data and 3DGS data (using NaVILA-base on NVIDIA H20 GPU) reveals a trade-off:

  • Rendering Speed: 3DGS data significantly outperforms scanned mesh data in rendering speed, with a per-frame rendering time of 6.2 ms and average memory usage of 220 MB, compared to 16.7 ms and 850 MB for scanned mesh. This highlights 3DGS's efficiency for real-time visual feedback.
  • Training Convergence: Despite faster rendering, 3DGS data exhibits slower convergence during training. To reach a 40% SR, the 3DGS-based model required 160k iterations and 6.2 hours, whereas the scanned mesh-based model achieved it in 120k iterations and 4.8 hours. This suggests that the richness and photorealism of 3DGS data, which better mirror real-world complexity, also present a greater challenge for models to learn and converge.

6.1.3. Insight 2: 3DGS Scene Data Exhibits Strong Generalizability

The following are the results from Table 4 of the original paper:

Methods R2R Val-Unseen
SR ↑ OSR ↑ SPL ↑
Seq2Seq 0.25 0.37 0.22
Navid-base 0.22 0.32 0.17
Navid-SAGE (Ours) 0.31 0.42 0.29
CMA 0.32 0.40 0.30
NaVid 0.37 0.49 0.36
NaVILA-base 0.29 0.38 0.27
NaVILA-SAGE (Ours) 0.38 0.51 0.36
NaVILA 0.50 0.58 0.45

To evaluate the generalizability of the 3DGS-based scene data, NaVILA-SAGE and NaVid-SAGE (trained exclusively on SAGE-Bench) were tested on the VLN-CE benchmark (R2R Val-Unseen split), which uses scanned mesh environments.

  • Significant Improvement: Both NaVILA-SAGE and NaVid-SAGE achieved substantial performance improvements over their respective base models (which were pre-trained but not fine-tuned on SAGE-Bench data for this specific comparison, although the table shows other NaVILA and NaVid results which likely imply training on existing VLN data).
  • NaVILA-SAGE improved SR by 31% relatively (from 0.29 to 0.38) and OSR by 34% (from 0.38 to 0.51) compared to NaVILA-base.
  • Similar gains were observed for NaVid-SAGE, which saw its SR increase from 0.22 to 0.31, and OSR from 0.32 to 0.42 compared to NaVid-base.
  • Reasoning: This strong generalizability suggests that the scene-rich, photorealistic 3DGS data from SAGE-Bench provides a more robust and real-world-aligned learning experience, enabling models to transfer learned navigation skills effectively to different, unseen environments, even those with different underlying representations (scanned mesh).

6.1.4. Insight 3: Natural Continuity Metrics Enable Effective Study of Navigation's Natural Continuity

The analysis of the CSR, ICP, and PS metrics from Table 2 provides critical insights into navigation quality beyond traditional SR and CR.

  • CSR vs. SR: CSR is generally higher than SR (e.g., NaVILA has SR 0.39, CSR 0.48; NaVILA-SAGE has SR 0.46, CSR 0.57). This indicates that CSR is a more inclusive metric, acknowledging "goal-consistent" behavior even if the agent doesn't perfectly stop at the target or deviates slightly from the exact path.

  • Revealing Flaws in "Successful" Navigation: While NaVILA achieves a relatively high SR (0.39) and OSR (0.47), its ICP of 0.61 and PS of 0.68 expose significant issues with natural motion continuity.

    • An ICP of 0.61 (where 0 is no collision, 1 is constant collision) suggests frequent and sustained collisions, implying the agent is scraping walls or obstacles for prolonged periods, which is undesirable for real-world robots.
    • A PS of 0.68 (where 1 is perfectly smooth) indicates large, mechanical turning angles rather than smooth, fluid motion, affecting real robot feasibility and stable planning.
  • Visualization Case Study: Figure 4 corroborates these findings, showing that the NaVILA model's blue trajectory often exhibits unsmooth movements and persistent collisions (e.g., hugging a wall for a long period in Case 1, despite a CR of only 1 for the episode, ICP reaches 0.87), which conventional metrics like CR would fail to capture. The following figure (Figure 4 from the original paper) shows visualization case study of navigation natural continuity:

    Figure 4: Visualization case study of navigation natural continuity. The red trajectory is the ground truth, and the blue Trajectory is the trajectory of NaVILA. 该图像是一个示意图,展示了导航过程中的自然连续性案例。红色轨迹代表真实路径,蓝色轨迹则是 NaVILA 的导航路径。图中的三个案例分别显示了不同的导航情况,并标注了相应的指标。

This demonstrates that CSR, ICP, and PS effectively fill gaps in conventional VLN evaluation, providing a more comprehensive understanding of agent behavior in continuous environments.

6.1.5. High-level Instructions vs. Low-level Instructions

The following are the results from Table 5 of the original paper:

Methods Instruction Level SAGE-Bench VLN
SR ↑ OSR ↑ SPL ↑ CSR ↑ ICP ↓ PS ↑
GPT-4.1 Low-level 0.22 0.37 0.19 0.27 0.60 0.70
High-level 0.13 0.21 0.12 0.19 0.35 0.81
InternVL-3-8B Low-level 0.20 0.35 0.18 0.26 0.61 0.69
High-level 0.12 0.20 0.11 0.17 0.32 0.82
NaVid Low-level 0.24 0.42 0.21 0.34 0.63 0.64
High-level 0.15 0.17 0.15 0.29 0.33 0.89
NaVILA Low-level 0.56 0.66 0.50 0.58 0.48 0.75
High-level 0.39 0.47 0.34 0.48 0.61 0.68

The comparison between high-level and low-level instructions on SAGE-Bench reveals that VLN models perform significantly worse on high-level instructions.

  • For NaVILA, the SR drops from 0.56 on low-level instructions to 0.39 on high-level instructions. Similar performance drops are observed across other models like GPT-4.1, InternVL-3-8B, and NaVid.
  • High-level instructions are semantically richer and closer to real-life scenarios, demanding deeper understanding of context, object attributes, and spatial relationships (e.g., task causality). Low-level instructions are basic step-by-step actions. The performance gap indicates that current models struggle with the more complex reasoning and grounding required by high-level semantics, highlighting a key challenge for future VLN development.

6.1.6. Impact of the Number of Scenes and Samples on Model Performance

The following are the results from Table 6 of the original paper:

Data in # Train SAGE-Bench VLN
#Scenes #Samples SR ↑ OSR ↑ SPL ↑ CSR ↑ ICP ↓ PS ↑
800 240k 0.42 0.47 0.42 0.50 0.61 0.63
800 120k 0.40 0.43 0.40 0.48 0.62 0.62
800 60k 0.36 0.42 0.38 0.46 0.64 0.58
400 120k 0.34 0.39 0.35 0.44 0.67 0.54
400 60k 0.31 0.37 0.33 0.43 0.67 0.52
400 30k 0.28 0.35 0.31 0.43 0.69 0.49
400 15k 0.25 0.31 0.27 0.39 0.70 0.46
200 60k 0.27 0.33 0.29 0.41 0.70 0.47
100 60k 0.23 0.29 0.26 0.38 0.71 0.44
NaVILA-base 0.21 0.26 0.22 0.36 0.72 0.41

The following figure (Figure 5 from the original paper) shows model performance change curve (number of scenes vs. sample size):

Figure 5: Model performance change curve (number of scenes vs. sample size). 该图像是图表,展示了模型性能变化曲线,其中横轴为样本数量,纵轴为成功率(SR)。红线表示在固定400个场景下变化样本数量,蓝线表示在固定60,000个样本下变化场景数量。整体趋势显示,成功率随样本增加而上升。

The analysis of varying number of scenes and sample sizes (using NaVILA-base on the val-unseen split) provides insights into data efficiency:

  • Scene Diversity is More Critical: Increasing the number of scenes (environment diversity) yields greater performance gains than merely increasing the number of samples (data density) while keeping scenes constant. For example, comparing 400 scenes/120k samples (SR 0.34) with 800 scenes/60k samples (SR 0.36) shows that a higher number of scenes with fewer samples can still outperform more samples from fewer scenes.
  • Optimal Strategy: Generating the same number of samples from a larger number of environments produces better results. This suggests that diversity of environments is more critical for robust VLN learning than simply increasing the number of trajectories within a limited set of environments.

6.1.7. Results under Different Evaluation Slice

The following figure (Figure 6 from the original paper) shows results under Different Evaluation Slice:

Figure 6: Results under Different Evaluation Slice. 该图像是一个图表,展示了不同指令类型、轨迹长度和场景复杂度下成功率的比较。数据表明,NavId 在各项指标上的表现优于 NaVILA,尤其是在指令类型为 AC 时,其成功率达到 0.41,而 NaVILA 仅为 0.16。

Further evaluation based on the three-axis evaluation framework (high-level instruction types, trajectory lengths, and scene complexities) reveals specific challenging areas:

  • Instruction Types: VLN models perform worse on Relative Relationship and Attribute-based instruction types. For both NaVILA and NaVid, SR scores are more than 2% lower for these types compared to others. This indicates that models struggle with fine-grained spatial reasoning and identifying objects based on detailed attributes.
  • Trajectory Length and Scene Complexity: As trajectory length increases and scene complexity (asset density) grows, model performance drops significantly. This is expected as longer paths accumulate more errors, and more complex scenes increase visual clutter and potential for navigation mistakes.

6.2. Data Presentation (Tables)

All relevant tables from the paper have been transcribed in the preceding sections.

6.3. Ablation Studies / Parameter Analysis

The paper primarily performs parameter analysis by varying the number of scenes and number of samples in the training data, as presented in Table 6 and Figure 5. This can be considered an ablation study on the data characteristics.

  • Findings from Parameter Analysis: The results consistently show that increasing the diversity of environments (#Scenes) has a more pronounced positive impact on VLN model performance than simply increasing the density of training samples (#Samples) within a fixed number of environments. For instance, increasing from 400 scenes to 800 scenes with 60k samples improves SR from 0.31 to 0.36, whereas doubling samples from 60k to 120k within 400 scenes only improves SR from 0.31 to 0.34. This suggests that broader environmental exposure is key to generalizability and learning robust navigation policies.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces SAGE-3D, a transformative paradigm that elevates 3D Gaussian Splatting (3DGS) from a mere perceptual scene representation to an executable, semantically, and physically aligned environment foundation for embodied navigation. The authors' key contributions include:

  • InteriorGS: The creation of the first large-scale dataset of 1,000 fully furnished indoor 3DGS reconstructions, enriched with dense object-level annotations. This dataset enables robust semantic grounding in photorealistic environments.

  • SAGE-3D Framework: A novel framework that integrates Object-Centric Semantic Grounding and Physics-Aware Execution Jointing to endow 3DGS with fine-grained semantics and accurate physical simulation capabilities.

  • SAGE-Bench: The development of the first 3DGS-based VLN benchmark, featuring 2M instruction-trajectory pairs, precise per-object physical simulation, a hierarchical instruction scheme, and advanced navigation natural continuity metrics.

    Through extensive experiments, the paper provides crucial insights: 3DGS environments, while demanding higher training effort, offer faster rendering and superior generalizability to unseen VLN environments. The newly proposed continuity metrics effectively diagnose subtle flaws in navigation behavior that traditional metrics overlook. SAGE-3D thus establishes a coherent pipeline from high-fidelity data generation to physically valid evaluation, paving the way for more sophisticated embodied AI research.

7.2. Limitations & Future Work

The authors suggest several avenues for future research:

  • Richer Multi-step and Semantic-aware Navigation Tasks: The current benchmark, while introducing causal instructions, can be further expanded to encompass even more complex, multi-stage tasks requiring deeper semantic reasoning.

  • Interactive Manipulation: Leveraging the physics-aware execution layer, future work could explore tasks involving physical interaction with objects beyond simple navigation, such as picking up, moving, or operating items.

  • Broader Sim-to-Real Studies: The 3DGS environment's high fidelity and generalizability make it an ideal platform for more extensive sim-to-real transfer learning research, exploring how policies trained in SAGE-3D perform on real robots.

    While not explicitly stated as limitations in a dedicated section, the paper's results imply current model limitations, particularly:

  • Convergence Difficulty: 3DGS data, despite its advantages, makes models harder to converge, suggesting a need for more efficient training strategies or model architectures capable of handling its richness.

  • Challenges with Natural Continuity: Even SOTA models trained on SAGE-Bench still exhibit suboptimal ICP and PS scores, indicating that achieving genuinely smooth and collision-free motion remains a significant challenge.

  • High-level Instruction Understanding: Models struggle significantly with high-level instructions compared to low-level ones, pointing to a gap in semantic and causal reasoning.

7.3. Personal Insights & Critique

This paper represents a significant step forward for embodied AI and VLN by leveraging the power of 3DGS.

Inspirations and Applications:

  • Bridging the Realism Gap: The integration of 3DGS for visual fidelity with mesh-based collision bodies for physics is an elegant solution to the sim-to-real gap. This hybrid approach could be highly transferable to other robotics simulation domains (e.g., manipulation, human-robot interaction) where both photorealism and accurate physics are crucial.
  • Value of Richer Semantics: The emphasis on object-level annotations and hierarchical instructions highlights the importance of moving beyond purely geometric navigation to truly intelligent, human-like VLN. This focus on causal instructions is particularly inspiring, as it reflects how humans naturally communicate goals.
  • Novel Evaluation Metrics: The introduction of CSR, ICP, and PS is a critical contribution. Traditional SR and SPL often mask undesirable behaviors like jerky movements or constant scraping. These new metrics provide a much-needed, nuanced assessment of natural continuity, which is paramount for real-world robot deployment. This approach could inspire similar refined metrics in other embodied AI tasks.
  • Data-centric AI: The findings on the importance of scene diversity over sample density for generalizability reinforce a key principle of data-centric AI. It suggests that creating more varied environments, even with fewer trajectories per environment, is more beneficial for learning robust policies.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Computational Cost of Data Generation: While 3DGS offers faster rendering, the initial process of generating InteriorGS (artist-created meshes, 3,000 camera views, manual annotation, convex decomposition) seems computationally intensive. The paper doesn't detail the cost (time, human effort, compute) of creating InteriorGS and SAGE-Bench beyond the training convergence time. This could be a barrier for other researchers attempting to extend the dataset.

  • Dynamic Objects and Interactions: The current physics-aware execution jointing configures static-scene objects as static bodies and a "curated subset" as movable or articulated. For more complex interactive tasks, handling dynamic objects (e.g., objects moved by the agent or other entities) and complex articulated objects (e.g., doors, drawers with varying degrees of freedom) natively within the 3DGS-Mesh Hybrid Representation could become more challenging, especially maintaining semantic consistency during state changes.

  • Generalizability to Outdoor/Open-World Environments: While 3DGS is excellent for capturing indoor scenes, its applicability to large-scale outdoor or open-world environments (where scene structure is less constrained and lighting conditions are more varied) would need further investigation.

  • Agent Embodiment Specificity: The robot APIs support various platforms (legged, wheeled, aerial). It would be interesting to see if the continuity metrics and VLN performance vary significantly across different embodiments due to their distinct kinematics and interaction capabilities.

  • Explanation of c(t) in ICP: The paper defines c(t) as collision intensity sequence but doesn't explicitly state its calculation method (e.g., is it binary, proportional to penetration depth, or duration within a timestep?). A clearer definition would enhance the reproducibility and interpretability of ICP.

    Overall, SAGE-3D provides a robust and forward-looking foundation for embodied navigation, pushing the boundaries of what 3DGS can achieve in complex AI tasks. Its contributions in data, methodology, and evaluation metrics are likely to inspire substantial future research in the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.