Towards Physically Executable 3D Gaussian for Embodied Navigation
TL;DR Summary
This paper introduces SAGE-3D, addressing the limitations of 3D Gaussian Splatting in embodied navigation with object-centric semantic grounding and physics-aware execution. The released InteriorGS dataset contains 1K annotated indoor scenes, and SAGE-Bench is the first VLN bench
Abstract
3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: (1) Object-Centric Semantic Grounding, which adds object-level fine-grained annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release InteriorGS, containing 1K object-annotated 3DGS indoor scene data, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments show that 3DGS scene data is more difficult to converge, while exhibiting strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task. Our data and code are available at: https://sage-3d.github.io.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Towards Physically Executable 3D Gaussian for Embodied Navigation".
1.2. Authors
The authors are:
- Bingchen Miao (Zhejiang University, Manycore Tech Inc)
- Rong Wei (Manycore Tech Inc)
- Zhiqi Ge (Zhejiang University)
- Xiaoquan Sun (Manycore Tech Inc, Huazhong University of Science and Technology)
- Shiqi Gao (Zhejiang University)
- Jingzhe Zhu (Zhejiang University)
- Renhan Wang (Manycore Tech Inc)
- Siliang Tang (Zhejiang University)
- Jun Xiao (Zhejiang University)
- Rui Tang (Manycore Tech Inc)
- Juncheng Li (Zhejiang University)
1.3. Journal/Conference
The paper is published at an unspecified venue, with the provided date 2025-10-24T10:05:00.000Z. Given the arXiv links, it is currently available as a preprint, suggesting it might be submitted to a conference or journal in 2025. The publication venue for 2025 is not explicitly stated in the provided text.
1.4. Publication Year
The publication year is 2025.
1.5. Abstract
This paper addresses the limitations of 3D Gaussian Splatting (3DGS)—a 3D representation method known for photorealistic real-time rendering—in Visual-Language Navigation (VLN) tasks, specifically its lack of fine-grained semantics and physical executability. To overcome these issues, the authors propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation). This new paradigm upgrades 3DGS into an executable, semantically, and physically aligned environment through two main components: (1) Object-Centric Semantic Grounding, which adds object-level annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects and creates rich physical interfaces within 3DGS. The authors release InteriorGS, a dataset of 1K object-annotated 3DGS indoor scenes, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments demonstrate that 3DGS scene data, while harder to converge, exhibits strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task.
1.6. Original Source Link
-
Original Source Link: https://arxiv.org/abs/2510.21307
-
PDF Link: https://arxiv.org/pdf/2510.21307v2.pdf
The paper is currently available as a preprint on
arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inability of 3D Gaussian Splatting (3DGS)—despite its advantages in photorealistic real-time rendering—to be directly applied to Visual-Language Navigation (VLN) tasks due to two significant limitations:
-
Lack of fine-grained object-level semantics: Current
3DGSscenes only contain color and density information, making it impossible to uniquely identify or ground objects mentioned in natural language instructions (e.g., "go to the red chair"). This makes tasks requiring semantic understanding difficult without complex and error-prone post-processing. -
Absence of a physically executable structure:
3DGSis primarily a volumetric rendering technique, making it challenging to derive reliable collision geometries. Without accurate physical properties, embodied agents cannot interact with the environment realistically, leading to issues likemesh penetrationthat hinder effective learning.This problem is important because
VLNis a core capability forVision-Language Action (VLA)models, enabling robots to follow natural language instructions in complex indoor spaces. Training these models directly in the real world is costly and risky, leading to the widespread adoption of thesim-to-real paradigm. Reducing thesim-to-real gapis crucial, and3DGShas been recognized as an effective tool for this due to its superior visual fidelity and view consistency compared to olderscanned meshrepresentations (likeMatterport3DandHM3D). However,3DGS's inherent lack of semantics and physics prevents it from being a truly executable environment forVLN.
The paper's entry point is to bridge this gap by transforming 3DGS from a purely perceptual representation into a semantically and physically aligned, executable environment.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- SAGE-3D Paradigm: It proposes
SAGE-3D(Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a novel paradigm that augments3DGSwith semantic granularity and physical validity, transforming it into an executable environment foundation for embodied navigation. - InteriorGS Dataset: It constructs and releases
InteriorGS, the first large-scale dataset of 1,000 fully furnished indoor3DGSreconstructions. This dataset includes dense object-level annotations (instance IDs, object categories, bounding box information) across 554k object instances and 755 categories, providing a semantically rich foundation. - SAGE-Bench Benchmark: It introduces
SAGE-Bench, the first3DGS-basedVLNbenchmark.SAGE-Benchfeatures 2M new trajectory-instruction pairs, accurate per-object physical simulation, and rich interfaces for robot embodiments, along with a hierarchical instruction scheme and novelnavigation natural continuity metrics. - Experimental Insights:
-
3DGSscene data renders faster but is harder to converge during training compared toscanned meshdata, suggesting higher demands due to its richness and photorealism. -
The
scene-rich, photorealistic3DGS VLNdata exhibits strong generalizability, leading to a significant performance improvement (31%SRincrease) for models trained on it when tested in unseenVLN-CEenvironments. -
The newly proposed
Continuous Success Ratio (CSR),Integrated Collision Penalty (ICP), andPath Smoothness (PS)metrics enable a more effective study of navigation's natural continuity, addressing gaps in conventional metrics that fail to capture issues like continuous collisions and unsmooth motion.These findings collectively solve the problem of making
3DGSa viable and effective environment for training and evaluatingVLNagents, significantly narrowing thesim-to-real gapby providing high-fidelity, semantically rich, and physically interactive simulation environments.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following core concepts:
-
3D Gaussian Splatting (3DGS):
- Conceptual Definition:
3DGSis a novel 3D scene representation method that uses a collection of thousands to millions of anisotropic (non-uniform in all directions) 3D Gaussian primitives to model a scene. Each Gaussian is essentially a small, soft, transparent blob defined by its position, covariance (describing its shape and orientation), color, and opacity. - How it works: Instead of voxels or meshes,
3DGSoptimizes these individual Gaussian primitives directly from a set of input images captured from various viewpoints. During rendering, these Gaussians are projected onto the 2D image plane and blended using standard alpha compositing techniques. - Advantages: It enables photorealistic real-time rendering from novel viewpoints, which is crucial for dynamic camera movements in navigation tasks. It also offers view-consistent appearance, meaning the scene looks consistent from any angle, unlike
scanned meshtextures that can stretch or break. - Limitations (addressed by this paper): By default,
3DGSonly encodes appearance (color, density) and lacks semantic information (what objects are, their categories, instance IDs) and explicit geometric surfaces for physical interaction.
- Conceptual Definition:
-
Visual-Language Navigation (VLN):
- Conceptual Definition:
VLNis an embodied artificial intelligence (AI) task where an agent (e.g., a robot) must navigate through a 3D environment to a target location or perform an action, guided solely by natural language instructions. - Example: An instruction like "Go to the kitchen and find the red mug on the counter." requires the agent to understand the language, perceive the visual environment, and execute a sequence of actions (move, turn) to reach the goal.
- Importance: It's a key step towards more capable and general-purpose
embodied AIagents that can follow human commands in real-world settings.
- Conceptual Definition:
-
Sim-to-Real Gap:
- Conceptual Definition: This refers to the challenge of transferring policies or models trained in a simulated environment to the real world. Models trained in simulation often struggle in reality due to discrepancies in physics, visual appearance, sensor noise, and environmental complexity between the simulation and the real world.
- Relevance to VLN: High-fidelity simulations that closely resemble reality are crucial for training
VLNagents effectively, as direct real-world training is often expensive, time-consuming, and risky.3DGSis seen as a way to narrow this gap due to its photorealism.
-
Embodied Navigation / Embodied AI:
- Conceptual Definition:
Embodied AIfocuses on creating AI agents that perceive, act, and learn within physical or simulated environments, similar to living beings.Embodied Navigationis a subfield specifically dealing with an agent's ability to move and find its way in these environments. - Key Aspect: These agents need to interact with the environment physically (e.g., avoid collisions, open doors), not just visually.
- Conceptual Definition:
-
Scanned Mesh Reconstructions (e.g., Matterport3D, HM3D):
- Conceptual Definition: These are 3D models of real-world environments created by scanning them with
RGB-Dcameras (which capture both color and depth information). The scans are then stitched together to form a mesh (a collection of interconnected triangles forming surfaces). - Limitations:
- Noisy Geometry: Depth scans can be noisy, leading to imperfect, bumpy, or
estimated meshsurfaces that might merge objects with their surroundings. - Poor Semantics: Separating individual objects from a continuous
scanned meshfor semantic annotation is difficult and costly. - View-Inconsistent Textures: Textures are often stitched from sparse
RGBviewpoints, leading toseams,stretching, orblurwhen viewed from novel angles. This can be distracting for agents trying to learn visual cues.
- Noisy Geometry: Depth scans can be noisy, leading to imperfect, bumpy, or
- Conceptual Definition: These are 3D models of real-world environments created by scanning them with
-
Partially Observable Markov Decision Process (POMDP):
- Conceptual Definition: A
POMDPis a mathematical framework for modeling decision-making in environments where the agent cannot directly observe the full state of the world. Instead, the agent receives observations that provide partial information about the state. - Components:
- : Instruction space (the natural language commands).
- : Continuous state space (the full underlying state of the environment and agent, which may not be fully observable).
- : Action space (the set of actions the agent can take, e.g., move forward, turn).
- : Multimodal observation space (what the agent perceives, e.g., RGB images, depth maps, semantic segmentation).
- : State transition function (describes how the state changes after an action, often influenced by physics).
- : Observation function (describes the probability of making an observation given a state).
- Relevance: The paper defines its
SAGE-3Denvironment as a semantics- and physics-augmentedPOMDP, highlighting that the agent's decision-making is based on partial observations within a dynamic, physically realistic world.
- Conceptual Definition: A
-
Convex Hull Decomposition:
- Conceptual Definition: A
convex hullof a set of points is the smallest convex set that contains all the points.Convex decompositionis the process of breaking down a complex, non-convex 3D mesh (like a detailed 3D model of a chair) into a set of simpler, convex sub-meshes. - Purpose in Physics Simulation: Convex shapes are computationally much easier to handle for collision detection and physics simulation than complex non-convex shapes. By decomposing objects into convex parts, real-time physical interactions (like collisions between a robot and an object) can be simulated efficiently and accurately.
- Conceptual Definition: A
-
A-based Shortest-Path Search:*
- Conceptual Definition: (pronounced "A-star") is a popular pathfinding algorithm used in many areas of computer science, including robotics and game development. It is an informed search algorithm, meaning it uses heuristic functions to guide its search.
- How it works: It finds the shortest path between a starting point and a goal point in a graph. It does this by evaluating each node (possible position) based on the cost from the start to that node (g-score) and an estimated cost from that node to the goal (h-score, the heuristic). It prioritizes nodes that appear to be on the shortest path.
- Relevance: The paper uses -based search to generate optimal trajectories for
VLNagents within the constructed navigation maps.
3.2. Previous Works
The paper contextualizes its work by discussing previous approaches in VLN and scene representations:
-
Early VLN Work (e.g., Anderson et al., 2018; Ku et al., 2020): These studies often used
Matterport3D-based discrete panoramic graphs. Agents navigated by selecting discrete viewpoints rather than continuous motion. -
Continuous Navigation (e.g., Krantz et al., 2020 - VLN-CE): This shifted
VLNresearch towards continuous control in environments likeHabitat(Savva et al., 2019), allowing agents more realistic movement. However, mainstream benchmarks for continuous navigation still largely relied onscanned meshreconstructions. -
Scene Representations:
- Scanned Mesh Reconstructions:
Matterport3D(Chang et al., 2017) andHM3D(Ramakrishnan et al., 2021) are prominent datasets based onRGB-Dscans. While widely used, they suffer fromtexture/semantic limitationsas discussed above. - 3D Gaussian Splatting (3DGS): Introduced by Kerbl et al. (2023),
3DGSemerged as an efficient 3D representation for photorealistic real-time rendering. The paper notes that3DGShas been integrated into embodied learning contexts (e.g., Jia et al., 2025; Zhu et al., 2025b forMuJoCo/Isaac Sim), sometimes adoptingdual-representation(Gaussians for rendering, meshes for collision) (Lou et al., 2025; Wu et al., 2025b), and enhanced withlighting estimation(Phongthawee et al., 2024).
- Scanned Mesh Reconstructions:
-
VLN Benchmarks: The paper lists several existing benchmarks for continuous navigation tasks, highlighting their scene sources, instruction types, and 3D representations:
-
VLN-CE(Krantz et al., 2020): 4.5k tasks, 90 scenes, MP3D,Estimated Scene Geometry,Scanned Mesh3D Representation. -
OVON(Yokoyama et al., 2024): 53k tasks, 181 scenes, HM3D,Estimated Scene Geometry,Scanned Mesh. -
GOAT-Bench(Khanna et al., 2024): 725k tasks, 181 scenes, HM3D,Estimated Scene Geometry,Scanned Mesh. -
IR2R-CE(Krantz et al., 2022): 414 tasks, 71 scenes, MP3D,Estimated Scene Geometry,Scanned Mesh. -
LHPR-VLN(Song et al., 2025): 3.3k tasks, 216 scenes, HM3D,Estimated Scene Geometry,Scanned Mesh. -
OctoNav-Bench(Gao et al., 2025): 45k tasks, 438 scenes, MP3D, HM3D,Estimated Scene Geometry,Scanned Mesh.The key takeaway is that most existing
VLNbenchmarks rely onscanned meshrepresentations, which have limitations in terms oftexture/semantic qualityandphysical validity. They also primarily focus onA-to-Bnavigation without complex causal dependencies.
-
3.3. Technological Evolution
The evolution of scene representations for VLN has been driven by the need to narrow the sim-to-real gap and provide more realistic training environments:
- Early Scanned Mesh Reconstructions (e.g., Matterport3D, HM3D): These provided the first large-scale 3D indoor environments for
Embodied AI. They enabled earlyVLNresearch, but their limitations in semantic annotation, texture quality, and physical accuracy became apparent as research advanced. - Transition to Continuous Environments: The shift from discrete panoramic graphs to continuous navigation in environments like
Habitathighlighted the need for more robust scene representations that support free movement and physical interaction. - Emergence of 3D Gaussian Splatting (3DGS):
3DGSrepresented a significant leap in photorealistic real-time rendering, offering superior visual fidelity and view consistency compared toscanned mesh. This made it a highly attractive candidate forVLNenvironments. - This Paper's Contribution: The current paper,
SAGE-3D, addresses the critical next step in this evolution: making3DGSnot just visually stunning but also semantically meaningful and physically executable. It pushes the boundary of3DGSbeyond pure rendering to serve as a foundational, interactive environment forEmbodied AI.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's approach are:
- Upgrading 3DGS from Perceptual to Executable: Prior
3DGSapplications inEmbodied AIoften treated Gaussians primarily for rendering, sometimes usingdual-representationswhere meshes handle physics.SAGE-3Dprovides a unified paradigm to integrate semantics and physics directly into the3DGSenvironment structure, making it natively executable and semantically aligned. - Addressing Semantics in 3DGS: Existing
3DGSscenes lackobject-level semantics. This paper introducesObject-Level Semantic GroundingthroughInteriorGS(a manually annotated3DGSdataset) and2D semantic top-down map generation, which is a significant innovation forVLNtasks requiring fine-grained understanding of instructions like "go to the red chair". - Enabling Physics in 3DGS Environments: Native
3DGSstruggles with deriving reliable collision geometries.SAGE-3DintroducesPhysics-Aware Execution Jointingby creating a3DGS-Mesh Hybrid Representationwherecollision bodiesare derived from artist-created meshes viaconvex decompositionand integrated with3DGSfor appearance. This provides accurate physical simulation without sacrificing visual fidelity. - Novel VLN Benchmark:
SAGE-Benchis the first3DGS-basedVLNbenchmark. Unlike prior benchmarks that largely usescanned meshand focus onA-to-Bnavigation,SAGE-Benchoffers:Ground Truth Scene Geometry(derived from artist meshes) instead ofestimatedfrom scans.Hierarchical Instruction Generation: It includesInstruction with Causality(high-level semantic goals like "I'm thirsty, get water from the table") in addition to low-level actions, moving beyond simpleA-to-Btasks.Navigation Natural Continuity Metrics: It introducesCSR,ICP, andPSto evaluate continuous motion, addressing the limitations of traditional metrics likeSRandCRthat don't capture nuanced agent behavior (e.g., persistent collisions, jerky movements).
- Enhanced Data Generalizability: The
photorealisticandscene-richnature of the3DGSdata inSAGE-Benchis shown to significantly improve the generalizability ofVLNmodels to unseen environments, which is a crucial step for real-world deployment.
4. Methodology
4.1. Principles
The core idea of SAGE-3D is to transform 3DGS from a purely visual scene representation into a fully functional, executable environment for embodied navigation. The theoretical basis is that for an agent to effectively navigate and interact in a VLN task, the environment must provide three key components:
-
Photorealistic Appearance: To accurately mirror real-world visual complexity and reduce the
sim-to-real gap.3DGSexcels at this. -
Fine-grained Semantics: To enable agents to understand and ground natural language instructions that refer to specific objects, their attributes, and relationships.
-
Physical Executability: To allow agents to move and interact with objects in a physically plausible manner, avoiding collisions and supporting realistic dynamics.
The intuition behind
SAGE-3Dis to decouple the strengths of different representations:3DGSfor rendering high-fidelity visuals, and traditional mesh-based approaches for providing precisecollision geometryandsemantic annotations. By carefully aligning these layers, the system aims to offer the best of both worlds within a unified framework.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. SAGE-3D Paradigm Definition
The paper formally defines SAGE-3D as a process that upgrades 3DGS into an executable, semantically and physically aligned environment foundation for embodied navigation. This transformation is represented as:
-
: Represents the
set of Gaussian primitives() that constitute the3DGSscene, providing the photorealistic visual appearance. -
: Denotes the
semantic layer, which includes fine-grained information such asinstance/category mapsandobject attributes. This layer allows for the understanding and grounding of natural language instructions. -
: Represents the
physics layer, comprisingcollision bodiesanddynamicsproperties. This layer enables physically accurate interactions and movement within the environment. -
: The resulting
executable environment, which supports continuousVisual-Language Navigationand relatedembodied AItasks.The resulting environment is formalized as a
semantics- and physics-augmented Partially Observable Markov Decision Process (POMDP):
-
: The
instruction space, which encompasses the natural language commands provided to the agent. -
: The
continuous state space, representing the complete, underlying state of the environment and the agent's pose. -
: The
action space, defining the set of continuous or discrete actions the agent can perform (e.g., velocity commands, turning, moving). -
: The
multimodal observation space, which includes all sensor data available to the agent (e.g.,RGB,depth,semantic segmentation,poses,contact events). -
: The
physics-driven state transition function, describing how the environment and agent state evolve over time based on actions and physical laws. -
: The
rendering function, which generates observations from the current state, primarily driven by the3DGSappearance. -
: The
semantic layeris explicitly included in thePOMDPdefinition, indicating that semantic information is an integral part of the environment's state and observations. -
: The
physics layeris also explicitly included, signifying that physical properties and interactions are central to the environment's dynamics and agent behavior.This paradigm ensures that the high-fidelity rendering of
3DGSis preserved whileobject-level semanticsandphysical executabilityare introduced, making3DGSa practical foundation for training and evaluatingembodied agents.
4.2.2. Object-Level Semantic Grounding
To address the lack of fine-grained semantics in conventional 3DGS, the paper introduces InteriorGS and a 2D semantic top-down map generator.
4.2.2.1. InteriorGS Dataset Construction
The InteriorGS dataset is constructed with the goal of providing object-level semantic annotations for 3DGS scenes.
- Source Data: The
3DGSdata inInteriorGSis sampled from artist-createdmesh scenes. These scenes provide clean, accurate geometry and object boundaries, which are critical for reliable semantic annotation and later for extractingcollision bodies. - 3DGS Reconstruction: To achieve high-fidelity
3DGSreconstructions, especially in occlusion-rich indoor environments, an average of 3,000 camera views are rendered per scene using aray tracing renderer. The poses from this renderer are then used with theopen-source GSplat pipeline(Ye et al., 2025) to estimate the3DGS parameters(positions, colors, opacities, covariances of Gaussians).- Camera Placement Policies (detailed in Appendix B):
- Perimeter-aware floorplan sweeps ("surround"): For each room polygon, a global camera budget is distributed proportionally to its perimeter. Cameras are uniformly spaced along the polygon's perimeter, with optical axes aligned to inward edge normals. Three tangential baselines (left/center/right) and three vertical tiers are instantiated:
- Outer tiers: lower (150mm above floor, pitch), middle (mid-height, pitch), upper (500mm below ceiling, pitch).
- Interior tiers (if applicable): heights interpolated, upper ( pitch), lower ( pitch), middle matching outer middle.
- Volume-uniform sampling: The global camera budget is distributed across rooms proportionally to their volume.
Poisson-disk samplingis used to draw 3D positions, ensuring space-filling uniformity. At each sampled position, six cameras with canonical yaw-pitch templates are instantiated, with a small random perturbation applied to their orientations. - Viewpoint Selection: Viewpoints are selected to be at appropriate distances from
mesh surfaces. Viewpoints too close to the nearestmesh surface(below a safety threshold) are discarded to mitigateundersampling-induced 3DGS underfitting.
- Perimeter-aware floorplan sweeps ("surround"): For each room polygon, a global camera budget is distributed proportionally to its perimeter. Cameras are uniformly spaced along the polygon's perimeter, with optical axes aligned to inward edge normals. Three tangential baselines (left/center/right) and three vertical tiers are instantiated:
- Camera Placement Policies (detailed in Appendix B):
- Manual Annotation: After
3DGSreconstruction, the scenes undergo carefulmanual labelinganddouble verificationto addobject-level semantics. This includes:Object Categories: e.g., "chair," "table," "sofa."Instance IDs: Unique identifiers for each individual object instance (e.g., , ).Bounding Box Information: 3D axis-aligned bounding boxes for each object.
- Dataset Content:
InteriorGScomprises 1,000 high-fidelity indoor3DGSscenes (752 residential interiors, 248 public spaces like concert halls). It contains over 554kobject instancesacross 755 categories, providing a dense, semantically consistent, and diverse foundation.
4.2.2.2. 2D Semantic Top-Down Map Generation
Since 3DGS lacks inherent semantics and discrete entities for traditional NavMesh generation, a 2D semantic top-down map is designed.
- Projection: Annotated 3D objects from
InteriorGS(initially stored as axis-aligned 3D bounding boxes) are projected onto theground plane. - Refinement: Each object's footprint is refined into an irregular mask. This is done by sampling points from the object's surface, projecting them onto the
ground plane, and then computing the2D convex hullof these projected points. - Formula for Mask Generation:
- : Represents the
2D maskgenerated for a specific object . This mask delineates the object's footprint on the ground plane. - : Denotes the
set of sampled surface pointsbelonging to the 3D object . These points are representative of the object's exterior. - : This is the
projection operatorthat maps a 3D point onto theground plane. - : The
2D convex-hull operator, which takes a set of 2D points (the projected surface points) and returns the smallest convex polygon that encloses all of them. This effectively creates a tight boundary around the object's base. - : A
merging functionthat consolidatesmulti-view masksinto a single, consistentfootprint. This step is important if an object's footprint is perceived or processed from multiple angles, ensuring a unified representation.
- : Represents the
- Additional Annotations: Doors are tagged with their state (open/closed/half-open), and walls are marked as non-traversable, providing critical information for
path planningandinstruction generation.
4.2.3. Physics-Aware Execution Jointing
To enable physical interaction and address issues like mesh penetration in 3DGS, the paper introduces object-level collision geometry and a physics simulation interface.
4.2.3.1. Physics Simulation with 3DGS-Mesh Hybrid Representation
This component is crucial for making the environment physically executable.
- Collision Geometry Extraction: Starting from the artist-created
triangle meshesof each object (which are clean and accurate, unlikescanned meshes), theCoACD(Wei et al., 2022) algorithm is applied forconvex decomposition. This process breaks down complex object meshes into simplerconvex shapes, which are ideal for efficientcollision detectionandphysics simulation. These resultingconvex shapesform theper-object collision bodies. - 3DGS-Mesh Hybrid Representation Assembly: A
USDA scene(Universal Scene Description Archive, a file format for 3D content interchange) is assembled. In thisUSDAscene:- The
collision bodies(derived from meshes) are authored asinvisible rigid shapes. Theserigid shapesare responsible for driving contact and dynamics during simulation. - The
3DGS file(exported from3DGUT(Wu et al., 2025a), supporting3DGSassets from USDZ files inIsaac Sim5.0+) provides thephotorealistic appearanceand remains visible.
- The
- Decoupled Design: This
decoupled designmeans that3DGShandles the high-fidelity rendering, while the mesh-basedcollision bodiesprovide accurate geometry forphysics simulation. Each object is instantiated as aUSD prim(a fundamental building block inUSD) and augmented with (rigid-body and contact parameters). Static scene objects are set asstatic bodies, while a curated subset can be configured asmovableorarticulatedto support advanced interactions. - Benefits: This hybrid representation avoids the need to
ray tracedetailed artist meshes at runtime, preserves the visual quality of3DGS, and supplies accuratecollision geometryfor robustphysics simulationwith connectivity torobotics APIs.
4.2.3.2. Agents, Control, and Observations in a Continuous Environment
The SAGE-3D environment provides comprehensive interfaces for embodied agents:
- Robot APIs: The
simulatorexposesrobot APIscompatible with various ground platforms (e.g., Unitree G1/Go2/H1 for legged and wheeled robots) and aerial robots (e.g., quadrotorUAVs). - Action Interfaces:
Discrete commands: e.g.,turn,forward,stop.Continuous control:velocity commands() for ground robots (linear velocity , angular velocity ) and6-DoF velocity/attitude commandsforUAVs. These actions are executed in acontinuous environment(metric 3D space, no teleportation between nodes).
- Observation Space: The environment provides synchronized
multimodal observations, including:RGBimages: Color images from the agent's viewpoint.Depthmaps: Distance information to surfaces.Semantic segmentation: Pixel-level labels for objects.Poses: Agent's 3D position and orientation.Contact events: Information about collisions.
- Safety Features: Built-in
collision detection,stuck/interpenetration monitoring, andrecovery mechanismsensure safe and robust simulation. - Performance: Offline-generated
collision bodiesare cached to accelerate loading and guarantee stable, repeatable evaluation.
5. Experimental Setup
5.1. Datasets
The experiments primarily leverage two novel datasets introduced in the paper, InteriorGS and SAGE-Bench, and also involve the VLN-CE benchmark for generalizability testing.
5.1.1. InteriorGS
- Source:
InteriorGSis a dataset of 1,000 high-fidelity indoor3DGSscenes. These scenes are sampled from artist-createdmesh scenes. - Scale and Characteristics:
- 752 residential interior scenes and 248 public spaces (e.g., concert halls, amusement parks, gyms).
- Over 554k
object instancesacross 755 categories. - Features
double-verified object-level annotations, includingobject categories,instance IDs, andbounding box information. - The
3DGSdata is reconstructed from an average of 3,000 camera views per scene using ray tracing and theGSplatpipeline. - Example: Figure 9 in the paper (More Visualization of InteriorGS) shows examples of highly detailed and photorealistic indoor scenes, demonstrating the quality and diversity of the data. Figure 10 (Distribution of non-home scenes of InteriorGS) illustrates the distribution of public space categories, and Figure 11 (Distribution of assets of InteriorGS) shows the distribution of asset types (e.g., Furniture, Lighting, Books). For instance, books are the most numerous assets, totaling 80,096 instances.
- Purpose: Provides a dense, semantically consistent, and broadly diverse foundation for training and evaluating
embodied agents, particularly forVLNtasks requiring fine-grained object understanding.
5.1.2. SAGE-Bench
- Source:
SAGE-Benchis the first3DGS-basedVLNbenchmark, built upon theInteriorGSscenes. - Scale and Characteristics:
- Contains 2M new
instruction-trajectory pairs. - Includes 554k detailed
collision bodies(derived from the original mesh scenes for physics simulation). - Hierarchical Instruction Scheme:
- High-level instructions: Emphasize task semantics and human-oriented intent, addressing
causal dependencies. Categories includeAdd Object(e.g., "Please move the book from the coffee table to the bookshelf"),Scenario Driven(e.g., "I'm thirsty, please bring me a drink from the fridge"),Relative Relationship(e.g., "Move to the chair next to that table"),Attribute-based(e.g., "Find an empty table"), andArea-based(e.g., "Walk to the kitchen area"). - Low-level instructions: Focus on control and kinematic evaluation, including primitive actions (
Base-Action: "Move forward two steps") and goal-directed point-to-point navigation (Single-Goal: "Walk to the bedroom").
- High-level instructions: Emphasize task semantics and human-oriented intent, addressing
- Trajectory Generation: Uses the
collision bodiesfrom Section 2.1 to construct anavigation map(combining a 1.2m-heightoccupancy mapwith the2D semantic map).A*-based shortest-path searchis then used to generate trajectories, with acost functionintegratingfree-space distance,narrow-passage penalties, andarea preferences. Start-end pairs are sampled across different rooms/objects, with a minimum safety distance.
- Contains 2M new
- Splits: A test split of 1,148 samples (944 high-level, 204 low-level across 35 distinct scenes) is selected, with the remainder used for training and validation.
- Purpose: To benchmark
VLNmodels in high-fidelity, physically executable environments with complex, semantically rich instructions, pushing beyond traditionalA-to-Bnavigation.
5.1.3. VLN-CE
- Purpose: Used to evaluate the
generalizabilityof models trained onSAGE-Benchto unseen environments from a different source (scanned mesh). - Source: Based on
Matterport3D(MP3D) scenes.
5.2. Evaluation Metrics
5.2.1. VLN Task Metrics
For the VLN task, both conventional metrics and new natural continuity metrics are used.
-
Success Rate (SR):
- Conceptual Definition: Measures the proportion of episodes where the agent successfully reaches the target location within a predefined tolerance at the end of the navigation episode. It's a binary (0 or 1) judgment: either the goal is reached, or it's not.
- Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successful episodes}}{\text{Total number of episodes}} $
- Symbol Explanation:
- : The count of episodes where the agent's final position is within the goal region.
- : The total count of navigation attempts.
-
Oracle Success Rate (OSR):
- Conceptual Definition: Similar to
SR, but it considers success if any point along the agent's trajectory (not just the final one) is within the goal region. This evaluates whether the agent ever reached the goal, even if it later moved away or failed to stop at the correct point. - Mathematical Formula: $ \mathrm{OSR} = \frac{\text{Number of episodes where goal was visited}}{\text{Total number of episodes}} $
- Symbol Explanation:
- : Count of episodes where at least one point in the agent's path was within the goal region.
- : The total count of navigation attempts.
- Conceptual Definition: Similar to
-
Success Weighted by Path Length (SPL):
- Conceptual Definition: A metric that balances
success ratewithpath efficiency. It penalizes agents that take long, inefficient paths to reach the goal. A higherSPLmeans both successful navigation and taking a relatively short path. - Mathematical Formula: $ \mathrm{SPL} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}{S_i} \frac{L{GT,i}}{\max(L_{Act,i}, L_{GT,i})} $
- Symbol Explanation:
- : Total number of episodes.
- : An indicator function that is 1 if episode is successful, and 0 otherwise.
- : Length of the shortest path (ground truth) for episode .
- : Length of the agent's actual path for episode .
- : Ensures the ratio is always and handles cases where actual path is shorter than ground truth (though usually ).
- Conceptual Definition: A metric that balances
-
Collision Rate (CR):
- Conceptual Definition: Measures the proportion of episodes in which the agent experienced one or more collisions with the environment. It's a binary indicator per episode.
- Mathematical Formula: $ \mathrm{CR} = \frac{\text{Number of episodes with collisions}}{\text{Total number of episodes}} $
- Symbol Explanation:
- : Count of episodes where the agent registered at least one collision event.
- : The total count of navigation attempts.
5.2.2. Navigation Natural Continuity Metrics
These three novel metrics are introduced to assess VLN model performance from the perspective of continuous motion, addressing limitations of conventional metrics.
-
Continuous Success Ratio (CSR):
- Conceptual Definition: Measures the fraction of time the agent stays within a permissible corridor around the reference path while also satisfying task conditions. Unlike
SRwhich is a 0/1 judgment at the endpoint,CSRreflects the "goal-consistent" behavior throughout the entire episode, quantifying how well the agent follows the intended path. - Mathematical Formula:
- Symbol Explanation:
- : The total length (duration or number of time steps) of the trajectory.
s(t): A binary indicator function at time step .- if the agent's position is within the permissible corridor and task conditions are satisfied at time .
- otherwise.
- : The
permissible corridor, defined by buffering the reference path with a radius (tolerance radius).
- Conceptual Definition: Measures the fraction of time the agent stays within a permissible corridor around the reference path while also satisfying task conditions. Unlike
-
Integrated Collision Penalty (ICP):
- Conceptual Definition: Quantifies the time-averaged collision intensity along the trajectory, capturing both the frequency and duration of contacts. This metric goes beyond
Collision Rate (CR)by differentiating between brief, occasional contacts and prolonged, persistent scraping against obstacles, providing a more nuanced measure of physical interaction. - Mathematical Formula:
- Symbol Explanation:
- : The total length (duration or number of time steps) of the trajectory.
c(t): Thecollision intensity sequenceat time step , where . A value of 0 indicates no collision, and 1 indicates a continuous or severe collision. The paper doesn't explicitly define howc(t)is calculated, but implies it's a normalized measure of collision severity/duration at each step.
- Conceptual Definition: Quantifies the time-averaged collision intensity along the trajectory, capturing both the frequency and duration of contacts. This metric goes beyond
-
Path Smoothness (PS):
- Conceptual Definition: Evaluates the smoothness of the agent's path by analyzing consecutive changes in heading angle. Higher
PSvalues indicate smoother paths with fewer abrupt turns and acceleration changes, which is desirable for real robots due to feasibility and stability concerns. - Mathematical Formula:
- Symbol Explanation:
- : The total length (duration or number of time steps) of the trajectory.
- : The agent's
heading angleat trajectory time step . - : The
change in headingbetween two consecutive time steps, calculated as . - : Normalizes the absolute change in heading angle by (180 degrees), ensuring the value is between 0 and 1 for typical turns. The
minfunction caps it at 1 in case of very large turns or measurement noise. - The sum averages the normalized heading changes, and
1 - averageyields a smoothness score, where 1 is perfectly smooth and lower values indicate more jerky motion.
- Conceptual Definition: Evaluates the smoothness of the agent's path by analyzing consecutive changes in heading angle. Higher
5.2.3. No-goal-Nav Task Metrics
For the No-goal-Nav task, where the goal is to explore the environment safely, two metrics are used:
- Episode Time: Measures the total duration (in seconds) an agent navigates in the environment until the episode terminates (e.g., due to collision or maximum time limit).
- Explored Areas: Measures the total unique area (e.g., in square meters) covered by the agent during an episode.
5.3. Baselines
The paper evaluates a wide range of models, categorized as:
- Closed-source MLLMs as Agent:
Qwen-VL-MAX(Bai et al., 2023)GPT-4.1GPT-5
- Open-source MLLMs as Agent:
Qwen2.5-VL-7B(Bai et al., 2023)InternVL-2.5-8B(Zhu et al., 2025a)InternVL-3-8B(Chen et al., 2024)Llama-3.2-11B
- Vision-Language Models (VLN-specific models):
-
NaviLLM(Zheng et al., 2024) -
NavGPT-2(Zhou et al., 2024) -
CMA(Krantz et al., 2020) -
NaVid(Zhang et al., 2024) -
NaVILA(Cheng et al., 2025) -
NaVid-base:NaVid's pre-trained modelnavid-7b-full-224. -
NaVid-SAGE (Ours):NaVid-basefine-tuned onSAGE-Benchdata. -
NaVILA-base:NaVILA's pre-trained modelnavila-siglip-lama3-8b-v1.5-pretrain. -
NaVILA-SAGE (Ours):NaVILA-basefine-tuned onSAGE-Benchdata.These baselines are representative as they cover a spectrum from general
Multimodal Large Language Models (MLLMs)to specializedVLNmodels, including recentState-of-the-Art (SOTA)approaches. The inclusion ofbasemodels and theirSAGE-trained counterparts (NaVid-SAGE,NaVILA-SAGE) allows for direct evaluation of the impact of training on the new3DGS-based data.
-
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Overall Comparison on SAGE-Bench
The following are the results from Table 2 of the original paper:
| Methods | VLN (High-level Instruction) | Nogoal-Nav | |||||||
|---|---|---|---|---|---|---|---|---|---|
| SR ↑ | OSR ↑ | SPL ↑ | CR ↓ | CSR ↑ | ICP ↓ PS ↑ | Episode Time ↑ Explored Areas ↑ | |||
| Closed-source MLLMs as Agent | |||||||||
| Qwen-VL-MAX | 0.14 | 0.25 | 0.12 | 0.85 | 0.21 | 0.41 | 0.79 | 64.74 | 6.40 |
| GPT-4.1 | 0.13 | 0.21 | 0.12 | 0.72 | 0.19 | 0.35 | 0.81 | 67.70 | 3.00 |
| GPT-5 | 0.12 | 0.18 | 0.11 | 0.63 | 0.18 | 0.24 | 0.86 | 64.60 | 2.16 |
| Open-source MLLMs as Agent | |||||||||
| Qwen2.5-VL-7B | 0.13 | 0.14 | 0.13 | 0.71 | 0.21 | 0.27 | 0.87 | 42.19 | 6.88 |
| InternVL-2.5-8B | 0.10 | 0.13 | 0.10 | 0.52 | 0.14 | 0.33 | 0.88 | 28.82 | 4.28 |
| InternVL-3-8B | 0.12 | 0.20 | 0.11 | 0.64 | 0.17 | 0.32 | 0.82 | 34.70 | 6.34 |
| Llama-3.2-11B | 0.13 | 0.18 | 0.14 | 0.74 | 0.16 | 0.29 | 0.83 | 38.45 | 6.68 |
| Vision-Language Model | |||||||||
| NaviLLM | 0.05 | 0.06 | 0.05 | 0.21 | 0.09 | 0.24 | 0.90 | 18.73 | 5.74 |
| NavGPT-2 | 0.10 | 0.12 | 0.11 | 0.33 | 0.14 | 0.29 | 0.83 | 24.51 | 3.36 |
| CMA | 0.13 | 0.15 | 0.14 | 0.54 | 0.26 | 0.28 | 0.86 | 44.26 | 3.22 |
| NaVid | 0.15 | 0.17 | 0.15 | 1.24 | 0.29 | 0.33 | 0.89 | 56.13 | 4.28 |
| NaVILA | 0.39 | 0.47 | 0.34 | 3.28 | 0.48 | 0.61 | 0.68 | 77.82 | 8.40 |
| NaVid-base | 0.10 | 0.13 | 0.10 | 0.33 | 0.15 | 0.28 | 0.84 | 20.37 | 3.42 |
| NaVid-SAGE (Ours) | 0.36 | 0.46 | 0.32 | 2.12 | 0.48 | 0.66 | 0.54 | 60.35 | 5.66 |
| NaVILA-base | 0.21 | 0.26 | 0.22 | 3.53 | 0.33 | 0.72 | 0.41 | 58.26 | 6.52 |
| NaVILA-SAGE (Ours) | 0.46 | 0.55 | 0.48 | 2.67 | 0.57 | 0.54 | 0.74 | 82.48 | 8.74 |
- SAGE-Bench is Challenging: The results show that
SAGE-Benchpresents a significant challenge for currentVLNmodels andMLLMs. Most models achieveSRvalues no higher than 0.15. For instance,NaVid, which performs well onVLN-CE R2R Val-Unseen, only achieves 0.15SRonSAGE-Bench, andNaVILA'sSRdrops from 0.54 (VLN-CE) to 0.39 onSAGE-Bench. This indicates the higher complexity and demands of the3DGS-based, semantically and physically rich environment. - MLLMs' Inherent VLN Capability:
MLLMs(both open-source and closed-source) show a baselineVLNcapability, achievingSRsbetween 0.10 and 0.14. This is comparable to, or even surpasses in someOSRcases, dedicatedVLNmodels likeCMA(0.13SR) andNaVid(0.15SR). For example,InternVL-3achieves anOSRof 0.20, higher thanNaVid's 0.17. This suggests that the multimodal understanding ofMLLMsinherently equips them with some capacity for navigation, even without explicitVLNtraining on this benchmark. - Impact of SAGE Training: Fine-tuning on
SAGE-Benchdata significantly improves performance.NaVILA-SAGEachieves the best performance with anSRof 0.46,OSRof 0.55, andSPLof 0.48, outperforming all other baselines, including the originalNaVILA. This validates the effectiveness of the proposed dataset and paradigm. - Continuity Metrics Insights: For models with weak
VLNperformance (), theirCR,ICP, andPSmetrics are often non-comparable as they fail to navigate meaningfully (e.g., performing random or single-action predictions). However, forNaVILAandNaVILA-SAGE, whileNaVILAattains a relatively highSR(0.39), itsICPof 0.61 indicates sustained collisions, andPSof 0.68 reflects unsmooth motion.NaVILA-SAGEimprovesCSRto 0.57 but still shows a highICP(0.54) andPS(0.74), suggesting room for improvement in natural continuity despite high task success.
6.1.2. Insight 1: 3DGS Scene Data Renders Faster but is Harder to Converge
The following are the results from Table 3 of the original paper:
| Environment Type | Avg. Render Time / Frame (ms) ↓ Avg. Memory (MB) ↓ | Iters to SR=40% (k) ↓ | Time-to-SR=40% (hrs) ↓ | |
|---|---|---|---|---|
| Scanned Mesh (MP3D/HM3D) | 16.7 | 850 | 120 | 4.8 |
| 3DGSMesh Hybrid Representation (Ours) | 6.2 | 220 | 160 | 6.2 |
This comparison between scanned mesh data and 3DGS data (using NaVILA-base on NVIDIA H20 GPU) reveals a trade-off:
- Rendering Speed:
3DGSdata significantly outperformsscanned meshdata in rendering speed, with aper-frame rendering timeof 6.2 ms andaverage memory usageof 220 MB, compared to 16.7 ms and 850 MB forscanned mesh. This highlights3DGS's efficiency for real-time visual feedback. - Training Convergence: Despite faster rendering,
3DGSdata exhibits slower convergence during training. To reach a 40%SR, the3DGS-based model required 160k iterations and 6.2 hours, whereas thescanned mesh-based model achieved it in 120k iterations and 4.8 hours. This suggests that the richness and photorealism of3DGSdata, which better mirror real-world complexity, also present a greater challenge for models to learn and converge.
6.1.3. Insight 2: 3DGS Scene Data Exhibits Strong Generalizability
The following are the results from Table 4 of the original paper:
| Methods | R2R Val-Unseen | ||
|---|---|---|---|
| SR ↑ | OSR ↑ | SPL ↑ | |
| Seq2Seq | 0.25 | 0.37 | 0.22 |
| Navid-base | 0.22 | 0.32 | 0.17 |
| Navid-SAGE (Ours) | 0.31 | 0.42 | 0.29 |
| CMA | 0.32 | 0.40 | 0.30 |
| NaVid | 0.37 | 0.49 | 0.36 |
| NaVILA-base | 0.29 | 0.38 | 0.27 |
| NaVILA-SAGE (Ours) | 0.38 | 0.51 | 0.36 |
| NaVILA | 0.50 | 0.58 | 0.45 |
To evaluate the generalizability of the 3DGS-based scene data, NaVILA-SAGE and NaVid-SAGE (trained exclusively on SAGE-Bench) were tested on the VLN-CE benchmark (R2R Val-Unseen split), which uses scanned mesh environments.
- Significant Improvement: Both
NaVILA-SAGEandNaVid-SAGEachieved substantial performance improvements over their respectivebasemodels (which were pre-trained but not fine-tuned onSAGE-Benchdata for this specific comparison, although the table shows otherNaVILAandNaVidresults which likely imply training on existing VLN data). NaVILA-SAGEimprovedSRby 31% relatively (from 0.29 to 0.38) andOSRby 34% (from 0.38 to 0.51) compared toNaVILA-base.- Similar gains were observed for
NaVid-SAGE, which saw itsSRincrease from 0.22 to 0.31, andOSRfrom 0.32 to 0.42 compared toNaVid-base. - Reasoning: This strong
generalizabilitysuggests that thescene-rich,photorealistic3DGSdata fromSAGE-Benchprovides a more robust and real-world-aligned learning experience, enabling models to transfer learned navigation skills effectively to different, unseen environments, even those with different underlying representations (scanned mesh).
6.1.4. Insight 3: Natural Continuity Metrics Enable Effective Study of Navigation's Natural Continuity
The analysis of the CSR, ICP, and PS metrics from Table 2 provides critical insights into navigation quality beyond traditional SR and CR.
-
CSR vs. SR:
CSRis generally higher thanSR(e.g.,NaVILAhasSR0.39,CSR0.48;NaVILA-SAGEhasSR0.46,CSR0.57). This indicates thatCSRis a more inclusive metric, acknowledging "goal-consistent" behavior even if the agent doesn't perfectly stop at the target or deviates slightly from the exact path. -
Revealing Flaws in "Successful" Navigation: While
NaVILAachieves a relatively highSR(0.39) andOSR(0.47), itsICPof 0.61 andPSof 0.68 expose significant issues withnatural motion continuity.- An
ICPof 0.61 (where 0 is no collision, 1 is constant collision) suggests frequent and sustained collisions, implying the agent is scraping walls or obstacles for prolonged periods, which is undesirable for real-world robots. - A
PSof 0.68 (where 1 is perfectly smooth) indicates large, mechanical turning angles rather than smooth, fluid motion, affectingreal robot feasibilityandstable planning.
- An
-
Visualization Case Study: Figure 4 corroborates these findings, showing that the
NaVILAmodel's blue trajectory often exhibits unsmooth movements and persistent collisions (e.g., hugging a wall for a long period in Case 1, despite aCRof only 1 for the episode,ICPreaches 0.87), which conventional metrics likeCRwould fail to capture. The following figure (Figure 4 from the original paper) shows visualization case study of navigation natural continuity:
该图像是一个示意图,展示了导航过程中的自然连续性案例。红色轨迹代表真实路径,蓝色轨迹则是 NaVILA 的导航路径。图中的三个案例分别显示了不同的导航情况,并标注了相应的指标。
This demonstrates that CSR, ICP, and PS effectively fill gaps in conventional VLN evaluation, providing a more comprehensive understanding of agent behavior in continuous environments.
6.1.5. High-level Instructions vs. Low-level Instructions
The following are the results from Table 5 of the original paper:
| Methods | Instruction Level | SAGE-Bench VLN | |||||
|---|---|---|---|---|---|---|---|
| SR ↑ | OSR ↑ | SPL ↑ | CSR ↑ | ICP ↓ | PS ↑ | ||
| GPT-4.1 | Low-level | 0.22 | 0.37 | 0.19 | 0.27 | 0.60 | 0.70 |
| High-level | 0.13 | 0.21 | 0.12 | 0.19 | 0.35 | 0.81 | |
| InternVL-3-8B | Low-level | 0.20 | 0.35 | 0.18 | 0.26 | 0.61 | 0.69 |
| High-level | 0.12 | 0.20 | 0.11 | 0.17 | 0.32 | 0.82 | |
| NaVid | Low-level | 0.24 | 0.42 | 0.21 | 0.34 | 0.63 | 0.64 |
| High-level | 0.15 | 0.17 | 0.15 | 0.29 | 0.33 | 0.89 | |
| NaVILA | Low-level | 0.56 | 0.66 | 0.50 | 0.58 | 0.48 | 0.75 |
| High-level | 0.39 | 0.47 | 0.34 | 0.48 | 0.61 | 0.68 | |
The comparison between high-level and low-level instructions on SAGE-Bench reveals that VLN models perform significantly worse on high-level instructions.
- For
NaVILA, theSRdrops from 0.56 onlow-level instructionsto 0.39 onhigh-level instructions. Similar performance drops are observed across other models likeGPT-4.1,InternVL-3-8B, andNaVid. High-level instructionsare semantically richer and closer to real-life scenarios, demanding deeper understanding of context, object attributes, and spatial relationships (e.g.,task causality).Low-level instructionsare basic step-by-step actions. The performance gap indicates that current models struggle with the more complex reasoning and grounding required byhigh-level semantics, highlighting a key challenge for futureVLNdevelopment.
6.1.6. Impact of the Number of Scenes and Samples on Model Performance
The following are the results from Table 6 of the original paper:
| Data in # Train | SAGE-Bench VLN | |||||||
|---|---|---|---|---|---|---|---|---|
| #Scenes | #Samples | SR ↑ | OSR ↑ | SPL ↑ | CSR ↑ | ICP ↓ | PS ↑ | |
| 800 | 240k | 0.42 | 0.47 | 0.42 | 0.50 | 0.61 | 0.63 | |
| 800 | 120k | 0.40 | 0.43 | 0.40 | 0.48 | 0.62 | 0.62 | |
| 800 | 60k | 0.36 | 0.42 | 0.38 | 0.46 | 0.64 | 0.58 | |
| 400 | 120k | 0.34 | 0.39 | 0.35 | 0.44 | 0.67 | 0.54 | |
| 400 | 60k | 0.31 | 0.37 | 0.33 | 0.43 | 0.67 | 0.52 | |
| 400 | 30k | 0.28 | 0.35 | 0.31 | 0.43 | 0.69 | 0.49 | |
| 400 | 15k | 0.25 | 0.31 | 0.27 | 0.39 | 0.70 | 0.46 | |
| 200 | 60k | 0.27 | 0.33 | 0.29 | 0.41 | 0.70 | 0.47 | |
| 100 | 60k | 0.23 | 0.29 | 0.26 | 0.38 | 0.71 | 0.44 | |
| NaVILA-base | 0.21 | 0.26 | 0.22 | 0.36 | 0.72 | 0.41 | ||
The following figure (Figure 5 from the original paper) shows model performance change curve (number of scenes vs. sample size):
该图像是图表,展示了模型性能变化曲线,其中横轴为样本数量,纵轴为成功率(SR)。红线表示在固定400个场景下变化样本数量,蓝线表示在固定60,000个样本下变化场景数量。整体趋势显示,成功率随样本增加而上升。
The analysis of varying number of scenes and sample sizes (using NaVILA-base on the val-unseen split) provides insights into data efficiency:
- Scene Diversity is More Critical: Increasing the
number of scenes(environment diversity) yields greater performance gains than merely increasing thenumber of samples(data density) while keeping scenes constant. For example, comparing 400 scenes/120k samples (SR0.34) with 800 scenes/60k samples (SR0.36) shows that a higher number of scenes with fewer samples can still outperform more samples from fewer scenes. - Optimal Strategy: Generating the same number of samples from a larger number of environments produces better results. This suggests that
diversity of environmentsis more critical for robustVLNlearning than simply increasing the number of trajectories within a limited set of environments.
6.1.7. Results under Different Evaluation Slice
The following figure (Figure 6 from the original paper) shows results under Different Evaluation Slice:
该图像是一个图表,展示了不同指令类型、轨迹长度和场景复杂度下成功率的比较。数据表明,NavId 在各项指标上的表现优于 NaVILA,尤其是在指令类型为 AC 时,其成功率达到 0.41,而 NaVILA 仅为 0.16。
Further evaluation based on the three-axis evaluation framework (high-level instruction types, trajectory lengths, and scene complexities) reveals specific challenging areas:
- Instruction Types:
VLNmodels perform worse onRelative RelationshipandAttribute-basedinstruction types. For bothNaVILAandNaVid,SRscores are more than 2% lower for these types compared to others. This indicates that models struggle with fine-grained spatial reasoning and identifying objects based on detailed attributes. - Trajectory Length and Scene Complexity: As
trajectory lengthincreases andscene complexity(asset density) grows, model performance drops significantly. This is expected as longer paths accumulate more errors, and more complex scenes increase visual clutter and potential for navigation mistakes.
6.2. Data Presentation (Tables)
All relevant tables from the paper have been transcribed in the preceding sections.
6.3. Ablation Studies / Parameter Analysis
The paper primarily performs parameter analysis by varying the number of scenes and number of samples in the training data, as presented in Table 6 and Figure 5. This can be considered an ablation study on the data characteristics.
- Findings from Parameter Analysis: The results consistently show that increasing the diversity of environments (
#Scenes) has a more pronounced positive impact onVLNmodel performance than simply increasing the density of training samples (#Samples) within a fixed number of environments. For instance, increasing from 400 scenes to 800 scenes with 60k samples improvesSRfrom 0.31 to 0.36, whereas doubling samples from 60k to 120k within 400 scenes only improvesSRfrom 0.31 to 0.34. This suggests that broader environmental exposure is key togeneralizabilityand learning robust navigation policies.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces SAGE-3D, a transformative paradigm that elevates 3D Gaussian Splatting (3DGS) from a mere perceptual scene representation to an executable, semantically, and physically aligned environment foundation for embodied navigation. The authors' key contributions include:
-
InteriorGS: The creation of the first large-scale dataset of 1,000 fully furnished indoor
3DGSreconstructions, enriched with dense object-level annotations. This dataset enables robust semantic grounding in photorealistic environments. -
SAGE-3D Framework: A novel framework that integrates
Object-Centric Semantic GroundingandPhysics-Aware Execution Jointingto endow3DGSwith fine-grained semantics and accurate physical simulation capabilities. -
SAGE-Bench: The development of the first
3DGS-basedVLNbenchmark, featuring 2Minstruction-trajectory pairs, precise per-object physical simulation, ahierarchical instruction scheme, and advancednavigation natural continuity metrics.Through extensive experiments, the paper provides crucial insights:
3DGSenvironments, while demanding higher training effort, offer faster rendering and superior generalizability to unseenVLNenvironments. The newly proposed continuity metrics effectively diagnose subtle flaws in navigation behavior that traditional metrics overlook.SAGE-3Dthus establishes a coherent pipeline from high-fidelity data generation to physically valid evaluation, paving the way for more sophisticatedembodied AIresearch.
7.2. Limitations & Future Work
The authors suggest several avenues for future research:
-
Richer Multi-step and Semantic-aware Navigation Tasks: The current benchmark, while introducing
causal instructions, can be further expanded to encompass even more complex, multi-stage tasks requiring deeper semantic reasoning. -
Interactive Manipulation: Leveraging the
physics-aware execution layer, future work could explore tasks involving physical interaction with objects beyond simple navigation, such as picking up, moving, or operating items. -
Broader Sim-to-Real Studies: The
3DGSenvironment's high fidelity andgeneralizabilitymake it an ideal platform for more extensivesim-to-realtransfer learning research, exploring how policies trained inSAGE-3Dperform on real robots.While not explicitly stated as limitations in a dedicated section, the paper's results imply current model limitations, particularly:
-
Convergence Difficulty:
3DGSdata, despite its advantages, makes modelsharder to converge, suggesting a need for more efficient training strategies or model architectures capable of handling its richness. -
Challenges with Natural Continuity: Even
SOTAmodels trained onSAGE-Benchstill exhibit suboptimalICPandPSscores, indicating that achieving genuinely smooth and collision-free motion remains a significant challenge. -
High-level Instruction Understanding: Models struggle significantly with
high-level instructionscompared tolow-level ones, pointing to a gap in semantic and causal reasoning.
7.3. Personal Insights & Critique
This paper represents a significant step forward for embodied AI and VLN by leveraging the power of 3DGS.
Inspirations and Applications:
- Bridging the Realism Gap: The integration of
3DGSfor visual fidelity withmesh-based collision bodiesfor physics is an elegant solution to thesim-to-real gap. This hybrid approach could be highly transferable to otherrobotics simulationdomains (e.g., manipulation, human-robot interaction) where both photorealism and accurate physics are crucial. - Value of Richer Semantics: The emphasis on
object-level annotationsandhierarchical instructionshighlights the importance of moving beyond purely geometric navigation to truly intelligent, human-likeVLN. This focus oncausal instructionsis particularly inspiring, as it reflects how humans naturally communicate goals. - Novel Evaluation Metrics: The introduction of
CSR,ICP, andPSis a critical contribution. TraditionalSRandSPLoften mask undesirable behaviors like jerky movements or constant scraping. These new metrics provide a much-needed, nuanced assessment ofnatural continuity, which is paramount for real-world robot deployment. This approach could inspire similar refined metrics in otherembodied AItasks. - Data-centric AI: The findings on the importance of
scene diversityoversample densityforgeneralizabilityreinforce a key principle ofdata-centric AI. It suggests that creating more varied environments, even with fewer trajectories per environment, is more beneficial for learning robust policies.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Computational Cost of Data Generation: While
3DGSoffers faster rendering, the initial process of generatingInteriorGS(artist-created meshes, 3,000 camera views, manual annotation,convex decomposition) seems computationally intensive. The paper doesn't detail the cost (time, human effort, compute) of creatingInteriorGSandSAGE-Benchbeyond the training convergence time. This could be a barrier for other researchers attempting to extend the dataset. -
Dynamic Objects and Interactions: The current
physics-aware execution jointingconfiguresstatic-scene objectsasstatic bodiesand a "curated subset" asmovableorarticulated. For more complex interactive tasks, handling dynamic objects (e.g., objects moved by the agent or other entities) and complexarticulated objects(e.g., doors, drawers with varying degrees of freedom) natively within the3DGS-Mesh Hybrid Representationcould become more challenging, especially maintaining semantic consistency during state changes. -
Generalizability to Outdoor/Open-World Environments: While
3DGSis excellent for capturing indoor scenes, its applicability to large-scale outdoor or open-world environments (where scene structure is less constrained and lighting conditions are more varied) would need further investigation. -
Agent Embodiment Specificity: The
robot APIssupport various platforms (legged, wheeled, aerial). It would be interesting to see if thecontinuity metricsandVLNperformance vary significantly across differentembodimentsdue to their distinct kinematics and interaction capabilities. -
Explanation of
c(t)in ICP: The paper definesc(t)ascollision intensity sequencebut doesn't explicitly state its calculation method (e.g., is it binary, proportional to penetration depth, or duration within a timestep?). A clearer definition would enhance the reproducibility and interpretability ofICP.Overall,
SAGE-3Dprovides a robust and forward-looking foundation forembodied navigation, pushing the boundaries of what3DGScan achieve in complexAItasks. Its contributions in data, methodology, and evaluation metrics are likely to inspire substantial future research in the field.
Similar papers
Recommended via semantic vector search.