Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains
TL;DR Summary
Gallant is a voxel-grid-based framework for humanoid locomotion and navigation that utilizes voxelized LiDAR data for accurate 3D perception, achieving near 100% success in challenging terrains through end-to-end optimization.
Abstract
Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure. This paper presents Gallant, a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency. Experimental results show that Gallant's broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves near 100% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is "Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains". The central topic is the development of a novel framework, named Gallant, that enables humanoid robots to move and navigate effectively in complex three-dimensional environments by using a specialized perception system based on voxel grids derived from LiDAR data.
1.2. Authors
The authors are Qingwei Ben, Botian Xu, Kailin Li, Feiyu Jia, Wentao Zhang, Jingping Wang, Jingbo Wang, Dahua Lin, and Jiangmiao Pang.
- Qingwei Ben, Botian Xu, and Kailin Li are marked with an asterisk (*), indicating equal contribution.
- Jiangmiao Pang is the Corresponding Author.
- Affiliations include: Shanghai Artificial Intelligence Laboratory (1), The Chinese University of Hong Kong (2), University of Science and Technology of China (3), University of Tokyo (4), and Shanghai Jiaotong University (5).
1.3. Journal/Conference
The paper is published as a preprint on arXiv, with the identifier arXiv:2511.14625. As an arXiv preprint, it is currently undergoing peer review and has not yet been formally published in a specific journal or conference. arXiv is a widely recognized open-access repository for scientific preprints in various disciplines, including computer science and robotics, allowing for rapid dissemination of research findings.
1.4. Publication Year
The paper was published at UTC: 2025-11-18T16:16:31.000Z. This indicates a publication year of 2025.
1.5. Abstract
The paper addresses the challenge of robust humanoid locomotion in complex 3D environments, which necessitates accurate and globally consistent perception. Traditional perception methods, such as depth images or elevation maps, are limited by partial and flattened views, failing to capture the full 3D scene structure. To overcome this, the paper introduces Gallant, a voxel-grid-based framework designed for humanoid locomotion and local navigation in 3D constrained terrains.
Gallant utilizes voxelized LiDAR data as a lightweight and structured perceptual representation. This representation is then processed by a z-grouped 2D Convolutional Neural Network (CNN) to map the visual information to the control policy, enabling fully end-to-end optimization. A key component is a high-fidelity LiDAR simulation that dynamically generates realistic observations, supporting scalable, LiDAR-based training and ensuring sim-to-real consistency.
Experimental results demonstrate that Gallant's broader perceptual coverage allows a single policy to handle diverse 3D constraints beyond just ground-level obstacles, including lateral clutter, overhead constraints, multi-level structures, and narrow passages. The framework also achieves near 100% success rates in challenging scenarios like stair climbing and stepping onto elevated platforms, attributed to its improved end-to-end optimization.
1.6. Original Source Link
The original source link is: https://arxiv.org/abs/2511.14625 The PDF link is: https://arxiv.org/pdf/2511.14625v1.pdf The paper is available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is enabling robust humanoid locomotion and local navigation in complex, 3D constrained environments. Humanoid robots, unlike wheeled or tracked vehicles, possess highly dexterous limbs, allowing them to traverse diverse and irregular terrains. However, this capability is heavily reliant on an accurate and comprehensive understanding of the surrounding environment.
This problem is crucial in the current field because, despite significant advancements in humanoid robotics, ensuring operational safety and adaptability in unstructured real-world settings remains a major challenge. Robots need to move beyond traversing simple flat surfaces; they must handle terrain irregularities, ground-level obstacles, lateral clutter (obstacles to the side), and overhead constraints (obstacles above, like low ceilings). This requires anticipatory collision checking, clearance-aware motion generation, and intelligent planning of contact-rich maneuvers.
The specific challenges or gaps in prior research primarily stem from limitations in existing perception modules:
-
Depth Images: While offering lower latency, depth cameras typically have a
narrow Field of View (FoV)andlimited range, which restricts a robot's ability to reason about complex, spatially extended environments. -
Elevation Maps: These approaches compress
3D LiDAR point cloudsinto2.5D height fields. This projection effectively flattens the environment, discarding crucialverticalandmultilayer structure(e.g., overhangs, low ceilings, mezzanines, stair undersides). Moreover, the reconstruction stage can introducealgorithm-specific distortionsandlatency, further decoupling perception from control. -
Raw Point Clouds: While
3D LiDARprovides detailed scene geometry with a wide FoV, its rawpoint cloudsare oftensparseandnoisy, making them difficult forsample-efficient policy learningandreal-time inference.The paper's entry point or innovative idea is to use a
voxel-grid-basedrepresentation ofLiDAR data. This approach aims to preserve thefull 3D structureof the environment, overcome the FoV and flattening limitations of previous methods, and provide alightweightandstructuredinput suitable forend-to-end policy learningwith az-grouped 2D CNN.
2.2. Main Contributions / Findings
The paper presents Gallant as a significant step forward in humanoid locomotion and local navigation. Its primary contributions and key findings are:
-
Voxel Grid as a Geometry-Preserving Representation:
Gallantproposes and verifies the use of avoxel gridderived fromLiDAR dataas alightweightyetgeometry-preservingperceptual representation. This representation captures thefull 3D structure(includingmulti-layerinformation andvertical patterns) over a largeField of View (FoV), unlikedepth imagesorelevation maps. This directly addresses the limitation of previous methods that provided only partial or locally flattened views. -
Efficient Voxel Grid Processing with z-grouped 2D CNN: The paper introduces and validates a
z-grouped 2D Convolutional Neural Network (CNN)for processing the sparsevoxel grids. This architectural choice treats height slices (-dimension) as channels and applies2D convolutionsover thex-y plane. This design offers afavorable trade-off between representation capacity and computational efficiencycompared to heavier3D CNNsor less suitablesparse CNNs, making it practical for real-timeonboard deployment. -
Full-Stack Pipeline for Sim-to-Real Transfer: The research develops a comprehensive
full-stack pipeline, spanning fromhigh-fidelity LiDAR sensor simulationtopolicy training. This pipeline includesrealistic LiDAR simulationthat dynamically models sensor noise, latency, and even scansdynamic objects(like the robot's own body links). This rigorous simulation environment, combined withcurriculum trainingacross diverse3D-constrained terrains, enables the training of asingle policythat demonstrateszero-shot sim-to-real transferand generalizes robustly across various real-world3D-constrained environments. -
Enhanced Locomotion Capabilities:
Gallantsignificantly expands the range of navigable terrains for humanoids. Its broader perceptual coverage allows the single trained policy to handle not only conventionalground-level obstaclesbut also complex scenarios involvinglateral clutter(e.g.,Forest,Door),overhead constraints(e.g.,Ceiling),multi-level structures, andnarrow passages. This goes beyond the limitations of previous methods often confined to simpler terrains. -
Achieving Near 100% Success Rates in Challenging Scenarios: The framework achieves near
100% success ratesin tasks previously considered very challenging and unstable for humanoids, such asstair climbing(Upstair,Downstair) andstepping onto elevated platforms(Platform). This improvement is attributed to the enhancedend-to-end optimizationfacilitated by the robust 3D perception.These findings collectively solve the problem of limited 3D perception and poor generalization in humanoid locomotion by providing a practical, robust, and generalizable solution for navigating complex 3D environments.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the Gallant paper, a beginner should be familiar with several foundational concepts in robotics, machine learning, and computer vision:
- Humanoid Locomotion: This refers to the ability of
humanoid robots(robots designed to resemble the human body) to move and balance. Unlike wheeled robots, humanoids must contend with complex dynamics, balance control, and intermittent foot contacts with the ground.Locomotioninvolves generating appropriate joint trajectories and forces to achieve desired movements like walking, running, or climbing. - Local Navigation: This is the process by which a robot plans and executes collision-free paths within its immediate environment to reach a local target or traverse an obstacle. It typically works in conjunction with a higher-level global planner.
Local navigationfor legged robots often involves adjusting foot placement, body posture, and step timing based on perceived terrain. - 3D Constrained Terrains: These are environments where movement is restricted not just by ground-level obstacles but also by vertical structures. Examples include low ceilings, narrow passages, stairs, platforms, and uneven surfaces that require specific body clearances or multi-level interaction.
- Perception Modules: These are the sensing and data processing components that allow a robot to "see" and interpret its surroundings.
- Depth Images: These are images where each pixel value represents the distance from the camera to the corresponding point in the scene.
Depth cameras(e.g., Intel RealSense, Microsoft Kinect) are common sensors for generating these. - LiDAR (Light Detection and Ranging): A remote sensing method that uses pulsed laser light to measure distances. A
LiDAR scanneremits laser pulses and measures the time it takes for the light to return, generating apoint cloud– a collection of 3D data points representing the surface of objects in the environment.3D LiDARsystems can provide wideField of View (FoV)and high-fidelity 3D geometry. - Elevation Maps (or Height Maps): A
2.5D representationof terrain. It's a grid (typically 2D) where each cell stores the height of the terrain at that(x, y)location. The "2.5D" implies that it captures height information but cannot representoverhangsormulti-layer structures(where multiple values exist for a single(x, y)coordinate). - Voxel Grid: A
3D gridmade up ofvoxels(volumetric pixels). Similar to how a pixel is a 2D square, a voxel is a 3D cube. In the context of perception, avoxel gridcan represent the occupancy of space, where each voxel can be marked asoccupied(e.g., by an obstacle) orfree. This representation naturally capturesfull 3D structure.
- Depth Images: These are images where each pixel value represents the distance from the camera to the corresponding point in the scene.
- Convolutional Neural Networks (CNNs): A class of deep learning models particularly effective for processing grid-like data, such as images.
- 2D CNN: Applies convolutional filters across two spatial dimensions (e.g., width and height in an image).
- 3D CNN: Extends 2D CNNs by applying convolutional filters across three spatial dimensions (e.g., width, height, and depth/time), suitable for volumetric data or video.
- Sparse CNN: Optimized for processing data where most values are zero (i.e., sparse data).
Voxel gridscan be very sparse if only a small fraction of voxels are occupied.
- End-to-End Optimization/Learning: A machine learning paradigm where a single model or system learns to map raw input data directly to the desired output (e.g., sensor readings directly to motor commands), bypassing intermediate, hand-engineered processing stages. This often involves training the entire system jointly using techniques like
Reinforcement Learning. - Reinforcement Learning (RL): A machine learning paradigm where an
agentlearns to make decisions by performing actions in anenvironmentto maximize a cumulativereward.- Partially Observable Markov Decision Process (POMDP): A mathematical framework for modeling decision-making where the
agentdoes not have full access to thestateof theenvironmentbut instead relies onobservationsthat are probabilistically related to the state. This is highly relevant for robots whose sensors provide incomplete information. - Actor-Critic: A class of RL algorithms that combines two components: an
Actor(which learns apolicyfor taking actions) and aCritic(which learns to estimate thevalue function, or expected future reward, of thestatesorstate-action pairs). TheCriticguides theActor's learning. - Proximal Policy Optimization (PPO): A popular
Actor-CriticReinforcement Learning algorithmknown for its stability and good performance. It aims to take the largest possible improvement step on thepolicywithout causing thepolicyto collapse.
- Partially Observable Markov Decision Process (POMDP): A mathematical framework for modeling decision-making where the
- Sim-to-Real Transfer: The process of training a robot control policy or perception model entirely in a
simulated environmentand then deploying it successfully onphysical hardware(the "real world") without significant retraining or adaptation. This is highly desirable for safety, scalability, and cost-effectiveness. - Domain Randomization: A technique used in
sim-to-real transferto improve the robustness and generalization of models trained in simulation. It involves randomizing various properties of the simulation environment (e.g., textures, lighting, sensor noise, physics parameters) to force the learned policy to ignore specifics of the simulation and focus on general features, making it more adaptable to the variations found in the real world.
3.2. Previous Works
The paper contextualizes its contributions by discussing prior research in humanoid perceptive locomotion and local navigation.
Humanoid Perceptive Locomotion
Previous approaches for humanoid perceptive locomotion have primarily relied on:
- Elevation Maps: Methods like those by Long et al. [21], Ren et al. [30], and Wang et al. [38] utilize
elevation maps(also known as2.5D height fields).- Concept: An
elevation mapdiscretizes the ground plane into a grid and stores the height of the highest point within each cell. This is often reconstructed fromLiDARdata [10, 11]. - Limitations (as highlighted by Gallant): While effective for reasoning about
ground-level obstacles,elevation mapsinherentlyflatten the scene, discarding information aboutverticalandmultilayer structuressuch asoverhangs,low ceilings, ormezzanines. They also introducereconstruction latencyand can suffer fromalgorithm-specific distortions.
- Concept: An
- Depth Cameras: Approaches leveraging
depth cameras(e.g., Zhuang et al. [49], Sun et al. [34]) have shown effectiveness, particularly onquadruped robots[1, 7, 18, 22, 34, 48, 49].- Concept:
Depth camerasprovide direct depth measurements for a scene, often at high frame rates. - Limitations (as highlighted by Gallant): Their
narrow field of view (FoV)andlimited spatial continuity(especially in range) restrict3D understanding, hindering policy generalization in diverse environments.
- Concept:
- Point Clouds: Some recent works, benefiting from advances in
LiDAR simulation, have exploredpoint-cloud-based inputs[15, 39].- Concept: Directly using the raw
3D point cloudgenerated byLiDARsensors. - Limitations (as highlighted by Gallant): While addressing some
FoVlimitations,raw point cloudsare typicallysparseandnoisy, and theirhigh processing costmakesreal-time onboard use infeasiblefor control tasks.
- Concept: Directly using the raw
Local Navigation
Local navigation strategies for legged robots often adopt a hierarchical design:
- Hierarchical Design: A high-level planner provides
velocity commandsortarget waypoints, which a low-level policy then tracks [4, 6, 12, 15, 20, 29, 40, 45, 46].- Limitations (as highlighted by Gallant): This
decouplinglimits the low-level policy's ability toexploit terrain featuresfor more agile movements.Tracking errorsandslow high-level updatescan further degrade performance.
- Limitations (as highlighted by Gallant): This
- End-to-End Training with Obstacle Avoidance: Recent work has explored
end-to-end trainingby incorporatingobstacle-avoidance rewardsintovelocity tracking objectives[30].- Limitations (as highlighted by Gallant): This approach can create
conflicting objectives, potentially hindering optimal performance.
- Limitations (as highlighted by Gallant): This approach can create
- Position-Based Formulation: Using
target positionsinstead ofvelocity commandsallows the policy to reason more directly aboutterrainand choose appropriate actions [14, 31, 44]. However, this approach has primarily been tested onquadrupeds[31] and remains largelyuntested on humanoids.Gallantadopts thisposition-based formulation.
3.3. Technological Evolution
The evolution of robotic locomotion perception has moved from simpler, often 2D or 2.5D representations, towards more comprehensive 3D scene understanding. Initially, tasks like navigating flat ground or simple obstacles could rely on depth images or elevation maps. However, as robots were tasked with more complex environments, the limitations of these representations became apparent. The ability to perceive overhangs, multi-level structures, and lateral clearances became critical.
LiDAR technology offered the promise of rich 3D data but presented challenges with raw point cloud processing (sparsity, noise, computational cost). This led to intermediate representations like elevation maps (a compromise for efficiency) and eventually to more structured 3D representations like voxel grids, which aggregate points to reduce noise and dimensionality while preserving 3D information.
Concurrently, advances in deep learning, particularly Convolutional Neural Networks (CNNs), provided powerful tools for processing these visual and spatial data. The challenge then became how to efficiently apply these to sparse 3D data in real-time for robot control. The development of high-fidelity simulators and domain randomization techniques has been crucial in bridging the sim-to-real gap, allowing complex Reinforcement Learning policies to be trained at scale.
This paper's work (Gallant) fits into this timeline by pushing the boundaries of 3D perception for humanoids using voxel grids and efficient z-grouped 2D CNNs, coupled with advanced LiDAR simulation for robust sim-to-real transfer.
3.4. Differentiation Analysis
Compared to the main methods in related work, Gallant offers several core differences and innovations:
- Perceptual Representation:
- Previous: Predominantly
elevation maps(2.5D, flatten scene) ordepth images(narrow FoV, limited range). Some exploreraw point clouds(high processing cost). - Gallant: Uses a
voxel gridderived fromLiDAR point clouds. This is alightweight,structured, andgeometry-preserving 3D representationthat explicitly capturesmulti-layer structure,vertical patterns, and awide FoV, directly addressing the limitations of 2.5D and narrow FoV approaches.
- Previous: Predominantly
- Perceptual Processing:
- Previous: Often relies on standard
2D CNNsfor depth images or processing ofelevation maps.3D CNNsexist but are computationally heavy for sparse 3D data. - Gallant: Employs a novel
z-grouped 2D CNN. This treats the -axis (height) as channels for2D convolutionsover thex-y plane. This design leverages the sparsity typical invoxel grids(where most occupancy is concentrated along few -slices) to achievehigh computational efficiencyandreal-time inferencewhile still capturingvertical structurethrough channel mixing. This is more efficient than3D CNNsand better suited than genericsparse CNNsfor the specific egocentricvoxel gridstructure.
- Previous: Often relies on standard
- Scope of Locomotion:
- Previous: Policies typically confined to
ground-level obstaclesdue to limitations ofelevation maps(e.g., [21, 38]).Depth-based methodsmay handle more, but still suffer from FoV/range issues. - Gallant: A
single policyis capable of handling a much broader range of3D constraints, includinglateral clutter,overhead constraints,multi-level structures, andnarrow passages, in addition toground-level obstacles. This represents a significant increase in thegeneralization capabilityof the locomotion policy.
- Previous: Policies typically confined to
- Sim-to-Real Robustness:
- Previous: While
LiDAR simulationhas advanced, fully accounting for real-world complexities likedynamic objects(e.g., the robot's own body) and sensornoiseandlatencywithinReinforcement Learningpipelines has been challenging. - Gallant: Develops a
high-fidelity LiDAR simulation pipelinethat explicitly modelsdynamic objects(self-scan),sensor noise, andlatency. This, combined withdomain randomization,curriculum training, andprivileged informationfor the critic, ensures strongsim-to-real consistencyandzero-shot transferto diverse real-world terrains, outperforming baselines that omit these details.
- Previous: While
- End-to-End Optimization:
-
Previous: Often hierarchical, with perception decoupled from control, or end-to-end but with conflicting objectives (e.g., velocity tracking with obstacle avoidance).
-
Gallant: Integrates
local navigationandlocomotioninto asingle end-to-end policyusing aposition-based goal-reaching formulation. This allows the policy to reason directly over terrain and choose appropriate actions, leading to higher success rates in challengingcontact-rich maneuvers.The following are the results from Table 1 of the original paper:
Method Perceptual Representation FoV Ground Lateral Overheading Long et al. [21] Elevation Map ~ 1.97π ✓ × × Wang et al. [38] Elevation Map ~ 1.97π ✓ × × Ren et al. [30] Elevation Map ~ 1.97π ✓ × × Zhuang et al. [49] Depth Image ~ 0.43π ✓ ✓ × Wang et al. [39] Point Cloud ~ 1.97π ✓ ✓ ✓ Gallant (ours) Voxel Grid ~ 4.00π ✓ ✓ ✓
-
This table clearly highlights Gallant's superiority in Field of View (FoV) and its comprehensive ability to handle Ground, Lateral, and Overheading obstacles, which distinguishes it from prior methods primarily limited by their perceptual representation or FoV.
4. Methodology
The Gallant framework is a voxel-grid-based perceptive learning framework specifically designed for humanoid locomotion and local navigation in 3D constrained environments. It integrates several key components: a specialized LiDAR simulation pipeline, an efficient 2D CNN perception module for sparse voxel grids, and a structured curriculum training approach with diverse terrain families. These components together form a full-stack pipeline that enables training a single policy capable of robustly traversing all-space obstacles and deploying with zero-shot transfer on real hardware. The overall system architecture is depicted in Figure 2 of the original paper.
The following figure (Figure 2 from the original paper) illustrates the Gallant framework, showing the process of obtaining a Voxel Grid, comparisons between simulation and the real world, data processing, and its projection to a 2D CNN, along with subsequent policy optimization components:
该图像是示意图,展示了Gallant框架中获取Voxel Grid的过程,包括模拟与真实世界的对比、数据处理及其对2D CNN的投影等。图中展示了 的Voxel Grid表示方式以及后续的策略优化组件。
4.1. Problem Formulation
The problem of humanoid perceptive locomotion is formulated as a Partially Observable Markov Decision Process (POMDP). A POMDP is a mathematical framework for sequential decision-making where the agent's actions influence the state of the environment, and the agent receives observations that are related to the state but do not fully reveal it. The POMDP is defined by the tuple .
-
: Set of possible
statesof the environment. -
: Set of possible
actionsthe robot can take. -
: Set of possible
observationsthe robot receives. -
:
Transition function, , which specifies the probability of transitioning to state given the current state and action . -
:
Reward function,R(s, a), which specifies the immediate reward received after taking action in state . -
:
Observation function, , which specifies the probability of observing after taking action and landing in state . -
:
Discount factor, a value between 0 and 1 that discounts future rewards.An
actor-critic policyis trained usingProximal Policy Optimization (PPO)[32], a popularReinforcement Learning (RL)algorithm known for its stability and effectiveness. The training environment consists of blocks. In each episode, the humanoid starts at the center of a block, and agoalis sampled along the perimeter. The robot has a fixed horizon of 10 seconds to reach this goal.
The observation at time , denoted as , is composed of several elements:
$
\begin{array} { r l } & { o _ { t } = \big ( \underbrace { \mathbf { P } _ { t } , \mathbf { T } _ { \mathrm { elapset } } , \mathbf { T } _ { \mathrm { left } , t } } _ { \mathrm { Command } } , \underbrace { a _ { t - 4 : t - 1 } } _ { \mathrm { Action , history } } , } \ & { \qquad \underbrace { \omega _ { t - 5 : t } , \ g _ { t - 5 : t } , \ q _ { t - 5 : t } , \ \dot { q } _ { t - 5 : t } } _ { \mathrm { Proprioception } } , } \ & { \underbrace { \mathtt { Voxel _ Grid } _ { t } } _ { \mathrm { Perception } } , \underbrace { v _ { t } , \ \mathrm { Height _ Map } _ { t } } _ { \mathrm { Privileged } } \big ) , } \end{array}
$
where:
-
: The
goal positionrelative to the robot's base. This is a vector indicating the direction and distance to the target. -
: The
elapsed timein the current episode. -
: The
remaining timeuntil the episode timeout , calculated as . -
: A
history of actionstaken by the policy in the previous 4 time steps. This provides temporal context for the policy. -
: The
root angular velocityof the robot, sampled over the past 5 time steps. -
: The
gravitational vectorprojected into the robot's base frame, sampled over the past 5 time steps. This provides information about the robot's orientation relative to gravity. -
:
Joint positions(angles) of the robot, sampled over the past 5 time steps. -
:
Joint velocities(angular speeds) of the robot, sampled over the past 5 time steps. -
: The
voxelized perception inputat time , representing the 3D environment geometry. -
v _ { t }: Theroot linear velocityof the robot. This is consideredprivileged information. -
:
Relative heightsof scanned points to the robot. This is also consideredprivileged information.The subscript range indicates that temporal history from time step to (inclusive) is included in the observation. In the
actor-criticframework, theactor(which determines actions) andcritic(which evaluates states) share all features except forprivileged inputs.Privileged inputsare additional pieces of information that are available to thecriticduring training (to aid in learning a good value function) but are not available to theactorduring deployment (as they might be difficult to obtain in the real world).
The reward function follows Ben et al. [3], but with velocity tracking rewards replaced by a goal-reaching reward [31]:
$
r _ { \mathrm { r e a c h } } = \frac { 1 } { 1 + \Vert \mathbf { P } _ { t } \Vert ^ { 2 } } \cdot \frac { \Vert ( t > T - T _ { r } ) } { T _ { r } } \ ( T _ { r } = 2 s )
$
where:
-
: The squared Euclidean distance between the robot's current position and the
goal position. The term provides a higher reward when the robot is closer to the goal. -
: An indicator function that is 1 if the current time is greater than the episode timeout minus a specific reward window , and 0 otherwise. This means the goal-reaching reward is primarily given closer to the end of the episode to incentivize reaching the goal within the time limit.
-
: The total episode timeout (10 seconds).
-
: A reward window, set to 2 seconds. This means the goal-reaching reward is active only during the last 2 seconds of the episode.
The objective of the
Reinforcement Learningagent is to maximize the expected cumulative discounted reward: $ J ( \pi ) \ = \ \mathbb { E } [ \sum _ { t = 0 } ^ { H - 1 } \gamma ^ { t } r _ { t } ] $ where: -
: The expected return (total discounted reward) for a given policy .
-
: Expected value.
-
: The episode horizon (maximum number of time steps).
-
: The
discount factor(0.99, as detailed in the appendix), which balances immediate and future rewards. -
: The reward received at time step .
Episodes terminate if the robot falls, experiences a harsh collision, or reaches the timeout of 10 seconds.
4.2. Efficient LiDAR Simulation
Most GPU-based simulators (e.g., IsaacGym, IsaacSim) often lack native support for efficient LiDAR simulation or are limited to scanning only static meshes. However, realistic simulation for dynamic environments requires accounting for all relevant geometry, including both static components (e.g., terrain, walls) and dynamic components (e.g., the robot's own moving links). To address this, Gallant implements a lightweight, efficient raycast-voxelization pipeline using NVIDIA Warp [24].
The core idea is to handle raycasting efficiently in dynamic scenes without rebuilding complex data structures every step.
- Precomputation: A
Bounding Volume Hierarchy (BVH)is precomputed for each mesh (e.g., each link of the robot, or static obstacles) in itslocal (body) frame. ABVHis a tree structure on a set of geometric objects, used to efficiently test for collisions or ray intersections. Precomputing it in the local frame means it only needs to be done once per mesh, not for every position/orientation it takes. - Dynamic Raycasting: During simulation, when a ray is to be cast:
-
The ray's
originanddirectionare transformed into thelocal frameof the target mesh. This involves applying the inverse of the mesh'stransformation matrixto the origin and the inverse of itsrotational componentto the direction. -
The
raycastingfunction is then performed in the local frame of the mesh, which remains static relative to itsBVH. -
The result (intersection point) is transformed back to the world frame.
This process is formalized by the raycasting function: $ \mathrm { r a y c a s t } ( T M , { \bf p } , { \bf d } ) = T ^ { - 1 } \mathrm { r a y c a s t } ( M , T ^ { - 1 } { \bf p } , R ^ { - 1 } { \bf d } ) $ where:
-
-
: The raycasting operation for a mesh that has been transformed by , with ray origin and direction in the world frame.
-
: The
full transformation matrix(translation and rotation) of the mesh from its local frame to the world frame. -
: The mesh in its
local frame. -
: The ray
originin the world frame. -
: The ray
directionin the world frame. -
: The
rotational componentof the transformation matrix . -
: The
inverseof the full transformation matrix, used to transform a point from world to local frame. -
: The
inverseof the rotational component, used to transform a direction from world to local frame.At each simulation step,
ray-mesh intersectionsare computed for every mesh using its current transform . This entire computation isparallelizedusing aWarp kernel(NVIDIA's high-performance Python framework for GPU simulation) with a shape of , indicating that many environments, meshes, and rays can be processed concurrently.
Rays are emitted from the LiDAR origin in directions defined as:
$
O _ { \mathrm { r a y } _ { i } } = O _ { \mathrm { L i D A R } } + O _ { \mathrm { r a y } _ { i } , \mathrm { o f f s e t } }
$
where:
-
: The direction of the -th ray in the world frame.
-
: The orientation of the
LiDAR sensorin the world frame. -
: The
direction offsetfor the -th ray relative to theLiDAR's orientation, which defines theLiDAR's specific scanning pattern.If is the
hit positionof the -th ray, the resultingpoint cloudat time is formed by the union of all such hit points: $ \mathcal { P } _ { t } = \textstyle \bigcup _ { i = 1 } ^ { N _ { \mathrm { r a y s } } } \left{ P _ { i } \right} $ where is the total number of rays emitted. This point cloud is then converted into avoxel grid.
To align the simulation with real-world sensing and improve sim-to-real transferability, domain randomization is applied:
- (a) LiDAR Pose: The
LiDAR's pose(position and orientation) is perturbed at the beginning of each episode.- Position: (cm)
- : Randomized LiDAR position.
- : Original LiDAR position.
- : A random value sampled from a
normal (Gaussian) distributionwith mean 0 and standard deviation 1 (interpreted as 1 cm standard deviation for position).
- Orientation: (rad)
- : Randomized ray direction.
- : A random value sampled from a
normal distributionwith mean 0 and standard deviation radians (which is 1 degree). This perturbs the LiDAR's overall orientation.
- Position: (cm)
- (b) Hit Position: The calculated
hit positionof each ray is randomized.- (cm)
- : Randomized hit position.
P _ { i }: Original hit position.- : A random value sampled from a
normal distributionwith mean 0 and standard deviation 1 (interpreted as 1 cm standard deviation for hit position). This simulates sensor noise.
- (cm)
- (c) Latency:
Sensor latencyis simulated at with a delay of . This mimics the time lag between when a physical sensor captures data and when it becomes available to the robot's control system. - (d) Missing Grid:
2% of voxelsare randomly masked (set to 0) to modelreal-world dropoutorocclusioneffects, where certain parts of the environment might not be scanned.
4.3. Voxel Representation and 2D CNN Perception
The LiDAR point clouds are converted into a fixed-size, robot-centric voxel grid.
-
Sensor Setup: Two
torso-mounted LiDARs(Hesai JT128) are used, one on the front chest and one on the back. Their returns are transformed into aunified torso frame(a coordinate system fixed to the robot's torso). -
Perception Volume: The
perception volumeis defined as a cuboid region around the robot: . This defines the spatial extent that the robot "sees." -
Discretization: This volume is discretized (divided into small cubes) at a
resolutionof . -
Voxel Grid Dimensions: This results in a
voxel gridof along thex, y,and axes respectively. -
Occupancy: Each
voxelin this grid is set to 1 if at least oneLiDAR pointfalls within its volume (indicatingoccupancy), and 0 otherwise (indicatingfree space). This produces abinary occupancy tensor, where (number of height slices), , and (spatial resolution inx-y).The
voxel gridis typicallyhighly sparseandlocally concentrateddue to the nature ofLiDARand terrains. Most(x, y)columns might only contain one or two occupiedz-slices, and large contiguous regions can be empty. This sparsity makescomputationally expensive 3D convolutionsover the full volume inefficient.
To leverage this structure, Gallant introduces a z-grouped 2D Convolutional Neural Network (CNN):
-
Z-as-Channel: Instead of treating the -dimension as a spatial dimension for
3D convolution, it is treated as thechannel dimension. This means the40 z-slicesbecome40 input channelsfor a2D CNN. -
2D Convolution:
2D convolutionsare then applied over thex-y plane. This design exploits spatial context within eachx-y sliceand useschannel mixing(inherent in2D CNNoperations with multiple input channels) to capturevertical structureacross thez-slices.The
2D convolutionoperation is formally expressed as: $ Y _ { o , v , u } = \sigma \left( \sum _ { c = 0 } ^ { C - 1 } \sum _ { \Delta v , \Delta u } \mathbf { W } _ { o , c , \Delta v , \Delta u } \cdot X _ { c , v + \Delta v , u + \Delta u } + b _ { o } \right) $ where: -
: The output value at
output channel,spatial location(v, u). -
: A
non-linearity(e.g., ReLU, Mish, which is used inGallant'spolicy network as described in the appendix). -
: The number of
input channels(which is 40, corresponding to thez-slices). -
: Index for
output channels. -
v, u: Spatial coordinates in theoutput feature map. -
: Offsets for the
convolution kernel(weights) across thespatial dimensions. -
: The
weight(filter coefficient) foroutput channel,input channel, andkernel offset. -
: The input value from
input channelat the correspondingspatial location. -
: A
bias termforoutput channel.Compared to a
3D convolution kernelof size (e.g., ), thisz-grouped 2D CNNdesign reduces computational and memory costs by roughly a factor of (e.g., a factor of 3 for a kernel vs. a 2D kernel across 40 channels). It stillcaptures vertical patternscritical for locomotion, makes efficient use ofsparse, localized occupancy, and supportsefficient parallel trainingandreal-time inference.
4.4. Terrain Design
To train robust policies, Gallant uses 8 representative terrain types in simulation. Each terrain type is designed to challenge specific aspects of humanoid locomotion and perception:
-
Plane: A flat, easy terrain for learning basic walking and initial stabilization.
-
Ceiling: Features randomized height and density of overhead structures, requiring the robot to reason about
overhead constraintsandcrouchingbehaviors. -
Forest: Composed of randomly spaced cylindrical pillars (
trees), representingsparse lateral clutterthat demandsweavingand precise lateral navigation. -
Door: Presents narrow gaps (
doorways) that requireprecise lateral clearanceand fine motor control. -
Platform: Consists of high, ring-shaped structures with variable spacing and height, necessitating the recognition of
stepable surfacesandinter-platform traversal. -
Pile: Introduces
fine-grained support reasoningforsafe foot placementon uneven, gapped surfaces. -
Upstair: Requires continuous adaptation to
vertical elevationfor climbing stairs. -
Downstair: Requires similar adaptation for descending stairs.
The following figure (Figure 3 from the original paper) shows the terrain types used to train robots in simulation:
该图像是用于训练机器人在模拟环境中应对不同类型地形的示意图。图中展示了六种地形类型,包括 Ceiling、Door、Pile、Downstairs、Plane、Forest 和 Platform,帮助研究者理解机器人如何在这些环境中进行导航。
The paper adopts a curriculum-based training strategy where the difficulty of the terrain progressively increases. Each terrain type is parameterized by a scalar difficulty . The terrain generation parameters are interpolated using a linear function:
$
\mathbf { p } _ { \tau } ( s ) = ( 1 - s ) \mathbf { p } _ { \tau } ^ { \mathrm { m i n } } + s \mathbf { p } _ { \tau } ^ { \mathrm { m a x } }
$
where:
-
: The vector of
parametersdefining terrain type at difficulty level . -
: The
difficulty scalar, ranging from 0 (easiest) to 1 (hardest). -
: The vector of parameters for the
easiest settingof terrain type . -
: The vector of parameters for the
hardest settingof terrain type .This formula allows for smooth progression of difficulty. For example, for
Ceilingterrain, as increases, the ceiling height decreases, and the number of ceilings increases, making it harder. ForPlatformterrain, as increases, the height of the platforms increases, and the gap width between them increases. The following are the parameters for generating curriculum training terrains from Table 2 of the original paper:
| Terrain Type τ | Term | Min (s=0) | Max (s=1) |
| Ceiling | Ceiling height (m) ↓ | 1.30 | 1.00 |
| Number of Ceiling (-) ↑ | 10 | 40 | |
| Forest | Minimum distance between trees (m) ↓ | 2.0 | 1.0 |
| Number of trees (-) ↑ | 3 | 32 | |
| Door | Distance between two walls (m) ↓ | 2.00 | 1.00 |
| Width of the doors (m) ↓ | 1.60 | 0.80 | |
| Platform | Height of the platforms (m) ↑ | 0.05 | 0.35 |
| Gap width between two platforms (m) ↑ | 0.20 | 0.50 | |
| Pile | Distance between two cylinders (m) ↑ | 0.35 | 0.45 |
| Upstair | Height of each step (m) ↑ | 0.00 | 0.20 |
| Width of each step (m) ↓ | 0.50 | 0.30 | |
| Downstair | Height of each step (m) ↑ | 0.00 | 0.20 |
| Width of each step (m) ↓ | 0.50 | 0.30 |
In each episode, a 10-second goal-reaching task is assigned. If the robot succeeds, it is promoted to harder settings (higher ); if it fails, it is demoted to easier settings (lower ). For the Pile terrain, a flat surface is overlaid during early training (low ) to allow the robot to learn basic foothold placement. As increases, this plane is removed, and the robot trains on the fully gapped terrain.
4.5. Training Details (from Supplementary Material)
4.5.1. Hyperparameters
The training framework is based on [41]. The key PPO hyperparameters used are:
The following are the hyperparameters and their values from Table 4 of the original paper:
| Hyperparameter | Value |
| Environment number | 1024 × 8 |
| Steps per iteration | 4 |
| PPO epochs | 4 |
| Minibatches | 8 |
| Clip range | 0.2 |
| Entropy coefficient | 0.003 |
| GAE factor λ | 0.95 |
| Discount factor γ | 0.99 |
| Learning rate | 5e-4 |
Environment number: The total number of parallel simulation environments running simultaneously to collect experience, .Steps per iteration: The number of environment steps collected before a policy update.PPO epochs: The number of times the collected data is iterated over during a policy update.Minibatches: The number of minibatches into which the collected data is divided for training.Clip range: A parameter inPPOthat limits the ratio of the new policy's probability to the old policy's probability, preventing large policy updates. A value of 0.2 means the ratio is clipped to .Entropy coefficient: A weight for theentropy bonusin thePPOloss function, which encourages exploration.- (Generalized Advantage Estimation factor lambda): A parameter for calculating
advantage estimates, which balances bias and variance inReinforcement Learning. - : Weights the importance of future rewards relative to immediate rewards. A value of 0.99 means future rewards are slightly discounted.
Learning rate: The step size for updating the neural network weights during optimization.
4.5.2. Policy Network Structure
The Actor and Critic networks share the same architecture but have independent parameters. The network processes two types of input: non-voxel information (e.g., proprioceptive input) and voxel grid input.
-
Non-Voxel Information Processing: A
two-layer Multi-Layer Perceptron (MLP)with hidden dimensions of 256 is used to encode non-voxel information.- First layer: $ h _ { \mathrm { m l p } } ^ { ( 1 ) } = \mathrm { Mish } \big ( \mathrm { LN } ( W _ { \mathrm { m l p } , 1 } x _ { \mathrm { m l p } } + b _ { \mathrm { m l p } , 1 } ) \big ) $
- Second layer: $ h _ { \mathrm { m l p } } = W _ { \mathrm { m l p } , 2 } h _ { \mathrm { m l p } } ^ { ( 1 ) } + b _ { \mathrm { m l p } , 2 } , \quad \mathrm { dim } ( h _ { \mathrm { m l p } } ) = 2 5 6 $ where:
- : The
non-voxel input(e.g., commands, action history, proprioception). - , :
Weightsandbiasof the firstMLPlayer. - :
Layer Normalization, a technique to normalize the activations of network layers. - : The
Mish activation function, defined as . - : Output of the first
MLPlayer after activation and normalization. - , :
Weightsandbiasof the secondMLPlayer. - : The
encoded non-voxel feature vector, with a dimension of 256.
-
Voxel Grid Processing: In parallel, a
three-layer 2D CNNprocesses thevoxel grid input. As described previously, the -dimension of thevoxel gridis treated as channels.- First layer (after convolution and pooling, flattened): $ h _ { \mathrm { c n n } } ^ { ( 1 ) } = \mathrm { Mish } ( \mathrm { LN } ( W _ { \mathrm { c n n } , 1 } h _ { \mathrm { c n n } } ^ { \mathrm { flat } } + b _ { \mathrm { c n n } , 1 } ) ) $
- Second layer: $ h _ { \mathrm { c n n } } = W _ { \mathrm { c n n , 2 } } h _ { \mathrm { c n n } } ^ { ( 1 ) } + b _ { \mathrm { c n n , 2 } } , \quad \dim ( h _ { \mathrm { c n n } } ) = 6 4 $ where:
- : The
flattened outputof the2D CNNlayers after processing thevoxel grid. - , :
Weightsandbiasof the firstMLP-like layer applied to theflattened CNNfeatures. - : Output of the first
CNNfeature processing layer. - , :
Weightsandbiasof the secondMLP-like layer. - : The
encoded voxel feature vector, with a dimension of 64.
-
Feature Concatenation and Final MLP: The two feature vectors, and , are concatenated to form a combined feature vector .
- Concatenation: $ f = [ h _ { \mathrm { m l p } } , \ : h _ { \mathrm { c n n } } ] $ (This notation implies concatenation along a dimension.)
- The combined feature is then passed through another
MLPto produce a 256-dimensionallatent representation.-
First layer: $ h _ { \mathrm { o u t } } ^ { ( 1 ) } = \mathrm { Mish } ( f ) $ (This implies a linear layer followed by Mish, potentially with Layer Normalization as well, simplified for presentation here).
-
Second layer: $ h _ { \mathrm { o u t } } = \mathrm { Mish } \big ( W _ { \mathrm { o u t } } h _ { \mathrm { o u t } } ^ { ( 1 ) } + b _ { \mathrm { o u t } } \big ) , \quad \dim ( h _ { \mathrm { o u t } } ) = 2 5 6 $ where:
-
: Intermediate output after the first Mish activation.
-
, : Weights and bias of the final
MLPlayer. -
: The 256-dimensional
latent representation.Finally, this
latent vector() is fed into a finalMLPto produce the outputs:
-
-
The
Actoroutputs anaction vectorof dimension 29 (for the 29-DoF Unitree G1 humanoid). -
The
Criticoutputs ascalar value estimate(the predicted value of the current state).All layers utilize the
Mish activation function.
4.5.3. Observation Terms and Dimensions
The composition of the observation has been detailed in Section 3.1. The dimensionality of each component for a single time step is summarized below. Before being fed into the policy, observations are processed using a trainable vecnorm module, which is applied in both training and deployment to normalize inputs.
The following are the observation terms and their dimensions from Table 5 of the original paper:
| Observation Term | Dimension |
| Pt | 4 |
| Telapse,t | 1 |
| Tleft,t | 1 |
| a | 29 |
| Wt | 3 |
| gt | 3 |
| qt | 29 |
| qt | 29 |
| Voxel-Gridt | [32 × 32 × 40] |
| vt | 3 |
| Height_Mapt | 1089 |
Pt: Goal position relative to the robot (4 dimensions: x, y, z, and an auxiliary term like yaw).Telapse,t: Elapsed time (1 dimension).Tleft,t: Time left (1 dimension).- : Actions (29 dimensions, corresponding to joint commands).
Wt: Root angular velocity (3 dimensions: roll, pitch, yaw rates).gt: Gravity vector in robot frame (3 dimensions).qt: Joint positions (29 dimensions).qt(second entry, likely ): Joint velocities (29 dimensions).Voxel-Gridt: The 3D voxel occupancy grid, with dimensions .vt: Root linear velocity (3 dimensions: x, y, z velocities).Height_Mapt: The flattenedheight map, with 1089 dimensions. Prior to flattening, it is a tensor, covering a area around the robot with resolution. This captures local terrain height around the robot, centered at its base.
4.5.4. Reward Function Details
Beyond the primary goal-reaching reward (), Gallant incorporates auxiliary shaping terms to improve sample efficiency during early training, inspired by Rudin et al. [31]. These geometry-aware and general-purpose rewards are computed consistently across all terrains without task-specific tuning.
-
Directional velocity reward (): This reward encourages the robot to move in a direction that aligns with the goal while simultaneously considering obstacle avoidance. $ r _ { \mathrm { v e l o c i t y _ d i r e c t i o n } } = { \frac { \mathbf { a } ( \mathbf { p } , \mathbf { g } ) \cdot \mathbf { v } _ { t } } { | \mathbf { a } ( \mathbf { p } , \mathbf { g } ) \cdot \mathbf { v } _ { t } | _ { 2 } } } $ where:
-
: The robot's
instantaneous linear velocity. -
: A
direction vectorthat combinesgoal alignmentandobstacle avoidance. This vector is critical for guiding the robot's movement.The
direction vectoris calculated as: $ \mathbf { a } ( \mathbf { p } , \mathbf { g } ) = \sum _ { j \in \mathcal { N } ( \mathbf { p } , r ) } w _ { j } \mathbf { u } _ { r , j } + \kappa \sum _ { j \in \mathcal { N } ( \mathbf { p } , r ) } w _ { j } \gamma _ { j } \mathbf { t } _ { j } $ where: -
: The robot's current position.
-
: The goal direction vector.
-
: Denotes the set of
obstacle pointslocated within aradiusfrom the robot's position . -
: A
distance-based weighting factorfor obstacle . It gives higher weight to closer obstacles, calculated as: $ w _ { j } = \frac { \left[ \operatorname* { m a x } \left( 1 - \frac { \operatorname* { m a x } ( d _ { j } - 0 . 2 , 0 . 0 2 ) } { 0 . 8 } , 0 \right) \right] ^ { 2 } } { \operatorname* { m a x } ( d _ { j } - 0 . 2 , 0 . 0 2 ) } $ where is the distance to obstacle . This function gives a strong repulsion weight to obstacles within 1m, with an inner threshold of 0.2m (and a minimum effective distance of 0.02m to avoid division by zero issues). -
: The
repulsion unit vectorfromobstaclepointing towards the robot. This term encourages the robot to move away from obstacles. -
: A
tangential unit vector(either left or right) aroundobstacle. This term encourages the robot to circumnavigate obstacles. -
: A
weighting coefficientfor the tangential term, set to 0.8. -
: A factor that
filters obstaclesthat are behind thegoal direction. is the vector from robot to obstacle . This ensures that the robot doesn't unnecessarily avoid obstacles it has already passed or that are not in its path towards the goal. This direction computation is applied only to relevant structures like cylinders inForestand walls inDoorand isefficiently parallelizedviaWarp.
-
-
Head height reward (): This reward encourages the robot to adjust its body height proactively, particularly useful for
overhead constraints. $ r _ { \mathrm { h e a d . h e i g h t } } = \exp \big ( - 4 ( H _ { \mathrm { h e a d . e s t } } - H _ { \mathrm { h e a d } } ) ^ { 2 } \big ) $ where:- : The
estimated target head height. This is computed by conceptually shifting the robot forward along thegoal direction, averaging the terrain height within a square at that location, and then subtracting a offset (to ensure clearance). - : The robot's
actual head height. ThisGaussian-shaped rewardis maximized when the robot's actual head height matches the estimated target head height, encouraging it to lower its head (crouch) to pass under obstacles likeceilings.
- : The
-
Foot clearance reward (): This reward encourages the robot to proactively lift its feet to clear obstacles, distinguishing it from prior work that might only consider terrain directly under the foot. $ r _ { \mathrm { f e e t . c l e a r a n c e } } = \exp \big ( - 4 ( H _ { \mathrm { f e e t . e s t } } - H _ { \mathrm { f e e t } } ) ^ { 2 } \big ) $ where:
- : The
estimated target foot height. This is calculated by querying terrain ahead of each foot and averaging the height in a square region. - : The robot's
actual foot height. Similar to the head height reward, thisGaussian-shaped rewardpromotes proactive leg lifting over steps or platforms, ensuring that the robot prepares its foot trajectory in advance of obstacles.
- : The
4.5.5. Domain Randomization (beyond LiDAR-specific)
In addition to the LiDAR-specific domain randomization (pose, hit position, latency, missing grid), several general randomization strategies are applied during training to enhance policy robustness and sim-to-real transferability:
- Mass randomization: The
massesof the robot'spelvisandtorso linksare randomized.- : Randomized mass.
- : Original mass.
- : A random value sampled from a
uniform distributionbetween 0.8 and 1.2. This simulates variations in robot construction or payload.
- Foot-ground contact randomization: Parameters related to contact physics are randomized.
Ground friction coefficientis fixed at 1.0.Foot joint frictionis sampled from .Restitution coefficient(bounciness) is sampled from . These randomizations help the policy become robust to variations in surface properties and joint mechanics.
- Control parameter randomization: The
joint stiffness() anddamping() parameters for the robot's joints are randomized.- , : Randomized stiffness and damping.
- , : Original stiffness and damping (following settings in Liao et al. [19]). This makes the policy less sensitive to exact control gains in deployment.
- Torso center-of-mass offset: The
center of mass (CoM)position of thetorsois perturbed.- An
offsetis sampled from along each axis (x, y, z). This simulates slight manufacturing variations or changes in robot configuration.
- An
- Init Joint Position offset: A random
offsetis added to the robot'sdefault joint positionsanddefault joint velocitiesduring environment reset.- Offset sampled from . This randomizes the robot's initial state, encouraging a more robust starting posture.
4.5.6. Termination Conditions
Several termination conditions are used during training to encourage effective and safe behavior and prevent undesirable strategies:
- Force contact: An episode terminates if any
external forceacting on thetorso,hip, orknee jointsexceeds at any time step. This prevents the robot from relying on harsh collisions. - Pillar fall: For pillar-based terrains (
Forest,Pile), if a footpenetrates more than 10 cmbelow the ground level, the episode terminates. This prevents the robot from "cheating" by falling through or bypassing obstacles. - No movement: To prevent the agent from exploiting reward shaping by simply staying in place, an episode terminates if the robot fails to move at least away from its initial position within 4 seconds.
- Fall over: The episode terminates if the robot
loses balanceandfalls. - Feet too close: Since
self-collisionis disabled during training (to speed up simulation), this condition prevents the robot's feet from crossing or overlapping unnaturally, ensuring physically plausible motions.
4.5.7. Symmetry
Symmetry-based data augmentation is applied to accelerate training. This involves flipping certain observations along the -axis.
Proprioceptive observationsare flipped, similar to Ben et al. [3].- The
perception representation(thevoxel grid) is alsomirrored along the y-dimension. For example, a grid map is flipped across its -axis to align with the flipped proprioceptive input, forming aconsistent flipped observation. - The
rewardremainsunchangedunder this transformation. Both original and flipped samples are stored in therollout bufferand used jointly during training.
4.6. Real-world deployment Details (from Supplementary Material)
4.6.1. Target Position Command
For real-world deployment, a Livox Mid-360 LiDAR mounted on the robot's head (facing downwards) is used with FastLIO2 [42, 43] to provide the robot's position in the world coordinate frame at .
- The
Mid-360provides aFoVof horizontally and to vertically. - During training, the observation frequency of (goal position) is also set to .
- The robot starts at , and the goal position for each run is fixed at .
- At each time step,
FastLIO2outputs the current robot position(x, y). The observation relative to the goal is then calculated as: $ P _ { t } = ( 4 , 0 ) - ( x , y ) = ( 4 - x , - y ) $
4.6.2. Voxel Grid Processing
Two Hesai JT128 LiDARs (front and rear torso-mounted) collect raw point cloud data.
- Each
JT128has aFoVof vertically and horizontally, with 128 channels. - The dual-sensor setup provides
near-complete coveragearound the robot. - The
JT128supports and output modes.10 Hzmode was chosen for better point cloud quality, and simulation is aligned accordingly. - To improve
voxel grid quality, raw point clouds are merged and then processedonboardusingOctoMap[16] before being passed to the policy.OctoMapgenerates abinary occupancy gridat . It's emphasized thatOctoMapserves as alightweight preprocessing step, not a full reconstruction pipeline, thus incurring minimal latency.
4.6.3. Information Communication
The system runs on an NVIDIA Orin NX with limited communication performance. To reduce latency:
LiDAR outputisclippedto only include points within the perceptionvoxel gridto reduce data size.- The
voxel gridfromOctoMapand the one used for observationshare memoryto avoid redundant transmission. Robot state readingandaction command deliveryare also implemented viashared memory, bypassingLCM(Lightweight Communications and Marshalling, a common robotics communication system). These optimizations minimize communication-induced latency, leaving primarily inherent sensor delays. The overall communication process is shown in Figure 8.
The following figure (Figure 8 from the original paper) illustrates the flow of information communication:
该图像是示意图,展示了信息通信的流程,包括传感器 JT128、PC 处理、OctoMap、机器人控制和策略制定之间的相互作用。各组件通过不同频率的数据流交互,以实现高效的动作控制和环境感知。
5. Experimental Setup
5.1. Datasets
The experiments primarily use simulated environments generated within NVIDIA IsaacSim [28]. These environments are configured with 8 representative terrain types designed to progressively increase in difficulty and cover a wide range of 3D constraints. The curriculum-based training strategy ensures that the robot is exposed to increasing complexity.
-
Plane: Flat ground.
-
Ceiling: Overhead obstacles of varying heights and densities.
-
Forest: Randomly spaced cylindrical pillars (lateral clutter).
-
Door: Narrow passages requiring precise lateral movement.
-
Platform: Elevated structures with gaps, demanding step recognition and traversal.
-
Pile: Uneven, gapped surfaces requiring fine foothold selection.
-
Upstair: Stairs for climbing.
-
Downstair: Stairs for descending.
These terrains are generated using parameterized settings, interpolated between a
min (easiest)andmax (hardest)configuration, as detailed in Table 2 of the original paper (and transcribed in Section 4.4). The following figure (Figure 3 from the original paper) shows examples of these terrain types:
该图像是用于训练机器人在模拟环境中应对不同类型地形的示意图。图中展示了六种地形类型,包括 Ceiling、Door、Pile、Downstairs、Plane、Forest 和 Platform,帮助研究者理解机器人如何在这些环境中进行导航。
The simulated environment itself acts as the dataset, providing LiDAR point clouds and proprioceptive feedback to train the Reinforcement Learning policy. For real-world deployment, the Unitree G1 humanoid interacts with physical versions of these challenging terrains.
5.2. Evaluation Metrics
The paper uses two distinct metrics to evaluate policy performance in simulation, particularly on the most challenging terrain settings ():
-
Success rate ():
- Conceptual Definition: This metric quantifies the proportion of episodes in which the robot successfully reaches its target goal within a specified time limit (10 seconds) without catastrophic failures, such as falling or incurring any severe collisions with obstacles. It measures the policy's overall capability to complete the task.
- Mathematical Formula: The paper defines it as a "fraction," so it can be expressed as: $ E_{\mathrm{succ}} = \frac{\text{Number of successful episodes}}{\text{Total number of episodes}} $
- Symbol Explanation:
- : The
success rate. - Number of successful episodes: Episodes where the robot reaches the target within 10 seconds, without falling or severe collisions.
- Total number of episodes: The total number of trials conducted.
- : The
-
Collision momentum ():
- Conceptual Definition: This metric measures the cumulative momentum transferred through unintended or unnecessary physical contacts between the robot and its environment. It explicitly excludes
nominal foot contacts(expected and desired contacts for locomotion). A lower collision momentum indicates a more adept and collision-free navigation policy. - Mathematical Formula: The paper describes it as "cumulative momentum transferred through unnecessary contacts." While a precise formula is not provided, it implies summation of momentum changes (force integrated over time) from non-foot contacts. Conceptually, if is the momentum transferred during an unnecessary contact event , then: $ E_{\mathrm{collision}} = \sum_{k=1}^{N_{\text{unnecessary_contacts}}} |\Delta \mathbf{p}_k| $
- Symbol Explanation:
-
: The
cumulative collision momentum. -
: The total number of contact events that are not nominal foot contacts.
-
: The change in momentum caused by the -th unnecessary contact. denotes the magnitude of this momentum change.
For simulation experiments, each policy is trained for 4,000 iterations, followed by 5 independent evaluations, each consisting of 1,000 complete episodes. The mean and standard deviation for these metrics are reported. Policies with higher and lower are considered superior.
-
- Conceptual Definition: This metric measures the cumulative momentum transferred through unintended or unnecessary physical contacts between the robot and its environment. It explicitly excludes
For real-world experiments, success rates are also measured. Policies are tested over 15 trials per terrain.
5.3. Baselines
To thoroughly assess the effectiveness of Gallant's core components, comparisons are made against several ablated methods and alternative approaches:
Simulation Baselines (Ablation Studies)
These baselines are used in IsaacSim to isolate and evaluate the contribution of specific design choices within Gallant:
-
Self-scan Ablation:
- w/o-Self-Scan: This variant
disables simulated LiDAR returns from dynamic geometry, specifically the robot's own moving links (e.g., legs, arms). It only scans static terrain. This is compared againstGallant, which explicitly models scans over both static terrain and dynamic robot links. This ablation tests the importance of thehigh-fidelity LiDAR simulationfor dynamic objects.
- w/o-Self-Scan: This variant
-
Perceptual Network Ablation: This compares
Gallant's z-grouped 2D CNNwith alternative CNN architectures for processing thevoxel grid:- Standard 3D CNN: A convolutional neural network that applies filters across all three spatial dimensions (
x, y, z) of thevoxel grid. This represents a more direct but potentially more computationally expensive way to process 3D volumetric data. - Sparse 2D CNN: A
2D CNNthat incorporates sparsity, meaning it only performs computations on occupied voxels in thex-yplane. - Sparse 3D CNN: A
3D CNNoptimized forsparse 3D data(commonly used inLiDAR perception[5, 12]). These variants (Sparse-2D-CNN,Sparse-3D-CNN) are based on [8]. These ablations evaluate theaccuracy-compute tradeoffof the chosenz-grouped 2D CNN.
- Standard 3D CNN: A convolutional neural network that applies filters across all three spatial dimensions (
-
Perceptual Representation Ablation: This examines the chosen
perceptual inputto theactorandcritic:- Only-Height-Map: The
actorandcriticboth receive only aheight mapas their perceptual input, completely replacing thevoxel grid. This highlights the limitations of 2.5D representations for3D constrained terrains. - Only-Voxel-Grid: The
actorandcriticboth receive only thevoxel grid(and noheight mapfor the critic). This tests the benefit of including theheight mapasprivileged informationfor thecriticduring training.
- Only-Height-Map: The
-
Voxel Resolution Ablation: This explores the impact of
voxel sizeon performance:- 10CM: A
voxel resolutionof 10 cm. This increases theField of View (FoV)but reduces geometric fidelity. - 2.5CM: A
voxel resolutionof 2.5 cm. This increasesgeometric precisionbut reducesFoVunder a fixed memory budget. These are compared againstGallant'sdefault5CMresolution to find the optimal balance betweencoverageanddetail.
- 10CM: A
Real-World Baselines
These policies are deployed on the Unitree G1 humanoid to evaluate sim-to-real performance:
- HeightMap: This policy replaces
Gallant's voxel gridwith anelevation map(estimated fromLivox Mid-360 LiDARdata) for its perception module. This serves as a direct comparison against a common2.5D perceptionmethod in a real-world setting. - NoDR (No Domain Randomization): This policy is trained identically to
Gallantbutwithout the LiDAR domain randomization(pose, hit position, latency, missing grid, as described in Section 4.2). This highlights the critical role ofdomain randomizationin bridging thesim-to-real gap.
6. Results & Analysis
6.1. Core Results Analysis
Gallant's experimental results demonstrate its superior performance and robustness across various challenging 3D constrained terrains in both simulation and the real world. The analyses highlight the importance of its key components, including the LiDAR simulation with dynamic object scanning, the z-grouped 2D CNN, the voxel grid representation, and the 5cm voxel resolution.
Simulation Experiments
The performance of Gallant and its ablations are evaluated in IsaacSim on the hardest terrain settings () using success rate () and collision momentum ().
The following are the simulation results from Table 3 of the original paper:
| Method | Plane | Ceiling | Forest | Door | Platform | Pile | Upstair | Downstair | ||||||||
| Esuce ↑ | Ecollision ↓ | Esuce ↑ | Ecollision ↓ | Esuce ↑ | Ecollision ↓ | Esuce ↑ | Ecollision ↓ | Esuce ↑ | Ecollision ↓ | Esuce ↑ | Ecollision ↓ | Esuce ↑ | Ecollision ↓ | Esuce ↑ | Ecollision ↓ | |
| (a) Ablation on Self-scan | ||||||||||||||||
| w/o-Self-Scan | 99.7 (±0.1) | 27.2 (±1.0) | 579.0 (±55.1) | 33.0 (±0.9) | 305.5 (±16.6) | 33.1 (±1.4) | 264.9 (±21.7) | 31.0 (±0.8) | 200.5 (±18.9) | 32.0 (±1.0) | 190.1 (±17.5) | 30.5 (±0.9) | 220.3 (±20.1) | 32.8 (±0.9) | 210.6 (±19.8) | 31.5 (±1.1) |
| Gallant | 100.0 (±0.0) | 0.0 (±0.0) | 97.1 (±0.6) | 24.6 (±6.3) | 84.3 (±0.7) | 311.1 (±25.9) | 98.7 (±0.3) | 27.7 (±6.4) | 96.1 (±0.5) | 30.1 (±5.3) | 82.1 (±0.6) | 113.1 (±14.6) | 96.2 (±0.6) | 27.0 (±4.9) | 97.9 (±0.4) | 15.6 (±6.2) |
| (b) Ablation on Perceptual Network | ||||||||||||||||
| Sparse-3D-CNN | 100.0 (±0.0) | 0.0 (±0.0) | 86.7 (±2.1) | 143.5 (±46.1) | 84.1 (±1.5) | 277.8 (±22.1) | 98.0 (±0.6) | 74.8 (±7.9) | 88.8 (±1.5) | 96.8 (±11.6) | 52.4 (±1.5) | 365.9 (±12.3) | 80.1 (±2.2) | 107.7 (±15.8) | 97.5 (±0.4) | 18.9 (±14.1) |
| 3D-CNN | 99.9 (±0.1) | 0.0 (±0.0) | 97.5 (±0.5) | 20.0 (±6.6) | 73.9 (±2.1) | 379.0 (±70.2) | 96.1 (±0.7) | 69.58 (±5.8) | 92.7 (±1.0) | 65.6 (±9.5) | 65.3 (±0.9) | 275.4 (±31.5) | 86.0 (±1.4) | 78.1 (±19.2) | 99.0 (±0.3) | 12.1 (±11.6) |
| Sparse-2D-CNN | 99.6 (±0.2) | 0.7 (±1.4) | 96.0 (±1.0) | 26.17 (±5.1) | 80.2 (±1.1) | 363.1 (±14.4) | 92.7 (±1.0) | 199.6 (±120.2) | 87.9 (±1.1) | 100.5 (±20.3) | 57.6 (±0.9) | 360.3 (±16.3) | 89.1 (±0.7) | 52.9 (±4.8) | 98.7 (±0.6) | 4.55 (±2.92) |
| Gallant | 100.0 (±0.0) | 0.0 (±0.0) | 97.1 (±0.6) | 24.6 (±6.3) | 84.3 (±0.7) | 311.1 (±25.9) | 98.7 (±0.3) | 27.7 (±6.4) | 96.1 (±0.5) | 30.1 (±5.3) | 82.1 (±0.6) | 113.1 (±14.6) | 96.2 (±0.6) | 27.0 (±4.9) | 97.9 (±0.4) | 15.6 (±6.2) |
| (c) Ablation on Perceptual Interface | ||||||||||||||||
| Only-Height-Map | 100.0 (±0.0) | 0.0 (±0.0) | 5.3 (±2.0) | 1995.3 (±68.3) | 10.5 (±1.5) | 577.4 (±18.1) | 10.2 (±1.3) | 717.5 (±33.8) | 96.0 (±0.7) | 34.3 (±2.8) | 86.2 (±0.6) | 101.6 (±13.8) | 98.3 (±0.2) | 11.6 (±6.2) | 98.5 (±0.3) | 11.2 (±6.4) |
| Only-Voxel-Grid | 100.0 (±0.0) | 0.0 (±0.0) | 96.9 (±0.4) | 22.4 (±4.2) | 75.9 (±1.5) | 506.0 (±20.6) | 96.0 (±0.3) | 281.4 (±29.0) | 94.2 (±0.8) | 51.0 (±10.2) | 72.3 (±0.6) | 201.8 (±14.9) | 96.2 (±0.6) | 46.9 (±10.5) | 98.8 (±0.2) | 7.0 (±3.9) |
| Gallant | 100.0 (±0.0) | 0.0 (±0.0) | 97.1 (±0.6) | 24.6 (±6.3) | 84.3 (±0.7) | 311.1 (±25.9) | 98.7 (±0.3) | 27.7 (±6.4) | 96.1 (±0.5) | 30.1 (±5.3) | 82.1 (±0.6) | 113.1 (±14.6) | 96.2 (±0.6) | 27.0 (±4.9) | 97.9 (±0.4) | 15.6 (±6.2) |
| (d) Ablation on Voxel Resolution | ||||||||||||||||
| 10CM | 98.8 (±0.2) | 2.1 (±1.6) | 13.3 (±2.4) | 1442.4 (±119.6) | 59.0 (±1.7) | 642.7 (±12.4) | 64.8 (±1.1) | 591.0 (±22.5) | 67.2 (±2.7) | 268.9 (±39.3) | 54.1 (±1.7) | 400.2 (±19.5) | 86.3 (±1.2) | 74.8 (±12.8) | 96.6 (±0.4) | 15.2 (±6.1) |
| 2.5CM | 99.9 (±0.1) | 2.1 (±1.6) | 97.3 (±0.9) | 24.2 (±11.0) | 77.5 (±3.4) | 368.0 (±36.3) | 97.5 (±0.4) | 260.4 (±38.8) | 75.5 (±0.5) | 63.0 (±4.9) | 65.2 (±5.5) | 256.3 (±50.0) | 94.1 (±1.1) | 38.6 (±6.7) | 97.5 (±0.4) | 13.5 (±2.0) |
| Gallant (5CM) | 100.0 (±0.0) | 0.0 (±0.0) | 97.1 (±0.6) | 24.6 (±6.3) | 84.3 (±0.7) | 311.1 (±25.9) | 98.7 (±0.3) | 27.7 (±6.4) | 96.1 (±0.5) | 30.1 (±5.3) | 82.1 (±0.6) | 113.1 (±14.6) | 96.2 (±0.6) | 27.0 (±4.9) | 97.9 (±0.4) | 15.6 (±6.2) |
1. LiDAR Returns from Dynamic Objects is Necessary (Self-scan Ablation):
Gallant (which includes self-scan) achieves significantly higher success rates and lower collision momentum across all tasks compared to the w/o-Self-Scan variant. For instance, on the Ceiling task, Gallant has a 97.1% success rate and 24.6 collision momentum, whereas w/o-Self-Scan shows only 33.0% success and 305.5 collision momentum. This is attributed to the fact that when the robot adopts postures like crouching (e.g., under a ceiling), its own links (e.g., legs) can occlude the ground. Without self-scan, the voxel grid would artificially show a flat floor, leading to an out-of-distribution (OOD) observation for the policy. This demonstrates that simulating dynamic objects in the LiDAR pipeline is crucial for realistic perception and robust performance, particularly in scenarios requiring significant body posture changes.
The following figure (Figure 5 from the original paper) illustrates the effect of self-scan:

2. z-grouped 2D CNN is the Most Suitable Choice (Perceptual Network Ablation):
While the 3D-CNN variant marginally outperforms Gallant on Ceiling (97.5% vs 97.1% success, 20.0 vs 24.6 collision momentum) and Downstair (99.0% vs 97.9% success, 12.1 vs 15.6 collision momentum), Gallant's z-grouped 2D CNN generally performs better or comparably well on most other tasks and maintains competitive performance. For Forest and Door, Gallant (84.3% and 98.7% success) clearly outperforms 3D-CNN (73.9% and 96.1% success). The paper argues that its voxel input () has relatively dense occupancy in the x-y plane, making sparse convolutions less efficient due to rulebook overhead. Full 3D CNNs introduce more parameters and memory traffic, making optimization harder. The z-grouped 2D CNN effectively preserves vertical structure through channel mixing, leverages optimized dense 2D operators, and provides the right inductive bias for an egocentric raster, delivering superior performance with markedly lower compute.
3. Combination of Voxel Grid and Height Map is Better (Perceptual Interface Ablation):
Only-Height-Map performs poorly on 3D constrained terrains like Ceiling (5.3% success), Forest (10.5% success), and Door (10.2% success), confirming its inability to represent multilayer structures. However, it performs well on Platform, Pile, Upstair, and Downstair due to their primary ground-level nature. Gallant (which uses a voxel grid for the actor and both voxel grid and height map as privileged information for the critic) achieves higher success rates than Only-Voxel-Grid (which only uses voxel grid for both actor and critic) across all tasks. This validates Gallant's asymmetric design, where the height map aids the critic in credit assignment during training, improving overall policy learning without introducing latency-sensitive channels to the deployed actor.
4. 5cm is a Suitable Resolution for Gallant (Voxel Resolution Ablation):
Gallant's default 5cm resolution (Gallant (5CM)) generally achieves the best balance. The 10CM variant (coarser) significantly underperforms on Ceiling (13.3% success) and Forest (59.0% success) due to impaired geometric fidelity, leading to poor contact- and clearance-sensitive interactions. Conversely, the 2.5CM variant (finer) also has lower success rates than Gallant (5CM) on most terrains, especially those requiring sensing far above or below the robot (e.g., Ceiling, Downstair), because its reduced FoV (under a fixed VRAM budget) hampers perception of long vertical extents. The 5cm resolution strikes an effective balance between coverage and detail under resource constraints.
Real-world Experiments
The Gallant-trained policy is deployed zero-shot on the Unitree G1 humanoid at control frequency. OctoMap [16] processes raw LiDAR point clouds onboard at for voxel grid construction.
The following figure (Figure 4 from the original paper) shows qualitative real-world results:
该图像是图表,展示了Gallant框架下人形机器人在各种3D受限地形中的行走和局部导航能力,包括跨越障碍、攀登楼梯和通过狭窄通道等场景。
The robot consistently traverses diverse real-world terrains, including flat ground, random-height ceilings, lateral clutter (e.g., doors), high platforms with gaps, stepping stones, and staircases. It demonstrates versatile capabilities: crouching, planning lateral motions, robustly stepping onto platforms, crossing gaps, and careful foot placement on stepping stones, and stable multi-step climbing/descent. These behaviors arise from a single policy without terrain-specific tuning, highlighting Gallant's generality and real-world transferability.
Real-world Ablation:
To evaluate sim-to-real performance, three policies are tested on the Unitree G1 over 15 trials per terrain: HeightMap, NoDR, and Gallant.
The following figure (Figure 6 from the original paper) shows the real-world success rates of the ablation studies:
该图像是一个柱状图,展示了在15次试验中不同场景下成功的次数。每个场景包括平面、天花板、门、平台、堆、楼梯向上和楼梯向下,而图中显示了三种不同算法(HeightMap、NoDR和Gallant)在各场景下的成功效果。Gallant算法在多个场景中表现优异,尤其是在楼梯向下的情况,显示出其在复杂环境中的有效性和可靠性。
- Gallant vs. HeightMap:
Gallantconsistently outperformsHeightMapacross all real-world terrains.HeightMapfails significantly onoverheading(Ceiling) andlateral(Door) obstacles due to itslimited 2.5D representation. Even onground-level terrains,HeightMap'sreal-world performance is hindered bynoisy elevation reconstruction, which can be exacerbated byLiDAR jitterfrom the robot'storso pitch/rollmovements. This reinforces the benefit ofvoxel gridswhich are less susceptible to such issues. - Gallant vs. NoDR:
NoDR(trained withoutLiDAR domain randomization) performs reasonably onCeilingandDoor, suggesting lower sensitivity tosensing latencyin these cases. However, its performancedrops significantlyonground-level terrains(e.g.,Platform,Pile,Stairs). Without modelingLiDAR delayandnoiseduring training, the robot oftenmisjudges its positionrelative to obstacles orreacts too late. This emphasizes thecritical role of domain randomizationin bridging thesim-to-real gap. Figure 9 in the supplementary material illustratesNoDR's failure modes.
Further Analyses
A clear correlation exists between Gallant's success rates in simulation and the real world (Figure 7). Terrains with higher success in simulation generally perform well on hardware, validating the use of large-scale simulated evaluation as a predictor of real-world performance.
The following figure (Figure 7 from the original paper) shows the success rates of Gallant in simulation and real-world environments:
该图像是图表,展示了Gallant在模拟和真实环境中的成功率。图中显示了不同场景(如天花板、门、平台、堆、楼梯上下)的成功率,模拟和真实结果在大多数情况下接近100%。
With voxel grids, scenarios like overheading (Ceiling) and lateral (Door) constraints, which were previously difficult for height map-based methods, become among the easiest for Gallant (achieving near 100% success), demonstrating the effectiveness of the voxel grid representation for full-space perception.
Gallant's main limitation appears on the Pile terrain, where accurate foothold selection is critical. Success rates plateau around 80%. Simulation with zero LiDAR latency improves this to over 90%, indicating that real-world sensor delay is a key bottleneck for this specific task. On other terrains, particularly Platforms and Stairs (which are typically unstable due to collision risk), Gallant achieves high success by proactively adjusting foot trajectories.
6.2. Data Presentation (Tables)
All tables from the original paper (Table 1, Table 2, Table 3) and the supplementary material (Table 4, Table 5) have been transcribed and presented in their respective sections above, following the strict formatting guidelines for Markdown and HTML tables.
6.3. Ablation Studies / Parameter Analysis
The paper conducts extensive ablation studies to verify the effectiveness of Gallant's various components, as detailed in Table 3 of the original paper and discussed in Section 6.1.
-
Self-scan: Crucial for dynamic environments, significantly impacting tasks likeCeilingandForest. -
Perceptual Network:z-grouped 2D CNNprovides a superioraccuracy-compute tradeoffcompared to3D CNNsorsparse variants, making it practical for real-time inference while maintaining high performance. -
Perceptual Representation: The combination ofvoxel grid(foractor) andheight map(asprivileged informationforcritic) yields the best results, demonstrating the value of comprehensive 3D data and strategic use of auxiliary information during training. -
Voxel Resolution:5cmresolution strikes an optimal balance betweenFoV coverageandgeometric fidelity, outperforming coarser (10cm) or finer (2.5cm) resolutions. -
Domain Randomization(NoDR baseline): Essential forsim-to-real transfer, particularly for mitigatingreal-world sensor latencyandnoise.The ablation results consistently show that each of
Gallant'sdesign choices (realisticLiDAR simulationincludingself-scan,z-grouped 2D CNN,voxel gridcombined withprivileged height map, andoptimal voxel resolution) contributes significantly to its robust performance andsim-to-real transferability.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces Gallant, a comprehensive full-stack pipeline for humanoid locomotion and local navigation in complex 3D-constrained environments. Gallant makes significant advancements by leveraging voxel grids as a lightweight yet geometry-preserving perceptual representation. This representation, coupled with a high-fidelity LiDAR simulation (which includes dynamic object scanning, noise, and latency modeling) and an efficient z-grouped 2D CNN for processing, enables fully end-to-end optimization.
The simulation ablations rigorously demonstrate that Gallant's core components are indispensable for training policies with high success rates. In real-world tests, a single LiDAR policy trained with Gallant not only covers ground obstacles (a domain traditionally handled by elevation-map controllers) but also effectively tackles lateral and overhead structures, achieving near 100% success with fewer collisions on ground-only terrains. These findings collectively establish Gallant as a robust and practical solution for humanoid robots to navigate diverse and challenging 3D terrains.
7.2. Limitations & Future Work
The authors acknowledge a primary limitation of Gallant: it does not yet achieve a 100% success rate across all scenarios.
-
LiDAR Latency: The main bottleneck identified is
LiDAR latency. Operating at , each scan incurs over delay due due tolight reflectionandcommunication overhead. This inherent delay limits the robot's ability to act preemptively and is particularly evident in tasks requiringfine-grained foothold selection(like thePileterrain).Future work will focus on addressing this latency:
-
Geometry-aware Teacher: Exploring the use of
Gallantitself as ageometry-aware teacherto guide policies trained with lower-latency sensors. -
Lower-Latency Sensors: Investigating alternative or complementary sensors with reduced latency to enable a
fully reactive policy. The ultimate goal is to achieve near-perfect performance across all terrains by overcoming these temporal limitations.
7.3. Personal Insights & Critique
Personal Insights
This paper presents a highly practical and well-engineered solution to a fundamental problem in humanoid robotics. The integration of voxel grids for 3D perception, the clever z-grouped 2D CNN architecture, and the meticulous LiDAR simulation with domain randomization are particularly insightful.
- Efficiency of z-grouped 2D CNN: The
z-grouped 2D CNNis a brilliant compromise. It acknowledges the sparsity ofvoxel dataand the computational burden of3D CNNs, offering an efficient way to extract 3D structural information using widely optimized2D CNNoperations. This balance betweenrepresentation capacityandcomputational efficiencyis critical forreal-time onboard deploymenton resource-constrained platforms like theNVIDIA Orin NX. - High-Fidelity LiDAR Simulation: The explicit modeling of
dynamic objects(self-scan),sensor noise, andlatencyin theLiDAR simulationis a standout feature. The ablation study clearly demonstrates its necessity for robustsim-to-real transfer. This level of detail in simulation design is often overlooked but proves to be paramount for real-world performance. - Generalization of a Single Policy: The ability of a
single policyto generalize across such a diverse range of3D constrained terrains(ground, lateral, overhead) withoutterrain-specific tuningis a significant step towards truly autonomous and versatile humanoid robots. This highlights the power of robust 3D perception combined with comprehensivecurriculum learning.
Critique
While Gallant is an excellent piece of work, some areas could be explored further or present inherent challenges:
-
Addressing LiDAR Latency: The identified limitation of
LiDAR latencyis a critical real-world problem. While the authors suggest usingGallantas a teacher, exploringpredictive models(e.g., usingRecurrent Neural Networksortransformerson historicalvoxel grids) to anticipate future terrain changes could also be a viable avenue. Combining multiple sensor modalities (e.g.,high-frequency IMUswithLiDAR) to create a more responsive internal state could also help. -
Computational Cost of OctoMap: While
OctoMapis stated as a "lightweight preprocessing step," its real-time performance on a continuous stream of dualJT128 LiDARdata on anOrin NX(especially for complex environments) could still be a bottleneck. Further optimization or alternative directvoxelizationmethods might be needed, especially if theperception volumeorvoxel resolutionwere to increase. -
Generalization to Novel Terrains: While
Gallantgeneralizes well across its8 trained terrain families, it would be interesting to see its performance onentirely novel 3D environmentsthat differ significantly from the training distribution (e.g., highly deformable terrain, dense vegetation, complex moving obstacles not part of the robot itself). Thedomain randomizationhelps, but the inherent structure of thevoxel gridmight still be specific to relatively rigid environments. -
Foot Placement on Pile: The slightly lower success rate on
Pile(80%) highlights the inherent difficulty of precisefoothold planning. Future work could investigate incorporating more explicitfootstep planningorcontact-rich manipulationmodules that utilize thevoxel gridmore directly foraffordance estimation(identifying stable contact points). -
Multi-robot Coordination or Interaction: The current framework focuses on single-robot locomotion. Extending this robust 3D perception to
multi-robot scenariosorhuman-robot interactionin shared 3D constrained spaces would open new research avenues.Overall,
Gallantprovides a strong foundation and a clear pathway for developing highly capable humanoid robots that can truly navigate and operate in the complex, unstructured 3D environments of the real world. Its rigorous methodology and clear empirical validation make it an impactful contribution to the field.
Similar papers
Recommended via semantic vector search.