Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains

Jiangmiao Pang

Paper status: completed

Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains

Published:11/19/2025

Voxel-based Humanoid Locomotion Planning (1)Navigation in 3D Constrained Environments (1)Structured Perception with LiDAR Data (1)High-Fidelity LiDAR Simulation (1)End-to-End Optimized Control Policy (1)

Original Link PDF

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Gallant is a voxel-grid-based framework for humanoid locomotion and navigation that utilizes voxelized LiDAR data for accurate 3D perception, achieving near 100% success in challenging terrains through end-to-end optimization.

Abstract

Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure. This paper presents Gallant, a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency. Experimental results show that Gallant's broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves near 100% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization.

Mind Map

In-depth Reading

English Analysis~45 min read · 59,119 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains". The central topic is the development of a novel framework, named Gallant, that enables humanoid robots to move and navigate effectively in complex three-dimensional environments by using a specialized perception system based on voxel grids derived from LiDAR data.

1.2. Authors

The authors are Qingwei Ben, Botian Xu, Kailin Li, Feiyu Jia, Wentao Zhang, Jingping Wang, Jingbo Wang, Dahua Lin, and Jiangmiao Pang.

Qingwei Ben, Botian Xu, and Kailin Li are marked with an asterisk (*), indicating equal contribution.
Jiangmiao Pang is the Corresponding Author.
Affiliations include: Shanghai Artificial Intelligence Laboratory (1), The Chinese University of Hong Kong (2), University of Science and Technology of China (3), University of Tokyo (4), and Shanghai Jiaotong University (5).

1.3. Journal/Conference

The paper is published as a preprint on arXiv, with the identifier arXiv:2511.14625. As an arXiv preprint, it is currently undergoing peer review and has not yet been formally published in a specific journal or conference. arXiv is a widely recognized open-access repository for scientific preprints in various disciplines, including computer science and robotics, allowing for rapid dissemination of research findings.

1.4. Publication Year

The paper was published at UTC: 2025-11-18T16:16:31.000Z. This indicates a publication year of 2025.

1.5. Abstract

The paper addresses the challenge of robust humanoid locomotion in complex 3D environments, which necessitates accurate and globally consistent perception. Traditional perception methods, such as depth images or elevation maps, are limited by partial and flattened views, failing to capture the full 3D scene structure. To overcome this, the paper introduces Gallant, a voxel-grid-based framework designed for humanoid locomotion and local navigation in 3D constrained terrains.

Gallant utilizes voxelized LiDAR data as a lightweight and structured perceptual representation. This representation is then processed by a z-grouped 2D Convolutional Neural Network (CNN) to map the visual information to the control policy, enabling fully end-to-end optimization. A key component is a high-fidelity LiDAR simulation that dynamically generates realistic observations, supporting scalable, LiDAR-based training and ensuring sim-to-real consistency.

Experimental results demonstrate that Gallant's broader perceptual coverage allows a single policy to handle diverse 3D constraints beyond just ground-level obstacles, including lateral clutter, overhead constraints, multi-level structures, and narrow passages. The framework also achieves near 100% success rates in challenging scenarios like stair climbing and stepping onto elevated platforms, attributed to its improved end-to-end optimization.

1.6. Original Source Link

The original source link is: https://arxiv.org/abs/2511.14625 The PDF link is: https://arxiv.org/pdf/2511.14625v1.pdf The paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is enabling robust humanoid locomotion and local navigation in complex, 3D constrained environments. Humanoid robots, unlike wheeled or tracked vehicles, possess highly dexterous limbs, allowing them to traverse diverse and irregular terrains. However, this capability is heavily reliant on an accurate and comprehensive understanding of the surrounding environment.

This problem is crucial in the current field because, despite significant advancements in humanoid robotics, ensuring operational safety and adaptability in unstructured real-world settings remains a major challenge. Robots need to move beyond traversing simple flat surfaces; they must handle terrain irregularities, ground-level obstacles, lateral clutter (obstacles to the side), and overhead constraints (obstacles above, like low ceilings). This requires anticipatory collision checking, clearance-aware motion generation, and intelligent planning of contact-rich maneuvers.

The specific challenges or gaps in prior research primarily stem from limitations in existing perception modules:

Depth Images: While offering lower latency, depth cameras typically have a narrow Field of View (FoV) and limited range, which restricts a robot's ability to reason about complex, spatially extended environments.
Elevation Maps: These approaches compress 3D LiDAR point clouds into 2.5D height fields. This projection effectively flattens the environment, discarding crucial vertical and multilayer structure (e.g., overhangs, low ceilings, mezzanines, stair undersides). Moreover, the reconstruction stage can introduce algorithm-specific distortions and latency, further decoupling perception from control.
Raw Point Clouds: While 3D LiDAR provides detailed scene geometry with a wide FoV, its raw point clouds are often sparse and noisy, making them difficult for sample-efficient policy learning and real-time inference.

The paper's entry point or innovative idea is to use a voxel-grid-based representation of LiDAR data. This approach aims to preserve the full 3D structure of the environment, overcome the FoV and flattening limitations of previous methods, and provide a lightweight and structured input suitable for end-to-end policy learning with a z-grouped 2D CNN.

2.2. Main Contributions / Findings

The paper presents Gallant as a significant step forward in humanoid locomotion and local navigation. Its primary contributions and key findings are:

Voxel Grid as a Geometry-Preserving Representation: Gallant proposes and verifies the use of a voxel grid derived from LiDAR data as a lightweight yet geometry-preserving perceptual representation. This representation captures the full 3D structure (including multi-layer information and vertical patterns) over a large Field of View (FoV), unlike depth images or elevation maps. This directly addresses the limitation of previous methods that provided only partial or locally flattened views.
Efficient Voxel Grid Processing with z-grouped 2D CNN: The paper introduces and validates a z-grouped 2D Convolutional Neural Network (CNN) for processing the sparse voxel grids. This architectural choice treats height slices ( $z$ -dimension) as channels and applies 2D convolutions over the x-y plane. This design offers a favorable trade-off between representation capacity and computational efficiency compared to heavier 3D CNNs or less suitable sparse CNNs, making it practical for real-time onboard deployment.
Full-Stack Pipeline for Sim-to-Real Transfer: The research develops a comprehensive full-stack pipeline, spanning from high-fidelity LiDAR sensor simulation to policy training. This pipeline includes realistic LiDAR simulation that dynamically models sensor noise, latency, and even scans dynamic objects (like the robot's own body links). This rigorous simulation environment, combined with curriculum training across diverse 3D-constrained terrains, enables the training of a single policy that demonstrates zero-shot sim-to-real transfer and generalizes robustly across various real-world 3D-constrained environments.
Enhanced Locomotion Capabilities: Gallant significantly expands the range of navigable terrains for humanoids. Its broader perceptual coverage allows the single trained policy to handle not only conventional ground-level obstacles but also complex scenarios involving lateral clutter (e.g., Forest, Door), overhead constraints (e.g., Ceiling), multi-level structures, and narrow passages. This goes beyond the limitations of previous methods often confined to simpler terrains.
Achieving Near 100% Success Rates in Challenging Scenarios: The framework achieves near 100% success rates in tasks previously considered very challenging and unstable for humanoids, such as stair climbing (Upstair, Downstair) and stepping onto elevated platforms (Platform). This improvement is attributed to the enhanced end-to-end optimization facilitated by the robust 3D perception.

These findings collectively solve the problem of limited 3D perception and poor generalization in humanoid locomotion by providing a practical, robust, and generalizable solution for navigating complex 3D environments.

3.1. Foundational Concepts

To fully understand the Gallant paper, a beginner should be familiar with several foundational concepts in robotics, machine learning, and computer vision:

Humanoid Locomotion: This refers to the ability of humanoid robots (robots designed to resemble the human body) to move and balance. Unlike wheeled robots, humanoids must contend with complex dynamics, balance control, and intermittent foot contacts with the ground. Locomotion involves generating appropriate joint trajectories and forces to achieve desired movements like walking, running, or climbing.
Local Navigation: This is the process by which a robot plans and executes collision-free paths within its immediate environment to reach a local target or traverse an obstacle. It typically works in conjunction with a higher-level global planner. Local navigation for legged robots often involves adjusting foot placement, body posture, and step timing based on perceived terrain.
3D Constrained Terrains: These are environments where movement is restricted not just by ground-level obstacles but also by vertical structures. Examples include low ceilings, narrow passages, stairs, platforms, and uneven surfaces that require specific body clearances or multi-level interaction.
Perception Modules: These are the sensing and data processing components that allow a robot to "see" and interpret its surroundings.
- Depth Images: These are images where each pixel value represents the distance from the camera to the corresponding point in the scene. Depth cameras (e.g., Intel RealSense, Microsoft Kinect) are common sensors for generating these.
- LiDAR (Light Detection and Ranging): A remote sensing method that uses pulsed laser light to measure distances. A LiDAR scanner emits laser pulses and measures the time it takes for the light to return, generating a point cloud – a collection of 3D data points representing the surface of objects in the environment. 3D LiDAR systems can provide wide Field of View (FoV) and high-fidelity 3D geometry.
- Elevation Maps (or Height Maps): A 2.5D representation of terrain. It's a grid (typically 2D) where each cell stores the height of the terrain at that (x, y) location. The "2.5D" implies that it captures height information but cannot represent overhangs or multi-layer structures (where multiple $z$ values exist for a single (x, y) coordinate).
- Voxel Grid: A 3D grid made up of voxels (volumetric pixels). Similar to how a pixel is a 2D square, a voxel is a 3D cube. In the context of perception, a voxel grid can represent the occupancy of space, where each voxel can be marked as occupied (e.g., by an obstacle) or free. This representation naturally captures full 3D structure.
Convolutional Neural Networks (CNNs): A class of deep learning models particularly effective for processing grid-like data, such as images.
- 2D CNN: Applies convolutional filters across two spatial dimensions (e.g., width and height in an image).
- 3D CNN: Extends 2D CNNs by applying convolutional filters across three spatial dimensions (e.g., width, height, and depth/time), suitable for volumetric data or video.
- Sparse CNN: Optimized for processing data where most values are zero (i.e., sparse data). Voxel grids can be very sparse if only a small fraction of voxels are occupied.
End-to-End Optimization/Learning: A machine learning paradigm where a single model or system learns to map raw input data directly to the desired output (e.g., sensor readings directly to motor commands), bypassing intermediate, hand-engineered processing stages. This often involves training the entire system jointly using techniques like Reinforcement Learning.
Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward.
- Partially Observable Markov Decision Process (POMDP): A mathematical framework for modeling decision-making where the agent does not have full access to the state of the environment but instead relies on observations that are probabilistically related to the state. This is highly relevant for robots whose sensors provide incomplete information.
- Actor-Critic: A class of RL algorithms that combines two components: an Actor (which learns a policy for taking actions) and a Critic (which learns to estimate the value function, or expected future reward, of the states or state-action pairs). The Critic guides the Actor's learning.
- Proximal Policy Optimization (PPO): A popular Actor-Critic Reinforcement Learning algorithm known for its stability and good performance. It aims to take the largest possible improvement step on the policy without causing the policy to collapse.
Sim-to-Real Transfer: The process of training a robot control policy or perception model entirely in a simulated environment and then deploying it successfully on physical hardware (the "real world") without significant retraining or adaptation. This is highly desirable for safety, scalability, and cost-effectiveness.
Domain Randomization: A technique used in sim-to-real transfer to improve the robustness and generalization of models trained in simulation. It involves randomizing various properties of the simulation environment (e.g., textures, lighting, sensor noise, physics parameters) to force the learned policy to ignore specifics of the simulation and focus on general features, making it more adaptable to the variations found in the real world.

3.2. Previous Works

The paper contextualizes its contributions by discussing prior research in humanoid perceptive locomotion and local navigation.

Humanoid Perceptive Locomotion

Previous approaches for humanoid perceptive locomotion have primarily relied on:

Elevation Maps: Methods like those by Long et al. [21], Ren et al. [30], and Wang et al. [38] utilize elevation maps (also known as 2.5D height fields).
- Concept: An elevation map discretizes the ground plane into a grid and stores the height of the highest point within each cell. This is often reconstructed from LiDAR data [10, 11].
- Limitations (as highlighted by Gallant): While effective for reasoning about ground-level obstacles, elevation maps inherently flatten the scene, discarding information about vertical and multilayer structures such as overhangs, low ceilings, or mezzanines. They also introduce reconstruction latency and can suffer from algorithm-specific distortions.
Depth Cameras: Approaches leveraging depth cameras (e.g., Zhuang et al. [49], Sun et al. [34]) have shown effectiveness, particularly on quadruped robots [1, 7, 18, 22, 34, 48, 49].
- Concept: Depth cameras provide direct depth measurements for a scene, often at high frame rates.
- Limitations (as highlighted by Gallant): Their narrow field of view (FoV) and limited spatial continuity (especially in range) restrict 3D understanding, hindering policy generalization in diverse environments.
Point Clouds: Some recent works, benefiting from advances in LiDAR simulation, have explored point-cloud-based inputs [15, 39].
- Concept: Directly using the raw 3D point cloud generated by LiDAR sensors.
- Limitations (as highlighted by Gallant): While addressing some FoV limitations, raw point clouds are typically sparse and noisy, and their high processing cost makes real-time onboard use infeasible for control tasks.

Local navigation strategies for legged robots often adopt a hierarchical design:

Hierarchical Design: A high-level planner provides velocity commands or target waypoints, which a low-level policy then tracks [4, 6, 12, 15, 20, 29, 40, 45, 46].
- Limitations (as highlighted by Gallant): This decoupling limits the low-level policy's ability to exploit terrain features for more agile movements. Tracking errors and slow high-level updates can further degrade performance.
End-to-End Training with Obstacle Avoidance: Recent work has explored end-to-end training by incorporating obstacle-avoidance rewards into velocity tracking objectives [30].
- Limitations (as highlighted by Gallant): This approach can create conflicting objectives, potentially hindering optimal performance.
Position-Based Formulation: Using target positions instead of velocity commands allows the policy to reason more directly about terrain and choose appropriate actions [14, 31, 44]. However, this approach has primarily been tested on quadrupeds [31] and remains largely untested on humanoids. Gallant adopts this position-based formulation.

3.3. Technological Evolution

The evolution of robotic locomotion perception has moved from simpler, often 2D or 2.5D representations, towards more comprehensive 3D scene understanding. Initially, tasks like navigating flat ground or simple obstacles could rely on depth images or elevation maps. However, as robots were tasked with more complex environments, the limitations of these representations became apparent. The ability to perceive overhangs, multi-level structures, and lateral clearances became critical.

LiDAR technology offered the promise of rich 3D data but presented challenges with raw point cloud processing (sparsity, noise, computational cost). This led to intermediate representations like elevation maps (a compromise for efficiency) and eventually to more structured 3D representations like voxel grids, which aggregate points to reduce noise and dimensionality while preserving 3D information.

Concurrently, advances in deep learning, particularly Convolutional Neural Networks (CNNs), provided powerful tools for processing these visual and spatial data. The challenge then became how to efficiently apply these to sparse 3D data in real-time for robot control. The development of high-fidelity simulators and domain randomization techniques has been crucial in bridging the sim-to-real gap, allowing complex Reinforcement Learning policies to be trained at scale.

This paper's work (Gallant) fits into this timeline by pushing the boundaries of 3D perception for humanoids using voxel grids and efficient z-grouped 2D CNNs, coupled with advanced LiDAR simulation for robust sim-to-real transfer.

3.4. Differentiation Analysis

Compared to the main methods in related work, Gallant offers several core differences and innovations:

Perceptual Representation:
- Previous: Predominantly elevation maps (2.5D, flatten scene) or depth images (narrow FoV, limited range). Some explore raw point clouds (high processing cost).
- Gallant: Uses a voxel grid derived from LiDAR point clouds. This is a lightweight, structured, and geometry-preserving 3D representation that explicitly captures multi-layer structure, vertical patterns, and a wide FoV, directly addressing the limitations of 2.5D and narrow FoV approaches.
Perceptual Processing:
- Previous: Often relies on standard 2D CNNs for depth images or processing of elevation maps. 3D CNNs exist but are computationally heavy for sparse 3D data.
- Gallant: Employs a novel z-grouped 2D CNN. This treats the $z$ -axis (height) as channels for 2D convolutions over the x-y plane. This design leverages the sparsity typical in voxel grids (where most occupancy is concentrated along few $z$ -slices) to achieve high computational efficiency and real-time inference while still capturing vertical structure through channel mixing. This is more efficient than 3D CNNs and better suited than generic sparse CNNs for the specific egocentric voxel grid structure.
Scope of Locomotion:
- Previous: Policies typically confined to ground-level obstacles due to limitations of elevation maps (e.g., [21, 38]). Depth-based methods may handle more, but still suffer from FoV/range issues.
- Gallant: A single policy is capable of handling a much broader range of 3D constraints, including lateral clutter, overhead constraints, multi-level structures, and narrow passages, in addition to ground-level obstacles. This represents a significant increase in the generalization capability of the locomotion policy.
Sim-to-Real Robustness:
- Previous: While LiDAR simulation has advanced, fully accounting for real-world complexities like dynamic objects (e.g., the robot's own body) and sensor noise and latency within Reinforcement Learning pipelines has been challenging.
- Gallant: Develops a high-fidelity LiDAR simulation pipeline that explicitly models dynamic objects (self-scan), sensor noise, and latency. This, combined with domain randomization, curriculum training, and privileged information for the critic, ensures strong sim-to-real consistency and zero-shot transfer to diverse real-world terrains, outperforming baselines that omit these details.

End-to-End Optimization:

Previous: Often hierarchical, with perception decoupled from control, or end-to-end but with conflicting objectives (e.g., velocity tracking with obstacle avoidance).

Gallant: Integrates local navigation and locomotion into a single end-to-end policy using a position-based goal-reaching formulation. This allows the policy to reason directly over terrain and choose appropriate actions, leading to higher success rates in challenging contact-rich maneuvers.

The following are the results from Table 1 of the original paper:

Method	Perceptual Representation	FoV	Ground	Lateral	Overheading
Long et al. [21]	Elevation Map	~ 1.97π	✓	×	×
Wang et al. [38]	Elevation Map	~ 1.97π	✓	×	×
Ren et al. [30]	Elevation Map	~ 1.97π	✓	×	×
Zhuang et al. [49]	Depth Image	~ 0.43π	✓	✓	×
Wang et al. [39]	Point Cloud	~ 1.97π	✓	✓	✓
Gallant (ours)	Voxel Grid	~ 4.00π	✓	✓	✓

This table clearly highlights Gallant's superiority in Field of View (FoV) and its comprehensive ability to handle Ground, Lateral, and Overheading obstacles, which distinguishes it from prior methods primarily limited by their perceptual representation or FoV.

4. Methodology

The Gallant framework is a voxel-grid-based perceptive learning framework specifically designed for humanoid locomotion and local navigation in 3D constrained environments. It integrates several key components: a specialized LiDAR simulation pipeline, an efficient 2D CNN perception module for sparse voxel grids, and a structured curriculum training approach with diverse terrain families. These components together form a full-stack pipeline that enables training a single policy capable of robustly traversing all-space obstacles and deploying with zero-shot transfer on real hardware. The overall system architecture is depicted in Figure 2 of the original paper.

The following figure (Figure 2 from the original paper) illustrates the Gallant framework, showing the process of obtaining a Voxel Grid, comparisons between simulation and the real world, data processing, and its projection to a 2D CNN, along with subsequent policy optimization components:

$该图像是示意图，展示了Gallant框架中获取Voxel Grid的过程，包括模拟与真实世界的对比、数据处理及其对2D CNN的投影等。图中展示了 $Dim = \[32, 32, 40\]$ 的Voxel Grid表示方式以及后续的策略优化组件。$ 该图像是示意图，展示了Gallant框架中获取Voxel Grid的过程，包括模拟与真实世界的对比、数据处理及其对2D CNN的投影等。图中展示了 $Dim = [32, 32, 40]$ 的Voxel Grid表示方式以及后续的策略优化组件。

4.1. Problem Formulation

The problem of humanoid perceptive locomotion is formulated as a Partially Observable Markov Decision Process (POMDP). A POMDP is a mathematical framework for sequential decision-making where the agent's actions influence the state of the environment, and the agent receives observations that are related to the state but do not fully reveal it. The POMDP is defined by the tuple $\mathcal { M } =$ $( S , \mathcal { A } , \mathcal { O } , P , \mathcal { R } , \Omega , \gamma )$ .

$S$ : Set of possible states of the environment.
$\mathcal { A }$ : Set of possible actions the robot can take.
$\mathcal { O }$ : Set of possible observations the robot receives.
$P$ : Transition function, $P(s' | s, a)$ , which specifies the probability of transitioning to state $s'$ given the current state $s$ and action $a$ .
$\mathcal { R }$ : Reward function, R(s, a), which specifies the immediate reward received after taking action $a$ in state $s$ .
$\Omega$ : Observation function, $\Omega(o | s, a)$ , which specifies the probability of observing $o$ after taking action $a$ and landing in state $s$ .
$\gamma$ : Discount factor, a value between 0 and 1 that discounts future rewards.

An actor-critic policy is trained using Proximal Policy Optimization (PPO) [32], a popular Reinforcement Learning (RL) algorithm known for its stability and effectiveness. The training environment consists of $8 \mathrm { m } { \times } 8 \mathrm { m }$ blocks. In each episode, the humanoid starts at the center of a block, and a goal $\mathbf { G }$ is sampled along the perimeter. The robot has a fixed horizon of 10 seconds to reach this goal.

The observation at time $t$ , denoted as $o_t$ , is composed of several elements: $ \begin{array} { r l } & { o _ { t } = \big ( \underbrace { \mathbf { P } _ { t } , \mathbf { T } _ { \mathrm { elapset } } , \mathbf { T } _ { \mathrm { left } , t } } _ { \mathrm { Command } } , \underbrace { a _ { t - 4 : t - 1 } } _ { \mathrm { Action , history } } , } \ & { \qquad \underbrace { \omega _ { t - 5 : t } , \ g _ { t - 5 : t } , \ q _ { t - 5 : t } , \ \dot { q } _ { t - 5 : t } } _ { \mathrm { Proprioception } } , } \ & { \underbrace { \mathtt { Voxel _ Grid } _ { t } } _ { \mathrm { Perception } } , \underbrace { v _ { t } , \ \mathrm { Height _ Map } _ { t } } _ { \mathrm { Privileged } } \big ) , } \end{array} $ where:

$\mathbf { P } _ { t }$ : The goal position relative to the robot's base. This is a vector indicating the direction and distance to the target.
$\mathbf { T } _ { \mathrm { elapset } }$ : The elapsed time in the current episode.
$\mathbf { T } _ { \mathrm { left } , t }$ : The remaining time until the episode timeout $T = 10 \mathrm { s }$ , calculated as $T - \mathbf { T } _ { \mathrm { elapset } }$ .
$a _ { t - 4 : t - 1 }$ : A history of actions taken by the policy in the previous 4 time steps. This provides temporal context for the policy.
$\omega _ { t - 5 : t }$ : The root angular velocity of the robot, sampled over the past 5 time steps.
$g _ { t - 5 : t }$ : The gravitational vector $[ 0 , 0 , - 1 ]$ projected into the robot's base frame, sampled over the past 5 time steps. This provides information about the robot's orientation relative to gravity.
$q _ { t - 5 : t }$ : Joint positions (angles) of the robot, sampled over the past 5 time steps.
$\dot { q } _ { t - 5 : t }$ : Joint velocities (angular speeds) of the robot, sampled over the past 5 time steps.
$\mathtt { Voxel \_ Grid } _ { t }$ : The voxelized perception input at time $t$ , representing the 3D environment geometry.
v _ { t }: The root linear velocity of the robot. This is considered privileged information.
$\mathrm { Height \_ Map } _ { t }$ : Relative heights of scanned points to the robot. This is also considered privileged information.

The subscript range $t - a : t - b$ indicates that temporal history from time step $t - a$ to $t - b$ (inclusive) is included in the observation. In the actor-critic framework, the actor (which determines actions) and critic (which evaluates states) share all features except for privileged inputs. Privileged inputs are additional pieces of information that are available to the critic during training (to aid in learning a good value function) but are not available to the actor during deployment (as they might be difficult to obtain in the real world).

The reward function follows Ben et al. [3], but with velocity tracking rewards replaced by a goal-reaching reward [31]: $ r _ { \mathrm { r e a c h } } = \frac { 1 } { 1 + \Vert \mathbf { P } _ { t } \Vert ^ { 2 } } \cdot \frac { \Vert ( t > T - T _ { r } ) } { T _ { r } } \ ( T _ { r } = 2 s ) $ where:

$\Vert \mathbf { P } _ { t } \Vert ^ { 2 }$ : The squared Euclidean distance between the robot's current position and the goal position $\mathbf { P } _ { t }$ . The term $\frac { 1 } { 1 + \Vert \mathbf { P } _ { t } \Vert ^ { 2 } }$ provides a higher reward when the robot is closer to the goal.
$( t > T - T _ { r } )$ : An indicator function that is 1 if the current time $t$ is greater than the episode timeout $T$ minus a specific reward window $T_r$ , and 0 otherwise. This means the goal-reaching reward is primarily given closer to the end of the episode to incentivize reaching the goal within the time limit.
$T$ : The total episode timeout (10 seconds).
$T_r$ : A reward window, set to 2 seconds. This means the goal-reaching reward is active only during the last 2 seconds of the episode.

The objective of the Reinforcement Learning agent is to maximize the expected cumulative discounted reward: $ J ( \pi ) \ = \ \mathbb { E } [ \sum _ { t = 0 } ^ { H - 1 } \gamma ^ { t } r _ { t } ] $ where:
$J(\pi)$ : The expected return (total discounted reward) for a given policy $\pi$ .
$\mathbb{E}[\cdot]$ : Expected value.
$H$ : The episode horizon (maximum number of time steps).
$\gamma$ : The discount factor (0.99, as detailed in the appendix), which balances immediate and future rewards.
$r_t$ : The reward received at time step $t$ .

Episodes terminate if the robot falls, experiences a harsh collision, or reaches the timeout of 10 seconds.

4.2. Efficient LiDAR Simulation

Most GPU-based simulators (e.g., IsaacGym, IsaacSim) often lack native support for efficient LiDAR simulation or are limited to scanning only static meshes. However, realistic simulation for dynamic environments requires accounting for all relevant geometry, including both static components (e.g., terrain, walls) and dynamic components (e.g., the robot's own moving links). To address this, Gallant implements a lightweight, efficient raycast-voxelization pipeline using NVIDIA Warp [24].

The core idea is to handle raycasting efficiently in dynamic scenes without rebuilding complex data structures every step.

Precomputation: A Bounding Volume Hierarchy (BVH) is precomputed for each mesh (e.g., each link of the robot, or static obstacles) in its local (body) frame. A BVH is a tree structure on a set of geometric objects, used to efficiently test for collisions or ray intersections. Precomputing it in the local frame means it only needs to be done once per mesh, not for every position/orientation it takes.
Dynamic Raycasting: During simulation, when a ray is to be cast:
- The ray's origin $\mathbf{p}$ and direction $\mathbf{d}$ are transformed into the local frame of the target mesh. This involves applying the inverse of the mesh's transformation matrix $T$ to the origin and the inverse of its rotational component $R$ to the direction.
- The raycasting function is then performed in the local frame of the mesh, which remains static relative to its BVH.
- The result (intersection point) is transformed back to the world frame.
  
  This process is formalized by the raycasting function: $ \mathrm { r a y c a s t } ( T M , { \bf p } , { \bf d } ) = T ^ { - 1 } \mathrm { r a y c a s t } ( M , T ^ { - 1 } { \bf p } , R ^ { - 1 } { \bf d } ) $ where:

$\mathrm { r a y c a s t } ( T M , { \bf p } , { \bf d } )$ : The raycasting operation for a mesh $M$ that has been transformed by $T$ , with ray origin $\mathbf{p}$ and direction $\mathbf{d}$ in the world frame.
$T$ : The full transformation matrix (translation and rotation) of the mesh from its local frame to the world frame.
$M$ : The mesh in its local frame.
$\mathbf{p}$ : The ray origin in the world frame.
$\mathbf{d}$ : The ray direction in the world frame.
$R$ : The rotational component of the transformation matrix $T$ .
$T^{-1}$ : The inverse of the full transformation matrix, used to transform a point from world to local frame.
$R^{-1}$ : The inverse of the rotational component, used to transform a direction from world to local frame.

At each simulation step, ray-mesh intersections are computed for every mesh $M$ using its current transform $T_t$ . This entire computation is parallelized using a Warp kernel (NVIDIA's high-performance Python framework for GPU simulation) with a shape of $( N _ { \mathrm { e n v s } } , N _ { \mathrm { m e s h e s } } , N _ { \mathrm { r a y s } } )$ , indicating that many environments, meshes, and rays can be processed concurrently.

Rays are emitted from the LiDAR origin $P _ { \mathrm { L i D A R } }$ in directions defined as: $ O _ { \mathrm { r a y } _ { i } } = O _ { \mathrm { L i D A R } } + O _ { \mathrm { r a y } _ { i } , \mathrm { o f f s e t } } $ where:

$O _ { \mathrm { r a y } _ { i } }$ : The direction of the $i$ -th ray in the world frame.
$O _ { \mathrm { L i D A R } }$ : The orientation of the LiDAR sensor in the world frame.
$O _ { \mathrm { r a y } _ { i } , \mathrm { o f f s e t } }$ : The direction offset for the $i$ -th ray relative to the LiDAR's orientation, which defines the LiDAR's specific scanning pattern.

If $P_i$ is the hit position of the $i$ -th ray, the resulting point cloud $\mathcal { P } _ { t }$ at time $t$ is formed by the union of all such hit points: $ \mathcal { P } _ { t } = \textstyle \bigcup _ { i = 1 } ^ { N _ { \mathrm { r a y s } } } \left{ P _ { i } \right} $ where $N_{\mathrm{rays}}$ is the total number of rays emitted. This point cloud is then converted into a voxel grid.

To align the simulation with real-world sensing and improve sim-to-real transferability, domain randomization is applied:

(a) LiDAR Pose: The LiDAR's pose (position and orientation) is perturbed at the beginning of each episode.
- Position: $P _ { \mathrm { L i D A R } } ^ { \mathrm { r a n d } } = P _ { \mathrm { L i D A R } } + \mathcal { N } ( 0 , 1 )$ $P_{LiDAR}^{rand} = P_{LiDAR} + N (0, 1)$ (cm)
  - $P _ { \mathrm { L i D A R } } ^ { \mathrm { r a n d } }$ : Randomized LiDAR position.
  - $P _ { \mathrm { L i D A R } }$ : Original LiDAR position.
  - $\mathcal { N } ( 0 , 1 )$ : A random value sampled from a normal (Gaussian) distribution with mean 0 and standard deviation 1 (interpreted as 1 cm standard deviation for position).
- Orientation: $O _ { \mathrm { r a y } _ { i } } ^ { \mathrm { r a n d } } = O _ { \mathrm { L i D A R } } + \mathcal { N } ( 0 , ( \frac { \pi } { 1 8 0 } ) ^ { 2 } ) + O _ { \mathrm { r a y } _ { i } } , \mathrm { o f f s e t }$ $O_{ray_{i}}^{rand} = O_{LiDAR} + N (0, (\frac{π}{180})^{2}) + O_{ray_{i}}, offset$ (rad)
  - $O _ { \mathrm { r a y } _ { i } } ^ { \mathrm { r a n d } }$ : Randomized ray direction.
  - $\mathcal { N } ( 0 , ( \frac { \pi } { 1 8 0 } ) ^ { 2 } )$ : A random value sampled from a normal distribution with mean 0 and standard deviation $(\frac{\pi}{180})$ radians (which is 1 degree). This perturbs the LiDAR's overall orientation.
(b) Hit Position: The calculated hit position of each ray is randomized.
- $P _ { i } ^ { \mathrm { r a n d } } = P _ { i } + \mathcal { N } ( 0 , 1 )$ $P_{i}^{rand} = P_{i} + N (0, 1)$ (cm)
  - $P _ { i } ^ { \mathrm { r a n d } }$ : Randomized hit position.
  - P _ { i }: Original hit position.
  - $\mathcal { N } ( 0 , 1 )$ : A random value sampled from a normal distribution with mean 0 and standard deviation 1 (interpreted as 1 cm standard deviation for hit position). This simulates sensor noise.
(c) Latency: Sensor latency is simulated at $10 \mathrm { Hz }$ with a delay of $100 { - } 200 ~ \mathrm { ms }$ . This mimics the time lag between when a physical sensor captures data and when it becomes available to the robot's control system.
(d) Missing Grid: 2% of voxels are randomly masked (set to 0) to model real-world dropout or occlusion effects, where certain parts of the environment might not be scanned.

4.3. Voxel Representation and 2D CNN Perception

The LiDAR point clouds are converted into a fixed-size, robot-centric voxel grid.

Sensor Setup: Two torso-mounted LiDARs (Hesai JT128) are used, one on the front chest and one on the back. Their returns are transformed into a unified torso frame (a coordinate system fixed to the robot's torso).
Perception Volume: The perception volume is defined as a cuboid region around the robot: $\Omega = [ - 0 . 8 , 0 . 8 ] \mathrm { m } \times [ - 0 . 8 , 0 . 8 ] \mathrm { m } \times [ - 1 . 0 , 1 . 0 ] \mathrm { m }$ . This defines the spatial extent that the robot "sees."
Discretization: This volume is discretized (divided into small cubes) at a resolution of $\Delta = 0 . 0 5 \mathrm { m }$ .
Voxel Grid Dimensions: This results in a voxel grid of $3 2 \times 3 2 \times 4 0$ along the x, y, and $z$ axes respectively.
Occupancy: Each voxel in this grid is set to 1 if at least one LiDAR point falls within its volume (indicating occupancy), and 0 otherwise (indicating free space). This produces a binary occupancy tensor $X ~ \in ~ \{ 0 , 1 \} ^ { C \times H \times W }$ , where $C = 40$ (number of height slices), $H = 32$ , and $W = 32$ (spatial resolution in x-y).

The voxel grid is typically highly sparse and locally concentrated due to the nature of LiDAR and terrains. Most (x, y) columns might only contain one or two occupied z-slices, and large contiguous regions can be empty. This sparsity makes computationally expensive 3D convolutions over the full volume inefficient.

To leverage this structure, Gallant introduces a z-grouped 2D Convolutional Neural Network (CNN):

Z-as-Channel: Instead of treating the $z$ -dimension as a spatial dimension for 3D convolution, it is treated as the channel dimension. This means the 40 z-slices become 40 input channels for a 2D CNN.
2D Convolution: 2D convolutions are then applied over the x-y plane. This design exploits spatial context within each x-y slice and uses channel mixing (inherent in 2D CNN operations with multiple input channels) to capture vertical structure across the z-slices.

The 2D convolution operation is formally expressed as: $ Y _ { o , v , u } = \sigma \left( \sum _ { c = 0 } ^ { C - 1 } \sum _ { \Delta v , \Delta u } \mathbf { W } _ { o , c , \Delta v , \Delta u } \cdot X _ { c , v + \Delta v , u + \Delta u } + b _ { o } \right) $ where:
$Y_{o,v,u}$ : The output value at output channel $o$ , spatial location (v, u).
$\sigma$ : A non-linearity (e.g., ReLU, Mish, which is used in Gallant's policy network as described in the appendix).
$C$ : The number of input channels (which is 40, corresponding to the z-slices).
$o$ : Index for output channels.
v, u: Spatial coordinates in the output feature map.
$\Delta v, \Delta u$ : Offsets for the convolution kernel (weights) across the spatial dimensions.
$\mathbf { W } _ { o , c , \Delta v , \Delta u }$ : The weight (filter coefficient) for output channel $o$ , input channel $c$ , and kernel offset $(\Delta v, \Delta u)$ .
$X _ { c , v + \Delta v , u + \Delta u }$ : The input value from input channel $c$ at the corresponding spatial location $(v + \Delta v, u + \Delta u)$ .
$b_o$ : A bias term for output channel $o$ .

Compared to a 3D convolution kernel of size $k^3$ (e.g., $3 \times 3 \times 3$ ), this z-grouped 2D CNN design reduces computational and memory costs by roughly a factor of $k$ (e.g., a factor of 3 for a $3 \times 3 \times 3$ kernel vs. a $3 \times 3$ 2D kernel across 40 channels). It still captures vertical patterns critical for locomotion, makes efficient use of sparse, localized occupancy, and supports efficient parallel training and real-time inference.

4.4. Terrain Design

To train robust policies, Gallant uses 8 representative terrain types in simulation. Each terrain type is designed to challenge specific aspects of humanoid locomotion and perception:

Plane: A flat, easy terrain for learning basic walking and initial stabilization.
Ceiling: Features randomized height and density of overhead structures, requiring the robot to reason about overhead constraints and crouching behaviors.
Forest: Composed of randomly spaced cylindrical pillars (trees), representing sparse lateral clutter that demands weaving and precise lateral navigation.
Door: Presents narrow gaps (doorways) that require precise lateral clearance and fine motor control.
Platform: Consists of high, ring-shaped structures with variable spacing and height, necessitating the recognition of stepable surfaces and inter-platform traversal.
Pile: Introduces fine-grained support reasoning for safe foot placement on uneven, gapped surfaces.
Upstair: Requires continuous adaptation to vertical elevation for climbing stairs.
Downstair: Requires similar adaptation for descending stairs.

The following figure (Figure 3 from the original paper) shows the terrain types used to train robots in simulation:

$Figure 3. Terrain types used to train robots in simulation $( \\mathbf { p } _ { \\tau } ^ { \\mathrm { m a x } } ,$$ 该图像是用于训练机器人在模拟环境中应对不同类型地形的示意图。图中展示了六种地形类型，包括 Ceiling、Door、Pile、Downstairs、Plane、Forest 和 Platform，帮助研究者理解机器人如何在这些环境中进行导航。

The paper adopts a curriculum-based training strategy where the difficulty of the terrain progressively increases. Each terrain type $\tau$ is parameterized by a scalar difficulty $s \in [ 0 , 1 ]$ . The terrain generation parameters are interpolated using a linear function: $ \mathbf { p } _ { \tau } ( s ) = ( 1 - s ) \mathbf { p } _ { \tau } ^ { \mathrm { m i n } } + s \mathbf { p } _ { \tau } ^ { \mathrm { m a x } } $ where:

$\mathbf { p } _ { \tau } ( s )$ : The vector of parameters defining terrain type $\tau$ at difficulty level $s$ .
$s$ : The difficulty scalar, ranging from 0 (easiest) to 1 (hardest).
$\mathbf { p } _ { \tau } ^ { \mathrm { m i n } }$ : The vector of parameters for the easiest setting of terrain type $\tau$ .
$\mathbf { p } _ { \tau } ^ { \mathrm { m a x } }$ : The vector of parameters for the hardest setting of terrain type $\tau$ .

This formula allows for smooth progression of difficulty. For example, for Ceiling terrain, as $s$ increases, the ceiling height decreases, and the number of ceilings increases, making it harder. For Platform terrain, as $s$ increases, the height of the platforms increases, and the gap width between them increases. The following are the parameters for generating curriculum training terrains from Table 2 of the original paper:

Terrain Type τ	Term	Min (s=0)	Max (s=1)
Ceiling	Ceiling height (m) ↓	1.30	1.00
	Number of Ceiling (-) ↑	10	40
Forest	Minimum distance between trees (m) ↓	2.0	1.0
	Number of trees (-) ↑	3	32
Door	Distance between two walls (m) ↓	2.00	1.00
	Width of the doors (m) ↓	1.60	0.80
Platform	Height of the platforms (m) ↑	0.05	0.35
	Gap width between two platforms (m) ↑	0.20	0.50
Pile	Distance between two cylinders (m) ↑	0.35	0.45
Upstair	Height of each step (m) ↑	0.00	0.20
	Width of each step (m) ↓	0.50	0.30
Downstair	Height of each step (m) ↑	0.00	0.20
	Width of each step (m) ↓	0.50	0.30

In each episode, a 10-second goal-reaching task is assigned. If the robot succeeds, it is promoted to harder settings (higher $s$ ); if it fails, it is demoted to easier settings (lower $s$ ). For the Pile terrain, a flat surface is overlaid during early training (low $s$ ) to allow the robot to learn basic foothold placement. As $s$ increases, this plane is removed, and the robot trains on the fully gapped terrain.

4.5. Training Details (from Supplementary Material)

4.5.1. Hyperparameters

The training framework is based on [41]. The key PPO hyperparameters used are: The following are the hyperparameters and their values from Table 4 of the original paper:

Hyperparameter	Value
Environment number	1024 × 8
Steps per iteration	4
PPO epochs	4
Minibatches	8
Clip range	0.2
Entropy coefficient	0.003
GAE factor λ	0.95
Discount factor γ	0.99
Learning rate	5e-4

Environment number: The total number of parallel simulation environments running simultaneously to collect experience, $1024 \times 8 = 8192$ .
Steps per iteration: The number of environment steps collected before a policy update.
PPO epochs: The number of times the collected data is iterated over during a policy update.
Minibatches: The number of minibatches into which the collected data is divided for training.
Clip range: A parameter in PPO that limits the ratio of the new policy's probability to the old policy's probability, preventing large policy updates. A value of 0.2 means the ratio is clipped to $[1 - 0.2, 1 + 0.2]$ .
Entropy coefficient: A weight for the entropy bonus in the PPO loss function, which encourages exploration.
$GAE factor λ$ (Generalized Advantage Estimation factor lambda): A parameter for calculating advantage estimates, which balances bias and variance in Reinforcement Learning.
$Discount factor γ$ : Weights the importance of future rewards relative to immediate rewards. A value of 0.99 means future rewards are slightly discounted.
Learning rate: The step size for updating the neural network weights during optimization.

4.5.2. Policy Network Structure

The Actor and Critic networks share the same architecture but have independent parameters. The network processes two types of input: non-voxel information (e.g., proprioceptive input) and voxel grid input.

Non-Voxel Information Processing: A two-layer Multi-Layer Perceptron (MLP) with hidden dimensions of 256 is used to encode non-voxel information.
- First layer: $ h _ { \mathrm { m l p } } ^ { ( 1 ) } = \mathrm { Mish } \big ( \mathrm { LN } ( W _ { \mathrm { m l p } , 1 } x _ { \mathrm { m l p } } + b _ { \mathrm { m l p } , 1 } ) \big ) $
- Second layer: $ h _ { \mathrm { m l p } } = W _ { \mathrm { m l p } , 2 } h _ { \mathrm { m l p } } ^ { ( 1 ) } + b _ { \mathrm { m l p } , 2 } , \quad \mathrm { dim } ( h _ { \mathrm { m l p } } ) = 2 5 6 $ where:
- $x_{\mathrm{mlp}}$ : The non-voxel input (e.g., commands, action history, proprioception).
- $W_{\mathrm{mlp},1}$ , $b_{\mathrm{mlp},1}$ : Weights and bias of the first MLP layer.
- $\mathrm{LN}(\cdot)$ : Layer Normalization, a technique to normalize the activations of network layers.
- $\mathrm{Mish}(\cdot)$ : The Mish activation function, defined as $x \cdot \mathrm{tanh}(\mathrm{softplus}(x))$ .
- $h_{\mathrm{mlp}}^{(1)}$ : Output of the first MLP layer after activation and normalization.
- $W_{\mathrm{mlp},2}$ , $b_{\mathrm{mlp},2}$ : Weights and bias of the second MLP layer.
- $h_{\mathrm{mlp}}$ : The encoded non-voxel feature vector, with a dimension of 256.
Voxel Grid Processing: In parallel, a three-layer 2D CNN processes the voxel grid input. As described previously, the $z$ -dimension of the voxel grid is treated as channels.
- First layer (after convolution and pooling, flattened): $ h _ { \mathrm { c n n } } ^ { ( 1 ) } = \mathrm { Mish } ( \mathrm { LN } ( W _ { \mathrm { c n n } , 1 } h _ { \mathrm { c n n } } ^ { \mathrm { flat } } + b _ { \mathrm { c n n } , 1 } ) ) $
- Second layer: $ h _ { \mathrm { c n n } } = W _ { \mathrm { c n n , 2 } } h _ { \mathrm { c n n } } ^ { ( 1 ) } + b _ { \mathrm { c n n , 2 } } , \quad \dim ( h _ { \mathrm { c n n } } ) = 6 4 $ where:
- $h_{\mathrm{cnn}}^{\mathrm{flat}}$ : The flattened output of the 2D CNN layers after processing the voxel grid.
- $W_{\mathrm{cnn},1}$ , $b_{\mathrm{cnn},1}$ : Weights and bias of the first MLP-like layer applied to the flattened CNN features.
- $h_{\mathrm{cnn}}^{(1)}$ : Output of the first CNN feature processing layer.
- $W_{\mathrm{cnn},2}$ , $b_{\mathrm{cnn},2}$ : Weights and bias of the second MLP-like layer.
- $h_{\mathrm{cnn}}$ : The encoded voxel feature vector, with a dimension of 64.
Feature Concatenation and Final MLP: The two feature vectors, $h_{\mathrm{mlp}}$ and $h_{\mathrm{cnn}}$ , are concatenated to form a combined feature vector $f$ .
- Concatenation: $ f = [ h _ { \mathrm { m l p } } , \ : h _ { \mathrm { c n n } } ] $ (This notation $[h_{\mathrm{mlp}}, : h_{\mathrm{cnn}}]$ implies concatenation along a dimension.)
- The combined feature $f$ $f$ is then passed through another MLP to produce a 256-dimensional latent representation.
  - First layer: $ h _ { \mathrm { o u t } } ^ { ( 1 ) } = \mathrm { Mish } ( f ) $ (This implies a linear layer followed by Mish, potentially with Layer Normalization as well, simplified for presentation here).
  - Second layer: $ h _ { \mathrm { o u t } } = \mathrm { Mish } \big ( W _ { \mathrm { o u t } } h _ { \mathrm { o u t } } ^ { ( 1 ) } + b _ { \mathrm { o u t } } \big ) , \quad \dim ( h _ { \mathrm { o u t } } ) = 2 5 6 $ where:
  - $h_{\mathrm{out}}^{(1)}$ : Intermediate output after the first Mish activation.
  - $W_{\mathrm{out}}$ , $b_{\mathrm{out}}$ : Weights and bias of the final MLP layer.
  - $h_{\mathrm{out}}$ : The 256-dimensional latent representation.
    
    Finally, this latent vector ( $h_{\mathrm{out}}$ ) is fed into a final MLP to produce the outputs:

The Actor outputs an action vector of dimension 29 (for the 29-DoF Unitree G1 humanoid).
The Critic outputs a scalar value estimate (the predicted value of the current state).

All layers utilize the Mish activation function.

4.5.3. Observation Terms and Dimensions

The composition of the observation has been detailed in Section 3.1. The dimensionality of each component for a single time step $t$ is summarized below. Before being fed into the policy, observations are processed using a trainable vecnorm module, which is applied in both training and deployment to normalize inputs.

The following are the observation terms and their dimensions from Table 5 of the original paper:

Observation Term	Dimension
Pt	4
Telapse,t	1
Tleft,t	1
a	29
Wt	3
gt	3
qt	29
qt	29
Voxel-Gridt	[32 × 32 × 40]
vt	3
Height_Mapt	1089

Pt: Goal position relative to the robot (4 dimensions: x, y, z, and an auxiliary term like yaw).
Telapse,t: Elapsed time (1 dimension).
Tleft,t: Time left (1 dimension).
$a$ : Actions (29 dimensions, corresponding to joint commands).
Wt: Root angular velocity (3 dimensions: roll, pitch, yaw rates).
gt: Gravity vector in robot frame (3 dimensions).
qt: Joint positions (29 dimensions).
qt (second entry, likely $\dot{q}_t$ ): Joint velocities (29 dimensions).
Voxel-Gridt: The 3D voxel occupancy grid, with dimensions $32 \times 32 \times 40$ .
vt: Root linear velocity (3 dimensions: x, y, z velocities).
Height_Mapt: The flattened height map, with 1089 dimensions. Prior to flattening, it is a $33 \times 33$ tensor, covering a $0.8 \mathrm{m} \times 0.8 \mathrm{m}$ area around the robot with $0.05 \mathrm{m}$ resolution. This captures local terrain height around the robot, centered at its base.

4.5.4. Reward Function Details

Beyond the primary goal-reaching reward ( $r_{\mathrm{reach}}$ ), Gallant incorporates auxiliary shaping terms to improve sample efficiency during early training, inspired by Rudin et al. [31]. These geometry-aware and general-purpose rewards are computed consistently across all terrains without task-specific tuning.

Directional velocity reward ( $r_{\mathrm{velocity\_direction}}$ ): This reward encourages the robot to move in a direction that aligns with the goal while simultaneously considering obstacle avoidance. $ r _ { \mathrm { v e l o c i t y _ d i r e c t i o n } } = { \frac { \mathbf { a } ( \mathbf { p } , \mathbf { g } ) \cdot \mathbf { v } _ { t } } { | \mathbf { a } ( \mathbf { p } , \mathbf { g } ) \cdot \mathbf { v } _ { t } | _ { 2 } } } $ where:
- $\mathbf { v } _ { t }$ : The robot's instantaneous linear velocity.
- $\mathbf { a } ( \mathbf { p } , \mathbf { g } )$ : A direction vector that combines goal alignment and obstacle avoidance. This vector is critical for guiding the robot's movement.
  
  The direction vector $\mathbf { a } ( \mathbf { p } , \mathbf { g } )$ is calculated as: $ \mathbf { a } ( \mathbf { p } , \mathbf { g } ) = \sum _ { j \in \mathcal { N } ( \mathbf { p } , r ) } w _ { j } \mathbf { u } _ { r , j } + \kappa \sum _ { j \in \mathcal { N } ( \mathbf { p } , r ) } w _ { j } \gamma _ { j } \mathbf { t } _ { j } $ where:
- $\mathbf { p }$ : The robot's current position.
- $\mathbf { g }$ : The goal direction vector.
- $\mathcal { N } ( \mathbf { p } , r )$ : Denotes the set of obstacle points located within a radius $r = 1 \mathrm { m }$ from the robot's position $\mathbf { p }$ .
- $w_j$ : A distance-based weighting factor for obstacle $j$ . It gives higher weight to closer obstacles, calculated as: $ w _ { j } = \frac { \left[ \operatorname* { m a x } \left( 1 - \frac { \operatorname* { m a x } ( d _ { j } - 0 . 2 , 0 . 0 2 ) } { 0 . 8 } , 0 \right) \right] ^ { 2 } } { \operatorname* { m a x } ( d _ { j } - 0 . 2 , 0 . 0 2 ) } $ where $d_j$ is the distance to obstacle $j$ . This function gives a strong repulsion weight to obstacles within 1m, with an inner threshold of 0.2m (and a minimum effective distance of 0.02m to avoid division by zero issues).
- $\mathbf { u } _ { r , j }$ : The repulsion unit vector from obstacle $j$ pointing towards the robot. This term encourages the robot to move away from obstacles.
- $\mathbf { t } _ { j }$ : A tangential unit vector (either left or right) around obstacle $j$ . This term encourages the robot to circumnavigate obstacles.
- $\kappa$ : A weighting coefficient for the tangential term, set to 0.8.
- $\gamma _ { j } = \mathrm { m a x } ( \mathbf { g } ^ { \top } \mathbf { d } _ { j } , 0 )$ : A factor that filters obstacles that are behind the goal direction. $\mathbf{d}_j$ is the vector from robot to obstacle $j$ . This ensures that the robot doesn't unnecessarily avoid obstacles it has already passed or that are not in its path towards the goal. This direction computation is applied only to relevant structures like cylinders in Forest and walls in Door and is efficiently parallelized via Warp.
Head height reward ( $r_{\mathrm{head.height}}$ ): This reward encourages the robot to adjust its body height proactively, particularly useful for overhead constraints. $ r _ { \mathrm { h e a d . h e i g h t } } = \exp \big ( - 4 ( H _ { \mathrm { h e a d . e s t } } - H _ { \mathrm { h e a d } } ) ^ { 2 } \big ) $ where:
- $H _ { \mathrm { h ead . e s t } }$ : The estimated target head height. This is computed by conceptually shifting the robot $0 . 4 5 \mathrm { m }$ forward along the goal direction, averaging the terrain height within a $0 . 5 \times 0 . 5 \mathrm { m }$ square at that location, and then subtracting a $0 . 1 \mathrm { m }$ offset (to ensure clearance).
- $H _ { \mathrm { h ead } }$ : The robot's actual head height. This Gaussian-shaped reward is maximized when the robot's actual head height matches the estimated target head height, encouraging it to lower its head (crouch) to pass under obstacles like ceilings.
Foot clearance reward ( $r_{\mathrm{feet.clearance}}$ ): This reward encourages the robot to proactively lift its feet to clear obstacles, distinguishing it from prior work that might only consider terrain directly under the foot. $ r _ { \mathrm { f e e t . c l e a r a n c e } } = \exp \big ( - 4 ( H _ { \mathrm { f e e t . e s t } } - H _ { \mathrm { f e e t } } ) ^ { 2 } \big ) $ where:
- $H _ { \mathrm { f e e t . e s t } }$ : The estimated target foot height. This is calculated by querying terrain $0 . 5 \mathrm { m }$ ahead of each foot and averaging the height in a square region.
- $H _ { \mathrm { f e e t } }$ : The robot's actual foot height. Similar to the head height reward, this Gaussian-shaped reward promotes proactive leg lifting over steps or platforms, ensuring that the robot prepares its foot trajectory in advance of obstacles.

4.5.5. Domain Randomization (beyond LiDAR-specific)

In addition to the LiDAR-specific domain randomization (pose, hit position, latency, missing grid), several general randomization strategies are applied during training to enhance policy robustness and sim-to-real transferability:

Mass randomization: The masses of the robot's pelvis and torso links are randomized.
- $m _ { \mathrm { r a n d } } = m \times \mathbf { U } ( 0 . 8 , 1 . 2 )$
- $m_{\mathrm{rand}}$ : Randomized mass.
- $m$ : Original mass.
- $\mathbf { U } ( 0 . 8 , 1 . 2 )$ : A random value sampled from a uniform distribution between 0.8 and 1.2. This simulates variations in robot construction or payload.
Foot-ground contact randomization: Parameters related to contact physics are randomized.
- Ground friction coefficient is fixed at 1.0.
- Foot joint friction is sampled from $\mathbf { U } ( 0 . 5 , 2 . 0 )$ .
- Restitution coefficient (bounciness) is sampled from $\mathbf { U } ( 0 . 0 5 , 0 . 4 )$ . These randomizations help the policy become robust to variations in surface properties and joint mechanics.
Control parameter randomization: The joint stiffness ( $K_p$ $K_{p}$ ) and damping ( $K_d$ $K_{d}$ ) parameters for the robot's joints are randomized.
- $K _ { p , \mathrm { r a n d } } = K _ { p } \times \mathbf { U } ( 0 . 8 , 1 . 2 )$
- $K _ { d , \mathrm { r a n d } } = K _ { d } \times \mathbf { U } ( 0 . 8 , 1 . 2 )$
- $K_{p,\mathrm{rand}}$ , $K_{d,\mathrm{rand}}$ : Randomized stiffness and damping.
- $K_p$ , $K_d$ : Original stiffness and damping (following settings in Liao et al. [19]). This makes the policy less sensitive to exact control gains in deployment.
Torso center-of-mass offset: The center of mass (CoM) position of the torso is perturbed.
- An offset is sampled from $\mathbf { U } ( - 0 . 0 5 , 0 . 0 5 )$ along each axis (x, y, z). This simulates slight manufacturing variations or changes in robot configuration.
Init Joint Position offset: A random offset is added to the robot's default joint positions and default joint velocities $(0 \mathrm{rad/s})$ $(0 rad/s)$ during environment reset.
- Offset sampled from $\mathbf { U } ( - 0 . 1 , 0 . 1 )$ . This randomizes the robot's initial state, encouraging a more robust starting posture.

4.5.6. Termination Conditions

Several termination conditions are used during training to encourage effective and safe behavior and prevent undesirable strategies:

Force contact: An episode terminates if any external force acting on the torso, hip, or knee joints exceeds $100 \mathrm { N }$ at any time step. This prevents the robot from relying on harsh collisions.
Pillar fall: For pillar-based terrains (Forest, Pile), if a foot penetrates more than 10 cm below the ground level, the episode terminates. This prevents the robot from "cheating" by falling through or bypassing obstacles.
No movement: To prevent the agent from exploiting reward shaping by simply staying in place, an episode terminates if the robot fails to move at least $1 \mathrm { m }$ away from its initial position within 4 seconds.
Fall over: The episode terminates if the robot loses balance and falls.
Feet too close: Since self-collision is disabled during training (to speed up simulation), this condition prevents the robot's feet from crossing or overlapping unnaturally, ensuring physically plausible motions.

4.5.7. Symmetry

Symmetry-based data augmentation is applied to accelerate training. This involves flipping certain observations along the $y$ -axis.

Proprioceptive observations are flipped, similar to Ben et al. [3].
The perception representation (the voxel grid) is also mirrored along the y-dimension. For example, a $(32, 32, 40)$ grid map is flipped across its $y$ -axis to align with the flipped proprioceptive input, forming a consistent flipped observation.
The reward remains unchanged under this transformation. Both original and flipped samples are stored in the rollout buffer and used jointly during training.

4.6. Real-world deployment Details (from Supplementary Material)

4.6.1. Target Position Command

For real-world deployment, a Livox Mid-360 LiDAR mounted on the robot's head (facing downwards) is used with FastLIO2 [42, 43] to provide the robot's position in the world coordinate frame at $25 \mathrm { Hz }$ .

The Mid-360 provides a FoV of $360^\circ$ horizontally and $-7^\circ$ to $52^\circ$ vertically.
During training, the observation frequency of $\mathbf{P}_t$ (goal position) is also set to $25 \mathrm { Hz }$ .
The robot starts at $(0, 0)$ , and the goal position for each run is fixed at $(4, 0)$ .
At each time step, FastLIO2 outputs the current robot position (x, y). The observation $\mathbf{P}_t$ relative to the goal is then calculated as: $ P _ { t } = ( 4 , 0 ) - ( x , y ) = ( 4 - x , - y ) $

4.6.2. Voxel Grid Processing

Two Hesai JT128 LiDARs (front and rear torso-mounted) collect raw point cloud data.

Each JT128 has a FoV of $95^\circ$ vertically and $360^\circ$ horizontally, with 128 channels.
The dual-sensor setup provides near-complete coverage around the robot.
The JT128 supports $10 \mathrm { Hz }$ and $20 \mathrm { Hz }$ output modes. 10 Hz mode was chosen for better point cloud quality, and simulation is aligned accordingly.
To improve voxel grid quality, raw point clouds are merged and then processed onboard using OctoMap [16] before being passed to the policy. OctoMap generates a binary occupancy grid at $10 \mathrm { Hz }$ . It's emphasized that OctoMap serves as a lightweight preprocessing step, not a full reconstruction pipeline, thus incurring minimal latency.

4.6.3. Information Communication

The system runs on an NVIDIA Orin NX with limited communication performance. To reduce latency:

LiDAR output is clipped to only include points within the perception voxel grid to reduce data size.
The voxel grid from OctoMap and the one used for observation share memory to avoid redundant transmission.
Robot state reading and action command delivery are also implemented via shared memory, bypassing LCM (Lightweight Communications and Marshalling, a common robotics communication system). These optimizations minimize communication-induced latency, leaving primarily inherent sensor delays. The overall communication process is shown in Figure 8.

The following figure (Figure 8 from the original paper) illustrates the flow of information communication:

Figure 8. Diagram of information communication. 该图像是示意图，展示了信息通信的流程，包括传感器 JT128、PC 处理、OctoMap、机器人控制和策略制定之间的相互作用。各组件通过不同频率的数据流交互，以实现高效的动作控制和环境感知。

5. Experimental Setup

5.1. Datasets

The experiments primarily use simulated environments generated within NVIDIA IsaacSim [28]. These environments are configured with 8 representative terrain types designed to progressively increase in difficulty and cover a wide range of 3D constraints. The curriculum-based training strategy ensures that the robot is exposed to increasing complexity.

Plane: Flat ground.
Ceiling: Overhead obstacles of varying heights and densities.
Forest: Randomly spaced cylindrical pillars (lateral clutter).
Door: Narrow passages requiring precise lateral movement.
Platform: Elevated structures with gaps, demanding step recognition and traversal.
Pile: Uneven, gapped surfaces requiring fine foothold selection.
Upstair: Stairs for climbing.
Downstair: Stairs for descending.

These terrains are generated using parameterized settings, interpolated between a min (easiest) and max (hardest) configuration, as detailed in Table 2 of the original paper (and transcribed in Section 4.4). The following figure (Figure 3 from the original paper) shows examples of these terrain types:

$Figure 3. Terrain types used to train robots in simulation $( \\mathbf { p } _ { \\tau } ^ { \\mathrm { m a x } } ,$$ 该图像是用于训练机器人在模拟环境中应对不同类型地形的示意图。图中展示了六种地形类型，包括 Ceiling、Door、Pile、Downstairs、Plane、Forest 和 Platform，帮助研究者理解机器人如何在这些环境中进行导航。

The simulated environment itself acts as the dataset, providing LiDAR point clouds and proprioceptive feedback to train the Reinforcement Learning policy. For real-world deployment, the Unitree G1 humanoid interacts with physical versions of these challenging terrains.

5.2. Evaluation Metrics

The paper uses two distinct metrics to evaluate policy performance in simulation, particularly on the most challenging terrain settings ( $\mathbf{p}_{\tau}^{\mathrm{max}}$ ):

Success rate ( $E_{\mathrm{succ}}$ ):
- Conceptual Definition: This metric quantifies the proportion of episodes in which the robot successfully reaches its target goal within a specified time limit (10 seconds) without catastrophic failures, such as falling or incurring any severe collisions with obstacles. It measures the policy's overall capability to complete the task.
- Mathematical Formula: The paper defines it as a "fraction," so it can be expressed as: $ E_{\mathrm{succ}} = \frac{\text{Number of successful episodes}}{\text{Total number of episodes}} $
- Symbol Explanation:
  - $E_{\mathrm{succ}}$ : The success rate.
  - Number of successful episodes: Episodes where the robot reaches the target within 10 seconds, without falling or severe collisions.
  - Total number of episodes: The total number of trials conducted.
Collision momentum ( $E_{\mathrm{collision}}$ ):
- Conceptual Definition: This metric measures the cumulative momentum transferred through unintended or unnecessary physical contacts between the robot and its environment. It explicitly excludes nominal foot contacts (expected and desired contacts for locomotion). A lower collision momentum indicates a more adept and collision-free navigation policy.
- Mathematical Formula: The paper describes it as "cumulative momentum transferred through unnecessary contacts." While a precise formula is not provided, it implies summation of momentum changes (force integrated over time) from non-foot contacts. Conceptually, if $\Delta \mathbf{p}_k$ is the momentum transferred during an unnecessary contact event $k$ , then: $ E_{\mathrm{collision}} = \sum_{k=1}^{N_{\text{unnecessary_contacts}}} |\Delta \mathbf{p}_k| $
- Symbol Explanation:
  - $E_{\mathrm{collision}}$ : The cumulative collision momentum.
  - $N_{\text{unnecessary\_contacts}}$ : The total number of contact events that are not nominal foot contacts.
  - $\Delta \mathbf{p}_k$ : The change in momentum caused by the $k$ -th unnecessary contact. $\|\cdot\|$ denotes the magnitude of this momentum change.
    
    For simulation experiments, each policy is trained for 4,000 iterations, followed by 5 independent evaluations, each consisting of 1,000 complete episodes. The mean and standard deviation for these metrics are reported. Policies with higher $E_{\mathrm{succ}}$ and lower $E_{\mathrm{collision}}$ are considered superior.

For real-world experiments, success rates are also measured. Policies are tested over 15 trials per terrain.

5.3. Baselines

To thoroughly assess the effectiveness of Gallant's core components, comparisons are made against several ablated methods and alternative approaches:

Simulation Baselines (Ablation Studies)

These baselines are used in IsaacSim to isolate and evaluate the contribution of specific design choices within Gallant:

Self-scan Ablation:
- w/o-Self-Scan: This variant disables simulated LiDAR returns from dynamic geometry, specifically the robot's own moving links (e.g., legs, arms). It only scans static terrain. This is compared against Gallant, which explicitly models scans over both static terrain and dynamic robot links. This ablation tests the importance of the high-fidelity LiDAR simulation for dynamic objects.
Perceptual Network Ablation: This compares Gallant's z-grouped 2D CNN with alternative CNN architectures for processing the voxel grid:
- Standard 3D CNN: A convolutional neural network that applies filters across all three spatial dimensions (x, y, z) of the voxel grid. This represents a more direct but potentially more computationally expensive way to process 3D volumetric data.
- Sparse 2D CNN: A 2D CNN that incorporates sparsity, meaning it only performs computations on occupied voxels in the x-y plane.
- Sparse 3D CNN: A 3D CNN optimized for sparse 3D data (commonly used in LiDAR perception [5, 12]). These variants (Sparse-2D-CNN, Sparse-3D-CNN) are based on [8]. These ablations evaluate the accuracy-compute tradeoff of the chosen z-grouped 2D CNN.
Perceptual Representation Ablation: This examines the chosen perceptual input to the actor and critic:
- Only-Height-Map: The actor and critic both receive only a height map as their perceptual input, completely replacing the voxel grid. This highlights the limitations of 2.5D representations for 3D constrained terrains.
- Only-Voxel-Grid: The actor and critic both receive only the voxel grid (and no height map for the critic). This tests the benefit of including the height map as privileged information for the critic during training.
Voxel Resolution Ablation: This explores the impact of voxel size on performance:
- 10CM: A voxel resolution of 10 cm. This increases the Field of View (FoV) but reduces geometric fidelity.
- 2.5CM: A voxel resolution of 2.5 cm. This increases geometric precision but reduces FoV under a fixed memory budget. These are compared against Gallant's default 5CM resolution to find the optimal balance between coverage and detail.

Real-World Baselines

These policies are deployed on the Unitree G1 humanoid to evaluate sim-to-real performance:

HeightMap: This policy replaces Gallant's voxel grid with an elevation map (estimated from Livox Mid-360 LiDAR data) for its perception module. This serves as a direct comparison against a common 2.5D perception method in a real-world setting.
NoDR (No Domain Randomization): This policy is trained identically to Gallant but without the LiDAR domain randomization (pose, hit position, latency, missing grid, as described in Section 4.2). This highlights the critical role of domain randomization in bridging the sim-to-real gap.

6. Results & Analysis

6.1. Core Results Analysis

Gallant's experimental results demonstrate its superior performance and robustness across various challenging 3D constrained terrains in both simulation and the real world. The analyses highlight the importance of its key components, including the LiDAR simulation with dynamic object scanning, the z-grouped 2D CNN, the voxel grid representation, and the 5cm voxel resolution.

Simulation Experiments

The performance of Gallant and its ablations are evaluated in IsaacSim on the hardest terrain settings ( $\mathbf{p}_{\tau}^{\mathrm{max}}$ ) using success rate ( $E_{\mathrm{succ}}$ ) and collision momentum ( $E_{\mathrm{collision}}$ ).

The following are the simulation results from Table 3 of the original paper:

Method	Plane		Ceiling		Forest		Door		Platform		Pile		Upstair		Downstair
	Esuce ↑	Ecollision ↓	Esuce ↑	Ecollision ↓	Esuce ↑	Ecollision ↓	Esuce ↑	Ecollision ↓	Esuce ↑	Ecollision ↓	Esuce ↑	Ecollision ↓	Esuce ↑	Ecollision ↓	Esuce ↑	Ecollision ↓
	(a) Ablation on Self-scan
w/o-Self-Scan	99.7 (±0.1)	27.2 (±1.0)	579.0 (±55.1)	33.0 (±0.9)	305.5 (±16.6)	33.1 (±1.4)	264.9 (±21.7)	31.0 (±0.8)	200.5 (±18.9)	32.0 (±1.0)	190.1 (±17.5)	30.5 (±0.9)	220.3 (±20.1)	32.8 (±0.9)	210.6 (±19.8)	31.5 (±1.1)
Gallant	100.0 (±0.0)	0.0 (±0.0)	97.1 (±0.6)	24.6 (±6.3)	84.3 (±0.7)	311.1 (±25.9)	98.7 (±0.3)	27.7 (±6.4)	96.1 (±0.5)	30.1 (±5.3)	82.1 (±0.6)	113.1 (±14.6)	96.2 (±0.6)	27.0 (±4.9)	97.9 (±0.4)	15.6 (±6.2)
(b) Ablation on Perceptual Network
Sparse-3D-CNN	100.0 (±0.0)	0.0 (±0.0)	86.7 (±2.1)	143.5 (±46.1)	84.1 (±1.5)	277.8 (±22.1)	98.0 (±0.6)	74.8 (±7.9)	88.8 (±1.5)	96.8 (±11.6)	52.4 (±1.5)	365.9 (±12.3)	80.1 (±2.2)	107.7 (±15.8)	97.5 (±0.4)	18.9 (±14.1)
3D-CNN	99.9 (±0.1)	0.0 (±0.0)	97.5 (±0.5)	20.0 (±6.6)	73.9 (±2.1)	379.0 (±70.2)	96.1 (±0.7)	69.58 (±5.8)	92.7 (±1.0)	65.6 (±9.5)	65.3 (±0.9)	275.4 (±31.5)	86.0 (±1.4)	78.1 (±19.2)	99.0 (±0.3)	12.1 (±11.6)
Sparse-2D-CNN	99.6 (±0.2)	0.7 (±1.4)	96.0 (±1.0)	26.17 (±5.1)	80.2 (±1.1)	363.1 (±14.4)	92.7 (±1.0)	199.6 (±120.2)	87.9 (±1.1)	100.5 (±20.3)	57.6 (±0.9)	360.3 (±16.3)	89.1 (±0.7)	52.9 (±4.8)	98.7 (±0.6)	4.55 (±2.92)
Gallant	100.0 (±0.0)	0.0 (±0.0)	97.1 (±0.6)	24.6 (±6.3)	84.3 (±0.7)	311.1 (±25.9)	98.7 (±0.3)	27.7 (±6.4)	96.1 (±0.5)	30.1 (±5.3)	82.1 (±0.6)	113.1 (±14.6)	96.2 (±0.6)	27.0 (±4.9)	97.9 (±0.4)	15.6 (±6.2)
(c) Ablation on Perceptual Interface
Only-Height-Map	100.0 (±0.0)	0.0 (±0.0)	5.3 (±2.0)	1995.3 (±68.3)	10.5 (±1.5)	577.4 (±18.1)	10.2 (±1.3)	717.5 (±33.8)	96.0 (±0.7)	34.3 (±2.8)	86.2 (±0.6)	101.6 (±13.8)	98.3 (±0.2)	11.6 (±6.2)	98.5 (±0.3)	11.2 (±6.4)
Only-Voxel-Grid	100.0 (±0.0)	0.0 (±0.0)	96.9 (±0.4)	22.4 (±4.2)	75.9 (±1.5)	506.0 (±20.6)	96.0 (±0.3)	281.4 (±29.0)	94.2 (±0.8)	51.0 (±10.2)	72.3 (±0.6)	201.8 (±14.9)	96.2 (±0.6)	46.9 (±10.5)	98.8 (±0.2)	7.0 (±3.9)
Gallant	100.0 (±0.0)	0.0 (±0.0)	97.1 (±0.6)	24.6 (±6.3)	84.3 (±0.7)	311.1 (±25.9)	98.7 (±0.3)	27.7 (±6.4)	96.1 (±0.5)	30.1 (±5.3)	82.1 (±0.6)	113.1 (±14.6)	96.2 (±0.6)	27.0 (±4.9)	97.9 (±0.4)	15.6 (±6.2)
(d) Ablation on Voxel Resolution
10CM	98.8 (±0.2)	2.1 (±1.6)	13.3 (±2.4)	1442.4 (±119.6)	59.0 (±1.7)	642.7 (±12.4)	64.8 (±1.1)	591.0 (±22.5)	67.2 (±2.7)	268.9 (±39.3)	54.1 (±1.7)	400.2 (±19.5)	86.3 (±1.2)	74.8 (±12.8)	96.6 (±0.4)	15.2 (±6.1)
2.5CM	99.9 (±0.1)	2.1 (±1.6)	97.3 (±0.9)	24.2 (±11.0)	77.5 (±3.4)	368.0 (±36.3)	97.5 (±0.4)	260.4 (±38.8)	75.5 (±0.5)	63.0 (±4.9)	65.2 (±5.5)	256.3 (±50.0)	94.1 (±1.1)	38.6 (±6.7)	97.5 (±0.4)	13.5 (±2.0)
Gallant (5CM)	100.0 (±0.0)	0.0 (±0.0)	97.1 (±0.6)	24.6 (±6.3)	84.3 (±0.7)	311.1 (±25.9)	98.7 (±0.3)	27.7 (±6.4)	96.1 (±0.5)	30.1 (±5.3)	82.1 (±0.6)	113.1 (±14.6)	96.2 (±0.6)	27.0 (±4.9)	97.9 (±0.4)	15.6 (±6.2)

1. LiDAR Returns from Dynamic Objects is Necessary (Self-scan Ablation): Gallant (which includes self-scan) achieves significantly higher success rates and lower collision momentum across all tasks compared to the w/o-Self-Scan variant. For instance, on the Ceiling task, Gallant has a 97.1% success rate and 24.6 collision momentum, whereas w/o-Self-Scan shows only 33.0% success and 305.5 collision momentum. This is attributed to the fact that when the robot adopts postures like crouching (e.g., under a ceiling), its own links (e.g., legs) can occlude the ground. Without self-scan, the voxel grid would artificially show a flat floor, leading to an out-of-distribution (OOD) observation for the policy. This demonstrates that simulating dynamic objects in the LiDAR pipeline is crucial for realistic perception and robust performance, particularly in scenarios requiring significant body posture changes.

The following figure (Figure 5 from the original paper) illustrates the effect of self-scan:

该图像是一个示意图，展示了机器人在3D受限环境中行走的过程，包括(左)人形机器人穿越天花板的场景和(右)体素网格的自我扫描与非自我扫描情况，以及不同模型的训练迭代时间。这些视觉资源支持了论文中关于人形机器人在复杂 terrain 中优化运动的研究。

2. z-grouped 2D CNN is the Most Suitable Choice (Perceptual Network Ablation): While the 3D-CNN variant marginally outperforms Gallant on Ceiling (97.5% vs 97.1% success, 20.0 vs 24.6 collision momentum) and Downstair (99.0% vs 97.9% success, 12.1 vs 15.6 collision momentum), Gallant's z-grouped 2D CNN generally performs better or comparably well on most other tasks and maintains competitive performance. For Forest and Door, Gallant (84.3% and 98.7% success) clearly outperforms 3D-CNN (73.9% and 96.1% success). The paper argues that its voxel input ( $32 \times 32 \times 40$ ) has relatively dense occupancy in the x-y plane, making sparse convolutions less efficient due to rulebook overhead. Full 3D CNNs introduce more parameters and memory traffic, making optimization harder. The z-grouped 2D CNN effectively preserves vertical structure through channel mixing, leverages optimized dense 2D operators, and provides the right inductive bias for an egocentric raster, delivering superior performance with markedly lower compute.

3. Combination of Voxel Grid and Height Map is Better (Perceptual Interface Ablation): Only-Height-Map performs poorly on 3D constrained terrains like Ceiling (5.3% success), Forest (10.5% success), and Door (10.2% success), confirming its inability to represent multilayer structures. However, it performs well on Platform, Pile, Upstair, and Downstair due to their primary ground-level nature. Gallant (which uses a voxel grid for the actor and both voxel grid and height map as privileged information for the critic) achieves higher success rates than Only-Voxel-Grid (which only uses voxel grid for both actor and critic) across all tasks. This validates Gallant's asymmetric design, where the height map aids the critic in credit assignment during training, improving overall policy learning without introducing latency-sensitive channels to the deployed actor.

4. 5cm is a Suitable Resolution for Gallant (Voxel Resolution Ablation): Gallant's default 5cm resolution (Gallant (5CM)) generally achieves the best balance. The 10CM variant (coarser) significantly underperforms on Ceiling (13.3% success) and Forest (59.0% success) due to impaired geometric fidelity, leading to poor contact- and clearance-sensitive interactions. Conversely, the 2.5CM variant (finer) also has lower success rates than Gallant (5CM) on most terrains, especially those requiring sensing far above or below the robot (e.g., Ceiling, Downstair), because its reduced FoV (under a fixed VRAM budget) hampers perception of long vertical extents. The 5cm resolution strikes an effective balance between coverage and detail under resource constraints.

Real-world Experiments

The Gallant-trained policy is deployed zero-shot on the Unitree G1 humanoid at $50 \mathrm { Hz }$ control frequency. OctoMap [16] processes raw LiDAR point clouds onboard at $10 \mathrm { Hz }$ for voxel grid construction.

The following figure (Figure 4 from the original paper) shows qualitative real-world results:

该图像是图表，展示了Gallant框架下人形机器人在各种3D受限地形中的行走和局部导航能力，包括跨越障碍、攀登楼梯和通过狭窄通道等场景。

The robot consistently traverses diverse real-world terrains, including flat ground, random-height ceilings, lateral clutter (e.g., doors), high platforms with gaps, stepping stones, and staircases. It demonstrates versatile capabilities: crouching, planning lateral motions, robustly stepping onto platforms, crossing gaps, and careful foot placement on stepping stones, and stable multi-step climbing/descent. These behaviors arise from a single policy without terrain-specific tuning, highlighting Gallant's generality and real-world transferability.

Real-world Ablation: To evaluate sim-to-real performance, three policies are tested on the Unitree G1 over 15 trials per terrain: HeightMap, NoDR, and Gallant.

The following figure (Figure 6 from the original paper) shows the real-world success rates of the ablation studies:

该图像是一个柱状图，展示了在15次试验中不同场景下成功的次数。每个场景包括平面、天花板、门、平台、堆、楼梯向上和楼梯向下，而图中显示了三种不同算法（HeightMap、NoDR和Gallant）在各场景下的成功效果。Gallant算法在多个场景中表现优异，尤其是在楼梯向下的情况，显示出其在复杂环境中的有效性和可靠性。

Gallant vs. HeightMap: Gallant consistently outperforms HeightMap across all real-world terrains. HeightMap fails significantly on overheading (Ceiling) and lateral (Door) obstacles due to its limited 2.5D representation. Even on ground-level terrains, HeightMap's real-world performance is hindered by noisy elevation reconstruction, which can be exacerbated by LiDAR jitter from the robot's torso pitch/roll movements. This reinforces the benefit of voxel grids which are less susceptible to such issues.
Gallant vs. NoDR: NoDR (trained without LiDAR domain randomization) performs reasonably on Ceiling and Door, suggesting lower sensitivity to sensing latency in these cases. However, its performance drops significantly on ground-level terrains (e.g., Platform, Pile, Stairs). Without modeling LiDAR delay and noise during training, the robot often misjudges its position relative to obstacles or reacts too late. This emphasizes the critical role of domain randomization in bridging the sim-to-real gap. Figure 9 in the supplementary material illustrates NoDR's failure modes.

Further Analyses

A clear correlation exists between Gallant's success rates in simulation and the real world (Figure 7). Terrains with higher success in simulation generally perform well on hardware, validating the use of large-scale simulated evaluation as a predictor of real-world performance.

The following figure (Figure 7 from the original paper) shows the success rates of Gallant in simulation and real-world environments:

Figure 7. Gallant success rate in simulation and real world. 该图像是图表，展示了Gallant在模拟和真实环境中的成功率。图中显示了不同场景（如天花板、门、平台、堆、楼梯上下）的成功率，模拟和真实结果在大多数情况下接近100%。

With voxel grids, scenarios like overheading (Ceiling) and lateral (Door) constraints, which were previously difficult for height map-based methods, become among the easiest for Gallant (achieving near 100% success), demonstrating the effectiveness of the voxel grid representation for full-space perception.

Gallant's main limitation appears on the Pile terrain, where accurate foothold selection is critical. Success rates plateau around 80%. Simulation with zero LiDAR latency improves this to over 90%, indicating that real-world sensor delay is a key bottleneck for this specific task. On other terrains, particularly Platforms and Stairs (which are typically unstable due to collision risk), Gallant achieves high success by proactively adjusting foot trajectories.

6.2. Data Presentation (Tables)

All tables from the original paper (Table 1, Table 2, Table 3) and the supplementary material (Table 4, Table 5) have been transcribed and presented in their respective sections above, following the strict formatting guidelines for Markdown and HTML tables.

6.3. Ablation Studies / Parameter Analysis

The paper conducts extensive ablation studies to verify the effectiveness of Gallant's various components, as detailed in Table 3 of the original paper and discussed in Section 6.1.

Self-scan: Crucial for dynamic environments, significantly impacting tasks like Ceiling and Forest.
Perceptual Network: z-grouped 2D CNN provides a superior accuracy-compute tradeoff compared to 3D CNNs or sparse variants, making it practical for real-time inference while maintaining high performance.
Perceptual Representation: The combination of voxel grid (for actor) and height map (as privileged information for critic) yields the best results, demonstrating the value of comprehensive 3D data and strategic use of auxiliary information during training.
Voxel Resolution: 5cm resolution strikes an optimal balance between FoV coverage and geometric fidelity, outperforming coarser (10cm) or finer (2.5cm) resolutions.
Domain Randomization (NoDR baseline): Essential for sim-to-real transfer, particularly for mitigating real-world sensor latency and noise.

The ablation results consistently show that each of Gallant's design choices (realistic LiDAR simulation including self-scan, z-grouped 2D CNN, voxel grid combined with privileged height map, and optimal voxel resolution) contributes significantly to its robust performance and sim-to-real transferability.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Gallant, a comprehensive full-stack pipeline for humanoid locomotion and local navigation in complex 3D-constrained environments. Gallant makes significant advancements by leveraging voxel grids as a lightweight yet geometry-preserving perceptual representation. This representation, coupled with a high-fidelity LiDAR simulation (which includes dynamic object scanning, noise, and latency modeling) and an efficient z-grouped 2D CNN for processing, enables fully end-to-end optimization.

The simulation ablations rigorously demonstrate that Gallant's core components are indispensable for training policies with high success rates. In real-world tests, a single LiDAR policy trained with Gallant not only covers ground obstacles (a domain traditionally handled by elevation-map controllers) but also effectively tackles lateral and overhead structures, achieving near 100% success with fewer collisions on ground-only terrains. These findings collectively establish Gallant as a robust and practical solution for humanoid robots to navigate diverse and challenging 3D terrains.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation of Gallant: it does not yet achieve a 100% success rate across all scenarios.

LiDAR Latency: The main bottleneck identified is LiDAR latency. Operating at $10 \mathrm { Hz }$ , each scan incurs over $100 { - } 200 ~ \mathrm { ms }$ delay due due to light reflection and communication overhead. This inherent delay limits the robot's ability to act preemptively and is particularly evident in tasks requiring fine-grained foothold selection (like the Pile terrain).

Future work will focus on addressing this latency:
Geometry-aware Teacher: Exploring the use of Gallant itself as a geometry-aware teacher to guide policies trained with lower-latency sensors.
Lower-Latency Sensors: Investigating alternative or complementary sensors with reduced latency to enable a fully reactive policy. The ultimate goal is to achieve near-perfect performance across all terrains by overcoming these temporal limitations.

7.3. Personal Insights & Critique

Personal Insights

This paper presents a highly practical and well-engineered solution to a fundamental problem in humanoid robotics. The integration of voxel grids for 3D perception, the clever z-grouped 2D CNN architecture, and the meticulous LiDAR simulation with domain randomization are particularly insightful.

Efficiency of z-grouped 2D CNN: The z-grouped 2D CNN is a brilliant compromise. It acknowledges the sparsity of voxel data and the computational burden of 3D CNNs, offering an efficient way to extract 3D structural information using widely optimized 2D CNN operations. This balance between representation capacity and computational efficiency is critical for real-time onboard deployment on resource-constrained platforms like the NVIDIA Orin NX.
High-Fidelity LiDAR Simulation: The explicit modeling of dynamic objects (self-scan), sensor noise, and latency in the LiDAR simulation is a standout feature. The ablation study clearly demonstrates its necessity for robust sim-to-real transfer. This level of detail in simulation design is often overlooked but proves to be paramount for real-world performance.
Generalization of a Single Policy: The ability of a single policy to generalize across such a diverse range of 3D constrained terrains (ground, lateral, overhead) without terrain-specific tuning is a significant step towards truly autonomous and versatile humanoid robots. This highlights the power of robust 3D perception combined with comprehensive curriculum learning.

Critique

While Gallant is an excellent piece of work, some areas could be explored further or present inherent challenges:

Addressing LiDAR Latency: The identified limitation of LiDAR latency is a critical real-world problem. While the authors suggest using Gallant as a teacher, exploring predictive models (e.g., using Recurrent Neural Networks or transformers on historical voxel grids) to anticipate future terrain changes could also be a viable avenue. Combining multiple sensor modalities (e.g., high-frequency IMUs with LiDAR) to create a more responsive internal state could also help.
Computational Cost of OctoMap: While OctoMap is stated as a "lightweight preprocessing step," its real-time performance on a continuous stream of dual JT128 LiDAR data on an Orin NX (especially for complex environments) could still be a bottleneck. Further optimization or alternative direct voxelization methods might be needed, especially if the perception volume or voxel resolution were to increase.
Generalization to Novel Terrains: While Gallant generalizes well across its 8 trained terrain families, it would be interesting to see its performance on entirely novel 3D environments that differ significantly from the training distribution (e.g., highly deformable terrain, dense vegetation, complex moving obstacles not part of the robot itself). The domain randomization helps, but the inherent structure of the voxel grid might still be specific to relatively rigid environments.
Foot Placement on Pile: The slightly lower success rate on Pile (80%) highlights the inherent difficulty of precise foothold planning. Future work could investigate incorporating more explicit footstep planning or contact-rich manipulation modules that utilize the voxel grid more directly for affordance estimation (identifying stable contact points).
Multi-robot Coordination or Interaction: The current framework focuses on single-robot locomotion. Extending this robust 3D perception to multi-robot scenarios or human-robot interaction in shared 3D constrained spaces would open new research avenues.

Overall, Gallant provides a strong foundation and a clear pathway for developing highly capable humanoid robots that can truly navigate and operate in the complex, unstructured 3D environments of the real world. Its rigorous methodology and clear empirical validation make it an impactful contribution to the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~45 min read · 59,119 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

Humanoid Perceptive Locomotion

Local Navigation

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Problem Formulation

4.2. Efficient LiDAR Simulation

4.3. Voxel Representation and 2D CNN Perception

4.4. Terrain Design

4.5. Training Details (from Supplementary Material)

4.5.1. Hyperparameters

4.5.2. Policy Network Structure

4.5.3. Observation Terms and Dimensions

4.5.4. Reward Function Details

4.5.5. Domain Randomization (beyond LiDAR-specific)

4.5.6. Termination Conditions

4.5.7. Symmetry

4.6. Real-world deployment Details (from Supplementary Material)

4.6.1. Target Position Command

4.6.2. Voxel Grid Processing

4.6.3. Information Communication

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

Simulation Baselines (Ablation Studies)

Real-World Baselines

6. Results & Analysis

6.1. Core Results Analysis

Simulation Experiments

Real-world Experiments

Further Analyses

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Personal Insights

Critique

Similar papers