Paper status: completed

BeamDojo: Learning Agile Humanoid Locomotion on Sparse Footholds

Published:02/15/2025

Sparse Foothold Reinforcement Learning (1)Agile Humanoid Locomotion Learning (1)Double Critic Architecture (1)Two-Stage Reinforcement Learning Approach (1)LiDAR-based Elevation Map (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

BeamDojo is a reinforcement learning framework for agile humanoid locomotion on sparse footholds, utilizing sampling-based rewards and a double critic architecture. A two-stage learning approach balances rewards and achieves efficient learning and high success rates in dynamic re

Abstract

Traversing risky terrains with sparse footholds poses a significant challenge for humanoid robots, requiring precise foot placements and stable locomotion. Existing learning-based approaches often struggle on such complex terrains due to sparse foothold rewards and inefficient learning processes. To address these challenges, we introduce BeamDojo, a reinforcement learning (RL) framework designed for enabling agile humanoid locomotion on sparse footholds. BeamDojo begins by introducing a sampling-based foothold reward tailored for polygonal feet, along with a double critic to balancing the learning process between dense locomotion rewards and sparse foothold rewards. To encourage sufficient trial-and-error exploration, BeamDojo incorporates a two-stage RL approach: the first stage relaxes the terrain dynamics by training the humanoid on flat terrain while providing it with task-terrain perceptive observations, and the second stage fine-tunes the policy on the actual task terrain. Moreover, we implement a onboard LiDAR-based elevation map to enable real-world deployment. Extensive simulation and real-world experiments demonstrate that BeamDojo achieves efficient learning in simulation and enables agile locomotion with precise foot placement on sparse footholds in the real world, maintaining a high success rate even under significant external disturbances.

Mind Map

In-depth Reading

English Analysis~31 min read · 40,706 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is BeamDojo: Learning Agile Humanoid Locomotion on Sparse Footholds. It focuses on developing a robust reinforcement learning framework that enables humanoid robots to traverse challenging terrains with limited, sparse areas for foot placement.

1.2. Authors

The authors of this paper are:

Huayi Wang
Zirui Wang
Junli Ren
Qingwei Ben
Tao Huang
Weinan Zhang
Jiangmiao Pang

Their affiliations include Shanghai AI Laboratory, Shanghai Jiao Tong University, Zhejiang University, The University of Hong Kong, and The Chinese University of Hong Kong. Weinan Zhang and Jiangmiao Pang are noted as corresponding authors.

1.3. Journal/Conference

This paper is published as an arXiv preprint and has a publication date of 2025-02-14. While not yet in a specific journal or conference, arXiv is a reputable open-access archive for preprints of scientific papers, particularly in physics, mathematics, computer science, and related disciplines. The related works section frequently cites papers from major robotics and machine learning conferences such as Conference on Robot Learning (CoRL), IEEE International Conference on Robotics and Automation (ICRA), and IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), indicating that the work is situated within the leading research in these fields.

1.4. Publication Year

1.5. Abstract

Traversing risky terrains with sparse footholds poses a significant challenge for humanoid robots, requiring precise foot placements and stable locomotion. Existing learning-based approaches often struggle on such complex terrains due to sparse foothold rewards and inefficient learning processes. To address these challenges, we introduce BeamDojo, a reinforcement learning (RL) framework designed for enabling agile humanoid locomotion on sparse footholds. BeamDojo begins by introducing a sampling-based foothold reward tailored for polygonal feet, along with a double critic to balancing the learning process between dense locomotion rewards and sparse foothold rewards. To encourage sufficient trial-and-error exploration, BeamDojo incorporates a two-stage RL approach: the first stage relaxes the terrain dynamics by training the humanoid on flat terrain while providing it with task-terrain perceptive observations, and the second stage fine-tunes the policy on the actual task terrain. Moreover, we implement a onboard LiDAR-based elevation map to enable real-world deployment. Extensive simulation and real-world experiments demonstrate that BeamDojo achieves efficient learning in simulation and enables agile locomotion with precise foot placement on sparse footholds in the real world, maintaining a high success rate even under significant external disturbances.

1.6. Original Source Link

The official source link for the paper is:

https://arxiv.org/abs/2502.10363 (Abstract and main page)
https://arxiv.org/pdf/2502.10363v3.pdf (PDF link)

This is a preprint, meaning it has not yet undergone formal peer review and publication in a conference proceeding or journal.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is enabling humanoid robots to traverse risky terrains with sparse footholds. This is a critical challenge in robotics because such environments, like stepping stones or balancing beams, demand extremely precise foot placements and stable locomotion to prevent the robot from falling.

The problem is important because existing learning-based approaches, which have shown impressive results in various locomotion tasks (e.g., walking, stair climbing), often struggle with these specific types of complex terrains. This struggle arises from several key challenges and gaps in prior research:

Sparse Foothold Rewards: The reward signal (feedback) for correct foot placement is often sparse, meaning it's only given after a complete sub-process (like lifting and landing a foot). This makes it difficult for a robot to learn what specific actions contributed to success or failure.
Inefficient Learning Processes: A single misstep on challenging terrain can lead to early termination of a training episode. This limits the trial-and-error exploration crucial for reinforcement learning and makes the learning process highly inefficient.
Foot Geometry Differences: Many existing methods, particularly for quadrupedal robots, model feet as simple points. However, humanoid robots typically have polygonal feet (flat, larger foot surfaces), which introduces complexities for foothold evaluation and online planning that point-based models cannot address.
High Degrees of Freedom & Instability: Humanoid robots inherently have more degrees of freedom (DoF) and an unstable morphology compared to quadrupeds, making agile and stable locomotion even more difficult on risky terrains.

The paper's innovative idea, or entry point, is to combine novel reward design, a specialized learning architecture, and a two-stage training process specifically tailored to address these humanoid locomotion challenges on sparse footholds. The name BeamDojo itself reflects this—beam for sparse footholds and dojo for a place of training.

2.2. Main Contributions / Findings

The paper introduces BeamDojo, a novel reinforcement learning (RL) framework, and makes several primary contributions to enabling agile humanoid locomotion on sparse footholds:

Novel Foothold Reward Design: They propose a sampling-based foothold reward specifically designed for polygonal feet. This reward is continuous (proportional to the overlap between the foot and the safe foothold), encouraging precise foot placements more effectively than binary or coarse rewards.
Double Critic Architecture: To balance the sparse foothold reward with dense locomotion rewards (essential for gait regularization), BeamDojo incorporates a double critic in its Proximal Policy Optimization (PPO) framework. This allows independent estimation and normalization of value functions for different reward groups, enhancing learning stability and gait quality.
Two-Stage RL Training Approach: A unique two-stage RL strategy is introduced to tackle the early termination problem and promote sufficient trial-and-error exploration.
- Stage 1 (Soft Terrain Dynamics Constraints): The robot trains on flat terrain but receives perceptive observations of the challenging task terrain. Missteps incur penalties but do not terminate the episode, allowing foundational skills and broad exploration.
- Stage 2 (Hard Terrain Dynamics Constraints): The policy is fine-tuned on the actual task terrain, where missteps lead to termination, forcing precise adherence to environmental constraints.
Real-World Deployment with LiDAR-based Elevation Map: The framework integrates an onboard LiDAR-based elevation map to enable real-world deployment. This perceptual input, combined with carefully designed domain randomization in simulation, facilitates robust sim-to-real transfer.
Demonstrated Agile & Robust Locomotion: Extensive simulations and real-world experiments on a Unitree G1 Humanoid robot demonstrate that BeamDojo achieves:
- Efficient learning in simulation.
- Agile locomotion with precise foot placement on sparse footholds in the real world.
- High success rates even under significant external disturbances (e.g., heavy payloads, external pushes).
- Impressive zero-shot generalization to terrains not explicitly seen during training (e.g., Stepping Beams and Gaps).
- The ability to perform backward walking on risky terrains, leveraging LiDAR effectively.
  
  In essence, the paper provides a comprehensive, learning-based solution for a challenging problem in humanoid robotics, integrating perception, reward shaping, and training strategies to achieve high-performance locomotion in complex environments.

3.1. Foundational Concepts

To understand the BeamDojo framework, a foundational understanding of several key concepts in robotics and machine learning is beneficial.

Humanoid Robots: These are robots designed to mimic the human body, typically having two legs, two arms, and a torso. Their high degrees of freedom (DoF) (multiple joints allowing movement) and inherent unstable morphology (being tall and bipedal) make stable locomotion challenging, especially on uneven terrain. The paper specifically highlights their polygonal feet (flat foot surfaces) as a key differentiator from simpler point-foot models.
Legged Locomotion: The study of how robots move using legs. This involves complex control challenges to maintain balance, coordinate joint movements, and interact with the environment, particularly on non-flat or irregular surfaces. Sparse footholds refer to situations where there are only small, separated areas where the robot's feet can safely land.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal.
- Agent: The entity that learns and acts (in this case, the control policy for the humanoid robot).
- Environment: The world in which the agent operates (the terrain and physics simulation).
- State ( $s$ ): A complete description of the environment at a given time (e.g., robot's joint angles, velocities, base orientation, terrain information).
- Action ( $a$ ): The decision made by the agent (e.g., target joint positions for the robot).
- Reward ( $r$ ): A scalar feedback signal from the environment indicating how good or bad the agent's last action was. The goal of the agent is to maximize the sum of future rewards.
- Policy ( $\pi$ ): A function that maps states to actions, determining the agent's behavior.
- Value Function: Estimates the expected cumulative reward from a given state (or state-action pair).
- Exploration vs. Exploitation: The fundamental trade-off in RL. Exploration involves trying new actions to discover better strategies, while exploitation involves using currently known best strategies to maximize reward.
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in environments where outcomes are partly random and partly under the control of a decision maker. An MDP is defined by a tuple $\langle \mathcal{S}, \mathcal{A}, T, R, \gamma \rangle$ :
- $\mathcal{S}$ : A set of possible states.
- $\mathcal{A}$ : A set of possible actions.
- $T$ : A transition function $P(s' \mid s, a)$ representing the probability of transitioning to state $s'$ from state $s$ after taking action $a$ .
- $R$ : A reward function R(s, a) representing the immediate reward received after taking action $a$ in state $s$ .
- $\gamma$ : A discount factor (between 0 and 1) that determines the present value of future rewards.
Partially Observable Markov Decision Process (POMDP): An extension of MDPs where the agent does not have direct access to the true state of the environment but instead receives observations that are probabilistically related to the state. This is relevant for robots with sensors that provide incomplete or noisy information.
Proximal Policy Optimization (PPO): A popular reinforcement learning algorithm that belongs to the family of policy gradient methods. PPO aims to find an optimal policy by iteratively updating it. It is known for its stability and data efficiency. Key components include:
- Actor-Critic Architecture: PPO uses two neural networks: an actor network that learns the policy (what actions to take) and a critic network that learns the value function (how good a state is).
- Clipping: A mechanism in PPO that limits the size of policy updates, preventing large, destructive changes that could destabilize training.
- Generalized Advantage Estimation (GAE): A technique used in PPO to provide more stable and effective estimates of the advantage function. The advantage function measures how much better an action is compared to the average action in a given state.
LiDAR (Light Detection and Ranging): A remote sensing method that uses pulsed laser light to measure distances to the Earth's surface. In robotics, LiDAR is used to create 3D point clouds of the environment, which can then be processed to generate elevation maps.
Elevation Map: A representation of the terrain height, often as a grid, where each cell stores the height information. A robot-centric elevation map is centered around the robot, providing local terrain information relevant to its immediate surroundings.
Domain Randomization: A technique used to bridge the sim-to-real gap (the discrepancy between simulation and real-world performance). By randomizing various parameters in the simulation (e.g., physics properties, sensor noise, texture), the RL policy learns to be robust to variations, making it more likely to perform well in the real world where conditions are never perfectly known or constant.

3.2. Previous Works

The paper discusses existing approaches to legged locomotion, highlighting their strengths and limitations, especially concerning humanoid robots on sparse footholds.

Quadrupedal Robot Locomotion: Existing learning-based methods have effectively addressed sparse foothold traversal for quadrupedal robots (four-legged robots). These methods often model the robot's foot as a point, which simplifies foothold evaluation and planning. However, the paper emphasizes that these methods encounter great challenges when applied to humanoid robots due to the fundamental difference in foot geometry (humanoids have polygonal feet).
Model-Based Hierarchical Controllers: Traditionally, locomotion on sparse footholds for legged robots has been tackled using model-based hierarchical controllers. These controllers decompose the complex task into stages:
1. Perception: Gathering information about the environment.
2. Planning: Generating a sequence of footsteps or trajectories.
3. Control: Executing the planned movements. While effective, these methods are sensitive to violations of model assumptions and can be computationally burdensome for online planning, especially for polygonal feet which require additional half-space constraints (linear inequalities).
Hybrid Methods (RL + Model-Based): Some works combine the strengths of RL and model-based controllers. For instance, RL can be used to generate trajectories that are then tracked by model-based controllers, or RL policies can track trajectories from model-based planners. The paper notes that these decoupled architectures can constrain adaptability and coordination.
End-to-End Learning Frameworks: More recent approaches use end-to-end learning where RL directly learns perceptive locomotion controllers for sparse footholds. Many of these focus on quadrupeds and rely on depth cameras for exteroceptive observations (observations from external sensors). However, depth cameras have narrow fields of view and often require an image processing module to bridge the sim-to-real gap between depth images and terrain heightmaps. This limits robot movement (e.g., only moving backward) and adds complexity.
Two-Stage Training Frameworks in RL: Previous two-stage training frameworks have been used in RL to bridge the sim-to-real gap primarily in the observation space. This typically means training in simulation with noisy observations to mimic reality, then fine-tuning. BeamDojo differentiates itself by introducing a novel two-stage training approach specifically aimed at improving sample efficiency, particularly by addressing the early termination problem when learning to walk on sparse terrains.

3.3. Technological Evolution

The field of legged locomotion, especially for humanoid robots on challenging terrains, has evolved significantly.

Early Stages (Model-Based Control): Initially, model-based control dominated. This involved precise mathematical models of the robot and environment to calculate optimal movements. While providing high precision for well-defined tasks, these methods struggled with real-world variability and unstructured environments due to their reliance on perfect models.
Emergence of Learning-Based Methods (RL): With advancements in computational power and deep learning, Reinforcement Learning emerged as a powerful tool. RL allowed robots to learn complex behaviors directly from trial-and-error, leading to more robust and adaptive locomotion across various tasks (walking, stair climbing, parkour). However, initial RL applications often faced challenges like sample inefficiency (requiring huge amounts of data) and sim-to-real gaps.
Addressing Challenges in RL (PPO, GAE, Domain Randomization): Algorithms like PPO and techniques like GAE improved the stability and efficiency of RL. Domain randomization became crucial for transferring policies learned in simulation to the real world, making sim-to-real transfer more feasible.
Specialization for Complex Terrains (Perception-Driven RL): The focus shifted towards enabling locomotion on increasingly complex and risky terrains. This necessitated better perceptive capabilities (e.g., using depth cameras, LiDAR) and perception-aware RL policies. Initial efforts concentrated on quadrupedal robots due to their relative stability.
Current Frontier (Humanoid Locomotion on Sparse Footholds): The current work, BeamDojo, represents a step forward by tackling the more difficult problem of humanoid locomotion on sparse footholds. It builds upon prior RL successes but introduces specific innovations to overcome the unique challenges posed by humanoid kinematics, polygonal feet, and sparse reward signals.

3.4. Differentiation Analysis

Compared to the main methods in related work, BeamDojo offers several core differences and innovations:

Humanoid-Specific Foothold Handling: Unlike most quadrupedal methods that simplify feet to points, BeamDojo explicitly addresses the challenge of polygonal feet in humanoid robots. Its sampling-based foothold reward is tailored to evaluate the overlap of a flat foot surface with safe regions, a crucial distinction for humanoid stability.
Addressing Sparse Rewards with Double Critic: While other RL methods encounter sparse reward problems, BeamDojo introduces a double critic architecture. This design decouples the learning of dense locomotion rewards (for general gait) from the sparse foothold reward (for precise placement), allowing more effective optimization for both and preventing the sparse reward from destabilizing dense reward learning. This is a modular, plug-and-play solution.
Novel Two-Stage RL for Sample Efficiency: BeamDojo's two-stage RL approach is specifically designed to improve sample efficiency and overcome the early termination problem prevalent in complex terrains. By initially training in a soft terrain dynamics setting (flat terrain with perceptive feedback of sparse terrain), it encourages broad exploration without frequent episode resets, a significant departure from standard one-stage or sim-to-real-focused two-stage methods.
Robust LiDAR-Based Perception: Instead of depth cameras (which often have narrow fields of view and limitations for backward movement), BeamDojo utilizes a LiDAR-based elevation map. This enables robot-centric perception that supports both forward and backward movement and is more robust to environmental noise.
Strong Sim-to-Real Transfer through Comprehensive Domain Randomization: The paper highlights the implementation of an onboard LiDAR-based elevation map and carefully designed domain randomization (including specific elevation map measurement noise) to achieve a high zero-shot sim-to-real transfer success rate, which is crucial for real-world application.
First Learning-Based Method for Fine-Grained Foothold Control: The authors claim that BeamDojo is the first learning-based method to achieve fine-grained foothold control on risky terrains with sparse footholds for humanoid robots, combining these specific innovations.

4. Methodology

4.1. Principles

The core principle behind BeamDojo is to develop a terrain-aware humanoid locomotion policy using reinforcement learning (RL). The primary objective is to enable humanoid robots to traverse challenging environments with sparse footholds by learning precise and stable movements. This is achieved by formulating the problem as a Markov Decision Process (MDP) and optimizing a policy to maximize discounted cumulative rewards.

The method specifically targets the unique challenges of humanoid locomotion on sparse footholds:

Polygonal Foot Model: Unlike simpler point-foot models, humanoids have larger, polygonal feet that require a more sophisticated evaluation of foot placement.
Sparse Reward Signals: The feedback for correct foothold placement is often sparse, making it difficult for the RL agent to learn.
Inefficient Learning due to Early Termination: Missteps on risky terrain often lead to episode termination during training, which hinders exploration and makes learning inefficient.
Real-world Deployment: Bridging the sim-to-real gap and ensuring robust performance in unpredictable real-world scenarios.

BeamDojo addresses these by introducing a novel sampling-based foothold reward, a double critic architecture to handle mixed dense and sparse rewards, a two-stage RL training process for efficient exploration, and a LiDAR-based elevation map with domain randomization for robust sim-to-real transfer.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

The RL problem is formulated as a Markov Decision Process (MDP), denoted as $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, T, \mathcal{O}, r, \gamma \rangle$ .

$\mathcal{S}$ : The state space, which encompasses all possible configurations and conditions of the robot and its environment.
$\mathcal{A}$ : The action space, representing all possible movements or controls the robot can execute.
$T$ : The transition dynamics, given by $T(s' \mid s, a)$ , which describes the probability of the environment transitioning to a new state $s'$ given the current state $s$ and action $a$ .
$r$ : The reward function, r(s, a), which provides immediate feedback for taking action $a$ in state $s$ .
$\gamma$ : The discount factor, a value between 0 and 1 ( $\gamma \in$ ), which determines the present value of future rewards. A higher $\gamma$ means future rewards are considered more important.

The primary objective is to optimize the policy $\pi(\boldsymbol{a}_t \mid \boldsymbol{s}_t)$ to maximize the discounted cumulative rewards, which is expressed as: $\operatorname*{max}_{\pi} J(\mathcal{M}, \pi) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right] .$ Here, $J(\mathcal{M}, \pi)$ is the expected total discounted reward. $\mathbb{E}$ denotes the expectation, summing over all possible trajectories. Due to sensory limitations and environmental noise, the robot only has access to partial observations $\mathbf{o} \in \mathcal{O}$ , which are incomplete information about the true state. This means the agent operates within a Partially Observable Markov Decision Process (POMDP) framework.

4.2.2. Foothold Reward

To accommodate the polygonal foot model of the humanoid robot, BeamDojo introduces a sampling-based foothold reward that evaluates how well the foot is placed on sparse footholds. This evaluation is based on the overlap between the foot's placement and designated safe areas (e.g., stones, beams).

The approach samples $n$ points on the soles of the robot's feet, as depicted in Fig. 2. The following figure (Fig. 2 from the original paper) illustrates the sampling of points on the foot for the foothold reward:

$Fig. 2: Foothold Reward. We sample $n$ points under the foot. Green points indicate contact with the surface within the safe region, while red points represent those not in contact with the surface.$ 该图像是图示，显示了在稀疏踏脚点下的奖励情况。大腿机器人在不稳定的地形中行走，图中左侧为机器人脚下的区域，右侧的放大图展示了样本点，绿色点表示安全区域内的接触点，而红色点则表示未接触的地面。根据这些点的接触情况，机器人可以获得相应的奖励，以促进学习和适应。

For each $j$ -th sample on foot $i$ , $d_{ij}$ denotes the global terrain height at its corresponding position. The penalty foothold reward $r_{\mathrm{foothold}}$ is defined as: $r_{\mathrm{foothold}} = - \sum_{i=1}^{2} \mathbb{C}_i \sum_{j=1}^{n} \mathbb{1} \{ d_{ij} < \epsilon \} ,$ where:

$\mathbb{C}_i$ : An indicator function that is 1 if foot $i$ is in contact with the terrain surface, and 0 otherwise. This ensures that the reward is only calculated for feet that are currently on the ground.
$n$ : The total number of points sampled on each foot.
$\mathbb{1}\{ \cdot \}$ : The indicator function, which returns 1 if the condition inside the curly braces is true, and 0 otherwise.
$d_{ij}$ : The height of the terrain at the $j$ -th sampled point on foot $i$ .
$\epsilon$ : A predefined depth tolerance threshold. When $d_{ij} < \epsilon$ , it indicates that the terrain height at that sample point is significantly low, implying improper foot placement outside a safe area (i.e., the foot is partially in a gap or off a foothold).

This reward function encourages the humanoid robot to maximize the overlap between its foot placement and the safe footholds, thereby improving its terrain-awareness capabilities. Being a negative sum, the robot aims to minimize the penalty, which means maximizing the number of sampled points landing on safe areas.

4.2.3. Double Critic for Sparse Reward Learning

The task-specific foothold reward ( $r_{\mathrm{foothold}}$ ) described above is a sparse reward, meaning it's only received when certain conditions are met (i.e., correct foot placement). To effectively optimize the policy when dealing with such sparse rewards, it's crucial to balance them with dense locomotion rewards, which are continuous and provide frequent feedback essential for gait regularization (learning smooth and natural walking patterns).

Inspired by previous works, BeamDojo adopts a double critic framework based on PPO (Proximal Policy Optimization). This framework decouples (separates) the learning process for different types of rewards.

Two separate critic networks, denoted as $\{ V_{\phi_1}, V_{\phi_2} \}$ , are trained. Each critic independently estimates value functions for two distinct reward groups:

(i) Regular Locomotion Reward Group ( $R_1$ ): Consists of dense rewards (e.g., forward velocity, base height, orientation, joint effort) that have been used in quadruped and humanoid locomotion tasks. These provide continuous feedback on the general quality of movement.
(ii) Task-Specific Foothold Reward Group ( $R_2$ ): Contains the sparse reward $R_2 = \{ r_{\mathrm{foothold}} \}$ , focusing on the precision of foot placement on sparse footholds.

The double critic process is illustrated in Fig. 3. The following figure (Fig. 3 from the original paper) illustrates the double critic and two-stage RL framework:

该图像是示意图，展示了“BeamDojo”框架在训练与部署过程中对人形机器人进行灵活行走的流程。在左侧的部分（(a)），分为两个阶段：第一阶段是基于软地形动态的训练，利用稀疏奖励并结合感知信息；第二阶段则是在硬地形动态下的训练，结合密集和稀疏奖励以优化策略。右侧的部分（(b)）展示了机器人在实际部署中如何利用LiDAR生成的高程图和反馈控制器实现稳定行走。

Specifically, each value network $V_{\phi_i}$ is updated independently for its corresponding reward group $R_i$ using the temporal difference (TD) loss: $\mathcal{L}(\phi_i) = \mathbb{E} \left[ \left. R_{i,t} + \gamma V_{\phi_i}(s_{t+1}) - V_{\phi_i}(s_t) \right.^2 \right] ,$ where:

$\mathcal{L}(\phi_i)$ : The loss function for critic network $i$ , which is minimized during training.
$\phi_i$ : The parameters (weights) of the $i$ -th critic network.
$R_{i,t}$ : The reward received at time $t$ for reward group $i$ .
$\gamma$ : The discount factor.
$V_{\phi_i}(s_t)$ : The value estimate (predicted cumulative future reward) of state $s_t$ by critic network $i$ .

After computing the TD loss, the respective advantages $\{ \hat{A}_{i,t} \}$ are calculated using Generalized Advantage Estimation (GAE): $\begin{array}{c} { \delta_{i,t} = R_{i,t} + \gamma V_{\phi_i}(s_{t+1}) - V_{\phi_i}(s_t) , } \\ \\ { \hat{A}_{i,t} = \displaystyle \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{i,t+l} , } \end{array}$ where:
$\delta_{i,t}$ : The temporal difference error for reward group $i$ at time $t$ . It measures the discrepancy between the actual reward received plus the discounted estimated value of the next state, and the estimated value of the current state.
$\hat{A}_{i,t}$ : The advantage estimate for reward group $i$ at time $t$ . It quantifies how much better action $a_t$ was than the expected value in state $s_t$ .
$\lambda$ : A balancing parameter (between 0 and 1) for GAE, controlling the bias-variance trade-off in the advantage estimation.

These advantages are then individually normalized and combined into an overall advantage $\hat{A}_t$ : $\hat{A}_t = w_1 \cdot \frac{\hat{A}_{1,t} - \mu_{\hat{A}_{1,t}}}{\sigma_{\hat{A}_{1,t}}} + w_2 \cdot \frac{\hat{A}_{2,t} - \mu_{\hat{A}_{2,t}}}{\sigma_{\hat{A}_{2,t}}} ,$ where:
$w_i$ : The weight assigned to each advantage component. This allows the framework to prioritize or scale the influence of each reward group.
$\mu_{\hat{A}_{i,t}}$ : The batch mean of $\hat{A}_{i,t}$ for reward group $i$ .
$\sigma_{\hat{A}_{i,t}}$ : The batch standard deviation of $\hat{A}_{i,t}$ for reward group $i$ . Normalization (subtracting the mean and dividing by the standard deviation) helps stabilize the training process.

This overall advantage $\hat{A}_t$ is then used to update the actor network's policy (the part of the network that determines actions): $\mathcal{L}(\theta) = \mathbb{E} \left[ \operatorname*{min} \left( \alpha_t(\theta) \hat{A}_t , \mathrm{clip}(\alpha_t(\theta) , 1 - \epsilon , 1 + \epsilon) \hat{A}_t \right) \right] ,$ where:

$\mathcal{L}(\theta)$ : The loss function for the actor network, which is minimized to update the policy.
$\theta$ : The parameters (weights) of the actor network.
$\alpha_t(\theta)$ : The probability ratio between the new policy and the old policy for action $a_t$ in state $s_t$ .
$\mathrm{clip}(\cdot, 1-\epsilon, 1+\epsilon)$ : A clipping function that limits the ratio $\alpha_t(\theta)$ to stay within the range $[1-\epsilon, 1+\epsilon]$ .
$\epsilon$ : The clipping hyperparameter in PPO, which controls how much the policy can change in a single update step, preventing overly aggressive updates.

This double critic design makes it a modular and plug-and-play solution for handling specialized tasks with sparse rewards, effectively managing the difference in reward feedback frequencies within a mixed dense-sparse environment.

4.2.4. Learning Terrain-Aware Locomotion via Two-Stage RL

To address the early termination problem and promote thorough trial-and-error exploration in complex terrains, BeamDojo employs a novel two-stage reinforcement learning (RL) approach for terrain-aware locomotion in simulation. This process is also depicted in Fig. 3, which was shown previously.

Stage 1: Soft Terrain Dynamics Constraints Learning In this initial stage, the robot is trained under soft terrain dynamics constraints.

Setup: Each target task terrain (denoted as $\tau$ , e.g., stepping stones) is mapped to a flat terrain ( $\mathcal{F}$ ) of the same size. Both terrains share the same terrain noise, and points between them correspond one-to-one. The key difference is that the flat terrain $\mathcal{F}$ fills the gaps present in the real terrain $\tau$ , making it traversable without falling into holes.
Training: The humanoid robot traverses the flat terrain $\mathcal{F}$ , receiving proprioceptive observations (internal sensor data like joint angles and velocities). Crucially, it is simultaneously provided with perceptual feedback in the form of the elevation map of the true task terrain $\tau$ at the robot's current base position.
Outcome: This setup allows the robot to "imagine" walking on the challenging terrain $\tau$ while physically moving on the safer flat terrain $\mathcal{F}$ . Missteps are penalized (using the foothold reward from $\tau$ ) but do not lead to episode termination. This relaxation of terrain dynamics constraints allows the robot to broadly explore and develop foundational skills for terrain-aware locomotion without constantly restarting, significantly improving sampling efficiency.
Reward Decoupling: During this stage, locomotion rewards are primarily derived from the robot's movement on the flat terrain $\mathcal{F}$ , while the foothold reward is calculated based on the true terrain $\tau$ ( $d_{ij}$ representing heights on $\tau$ ). These two reward components are learned separately using the double critic framework (Section 4.2.3).

Stage 2: Hard Terrain Dynamics Constraints Learning In the second stage, the policy learned in Stage 1 is fine-tuned directly on the actual task terrain $\tau$ .

Setup: Unlike Stage 1, missteps on $\tau$ now result in immediate termination of the episode. This introduces hard terrain dynamics constraints.
Training: The robot continues to optimize both locomotion rewards and the foothold reward $r_{\mathrm{foothold}}$ using the double critic framework. The $d_{ij}$ values for the foothold reward are now directly from the actual terrain $\tau$ .
Outcome: This stage forces the robot to develop precise and safe locomotion strategies to avoid termination, refining its ability to navigate challenging terrains accurately.

This two-stage design is particularly effective because Stage 1 allows for extensive, penalty-based exploration, making it easier to accumulate successful foothold placement samples, which are difficult to obtain in conventional RL with early termination. Stage 2 then hones these skills into robust, precise real-world behavior.

4.2.5. Training in Simulation

To facilitate training, a comprehensive simulation environment and curriculum are designed.

4.2.5.1. Observation Space and Action Space

The policy's observations, denoted as $\mathbf{o}_t$ , are composed of four main components at time $t$ : $\mathbf{o}_t = \left[ \mathbf{c}_t , \mathbf{o}_t^{\mathrm{proprio}} , \mathbf{o}_t^{\mathrm{percept}} , \mathbf{a}_{t-1} \right] .$

$\mathbf{c}_t \in \mathbb{R}^3$ : Commands specifying the desired velocity. This includes $\left[ \mathbf{v}_x^c , \mathbf{v}_y^c , \omega_{\mathrm{yaw}}^c \right]$ , representing desired longitudinal velocity ( $\mathbf{v}_x^c$ ), lateral velocity ( $\mathbf{v}_y^c$ ), and angular velocity in the horizontal plane ( $\omega_{\mathrm{yaw}}^c$ ), respectively.
$\mathbf{o}_t^{\mathrm{proprio}} \in \mathbb{R}^{64}$ : Proprioceptive observations, which are internal sensor readings of the robot itself. These include:
- base linear velocity $\mathbf{v}_t \in \mathbb{R}^3$
- base angular velocity $\boldsymbol{\omega}_t \in \mathbb{R}^3$
- gravity direction in the robot's frame $\mathbf{g}_t \in \mathbb{R}^3$
- joint positions $\boldsymbol{\theta}_t \in \mathbb{R}^{29}$ (for 29 joints)
- joint velocities $\dot{\boldsymbol{\theta}}_t \in \mathbb{R}^{29}$ (for 29 joints)
- joint torques $\boldsymbol{\tau}_t \in \mathbb{R}^{29}$
- contact information (which feet are in contact with the ground).
$\mathbf{o}_t^{\mathrm{percept}} \in \mathbb{R}^{15 \times 15}$ : Perceptive observations, which is an egocentric elevation map (a height map centered around the robot). This map samples $15 \times 15$ points within a $0.1 \mathrm{m}$ grid in both the longitudinal (forward/backward) and lateral (sideways) directions, providing local terrain height information.
$\mathbf{a}_{t-1} \in \mathbb{R}^{12}$ : The action taken at the last timestep. This is included to provide temporal context, helping the policy understand its previous commands and their effects.

The action $\mathbf{a}_t \in \mathbb{R}^{12}$ represents the target joint positions for the 12 lower-body joints of the humanoid robot. These are directly output by the actor network. For the upper body joints, a default position is used for simplicity. A proportional-derivative (PD) controller then converts these target joint positions into torques, which are the actual forces applied by the robot's motors to track the desired positions.

4.2.5.2. Terrain and Curriculum Design

Inspired by previous research, five types of sparse foothold terrains are designed for the two-stage training and evaluation:

Stones Everywhere: A general sparse foothold terrain where stones are scattered across an $8m \times 8m$ area. The center is a platform surrounded by stones (Fig. 4a). Curriculum progresses by decreasing stone size and increasing sparsity.
Stepping Stones: Two lines of stepping stones along the longitudinal direction, connected by platforms at each end (Fig. 4b). Stone size decreases, and sparsity increases with curriculum progression.
Balancing Beams: Starts with two lines of separate stones. As the curriculum progresses, stone size decreases, lateral distance reduces, eventually forming a single line of balancing beams (Fig. 4c). This requires a distinct gait.
Stepping Beams: A sequence of beams randomly distributed along the longitudinal direction with platforms at ends (Fig. 4d). Requires high precision.
Gaps: Several gaps with random distances (Fig. 4e). Requires the robot to make large steps.

The following figure (Fig. 4 from the original paper) illustrates the terrain settings in simulation:

该图像是一个示意图，展示了不同的训练地形设置，用于机器人在复杂地形上行走的学习过程。其中 (a) 显示了覆盖石块的地形；(b) 和 (c) 展示了跳石和平衡梁；而 (d) 和 (e) 则展示了更具挑战性的平衡梁和间隙。这些设置旨在逐步提高任务的复杂性。

The training strategy is as follows:

Stage 1: The robot is initially trained on Stones Everywhere terrain with soft terrain constraints to learn a generalizable policy.
Stage 2: The policy is fine-tuned on Stepping Stones and Balancing Beams terrains with hard terrain constraints.

The commands used in these two stages are detailed in Table I. The following are the results from Table I of the original paper:

Term	Value (stage 1)	Value (stage 2)
$v_x^c$	U(−1.0, 1.0) m/s	U(−1.0, 1.0) m/s
$v_y^c$	U(-1.0, 1.0) m/s	U(0.0, 0.0) m/s
$\omega_{yaw}^c$	U(−1.0, 1.0) rad/s	U(0.0, 0.0) m/s

In Stage 2, only a single x-direction command ( $v_x^c$ ) is provided, with no yaw command ( $\omega_{yaw}^c$ ). This forces the robot to learn to consistently face forward using perceptual observations, rather than relying on continuous yaw corrections.

For evaluation, Stepping Stones, Balancing Beams, Stepping Beams, and Gaps terrains are used. The method shows zero-shot transfer capabilities on Stepping Beams and Gaps even without explicit training on them.

The curriculum is designed such that the robot progresses to the next terrain level after successfully traversing the current level three times consecutively. This prevents the robot from being sent back to easier levels until all levels are passed. The detailed settings for the terrain curriculum are provided in Appendix VI-B of the paper.

4.2.5.3. Sim-to-Real Transfer

To enhance the robustness of the learned policy and facilitate sim-to-real transfer (applying a policy trained in simulation directly to a real robot), extensive domain randomization is employed. This involves injecting noise into various parameters:

Observations: Angular velocity, joint positions, joint velocities, and projected gravity.
Humanoid Physical Properties: Actuator offset, motor strength, payload mass, center of mass displacement, and Kp, Kd noise factors (for the PD controller).
Terrain Dynamics: Friction factor, restitution factor, and terrain height.

Additionally, to address the large sim-to-real gap between the ground-truth elevation map in simulation and the LiDAR-generated map in reality (due to odometry inaccuracies, noise, jitter), four types of elevation map measurement noise are introduced during height sampling in the simulator:
Vertical Measurement: Random vertical offsets are applied per episode, and uniformly sampled vertical noise is added to each height sample at every timestep.
Map Rotation: The map is rotated in roll, pitch, and yaw to simulate odometry inaccuracies. For yaw, a random yaw noise is sampled, and the elevation map is resampled with this added noise. For roll and pitch, biases $[h_x, h_y]$ are sampled, and linear interpolation generates vertical height map noise, which is then added to the original elevation map.
Foothold Extension: Random foothold points adjacent to valid footholds are extended to become valid. This simulates the smoothing effect that can occur during the processing of LiDAR elevation data.
Map Repeat: To simulate delays in elevation map updates, the map from the previous timestep is randomly repeated.

The detailed domain randomization settings are provided in Appendix VI-C of the paper.

4.2.6. Real-world Deployment

4.2.6.1. Hardware Setup

The experiments utilize a Unitree G1 humanoid robot.

Weight: $35 \mathrm{~kg}$ .
Height: $1.32 \mathrm{~meters}$ .
Degrees of Freedom (DoF): 23 actuated DoF (6 in each leg, 5 in each arm, 1 in the waist).
Onboard Computation: Jetson Orin NX.
Perception: Livox Mid-360 LiDAR, which provides both IMU data (Inertial Measurement Unit) and feature points.

4.2.6.2. Elevation Map and System Design

The raw point cloud data from the LiDAR is noisy and subject to occlusions. To generate a robust elevation map:

Odometry: Fast LiDAR-Inertial Odometry (FAST-LIO) is used. This fuses LiDAR feature points with IMU data (from the LiDAR) to produce precise odometry outputs (robot's position and orientation).
Mapping: The odometry outputs are then processed using robot-centric elevation mapping methods to create a grid-based representation of ground heights.
Control Loop:
- The elevation map publishes information at a frequency of $10 \mathrm{~Hz}$ .
- The learned policy (from BeamDojo) operates at $50 \mathrm{~Hz}$ .
- The policy's action outputs are sent to a PD controller, which runs at $500 \mathrm{~Hz}$ , ensuring smooth and precise actuation of the robot's joints.

5. Experimental Setup

5.1. Datasets

The experiments primarily use simulated terrains to train and evaluate the BeamDojo framework, with real-world validation. The terrains are specifically designed to present sparse foothold challenges for humanoid robots.

The five types of sparse foothold terrains designed for two-stage training and evaluation are:

Stones Everywhere: A general sparse foothold terrain with stones scattered across an $8 \mathrm{m} \times 8 \mathrm{m}$ area. The terrain's center is a platform surrounded by stones. As the curriculum progresses, the stone size decreases, and sparsity increases.
Stepping Stones: This terrain features two lines of stepping stones in the longitudinal direction, connected by two platforms at each end. Each stone is uniformly distributed in two sub-square grids, with a similar curriculum progression to Stones Everywhere (decreasing stone size, increasing sparsity).
Balancing Beams: Initially, this terrain has two lines of separate stones. Through curriculum progression, the stones become smaller, their lateral distance reduces, eventually forming a single line of balancing beams. This terrain is particularly challenging as it requires the robot to keep its feet close together while maintaining its center of mass.
Stepping Beams: Consists of a sequence of beams randomly distributed along the longitudinal direction, with platforms at either end. Along with Stones Everywhere and Stepping Stones, this terrain requires high precision in foothold placement.
Gaps: This terrain features several gaps with random distances between them, requiring the robot to execute large steps to cross.

Terrain Characteristics:
The Stones Everywhere terrain covers an area of $8 \mathrm{m} \times 8 \mathrm{m}$ .
Stepping Stones and Balancing Beams are $2 \mathrm{m}$ in width and $8 \mathrm{m}$ in length, designed for single-direction commands.
The depth of gaps relative to the ground is set to $1.0 \mathrm{m}$ .
All stones and beams exhibit height variations within $\pm 0.05 \mathrm{m}$ to add realism.
The depth tolerance threshold $\epsilon$ for the foothold reward is set to $-0.1 \mathrm{m}$ , meaning a penalty is incurred if a sampled point falls $0.1 \mathrm{m}$ below the expected surface.

5.2. Evaluation Metrics

The performance of BeamDojo and its baselines is evaluated using three primary metrics, along with detailed analysis of reward components.

Success Rate ( $R_{\mathrm{succ}}$ ):
- Conceptual Definition: The percentage of trials where the robot successfully traverses the entire terrain length without falling or terminating prematurely. It quantifies the overall reliability of the locomotion policy.
- Mathematical Formula: $ R_{\mathrm{succ}} = \frac{\text{Number of successful traversals}}{\text{Total number of attempts}} \times 100% $
- Symbol Explanation: Not applicable beyond the intuitive definition.
Traverse Rate ( $R_{\mathrm{trav}}$ ):
- Conceptual Definition: The ratio of the distance the robot traveled before falling or failing to the total length of the terrain. This metric measures how far the robot can get even if it doesn't complete the entire traversal, indicating partial success or robustness.
- Mathematical Formula: $ R_{\mathrm{trav}} = \frac{\text{Distance traveled before failure}}{\text{Total terrain length}} \times 100% $
- Symbol Explanation: Not applicable beyond the intuitive definition. (The total terrain length for most terrains is $8 \mathrm{m}$ ).
Foothold Error ( $E_{\mathrm{foot}}$ ):
- Conceptual Definition: The average proportion of foot samples that land outside the intended safe foothold areas. This directly quantifies the precision of the robot's foot placement. A lower value indicates more accurate stepping.
- Mathematical Formula: Based on the definition in the paper (average proportion of foot samples landing outside the intended foothold areas), and the foothold reward formula, it could be conceptualized as the average of the penalty term over an episode normalized by the total number of samples. If we consider $\text{TotalSamplesPerEpisode} = T \times 2 \times n$ (where $T$ is episode length, 2 feet, $n$ samples per foot), and $\text{PenaltySamples} = \sum_{t=0}^T \sum_{i=1}^{2} \mathbb{C}_i \sum_{j=1}^{n} \mathbb{1} \{ d_{ij} < \epsilon \}$ $ E_{\mathrm{foot}} = \frac{\text{Number of penalized foot samples}}{\text{Total number of foot samples}} \times 100% $ The paper states it's "average proportion of foot samples landing outside the intended foothold areas". This directly relates to the $\mathbb{1}\{ d_{ij} < \epsilon \}$ part of the foothold reward.
- Symbol Explanation: Not applicable beyond the intuitive definition.

Reward Functions (Detailed in Appendix VI-A): While not a direct evaluation metric, the composition of the reward functions is crucial for understanding what the policy optimizes. The rewards are split into two groups for the double critic.

The following are the results from Table VII of the original paper:

Term	Equation	Weight
Group 1: Locomotion Reward Group
xy velocity tracking	exp {−\|vxy − vcy\|^2/ $\sigma$ },	1.0
yaw velocity tracking	exp { −( $\omega$ yaw − $\omega$ cyaw)^2/ $\sigma$ }	1.0
base height	(h − htarget)^2	-10.0
orientation	$\parallel$ gx $\parallel^2$ + $\parallel$ gy $\parallel^2$	-2.0
z velocity	$v_z^2$	-2.0
roll-pitch velocity	$\parallel$ \omega_{xy} $\parallel^2$	-0.05
action rate	$\parallel::MATH_BLOCK_7::\parallel^2$	-0.01
smoothness	$\parallel::MATH_BLOCK_8::\parallel^2$	−1e − 3
stand still	$\parallel$ \boldsymbol{\omega}_{\theta} $\parallel^2 \cdot \mathbb{1}$ { $\parallel$ \boldsymbol{v}_y^c $\parallel^2$ < $\epsilon$ }	-0.05
joint velocities	$\parallel$ \dot{\boldsymbol{\theta}} $\parallel^2$	−1e − 4
joint accelerations	$\parallel$ \ddot{\boldsymbol{\theta}} $\parallel^2$	−2.5e − 8
joint position limits	ReLU( $\boldsymbol{\theta} - \boldsymbol{\theta}_{max}$ ) + ReLU( $\boldsymbol{\theta}_{min} - \boldsymbol{\theta}$ )	-5.0
joint velocity limits	ReLU( $\parallel$ \dot{\boldsymbol{\theta}} $\parallel$ - $\parallel$ )	−1e − 3
joint power	$\parallel$ \boldsymbol{\tau} $\parallel^T$ / ( $\parallel$ \dot{\boldsymbol{\theta}} $\parallel^2$ + 0.2 * $\parallel$ \boldsymbol{\omega} $\parallel$ )	−2e − 5
feet ground parallel	$\sum_{i=1}^2$ $V_a(p_z, i)$	-0.02
feet distance	ReLU ( $\parallel$ p_{y,1} - p_{y,2} $\parallel$ - dmin) \$\cdot\$\$F_i\$	0.5
feet air time	$\sum_{i=1}^2$ $t_{air,i}$ / $t_{air,target}$	1.0
feet clearance	$p_{z,i}$ $\cdot$ $P_{xy,i}$	-1.0
Group 2: Foothold Reward Group
foothold	$- \sum_{i=1}^{2} \mathbb{C}_i \sum_{j=1}^{n} \mathbb{1} \{ d_{ij} < \epsilon \}$	1.0

The following are the results from Table VIII of the original paper:

Symbols	Description
$\sigma$	Tracking shape scale, set to 0.25.
$\epsilon$	Threshold for determining zero-command in stand still reward, set to 0.1.
$\boldsymbol{\tau}$	Computed joint torques.
$h_{target}$	Desired base height relative to the ground, set to 0.725.
ReLU( $\cdot$ )	Function that clips negative values to zero.
$p_{i}, P$	Spatial position and velocity of all sampled points on the i-th foot respectively.
$p_{z,target}$	Target foot-lift height, set to 0.1.
$t_{air,i}$	Air time of the i-th foot.
$t_{air,target}$	Desired feet air time, set to 0.5.
$F_i$	Indicator specifying whether foot i makes first ground contact.
$d_{min}$	Minimum allowable distance between two feet, set to 0.18.

5.3. Baselines

The BeamDojo framework, which integrates two-stage RL training and a double critic, is compared against four baselines to evaluate the effectiveness of its components. All methods are adapted to a two-stage structure for fairness. For Stage 1, training occurs on Stones Everywhere with curriculum learning. BeamDojo and BL 4 use soft terrain dynamics constraints in Stage 1, while other baselines use hard terrain dynamics constraints. Stage 2 involves fine-tuning on Stepping Stones and Balancing Beams with curriculum learning.

The baselines are:

BL 1) PIM: This baseline uses PIM (Perceptive Internal Model), a one-stage method originally designed for humanoid locomotion tasks on uneven terrains (like walking up stairs). To make it comparable, the foothold reward ( $r_{\mathrm{foothold}}$ ) is added to encourage accurate stepping on foothold areas. This baseline represents an existing state-of-the-art humanoid controller adapted to the foothold task.
BL 2) Naive: This is a direct, simplified implementation without the key innovations of BeamDojo. It neither includes the two-stage RL approach nor the double critic framework. Its only addition is the foothold reward. This serves as a basic implementation to highlight the necessity of the proposed architectural and training advancements.
BL 3) Ours w/o Soft Dyn: This is an ablation study that removes the first stage of training with soft terrain dynamics constraints. Essentially, it means the robot is trained directly with hard terrain dynamics constraints from the beginning, where missteps lead to early termination. This baseline assesses the contribution of the relaxed exploration phase.
BL 4) Ours w/o Double Critic: This ablation study replaces the double critic with a single critic to handle both locomotion rewards (dense) and the foothold reward (sparse). This reflects a more traditional RL design where all rewards are processed by a single value network, evaluating the importance of separating reward learning.

The training and simulation environments are implemented in IsaacGym, a high-performance GPU-based physics simulator designed for robot learning.

6. Results & Analysis

6.1. Core Results Analysis

The quantitative results demonstrate that BeamDojo consistently outperforms the baselines across various challenging terrains and difficulty levels. The evaluation metrics used are Success Rate ( $R_{\mathrm{succ}}$ ) and Traverse Rate ( $R_{\mathrm{trav}}$ ).

The following are the results from Table II of the original paper:

	Rsucc (%, ↑)	Rtrav (%, ↑)	Rsucc (%, ↑)	Rtrav (%, ↑)	Rsucc (%, ↑)	Rtrav (%, ↑)	Rsucc (%, ↑)	Rtrav (%, ↑)
	Stepping Stones		Balancing Beams		Stepping Beams		Gaps
Medium Terrain Difficulty
PIM	71.00 (±1.53)	78.29(±2.49)	74.67(±2.08)	82.19(±4.96)	88.33(±3.61)	93.16(±4.78)	98.00(±0.57)	99.16 (±0.75)
Naive	48.33(±6.11)	47.79(±5.76)	57.00(±7.81)	71.59(±8.14)	92.00(±2.52)	92.67(±3.62)	95.33 (±1.53)	98.41(±0.67)
Ours w/o Soft Dyn	65.33(±2.08)	74.62(±1.37)	79.00(±2.64)	82.67(±2.92)	98.67(±2.31)	99.64(±0.62)	96.33(±1.53)	98.60(±1.15)
Ours w/o Double Critic	83.00(±2.00)	86.64(±1.96)	88.67(±2.65)	90.21(±1.95)	96.33(±1.15)	98.88(±1.21)	98.00(±1.00)	99.33(±0.38)
BeaMDOJo	95.67(±1.53)	96.11(±1.22)	98.00(±2.00)	99.91(±0.07)	98.33(±1.15)	99.28(±0.65)	98.00(±2.65)	99.21(±1.24)
Hard Terrain Difficulty
PIM	46.67(±2.31)	52.88(±2.86)	33.00(±2.31)	45.28(±3.64)	82.67(±2.31)	90.68(±1.79)	96.00(±1.00)	98.27(±3.96)
Naive	00.33(±0.57)	21.17(±1.71)	00.67(±1.15)	36.25(±7.85)	82.00(±3.61)	88.91(±3.75)	31.00(±3.61)	62.70 (±4.08)
Ours w/o Soft Dyn	42.00(±6.56)	47.09 (±6.97)	51.00(±4.58)	72.93 (±4.38)	87.33(±2.08)	89.41(±1.75)	93.00(±1.00)	95.62(±2.50)
Ours w/o Double Critic	55.67(±3.61)	60.95(±2.67)	70.33(±3.06)	85.64(±3.24)	94.67(±1.53)	96.57(±1.42)	94.33(±3.06)	95.62(±2.50)
BeaMDOJo	91.67(±1.33)	94.26(±2.08)	94.33(±1.53)	95.15(±1.82)	97.67(±2.08)	98.54(±1.43)	94.33(±1.15)	97.00(±1.30)

Key Observations from Table II:

BeamDojo's Superiority: BeamDojo consistently achieves the highest Success Rate ( $R_{\mathrm{succ}}$ ) and Traverse Rate ( $R_{\mathrm{trav}}$ ) across all terrains (Stepping Stones, Balancing Beams, Stepping Beams, Gaps) and difficulty levels (Medium and Hard). For instance, on Hard Stepping Stones, BeamDojo achieves 91.67% $R_{\mathrm{succ}}$ , significantly outperforming all baselines. On Balancing Beams (Medium), it nearly reaches 98% $R_{\mathrm{succ}}$ and 99.91% $R_{\mathrm{trav}}$ , indicating extremely robust performance.
Struggles of Baselines: The Naive implementation performs very poorly, especially on Hard Stepping Stones and Balancing Beams, with success rates close to 0%. This highlights the necessity of BeamDojo's specialized components. PIM, an existing humanoid controller, also shows significant performance degradation on hard terrains, particularly Balancing Beams (33% $R_{\mathrm{succ}}$ ), demonstrating its limitations for fine-grained footholds.
Importance of Soft Dynamics and Double Critic: The ablation studies (Ours w/o Soft Dyn and Ours w/o Double Critic) show that removing either the soft terrain dynamics constraints or the double critic leads to a substantial drop in performance compared to the full BeamDojo framework, especially on the harder Stepping Stones and Balancing Beams. This confirms the critical role of both components.
Zero-Shot Generalization: A remarkable finding is BeamDojo's impressive zero-shot generalization capabilities. Even though the robot was only explicitly trained on Stones Everywhere (Stage 1) and Stepping Stones/Balancing Beams (Stage 2), it maintains high success rates and traverse rates on Stepping Beams and Gaps, performing comparably to the best baselines on these unseen terrains.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Foot Placement Accuracy

The Foothold Error ( $E_{\mathrm{foot}}$ ) comparison, as shown in Fig. 5, reveals that BeamDojo achieves highly accurate foot placement. The following figure (Fig. 5 from the original paper) shows the foothold error comparison:

Fig. 5: Foothold Error Comparison. The foothold error benchmarks of all methods are evaluated in (a) stepping stones and (b) balancing beams, both tested under medium terrain difficulty. 该图像是图表，展示了不同方法在两种中等难度地形下的足垫误差比较，分别为(a) 踏石和(b) 平衡梁。结果显示，我方方法在各测试中均表现优于其他方法，尤其是在平衡梁的测试中，实现了更低的足垫误差。

BeamDojo consistently exhibits lower foothold error values compared to other methods, largely attributed to the contribution of the double critic. In contrast, the Naive implementation shows higher error rates, indicating a significant portion of foot placements outside safe foothold areas. This underscores BeamDojo's precision in challenging terrains.

6.2.2. Learning Efficiency

BeamDojo demonstrates significantly faster convergence during training, as illustrated in Fig. 6, even though all designs were trained for 10,000 iterations to ensure convergence. The following figure (Fig. 6 from the original paper) shows the learning efficiency:

Fig. 6: Learning Efficiency. The learning curves show the maximum terrain levels achieved in two training stages of all methods. Faster attainment of terrain level 8 indicates more efficient learning. 该图像是图表，展示了两阶段训练中不同方法的学习效率。横轴为训练步骤（k），纵轴为地形等级。图中显示，采用我们的方法在第一阶段和第二阶段更快达到地形等级8，显示出更高的学习效率。

Both the two-stage training setup and the double critic contribute to this improved learning efficiency, with the two-stage setup being the more dominant factor. The Naive implementation struggles to reach higher terrain levels in both stages, highlighting its inefficiency. The two-stage learning allows for continuous attempts at foot placements, aiding in the accumulation of successful samples, while the double critic ensures that sparse foothold rewards updates are not destabilized by noisy locomotion signals in early training.

6.2.3. Gait Regularization

The double critic also plays a vital role in gait regularization, ensuring smoother and more natural movements. The following are the results from Table III of the original paper:

Designs	Smoothness (↓)	Feet Air Time (↑)
Naive	1.7591 (±0.1316)	-0.0319 (±0.0028)
Ours w/o Soft Dyn	0.9633 (±0.0526)	-0.0169 (±0.0014)
Ours w/o Double Critic	1.2705 (±0.1168)	−0.0229( (±0.0033)
BeAMDOJo	0.7603 (±0.0315)	−0.0182(±0.0027)

As seen in Table III, the Naive design and Ours w/o Double Critic show poorer performance in both smoothness and feet air time metrics. In contrast, BeamDojo and Ours w/ Soft Dyn (which still uses a double critic implicitly for these metrics) demonstrate superior motion smoothness (lower values) and improved feet air time (closer to target), leading to better feet clearance. This improvement stems from the double-critic framework normalizing advantage estimates for dense and sparse rewards independently, preventing the sparse rewards from introducing noise into regularization reward learning.

6.2.4. Foot Placement Planning

The double critic also benefits the foot placement planning throughout the entire sub-process of lifting and landing a foot. The following figure (Fig. 7 from the original paper) illustrates foot placement planning visualization:

该图像是示意图，展示了足部安置规划的可视化。图中黄线代表BeamDojo方法，红线则对应于没有双重评论的方法。A到C的过程中，缺乏双重评论的方法在接近目标石头时（B点）仅表现出显著的调整。

As visualized in Fig. 7, BeamDojo enables smoother planning, allowing the foot to precisely reach the next foothold (yellow line). Conversely, the baseline without the double critic (Ours w/o Double Critic, red line) shows more reactive stepping, with significant adjustments only made when the foot is very close to the target stone (point B). This indicates that by learning the sparse foothold reward separately, the double critic helps the policy plan its motion over a longer time horizon.

6.3. Real-world Experiments

BeamDojo successfully achieves zero-shot transfer to real-world dynamics, enabling the humanoid robot to traverse challenging terrains effectively. The success rate and traversal rate are reported in Fig. 8 for various terrains. The following figure (Fig. 8 from the original paper) illustrates the experimental results of BeamDojo for humanoid robot walking across different terrains:

$该图像是插图，展示了在不同地形上应用BeamDojo进行人形机器人行走的实验结果。图中包含四个不同的行走场景，以及对应的成功率（$R_{succ}$）和旅行率（$R_{trav}$）数据。每个场景均展示了机器人在不同宽度脚踏板上行走的表现，特别是在“Stepping Stones”、“Balancing Beams”、“Stepping Beams”和“Gaps”四种地形上的成功率与训练方法的对比。数据表格清楚地列出了不同情境下的实验结果。$ 该图像是插图，展示了在不同地形上应用BeamDojo进行人形机器人行走的实验结果。图中包含四个不同的行走场景，以及对应的成功率（ $R_{succ}$ ）和旅行率（ $R_{trav}$ ）数据。每个场景均展示了机器人在不同宽度脚踏板上行走的表现，特别是在“Stepping Stones”、“Balancing Beams”、“Stepping Beams”和“Gaps”四种地形上的成功率与训练方法的对比。数据表格清楚地列出了不同情境下的实验结果。

BeamDojo achieves a high success rate in real-world deployments, demonstrating excellent precise foot placement capabilities. It also shows impressive generalization performance on Stepping Beams and Gaps, even though these were not part of the explicit training set.
An ablation without height map domain randomization (ours w/o HR) results in a significantly lower success rate, underscoring the critical importance of domain randomization for robust sim-to-real transfer.
Notably, BeamDojo enables backward movement on risky terrains, which is a key advantage of using LiDAR for perception over single depth cameras.

6.3.1. Agility Test

To assess agility, the humanoid robot was given five commanded longitudinal velocities ( $\mathbf{v}_x^c$ ) on stepping stones ( $2.8 \mathrm{m}$ total length). The following are the results from Table IV of the original paper:

vx (m/s)	Time Cost (s)	Average Speed (m/s)	Error Rate (%, ↓)
0.5	6.33(±0.15)	0.45(±0.05)	10.67(±4.54)
0.75	4.33(±0.29)	0.65(±0.05)	13.53(±6.52)
1.0	3.17(±0.58)	0.88(±0.04)	11.83(±8.08)
1.25	2.91(±0.63)	0.96(±0.03)	22.74(±5.32)
1.5	2.69(±0.42)	1.04(±0.05)	30.68(±6.17)

Table IV shows that BeamDojo maintains minimal tracking error up to the highest training command velocity of $1.0 \mathrm{m/s}$ , achieving an average speed of $0.88 \mathrm{m/s}$ . This demonstrates the policy's agility. However, performance degrades significantly beyond $1.25 \mathrm{m/s}$ , as maintaining such high speeds on highly challenging terrains becomes increasingly difficult, leading to higher error rates.

6.3.2. Robustness Test

The robustness of the precise foothold controller was evaluated through several real-world experiments, illustrated in Fig. 9. The following figure (Fig. 9 from the original paper) shows various robustness tests:

该图像是一个示意图，展示了机器人在稀疏支撑点上的各个动作，包括推、单腿支撑、静止、迈步、失误和恢复，体现了BeamDojo框架的灵活步态学习能力。

Heavy Payload (Fig. 9a): The robot successfully carried a $10 \mathrm{~kg}$ payload (approx. 1.5 times its torso weight), demonstrating robust agile locomotion and precise foot placements despite a significant shift in its center of mass.
External Force (Fig. 9b): The robot was subjected to external pushes from various directions. It demonstrated the ability to transition from a stationary pose, endure forces while on single-leg support, and recover to a stable two-leg standing position.
Misstep Recovery (Fig. 9c): When traversing terrains without prior scanning (due to occlusions causing initial missteps), the robot exhibited robust recovery capabilities, indicating its adaptability to unexpected disturbances.

6.4. Extensive Studies and Analysis

6.4.1. Design of Foothold Reward

The paper compares its sampling-based foothold reward (a continuous reward proportional to the number of safe points) with binary and coarse reward designs. The following are the results from Table V of the original paper:

Designs	Rsucc (%, ↑)	Efoot (%, ↓)
foothold-30%	93.67(±1.96)	11.43(±0.81)
foothold-50%	92.71(±1.06)	10.78(±1.94)
foothold-70%	91.94(±2.08)	14.35(±2.61)
BeaMDOJo	95.67(±1.53)	7.79(±1.33)

Table V shows that BeamDojo's fine-grained continuous reward design yields the highest success rate and the lowest foothold error (7.79%) on stepping stones compared to coarse-grained approaches (e.g., foothold-30%, foothold-50%, foothold-70%). This confirms that gradually encouraging the maximization of overlap leads to more accurate foot placements. Among the coarse-grained methods, foothold-50% performed best, suggesting that thresholds too strict (30%) or too loose (70%) are less effective.

6.4.2. Design of Curriculum

The effectiveness of the terrain curriculum (Section 4.2.5.2) is validated by an ablation study without curriculum learning. The following are the results from Table VI of the original paper:

Designs	Rsucc	Rtrav	Rsucc	Rtrav
Designs	Medium Difficulty		Hard Difficulty
w/o curriculum-medium	88.33	90.76	2.00	18.36
w/o curriculum-hard	40.00	52.49	23.67	39.94
BEaMDOJo	95.67	96.11	82.33	86.87

Table VI clearly indicates that curriculum learning significantly improves both performance and generalization across terrains of varying difficulty. Without curriculum learning, training solely on hard terrain (w/o curriculum-hard) leads to very poor success rates (23.67%) and traverse rates (39.94%) on hard difficulty. Even training on medium difficulty without curriculum (w/o curriculum-medium) shows a dramatic drop in performance on hard terrain (2% Rsucc). BeamDojo's Rsucc of 82.33% on Hard Difficulty highlights the crucial role of progressive learning.

6.4.3. Design of Commands

In Stage 2, BeamDojo trains without an explicit heading command ( $\omega_{yaw}^c=0$ ), requiring the robot to learn to face forward using its perceptive observations. This is compared against a design that includes a heading command (ours w/ heading command), which applies corrective yaw commands based on directional error. Real-world trials on stepping stones showed BeamDojo had a success rate of $4/5$ , while the heading command design achieved only $1/5$ . The heading command approach performed poorly due to model overfitting to angular velocities in simulation (making it sensitive to noisy real-world odometry data) and the need for precise manual calibration of the initial position in reality. BeamDojo's approach without continuous yaw correction proves more robust.

6.4.4. Generalization to Non-Flat Terrains

BeamDojo demonstrates good generalization to other non-flat terrains such as stairs and slopes. The following figure (Fig. 10 from the original paper) shows generalization tests on non-flat terrains:

$Fig. 10: Generalization Test on Non-Flat Terrains. We conduct real-world experiments on (a) stairs with a width of $2 5 \\mathrm { c m }$ and a height of $1 5 \\mathrm { { c m } }$ , and (b) slopes with a 15-degree incline.$ 该图像是图表，展示了在非平坦地形上进行的一般化测试。左侧为高度15cm、宽度25cm的阶梯，右侧为15度倾斜的坡。

Real-world experiments (Fig. 10) on stairs ( $25 \mathrm{cm}$ width, $15 \mathrm{cm}$ height) and slopes (15-degree incline) show success rates of 8/10 and 10/10 respectively. The main adaptation required for these terrains is calculating the base height reward relative to the foot height rather than the ground height. For these terrains, the Stage 1 pre-training (with soft dynamics) becomes unnecessary as footholds are not sparse.

6.4.5. Failure Cases

The framework's performance limitations were investigated by varying stone sizes and step distances. The following figure (Fig. 11 from the original paper) shows failure case analysis:

Fig. 11: Failure Case Analysis. We evaluate the success rate on varying (a) stone sizes, and (b) step distances. 该图像是一个图表，展示了在不同石头大小和步距下的成功率分析。左侧图表显示了最小石头大小（从10cm到20cm）对成功率的影响，右侧图表显示了最大步距（从45cm到55cm）的影响。整体趋势显示，随着石头大小和步距的增加，成功率有所下降。

As shown in Fig. 11, while tougher training settings improve adaptability, performance drops sharply on 10 cm stones (approximately half the foot length) and 55 cm steps (roughly equal to the leg length). In these extreme scenarios, the difficulty shifts towards maintaining balance on very small footholds and executing larger strides, which the current reward function does not adequately address.

6.5. Domain Randomization Settings

The following are the results from Table IX of the original paper:

Term	Value
Observations angular velocity noise
	U(−0.5, 0.5) rad/s
joint position noise	U −0.05, 0.05) rad/s
joint velocity noise	U(−2.0, 2.0) rad/s
projected gravity noise	U(−0.05, 0.05) rad/s
Humanoid Physical Properties
actuator offset	U(—0.05, 0.05) rad
motor strength noise	U(0.9, 1.1)
payload mass	u(−2.0, 2.0) kg
center of mass displacement	u(−0.05, 0.05) m
Kp, Kd noise factor	U(0.85, 1.15)
Terrain Dynamics
friction factor	U(0.4, 1.0)
restitution factor	U(0.0, 1.0)
terrain height noise	U(-0.02, 0.02) m
Elevation Map
vertical offset	U(−0.03, 0.03) m
vertical noise	u(-0.03, 0.03) m
map roll, pitch rotation noise	u(-0.03, 0.03) m
map yaw rotation noise	U−0.2, 0.2) rad
foothold extension probability	0.6
map repeat probability	0.2

6.6. Hyperparameters

The following are the results from Table X of the original paper:

Hyperparameter	Value
General
num of robots	4096
num of steps per iteration	100
num of epochs	5
learning rate	1e − 8
PPO
clip range	0.2
entropy coefficient	0.01
discount factor $\gamma$	0.99
GAE balancing factor $\lambda$	0.95
desired KL-divergence	0.01
BEAMDOJO
actor and double critic NN	MLP, hidden units

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces BeamDojo, a novel reinforcement learning (RL) framework designed to enable humanoid robots to achieve agile and robust locomotion on sparse foothold terrains. The key contributions include:

Accurate Foot Placement: A unique sampling-based foothold reward tailored for polygonal feet ensures precise foot placements by continuously encouraging maximization of contact area with safe regions.
Efficient and Effective Training: A two-stage RL training process (soft constraints for exploration, hard constraints for fine-tuning) addresses the early termination problem and improves sample efficiency. Coupled with a double critic architecture, the framework effectively handles the balance between sparse foothold rewards and dense locomotion rewards, leading to regular gait patterns and improved long-distance foot placement planning.
Real-world Agility and Robustness: Extensive simulation and real-world experiments on a Unitree G1 Humanoid robot demonstrate BeamDojo's ability to achieve high success rates and agile locomotion in physical environments, even under significant external disturbances (payloads, pushes) and real-world variability. The use of a LiDAR-based elevation map also enables stable backward walking, an advantage over depth camera-based systems. Zero-shot generalization to unseen terrains like Stepping Beams and Gaps further highlights the framework's robustness.

7.2. Limitations & Future Work

The authors acknowledge several limitations of the current BeamDojo framework and suggest directions for future work:

Perception Module Limitations: The performance is significantly constrained by inaccuracies in the LiDAR odometry, jitter, and map drift. These issues pose considerable challenges for real-world deployment, making the system less adaptable to sudden disturbances or dynamic changes in the environment (e.g., jitter of stones, which is hard to simulate accurately).
Underutilization of Elevation Map Information: The current method does not fully leverage all the information provided by the elevation map.
Challenges with Significant Foothold Height Variations: The framework has not adequately addressed terrains with significant variations in foothold height.
Future Work: The authors aim to develop a more generalized controller that can enable agile locomotion across a broader range of complex terrains, including those requiring footstep planning (e.g., stairs) and those with large elevation changes.

7.3. Personal Insights & Critique

BeamDojo presents a comprehensive and well-engineered solution to a very challenging problem in humanoid robotics. The integration of multiple carefully designed components—specifically the polygonal foot reward, double critic, and two-stage training—is what makes this work stand out.

One of the most inspiring aspects is the methodical approach to tackling the sparse reward and early termination problems. The two-stage training with soft dynamics is a clever way to enable sufficient exploration in complex environments without constant resets, which is a major bottleneck in many RL applications. The double critic further refines this by ensuring that foundational locomotion skills are learned without being disrupted by the sparse, high-penalty foothold rewards. This detailed reward engineering and architectural design are critical to the observed success rates.

The robust sim-to-real transfer is another strong point. The extensive domain randomization, particularly the inclusion of elevation map measurement noise, shows a deep understanding of the practical challenges in real-world robotics. This level of detail in simulation design is often underestimated but is crucial for successful deployment.

However, the identified limitations also offer interesting avenues for future research. The reliance on the perception module highlights a common bottleneck in embodied AI; even the best control policy cannot compensate for poor or noisy sensory input. Future work could explore more advanced sensor fusion techniques or predictive perception to mitigate LiDAR inaccuracies and delays.

The current reward function's struggle with very small stones or large steps suggests that incorporating more explicit balance rewards or kinematic feasibility constraints directly into the reward structure, or perhaps through a more advanced model-predictive control layer, could further enhance performance in extreme scenarios. The idea of "not fully leveraging elevation map information" is also intriguing; it implies that there might be richer features or higher-level planning opportunities from the elevation map that could be exploited for even more intelligent and adaptive behaviors.

Overall, BeamDojo provides a significant step forward for humanoid locomotion on sparse terrains, demonstrating how thoughtful RL algorithm design and meticulous system integration can lead to highly capable and robust robotic systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.