BeamDojo: Learning Agile Humanoid Locomotion on Sparse Footholds
TL;DR Summary
BeamDojo is a reinforcement learning framework for agile humanoid locomotion on sparse footholds, utilizing sampling-based rewards and a double critic architecture. A two-stage learning approach balances rewards and achieves efficient learning and high success rates in dynamic re
Abstract
Traversing risky terrains with sparse footholds poses a significant challenge for humanoid robots, requiring precise foot placements and stable locomotion. Existing learning-based approaches often struggle on such complex terrains due to sparse foothold rewards and inefficient learning processes. To address these challenges, we introduce BeamDojo, a reinforcement learning (RL) framework designed for enabling agile humanoid locomotion on sparse footholds. BeamDojo begins by introducing a sampling-based foothold reward tailored for polygonal feet, along with a double critic to balancing the learning process between dense locomotion rewards and sparse foothold rewards. To encourage sufficient trial-and-error exploration, BeamDojo incorporates a two-stage RL approach: the first stage relaxes the terrain dynamics by training the humanoid on flat terrain while providing it with task-terrain perceptive observations, and the second stage fine-tunes the policy on the actual task terrain. Moreover, we implement a onboard LiDAR-based elevation map to enable real-world deployment. Extensive simulation and real-world experiments demonstrate that BeamDojo achieves efficient learning in simulation and enables agile locomotion with precise foot placement on sparse footholds in the real world, maintaining a high success rate even under significant external disturbances.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is BeamDojo: Learning Agile Humanoid Locomotion on Sparse Footholds. It focuses on developing a robust reinforcement learning framework that enables humanoid robots to traverse challenging terrains with limited, sparse areas for foot placement.
1.2. Authors
The authors of this paper are:
-
Huayi Wang
-
Zirui Wang
-
Junli Ren
-
Qingwei Ben
-
Tao Huang
-
Weinan Zhang
-
Jiangmiao Pang
Their affiliations include Shanghai AI Laboratory, Shanghai Jiao Tong University, Zhejiang University, The University of Hong Kong, and The Chinese University of Hong Kong. Weinan Zhang and Jiangmiao Pang are noted as corresponding authors.
1.3. Journal/Conference
This paper is published as an arXiv preprint and has a publication date of 2025-02-14. While not yet in a specific journal or conference, arXiv is a reputable open-access archive for preprints of scientific papers, particularly in physics, mathematics, computer science, and related disciplines. The related works section frequently cites papers from major robotics and machine learning conferences such as Conference on Robot Learning (CoRL), IEEE International Conference on Robotics and Automation (ICRA), and IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), indicating that the work is situated within the leading research in these fields.
1.4. Publication Year
1.5. Abstract
Traversing risky terrains with sparse footholds poses a significant challenge for humanoid robots, requiring precise foot placements and stable locomotion. Existing learning-based approaches often struggle on such complex terrains due to sparse foothold rewards and inefficient learning processes. To address these challenges, we introduce BeamDojo, a reinforcement learning (RL) framework designed for enabling agile humanoid locomotion on sparse footholds. BeamDojo begins by introducing a sampling-based foothold reward tailored for polygonal feet, along with a double critic to balancing the learning process between dense locomotion rewards and sparse foothold rewards. To encourage sufficient trial-and-error exploration, BeamDojo incorporates a two-stage RL approach: the first stage relaxes the terrain dynamics by training the humanoid on flat terrain while providing it with task-terrain perceptive observations, and the second stage fine-tunes the policy on the actual task terrain. Moreover, we implement a onboard LiDAR-based elevation map to enable real-world deployment. Extensive simulation and real-world experiments demonstrate that BeamDojo achieves efficient learning in simulation and enables agile locomotion with precise foot placement on sparse footholds in the real world, maintaining a high success rate even under significant external disturbances.
1.6. Original Source Link
The official source link for the paper is:
-
https://arxiv.org/abs/2502.10363(Abstract and main page) -
https://arxiv.org/pdf/2502.10363v3.pdf(PDF link)This is a preprint, meaning it has not yet undergone formal peer review and publication in a conference proceeding or journal.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is enabling humanoid robots to traverse risky terrains with sparse footholds. This is a critical challenge in robotics because such environments, like stepping stones or balancing beams, demand extremely precise foot placements and stable locomotion to prevent the robot from falling.
The problem is important because existing learning-based approaches, which have shown impressive results in various locomotion tasks (e.g., walking, stair climbing), often struggle with these specific types of complex terrains. This struggle arises from several key challenges and gaps in prior research:
-
Sparse Foothold Rewards: The reward signal (feedback) for correct foot placement is often
sparse, meaning it's only given after a complete sub-process (like lifting and landing a foot). This makes it difficult for a robot to learn what specific actions contributed to success or failure. -
Inefficient Learning Processes: A single misstep on challenging terrain can lead to
early terminationof a training episode. This limits thetrial-and-error explorationcrucial forreinforcement learningand makes the learning process highly inefficient. -
Foot Geometry Differences: Many existing methods, particularly for
quadrupedal robots, model feet as simple points. However,humanoid robotstypically havepolygonal feet(flat, larger foot surfaces), which introduces complexities forfoothold evaluationandonline planningthat point-based models cannot address. -
High Degrees of Freedom & Instability: Humanoid robots inherently have more
degrees of freedom (DoF)and anunstable morphologycompared to quadrupeds, making agile and stable locomotion even more difficult on risky terrains.The paper's innovative idea, or entry point, is to combine novel reward design, a specialized learning architecture, and a two-stage training process specifically tailored to address these humanoid locomotion challenges on sparse footholds. The name
BeamDojoitself reflects this—beamfor sparse footholds anddojofor a place of training.
2.2. Main Contributions / Findings
The paper introduces BeamDojo, a novel reinforcement learning (RL) framework, and makes several primary contributions to enabling agile humanoid locomotion on sparse footholds:
- Novel Foothold Reward Design: They propose a
sampling-based foothold rewardspecifically designed forpolygonal feet. This reward iscontinuous(proportional to the overlap between the foot and the safe foothold), encouragingprecise foot placementsmore effectively than binary or coarse rewards. - Double Critic Architecture: To balance the
sparse foothold rewardwithdense locomotion rewards(essential for gait regularization),BeamDojoincorporates adouble criticin itsProximal Policy Optimization (PPO)framework. This allows independent estimation and normalization of value functions for different reward groups, enhancing learning stability and gait quality. - Two-Stage RL Training Approach: A unique
two-stage RLstrategy is introduced to tackle theearly termination problemand promotesufficient trial-and-error exploration.- Stage 1 (
Soft Terrain Dynamics Constraints): The robot trains onflat terrainbut receivesperceptive observationsof the challenging task terrain. Missteps incur penalties but do not terminate the episode, allowing foundational skills and broad exploration. - Stage 2 (
Hard Terrain Dynamics Constraints): The policy isfine-tunedon the actualtask terrain, where missteps lead to termination, forcing precise adherence to environmental constraints.
- Stage 1 (
- Real-World Deployment with LiDAR-based Elevation Map: The framework integrates an
onboard LiDAR-based elevation mapto enablereal-world deployment. This perceptual input, combined with carefully designeddomain randomizationin simulation, facilitates robustsim-to-real transfer. - Demonstrated Agile & Robust Locomotion: Extensive simulations and real-world experiments on a
Unitree G1 Humanoidrobot demonstrate thatBeamDojoachieves:-
Efficient learningin simulation. -
Agile locomotionwithprecise foot placementon sparse footholds in the real world. -
High success rateseven undersignificant external disturbances(e.g., heavy payloads, external pushes). -
Impressive zero-shot generalizationto terrains not explicitly seen during training (e.g., Stepping Beams and Gaps). -
The ability to perform
backward walkingon risky terrains, leveragingLiDAReffectively.In essence, the paper provides a comprehensive, learning-based solution for a challenging problem in humanoid robotics, integrating perception, reward shaping, and training strategies to achieve high-performance locomotion in complex environments.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the BeamDojo framework, a foundational understanding of several key concepts in robotics and machine learning is beneficial.
-
Humanoid Robots: These are robots designed to mimic the human body, typically having two legs, two arms, and a torso. Their
high degrees of freedom (DoF)(multiple joints allowing movement) and inherentunstable morphology(being tall and bipedal) make stable locomotion challenging, especially on uneven terrain. The paper specifically highlights theirpolygonal feet(flat foot surfaces) as a key differentiator from simpler point-foot models. -
Legged Locomotion: The study of how robots move using legs. This involves complex control challenges to maintain balance, coordinate joint movements, and interact with the environment, particularly on non-flat or irregular surfaces.
Sparse footholdsrefer to situations where there are only small, separated areas where the robot's feet can safely land. -
Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make decisions by performing actions in anenvironmentto maximize a cumulativerewardsignal.- Agent: The entity that learns and acts (in this case, the control policy for the humanoid robot).
- Environment: The world in which the agent operates (the terrain and physics simulation).
- State (): A complete description of the environment at a given time (e.g., robot's joint angles, velocities, base orientation, terrain information).
- Action (): The decision made by the agent (e.g., target joint positions for the robot).
- Reward (): A scalar feedback signal from the environment indicating how good or bad the agent's last action was. The goal of the agent is to maximize the sum of future rewards.
- Policy (): A function that maps states to actions, determining the agent's behavior.
- Value Function: Estimates the expected cumulative reward from a given state (or state-action pair).
- Exploration vs. Exploitation: The fundamental trade-off in RL.
Explorationinvolves trying new actions to discover better strategies, whileexploitationinvolves using currently known best strategies to maximize reward.
-
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in environments where outcomes are partly random and partly under the control of a decision maker. An
MDPis defined by a tuple :- : A set of possible states.
- : A set of possible actions.
- : A transition function representing the probability of transitioning to state from state after taking action .
- : A reward function
R(s, a)representing the immediate reward received after taking action in state . - : A discount factor (between 0 and 1) that determines the present value of future rewards.
-
Partially Observable Markov Decision Process (POMDP): An extension of
MDPswhere the agent does not have direct access to the true state of the environment but instead receivesobservationsthat are probabilistically related to the state. This is relevant for robots with sensors that provide incomplete or noisy information. -
Proximal Policy Optimization (PPO): A popular
reinforcement learning algorithmthat belongs to the family ofpolicy gradient methods.PPOaims to find an optimal policy by iteratively updating it. It is known for its stability and data efficiency. Key components include:- Actor-Critic Architecture:
PPOuses two neural networks: anactornetwork that learns the policy (what actions to take) and acriticnetwork that learns the value function (how good a state is). - Clipping: A mechanism in
PPOthat limits the size of policy updates, preventing large, destructive changes that could destabilize training. - Generalized Advantage Estimation (GAE): A technique used in
PPOto provide more stable and effective estimates of theadvantage function. Theadvantage functionmeasures how much better an action is compared to the average action in a given state.
- Actor-Critic Architecture:
-
LiDAR (Light Detection and Ranging): A remote sensing method that uses pulsed laser light to measure distances to the Earth's surface. In robotics,
LiDARis used to create3D point cloudsof the environment, which can then be processed to generateelevation maps. -
Elevation Map: A representation of the terrain height, often as a grid, where each cell stores the height information. A
robot-centric elevation mapis centered around the robot, providing local terrain information relevant to its immediate surroundings. -
Domain Randomization: A technique used to bridge the
sim-to-real gap(the discrepancy between simulation and real-world performance). By randomizing various parameters in the simulation (e.g., physics properties, sensor noise, texture), theRL policylearns to be robust to variations, making it more likely to perform well in the real world where conditions are never perfectly known or constant.
3.2. Previous Works
The paper discusses existing approaches to legged locomotion, highlighting their strengths and limitations, especially concerning humanoid robots on sparse footholds.
-
Quadrupedal Robot Locomotion: Existing
learning-basedmethods have effectively addressedsparse foothold traversalforquadrupedal robots(four-legged robots). These methods often model the robot's foot as apoint, which simplifiesfoothold evaluationandplanning. However, the paper emphasizes that these methodsencounter great challenges when applied to humanoid robotsdue to the fundamental difference in foot geometry (humanoids havepolygonal feet). -
Model-Based Hierarchical Controllers: Traditionally,
locomotion on sparse footholdsfor legged robots has been tackled usingmodel-based hierarchical controllers. These controllers decompose the complex task into stages:- Perception: Gathering information about the environment.
- Planning: Generating a sequence of footsteps or trajectories.
- Control: Executing the planned movements.
While effective, these methods are
sensitive to violations of model assumptionsand can becomputationally burdensomeforonline planning, especially forpolygonal feetwhich require additionalhalf-space constraints(linear inequalities).
-
Hybrid Methods (RL + Model-Based): Some works combine the strengths of
RLandmodel-based controllers. For instance,RLcan be used to generate trajectories that are then tracked bymodel-based controllers, orRL policiescan track trajectories frommodel-based planners. The paper notes that thesedecoupled architecturescanconstrain adaptability and coordination. -
End-to-End Learning Frameworks: More recent approaches use
end-to-end learningwhereRLdirectly learnsperceptive locomotion controllersfor sparse footholds. Many of these focus onquadrupedsand rely ondepth camerasforexteroceptive observations(observations from external sensors). However,depth camerashavenarrow fields of viewand often require animage processing moduleto bridge thesim-to-real gapbetweendepth imagesandterrain heightmaps. This limits robot movement (e.g., only moving backward) and adds complexity. -
Two-Stage Training Frameworks in RL: Previous
two-stage training frameworkshave been used inRLto bridge thesim-to-real gapprimarily in theobservation space. This typically means training in simulation with noisy observations to mimic reality, then fine-tuning.BeamDojodifferentiates itself by introducing anovel two-stage training approach specifically aimed at improving sample efficiency, particularly by addressing theearly terminationproblem when learning to walk on sparse terrains.
3.3. Technological Evolution
The field of legged locomotion, especially for humanoid robots on challenging terrains, has evolved significantly.
- Early Stages (Model-Based Control): Initially,
model-based controldominated. This involved precise mathematical models of the robot and environment to calculate optimal movements. While providing high precision for well-defined tasks, these methods struggled withreal-world variabilityandunstructured environmentsdue to their reliance on perfect models. - Emergence of Learning-Based Methods (RL): With advancements in computational power and
deep learning,Reinforcement Learningemerged as a powerful tool.RLallowed robots to learn complex behaviors directly fromtrial-and-error, leading to more robust and adaptive locomotion across various tasks (walking, stair climbing, parkour). However, initialRLapplications often faced challenges likesample inefficiency(requiring huge amounts of data) andsim-to-real gaps. - Addressing Challenges in RL (PPO, GAE, Domain Randomization): Algorithms like
PPOand techniques likeGAEimproved the stability and efficiency ofRL.Domain randomizationbecame crucial for transferring policies learned in simulation to the real world, makingsim-to-real transfermore feasible. - Specialization for Complex Terrains (Perception-Driven RL): The focus shifted towards enabling locomotion on increasingly complex and
risky terrains. This necessitated betterperceptive capabilities(e.g., usingdepth cameras,LiDAR) andperception-aware RL policies. Initial efforts concentrated onquadrupedal robotsdue to their relative stability. - Current Frontier (Humanoid Locomotion on Sparse Footholds): The current work,
BeamDojo, represents a step forward by tackling the more difficult problem ofhumanoid locomotiononsparse footholds. It builds upon priorRLsuccesses but introduces specific innovations to overcome the unique challenges posed byhumanoid kinematics,polygonal feet, andsparse reward signals.
3.4. Differentiation Analysis
Compared to the main methods in related work, BeamDojo offers several core differences and innovations:
-
Humanoid-Specific Foothold Handling: Unlike most
quadrupedalmethods that simplify feet topoints,BeamDojoexplicitly addresses the challenge ofpolygonal feetinhumanoid robots. Itssampling-based foothold rewardis tailored to evaluate theoverlapof a flat foot surface with safe regions, a crucial distinction for humanoid stability. -
Addressing Sparse Rewards with Double Critic: While other
RLmethods encountersparse reward problems,BeamDojointroduces adouble criticarchitecture. This designdecouplesthe learning ofdense locomotion rewards(for general gait) from thesparse foothold reward(for precise placement), allowing more effective optimization for both and preventing the sparse reward from destabilizing dense reward learning. This is a modular,plug-and-playsolution. -
Novel Two-Stage RL for Sample Efficiency:
BeamDojo'stwo-stage RL approachis specifically designed to improvesample efficiencyand overcome theearly termination problemprevalent in complex terrains. By initially training in asoft terrain dynamicssetting (flat terrain with perceptive feedback of sparse terrain), it encouragesbroad explorationwithout frequent episode resets, a significant departure from standard one-stage or sim-to-real-focused two-stage methods. -
Robust LiDAR-Based Perception: Instead of
depth cameras(which often have narrow fields of view and limitations for backward movement),BeamDojoutilizes aLiDAR-based elevation map. This enablesrobot-centric perceptionthat supports both forward andbackward movementand is more robust toenvironmental noise. -
Strong Sim-to-Real Transfer through Comprehensive Domain Randomization: The paper highlights the implementation of an
onboard LiDAR-based elevation mapandcarefully designed domain randomization(including specificelevation map measurement noise) to achieve a highzero-shot sim-to-real transfer success rate, which is crucial for real-world application. -
First Learning-Based Method for Fine-Grained Foothold Control: The authors claim that
BeamDojois the firstlearning-based methodto achievefine-grained foothold controlonrisky terrains with sparse footholdsfor humanoid robots, combining these specific innovations.
4. Methodology
4.1. Principles
The core principle behind BeamDojo is to develop a terrain-aware humanoid locomotion policy using reinforcement learning (RL). The primary objective is to enable humanoid robots to traverse challenging environments with sparse footholds by learning precise and stable movements. This is achieved by formulating the problem as a Markov Decision Process (MDP) and optimizing a policy to maximize discounted cumulative rewards.
The method specifically targets the unique challenges of humanoid locomotion on sparse footholds:
-
Polygonal Foot Model: Unlike simpler point-foot models, humanoids have larger,
polygonal feetthat require a more sophisticated evaluation of foot placement. -
Sparse Reward Signals: The feedback for correct foothold placement is often
sparse, making it difficult for theRL agentto learn. -
Inefficient Learning due to Early Termination: Missteps on risky terrain often lead to episode termination during training, which hinders
explorationand makes learning inefficient. -
Real-world Deployment: Bridging the
sim-to-real gapand ensuring robust performance in unpredictable real-world scenarios.BeamDojoaddresses these by introducing a novelsampling-based foothold reward, adouble criticarchitecture to handle mixed dense and sparse rewards, atwo-stage RLtraining process for efficient exploration, and aLiDAR-based elevation mapwithdomain randomizationfor robustsim-to-real transfer.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Formulation
The RL problem is formulated as a Markov Decision Process (MDP), denoted as .
-
: The
state space, which encompasses all possible configurations and conditions of the robot and its environment. -
: The
action space, representing all possible movements or controls the robot can execute. -
: The
transition dynamics, given by , which describes the probability of the environment transitioning to a new state given the current state and action . -
: The
reward function,r(s, a), which provides immediate feedback for taking action in state . -
: The
discount factor, a value between 0 and 1 (), which determines the present value of future rewards. A higher means future rewards are considered more important.The primary objective is to optimize the
policyto maximize thediscounted cumulative rewards, which is expressed as: Here, is the expected total discounted reward. denotes the expectation, summing over all possible trajectories. Due tosensory limitationsandenvironmental noise, the robot only has access topartial observations, which are incomplete information about the true state. This means the agent operates within aPartially Observable Markov Decision Process (POMDP)framework.
4.2.2. Foothold Reward
To accommodate the polygonal foot model of the humanoid robot, BeamDojo introduces a sampling-based foothold reward that evaluates how well the foot is placed on sparse footholds. This evaluation is based on the overlap between the foot's placement and designated safe areas (e.g., stones, beams).
The approach samples points on the soles of the robot's feet, as depicted in Fig. 2. The following figure (Fig. 2 from the original paper) illustrates the sampling of points on the foot for the foothold reward:
该图像是图示,显示了在稀疏踏脚点下的奖励情况。大腿机器人在不稳定的地形中行走,图中左侧为机器人脚下的区域,右侧的放大图展示了样本点,绿色点表示安全区域内的接触点,而红色点则表示未接触的地面。根据这些点的接触情况,机器人可以获得相应的奖励,以促进学习和适应。
For each -th sample on foot , denotes the global terrain height at its corresponding position. The penalty foothold reward is defined as:
where:
-
: An
indicator functionthat is 1 if foot is in contact with the terrain surface, and 0 otherwise. This ensures that the reward is only calculated for feet that are currently on the ground. -
: The total number of points sampled on each foot.
-
: The
indicator function, which returns 1 if the condition inside the curly braces is true, and 0 otherwise. -
: The height of the terrain at the -th sampled point on foot .
-
: A predefined
depth tolerance threshold. When , it indicates that the terrain height at that sample point is significantly low, implyingimproper foot placementoutside a safe area (i.e., the foot is partially in a gap or off a foothold).This reward function encourages the humanoid robot to maximize the overlap between its foot placement and the safe footholds, thereby improving its
terrain-awareness capabilities. Being a negative sum, the robot aims to minimize the penalty, which means maximizing the number of sampled points landing on safe areas.
4.2.3. Double Critic for Sparse Reward Learning
The task-specific foothold reward () described above is a sparse reward, meaning it's only received when certain conditions are met (i.e., correct foot placement). To effectively optimize the policy when dealing with such sparse rewards, it's crucial to balance them with dense locomotion rewards, which are continuous and provide frequent feedback essential for gait regularization (learning smooth and natural walking patterns).
Inspired by previous works, BeamDojo adopts a double critic framework based on PPO (Proximal Policy Optimization). This framework decouples (separates) the learning process for different types of rewards.
Two separate critic networks, denoted as , are trained. Each critic independently estimates value functions for two distinct reward groups:
-
(i) Regular Locomotion Reward Group (): Consists of
dense rewards(e.g., forward velocity, base height, orientation, joint effort) that have been used inquadrupedandhumanoid locomotiontasks. These provide continuous feedback on the general quality of movement. -
(ii) Task-Specific Foothold Reward Group (): Contains the
sparse reward, focusing on the precision of foot placement on sparse footholds.The double critic process is illustrated in Fig. 3. The following figure (Fig. 3 from the original paper) illustrates the double critic and two-stage RL framework:
该图像是示意图,展示了“BeamDojo”框架在训练与部署过程中对人形机器人进行灵活行走的流程。在左侧的部分((a)),分为两个阶段:第一阶段是基于软地形动态的训练,利用稀疏奖励并结合感知信息;第二阶段则是在硬地形动态下的训练,结合密集和稀疏奖励以优化策略。右侧的部分((b))展示了机器人在实际部署中如何利用LiDAR生成的高程图和反馈控制器实现稳定行走。
Specifically, each value network is updated independently for its corresponding reward group using the temporal difference (TD) loss:
where:
-
: The
loss functionfor critic network , which is minimized during training. -
: The parameters (weights) of the -th critic network.
-
: The reward received at time for reward group .
-
: The
discount factor. -
: The
value estimate(predicted cumulative future reward) of state by critic network .After computing the TD loss, the respective
advantagesare calculated usingGeneralized Advantage Estimation (GAE): where: -
: The
temporal difference errorfor reward group at time . It measures the discrepancy between the actual reward received plus the discounted estimated value of the next state, and the estimated value of the current state. -
: The
advantage estimatefor reward group at time . It quantifies how much better action was than the expected value in state . -
: A
balancing parameter(between 0 and 1) forGAE, controlling the bias-variance trade-off in the advantage estimation.These
advantagesare then individually normalized and combined into anoverall advantage: where: -
: The
weightassigned to each advantage component. This allows the framework to prioritize or scale the influence of each reward group. -
: The
batch meanof for reward group . -
: The
batch standard deviationof for reward group . Normalization (subtracting the mean and dividing by the standard deviation) helps stabilize the training process.
This overall advantage is then used to update the actor network's policy (the part of the network that determines actions):
where:
-
: The
loss functionfor the actor network, which is minimized to update the policy. -
: The parameters (weights) of the actor network.
-
: The
probability ratiobetween the new policy and the old policy for action in state . -
: A
clipping functionthat limits the ratio to stay within the range . -
: The
clipping hyperparameterinPPO, which controls how much the policy can change in a single update step, preventing overly aggressive updates.This
double critic designmakes it a modular andplug-and-playsolution for handling specialized tasks with sparse rewards, effectively managing the difference in reward feedback frequencies within a mixed dense-sparse environment.
4.2.4. Learning Terrain-Aware Locomotion via Two-Stage RL
To address the early termination problem and promote thorough trial-and-error exploration in complex terrains, BeamDojo employs a novel two-stage reinforcement learning (RL) approach for terrain-aware locomotion in simulation. This process is also depicted in Fig. 3, which was shown previously.
Stage 1: Soft Terrain Dynamics Constraints Learning
In this initial stage, the robot is trained under soft terrain dynamics constraints.
- Setup: Each target
task terrain(denoted as , e.g., stepping stones) is mapped to aflat terrain() of the same size. Both terrains share the same terrain noise, and points between them correspond one-to-one. The key difference is that theflat terrainfills the gapspresent in the real terrain , making it traversable without falling into holes. - Training: The humanoid robot traverses the
flat terrain, receivingproprioceptive observations(internal sensor data like joint angles and velocities). Crucially, it is simultaneously provided withperceptual feedbackin the form of theelevation mapof thetrue task terrainat the robot's current base position. - Outcome: This setup allows the robot to "imagine" walking on the challenging terrain while physically moving on the safer
flat terrain. Missteps arepenalized(using thefoothold rewardfrom ) but do not lead to episode termination. This relaxation ofterrain dynamics constraintsallows the robot to broadlyexploreand developfoundational skillsforterrain-aware locomotionwithout constantly restarting, significantly improvingsampling efficiency. - Reward Decoupling: During this stage,
locomotion rewardsare primarily derived from the robot's movement on theflat terrain, while thefoothold rewardis calculated based on thetrue terrain( representing heights on ). These two reward components are learned separately using thedouble critic framework(Section 4.2.3).
Stage 2: Hard Terrain Dynamics Constraints Learning
In the second stage, the policy learned in Stage 1 is fine-tuned directly on the actual task terrain .
-
Setup: Unlike Stage 1,
misstepson now result inimmediate terminationof the episode. This introduceshard terrain dynamics constraints. -
Training: The robot continues to optimize both
locomotion rewardsand thefoothold rewardusing thedouble critic framework. The values for the foothold reward are now directly from theactual terrain. -
Outcome: This stage forces the robot to develop
preciseandsafe locomotion strategiesto avoid termination, refining its ability to navigate challenging terrains accurately.This two-stage design is particularly effective because Stage 1 allows for extensive, penalty-based exploration, making it easier to accumulate successful foothold placement samples, which are difficult to obtain in conventional
RLwith early termination. Stage 2 then hones these skills into robust, precise real-world behavior.
4.2.5. Training in Simulation
To facilitate training, a comprehensive simulation environment and curriculum are designed.
4.2.5.1. Observation Space and Action Space
The policy's observations, denoted as , are composed of four main components at time :
-
:
Commandsspecifying thedesired velocity. This includes , representing desired longitudinal velocity (), lateral velocity (), and angular velocity in the horizontal plane (), respectively. -
:
Proprioceptive observations, which are internal sensor readings of the robot itself. These include:base linear velocitybase angular velocitygravity directionin the robot's framejoint positions(for 29 joints)joint velocities(for 29 joints)joint torquescontact information(which feet are in contact with the ground).
-
:
Perceptive observations, which is anegocentric elevation map(a height map centered around the robot). This map samples points within a grid in both the longitudinal (forward/backward) and lateral (sideways) directions, providing local terrain height information. -
: The
actiontaken at thelast timestep. This is included to providetemporal context, helping the policy understand its previous commands and their effects.The
actionrepresents thetarget joint positionsfor the12 lower-body jointsof the humanoid robot. These are directly output by theactor network. For the upper body joints, a default position is used for simplicity. Aproportional-derivative (PD) controllerthen converts these target joint positions intotorques, which are the actual forces applied by the robot's motors to track the desired positions.
4.2.5.2. Terrain and Curriculum Design
Inspired by previous research, five types of sparse foothold terrains are designed for the two-stage training and evaluation:
-
Stones Everywhere: A general sparse foothold terrain where stones are scattered across an area. The center is a platform surrounded by stones (Fig. 4a). Curriculum progresses by decreasing stone size and increasing sparsity.
-
Stepping Stones: Two lines of stepping stones along the longitudinal direction, connected by platforms at each end (Fig. 4b). Stone size decreases, and sparsity increases with curriculum progression.
-
Balancing Beams: Starts with two lines of separate stones. As the curriculum progresses, stone size decreases, lateral distance reduces, eventually forming a single line of balancing beams (Fig. 4c). This requires a distinct gait.
-
Stepping Beams: A sequence of beams randomly distributed along the longitudinal direction with platforms at ends (Fig. 4d). Requires high precision.
-
Gaps: Several gaps with random distances (Fig. 4e). Requires the robot to make large steps.
The following figure (Fig. 4 from the original paper) illustrates the terrain settings in simulation:
该图像是一个示意图,展示了不同的训练地形设置,用于机器人在复杂地形上行走的学习过程。其中 (a) 显示了覆盖石块的地形;(b) 和 (c) 展示了跳石和平衡梁;而 (d) 和 (e) 则展示了更具挑战性的平衡梁和间隙。这些设置旨在逐步提高任务的复杂性。
The training strategy is as follows:
-
Stage 1: The robot is initially trained on
Stones Everywhereterrain withsoft terrain constraintsto learn a generalizable policy. -
Stage 2: The policy is fine-tuned on
Stepping StonesandBalancing Beamsterrains withhard terrain constraints.The commands used in these two stages are detailed in Table I. The following are the results from Table I of the original paper:
| Term | Value (stage 1) | Value (stage 2) |
|---|---|---|
| U(−1.0, 1.0) m/s | U(−1.0, 1.0) m/s | |
| U(-1.0, 1.0) m/s | U(0.0, 0.0) m/s | |
| U(−1.0, 1.0) rad/s | U(0.0, 0.0) m/s |
In Stage 2, only a single x-direction command () is provided, with no yaw command (). This forces the robot to learn to consistently face forward using perceptual observations, rather than relying on continuous yaw corrections.
For evaluation, Stepping Stones, Balancing Beams, Stepping Beams, and Gaps terrains are used. The method shows zero-shot transfer capabilities on Stepping Beams and Gaps even without explicit training on them.
The curriculum is designed such that the robot progresses to the next terrain level after successfully traversing the current level three times consecutively. This prevents the robot from being sent back to easier levels until all levels are passed. The detailed settings for the terrain curriculum are provided in Appendix VI-B of the paper.
4.2.5.3. Sim-to-Real Transfer
To enhance the robustness of the learned policy and facilitate sim-to-real transfer (applying a policy trained in simulation directly to a real robot), extensive domain randomization is employed. This involves injecting noise into various parameters:
-
Observations: Angular velocity, joint positions, joint velocities, and projected gravity.
-
Humanoid Physical Properties: Actuator offset, motor strength, payload mass, center of mass displacement, and
Kp, Kd noise factors(for the PD controller). -
Terrain Dynamics: Friction factor, restitution factor, and terrain height.
Additionally, to address the
large sim-to-real gapbetween theground-truth elevation mapin simulation and theLiDAR-generated mapin reality (due toodometry inaccuracies, noise, jitter), four types ofelevation map measurement noiseare introduced during height sampling in the simulator: -
Vertical Measurement: Random
vertical offsetsare applied per episode, anduniformly sampled vertical noiseis added to each height sample at every timestep. -
Map Rotation: The map is rotated in
roll,pitch, andyawto simulateodometry inaccuracies. Foryaw, a randomyaw noiseis sampled, and the elevation map is resampled with this added noise. Forrollandpitch, biases are sampled, and linear interpolation generates vertical height map noise, which is then added to the original elevation map. -
Foothold Extension: Random
foothold pointsadjacent to valid footholds areextendedto become valid. This simulates thesmoothing effectthat can occur during the processing ofLiDAR elevation data. -
Map Repeat: To simulate
delaysin elevation map updates, the map from the previous timestep is randomly repeated.The detailed domain randomization settings are provided in Appendix VI-C of the paper.
4.2.6. Real-world Deployment
4.2.6.1. Hardware Setup
The experiments utilize a Unitree G1 humanoid robot.
- Weight: .
- Height: .
- Degrees of Freedom (DoF): 23 actuated DoF (6 in each leg, 5 in each arm, 1 in the waist).
- Onboard Computation:
Jetson Orin NX. - Perception:
Livox Mid-360 LiDAR, which provides bothIMU data(Inertial Measurement Unit) andfeature points.
4.2.6.2. Elevation Map and System Design
The raw point cloud data from the LiDAR is noisy and subject to occlusions. To generate a robust elevation map:
- Odometry:
Fast LiDAR-Inertial Odometry (FAST-LIO)is used. This fusesLiDAR feature pointswithIMU data(from the LiDAR) to produce preciseodometry outputs(robot's position and orientation). - Mapping: The
odometry outputsare then processed usingrobot-centric elevation mapping methodsto create agrid-based representation of ground heights. - Control Loop:
-
The
elevation mappublishes information at a frequency of . -
The
learned policy(fromBeamDojo) operates at . -
The policy's action outputs are sent to a
PD controller, which runs at , ensuring smooth and precise actuation of the robot's joints.
-
5. Experimental Setup
5.1. Datasets
The experiments primarily use simulated terrains to train and evaluate the BeamDojo framework, with real-world validation. The terrains are specifically designed to present sparse foothold challenges for humanoid robots.
The five types of sparse foothold terrains designed for two-stage training and evaluation are:
-
Stones Everywhere: A general sparse foothold terrain with stones scattered across an area. The terrain's center is a platform surrounded by stones. As the curriculum progresses, the stone size decreases, and sparsity increases.
-
Stepping Stones: This terrain features two lines of stepping stones in the longitudinal direction, connected by two platforms at each end. Each stone is uniformly distributed in two sub-square grids, with a similar curriculum progression to
Stones Everywhere(decreasing stone size, increasing sparsity). -
Balancing Beams: Initially, this terrain has two lines of separate stones. Through curriculum progression, the stones become smaller, their lateral distance reduces, eventually forming a single line of balancing beams. This terrain is particularly challenging as it requires the robot to keep its feet close together while maintaining its center of mass.
-
Stepping Beams: Consists of a sequence of beams randomly distributed along the longitudinal direction, with platforms at either end. Along with
Stones EverywhereandStepping Stones, this terrain requires high precision in foothold placement. -
Gaps: This terrain features several gaps with random distances between them, requiring the robot to execute large steps to cross.
Terrain Characteristics:
-
The
Stones Everywhereterrain covers an area of . -
Stepping StonesandBalancing Beamsare in width and in length, designed for single-direction commands. -
The depth of
gapsrelative to the ground is set to . -
All stones and beams exhibit height variations within to add realism.
-
The
depth tolerance thresholdfor the foothold reward is set to , meaning a penalty is incurred if a sampled point falls below the expected surface.
5.2. Evaluation Metrics
The performance of BeamDojo and its baselines is evaluated using three primary metrics, along with detailed analysis of reward components.
-
Success Rate ():
- Conceptual Definition: The percentage of trials where the robot successfully traverses the entire terrain length without falling or terminating prematurely. It quantifies the overall reliability of the locomotion policy.
- Mathematical Formula: $ R_{\mathrm{succ}} = \frac{\text{Number of successful traversals}}{\text{Total number of attempts}} \times 100% $
- Symbol Explanation: Not applicable beyond the intuitive definition.
-
Traverse Rate ():
- Conceptual Definition: The ratio of the distance the robot traveled before falling or failing to the total length of the terrain. This metric measures how far the robot can get even if it doesn't complete the entire traversal, indicating partial success or robustness.
- Mathematical Formula: $ R_{\mathrm{trav}} = \frac{\text{Distance traveled before failure}}{\text{Total terrain length}} \times 100% $
- Symbol Explanation: Not applicable beyond the intuitive definition. (The total terrain length for most terrains is ).
-
Foothold Error ():
- Conceptual Definition: The average proportion of foot samples that land outside the intended safe foothold areas. This directly quantifies the precision of the robot's foot placement. A lower value indicates more accurate stepping.
- Mathematical Formula: Based on the definition in the paper (average proportion of foot samples landing outside the intended foothold areas), and the foothold reward formula, it could be conceptualized as the average of the penalty term over an episode normalized by the total number of samples. If we consider (where is episode length, 2 feet, samples per foot), and $ E_{\mathrm{foot}} = \frac{\text{Number of penalized foot samples}}{\text{Total number of foot samples}} \times 100% $ The paper states it's "average proportion of foot samples landing outside the intended foothold areas". This directly relates to the part of the foothold reward.
- Symbol Explanation: Not applicable beyond the intuitive definition.
-
Reward Functions (Detailed in Appendix VI-A): While not a direct evaluation metric, the composition of the reward functions is crucial for understanding what the policy optimizes. The rewards are split into two groups for the
double critic.The following are the results from Table VII of the original paper:
Term Equation Weight Group 1: Locomotion Reward Group xy velocity tracking exp {−|vxy − vcy|^2/}, 1.0 yaw velocity tracking exp { −(yaw − cyaw)^2/} 1.0 base height (h − htarget)^2 -10.0 orientation gx + gy -2.0 z velocity -2.0 roll-pitch velocity \omega_{xy} -0.05 action rate -0.01 smoothness −1e − 3 stand still \boldsymbol{\omega}_{\theta}{\boldsymbol{v}_y^c < } -0.05 joint velocities \dot{\boldsymbol{\theta}} −1e − 4 joint accelerations \ddot{\boldsymbol{\theta}} −2.5e − 8 joint position limits ReLU() + ReLU() -5.0 joint velocity limits ReLU(\dot{\boldsymbol{\theta}} - \parallel) −1e − 3 joint power \boldsymbol{\tau} / (\dot{\boldsymbol{\theta}} + 0.2 * \boldsymbol{\omega}) −2e − 5 feet ground parallel -0.02 feet distance ReLU (p_{y,1} - p_{y,2} - dmin) \$\cdot\$\$F_i\$ 0.5 feet air time / 1.0 feet clearance -1.0 Group 2: Foothold Reward Group foothold 1.0 The following are the results from Table VIII of the original paper:
Symbols Description Tracking shape scale, set to 0.25. Threshold for determining zero-command in stand still reward, set to 0.1. Computed joint torques. Desired base height relative to the ground, set to 0.725. ReLU() Function that clips negative values to zero. Spatial position and velocity of all sampled points on the i-th foot respectively. Target foot-lift height, set to 0.1. Air time of the i-th foot. Desired feet air time, set to 0.5. Indicator specifying whether foot i makes first ground contact. Minimum allowable distance between two feet, set to 0.18.
5.3. Baselines
The BeamDojo framework, which integrates two-stage RL training and a double critic, is compared against four baselines to evaluate the effectiveness of its components. All methods are adapted to a two-stage structure for fairness. For Stage 1, training occurs on Stones Everywhere with curriculum learning. BeamDojo and BL 4 use soft terrain dynamics constraints in Stage 1, while other baselines use hard terrain dynamics constraints. Stage 2 involves fine-tuning on Stepping Stones and Balancing Beams with curriculum learning.
The baselines are:
-
BL 1) PIM: This baseline uses
PIM(Perceptive Internal Model), a one-stage method originally designed for humanoid locomotion tasks on uneven terrains (like walking up stairs). To make it comparable, thefoothold reward() is added to encourage accurate stepping on foothold areas. This baseline represents an existing state-of-the-art humanoid controller adapted to the foothold task. -
BL 2) Naive: This is a direct, simplified implementation without the key innovations of
BeamDojo. Itneither includes the two-stage RL approach nor the double criticframework. Its only addition is thefoothold reward. This serves as a basic implementation to highlight the necessity of the proposed architectural and training advancements. -
BL 3) Ours w/o Soft Dyn: This is an
ablation studythat removes the first stage of training withsoft terrain dynamics constraints. Essentially, it means the robot is trained directly withhard terrain dynamics constraintsfrom the beginning, where missteps lead to early termination. This baseline assesses the contribution of the relaxed exploration phase. -
BL 4) Ours w/o Double Critic: This
ablation studyreplaces thedouble criticwith asingle criticto handle bothlocomotion rewards(dense) and thefoothold reward(sparse). This reflects a more traditionalRLdesign where all rewards are processed by a single value network, evaluating the importance of separating reward learning.The training and simulation environments are implemented in
IsaacGym, a high-performance GPU-based physics simulator designed for robot learning.
6. Results & Analysis
6.1. Core Results Analysis
The quantitative results demonstrate that BeamDojo consistently outperforms the baselines across various challenging terrains and difficulty levels. The evaluation metrics used are Success Rate () and Traverse Rate ().
The following are the results from Table II of the original paper:
| Stepping Stones | Balancing Beams | Stepping Beams | Gaps | |||||
|---|---|---|---|---|---|---|---|---|
| Rsucc (%, ↑) | Rtrav (%, ↑) | Rsucc (%, ↑) | Rtrav (%, ↑) | Rsucc (%, ↑) | Rtrav (%, ↑) | Rsucc (%, ↑) | Rtrav (%, ↑) | |
| Medium Terrain Difficulty | ||||||||
| PIM | 71.00 (±1.53) | 78.29(±2.49) | 74.67(±2.08) | 82.19(±4.96) | 88.33(±3.61) | 93.16(±4.78) | 98.00(±0.57) | 99.16 (±0.75) |
| Naive | 48.33(±6.11) | 47.79(±5.76) | 57.00(±7.81) | 71.59(±8.14) | 92.00(±2.52) | 92.67(±3.62) | 95.33 (±1.53) | 98.41(±0.67) |
| Ours w/o Soft Dyn | 65.33(±2.08) | 74.62(±1.37) | 79.00(±2.64) | 82.67(±2.92) | 98.67(±2.31) | 99.64(±0.62) | 96.33(±1.53) | 98.60(±1.15) |
| Ours w/o Double Critic | 83.00(±2.00) | 86.64(±1.96) | 88.67(±2.65) | 90.21(±1.95) | 96.33(±1.15) | 98.88(±1.21) | 98.00(±1.00) | 99.33(±0.38) |
| BeaMDOJo | 95.67(±1.53) | 96.11(±1.22) | 98.00(±2.00) | 99.91(±0.07) | 98.33(±1.15) | 99.28(±0.65) | 98.00(±2.65) | 99.21(±1.24) |
| Hard Terrain Difficulty | ||||||||
| PIM | 46.67(±2.31) | 52.88(±2.86) | 33.00(±2.31) | 45.28(±3.64) | 82.67(±2.31) | 90.68(±1.79) | 96.00(±1.00) | 98.27(±3.96) |
| Naive | 00.33(±0.57) | 21.17(±1.71) | 00.67(±1.15) | 36.25(±7.85) | 82.00(±3.61) | 88.91(±3.75) | 31.00(±3.61) | 62.70 (±4.08) |
| Ours w/o Soft Dyn | 42.00(±6.56) | 47.09 (±6.97) | 51.00(±4.58) | 72.93 (±4.38) | 87.33(±2.08) | 89.41(±1.75) | 93.00(±1.00) | 95.62(±2.50) |
| Ours w/o Double Critic | 55.67(±3.61) | 60.95(±2.67) | 70.33(±3.06) | 85.64(±3.24) | 94.67(±1.53) | 96.57(±1.42) | 94.33(±3.06) | 95.62(±2.50) |
| BeaMDOJo | 91.67(±1.33) | 94.26(±2.08) | 94.33(±1.53) | 95.15(±1.82) | 97.67(±2.08) | 98.54(±1.43) | 94.33(±1.15) | 97.00(±1.30) |
Key Observations from Table II:
BeamDojo's Superiority:BeamDojoconsistently achieves the highestSuccess Rate() andTraverse Rate() across all terrains (Stepping Stones, Balancing Beams, Stepping Beams, Gaps) and difficulty levels (Medium and Hard). For instance, onHard Stepping Stones,BeamDojoachieves 91.67% , significantly outperforming all baselines. OnBalancing Beams(Medium), it nearly reaches 98% and 99.91% , indicating extremely robust performance.- Struggles of Baselines: The
Naiveimplementation performs very poorly, especially onHard Stepping StonesandBalancing Beams, with success rates close to 0%. This highlights the necessity ofBeamDojo's specialized components.PIM, an existing humanoid controller, also shows significant performance degradation on hard terrains, particularlyBalancing Beams(33% ), demonstrating its limitations forfine-grained footholds. - Importance of
Soft DynamicsandDouble Critic: Theablation studies(Ours w/o Soft DynandOurs w/o Double Critic) show that removing either thesoft terrain dynamics constraintsor thedouble criticleads to a substantial drop in performance compared to the fullBeamDojoframework, especially on the harderStepping StonesandBalancing Beams. This confirms the critical role of both components. - Zero-Shot Generalization: A remarkable finding is
BeamDojo's impressivezero-shot generalization capabilities. Even though the robot was only explicitly trained onStones Everywhere(Stage 1) andStepping Stones/Balancing Beams(Stage 2), it maintains highsuccess ratesandtraverse ratesonStepping BeamsandGaps, performing comparably to the best baselines on these unseen terrains.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Foot Placement Accuracy
The Foothold Error () comparison, as shown in Fig. 5, reveals that BeamDojo achieves highly accurate foot placement.
The following figure (Fig. 5 from the original paper) shows the foothold error comparison:
该图像是图表,展示了不同方法在两种中等难度地形下的足垫误差比较,分别为(a) 踏石和(b) 平衡梁。结果显示,我方方法在各测试中均表现优于其他方法,尤其是在平衡梁的测试中,实现了更低的足垫误差。
BeamDojo consistently exhibits lower foothold error values compared to other methods, largely attributed to the contribution of the double critic. In contrast, the Naive implementation shows higher error rates, indicating a significant portion of foot placements outside safe foothold areas. This underscores BeamDojo's precision in challenging terrains.
6.2.2. Learning Efficiency
BeamDojo demonstrates significantly faster convergence during training, as illustrated in Fig. 6, even though all designs were trained for 10,000 iterations to ensure convergence.
The following figure (Fig. 6 from the original paper) shows the learning efficiency:
该图像是图表,展示了两阶段训练中不同方法的学习效率。横轴为训练步骤(k),纵轴为地形等级。图中显示,采用我们的方法在第一阶段和第二阶段更快达到地形等级8,显示出更高的学习效率。
Both the two-stage training setup and the double critic contribute to this improved learning efficiency, with the two-stage setup being the more dominant factor. The Naive implementation struggles to reach higher terrain levels in both stages, highlighting its inefficiency. The two-stage learning allows for continuous attempts at foot placements, aiding in the accumulation of successful samples, while the double critic ensures that sparse foothold rewards updates are not destabilized by noisy locomotion signals in early training.
6.2.3. Gait Regularization
The double critic also plays a vital role in gait regularization, ensuring smoother and more natural movements.
The following are the results from Table III of the original paper:
| Designs | Smoothness (↓) | Feet Air Time (↑) |
|---|---|---|
| Naive | 1.7591 (±0.1316) | -0.0319 (±0.0028) |
| Ours w/o Soft Dyn | 0.9633 (±0.0526) | -0.0169 (±0.0014) |
| Ours w/o Double Critic | 1.2705 (±0.1168) | −0.0229( (±0.0033) |
| BeAMDOJo | 0.7603 (±0.0315) | −0.0182(±0.0027) |
As seen in Table III, the Naive design and Ours w/o Double Critic show poorer performance in both smoothness and feet air time metrics. In contrast, BeamDojo and Ours w/ Soft Dyn (which still uses a double critic implicitly for these metrics) demonstrate superior motion smoothness (lower values) and improved feet air time (closer to target), leading to better feet clearance. This improvement stems from the double-critic framework normalizing advantage estimates for dense and sparse rewards independently, preventing the sparse rewards from introducing noise into regularization reward learning.
6.2.4. Foot Placement Planning
The double critic also benefits the foot placement planning throughout the entire sub-process of lifting and landing a foot.
The following figure (Fig. 7 from the original paper) illustrates foot placement planning visualization:
该图像是示意图,展示了足部安置规划的可视化。图中黄线代表BeamDojo方法,红线则对应于没有双重评论的方法。A到C的过程中,缺乏双重评论的方法在接近目标石头时(B点)仅表现出显著的调整。
As visualized in Fig. 7, BeamDojo enables smoother planning, allowing the foot to precisely reach the next foothold (yellow line). Conversely, the baseline without the double critic (Ours w/o Double Critic, red line) shows more reactive stepping, with significant adjustments only made when the foot is very close to the target stone (point B). This indicates that by learning the sparse foothold reward separately, the double critic helps the policy plan its motion over a longer time horizon.
6.3. Real-world Experiments
BeamDojo successfully achieves zero-shot transfer to real-world dynamics, enabling the humanoid robot to traverse challenging terrains effectively. The success rate and traversal rate are reported in Fig. 8 for various terrains.
The following figure (Fig. 8 from the original paper) illustrates the experimental results of BeamDojo for humanoid robot walking across different terrains:
该图像是插图,展示了在不同地形上应用BeamDojo进行人形机器人行走的实验结果。图中包含四个不同的行走场景,以及对应的成功率()和旅行率()数据。每个场景均展示了机器人在不同宽度脚踏板上行走的表现,特别是在“Stepping Stones”、“Balancing Beams”、“Stepping Beams”和“Gaps”四种地形上的成功率与训练方法的对比。数据表格清楚地列出了不同情境下的实验结果。
BeamDojoachieves a high success rate in real-world deployments, demonstrating excellentprecise foot placement capabilities. It also shows impressivegeneralization performanceonStepping BeamsandGaps, even though these were not part of the explicit training set.- An
ablation without height map domain randomization(ours w/o HR) results in a significantly lower success rate, underscoring the critical importance ofdomain randomizationfor robustsim-to-real transfer. - Notably,
BeamDojoenablesbackward movementon risky terrains, which is a key advantage of usingLiDARfor perception over singledepth cameras.
6.3.1. Agility Test
To assess agility, the humanoid robot was given five commanded longitudinal velocities () on stepping stones ( total length).
The following are the results from Table IV of the original paper:
| vx (m/s) | Time Cost (s) | Average Speed (m/s) | Error Rate (%, ↓) |
|---|---|---|---|
| 0.5 | 6.33(±0.15) | 0.45(±0.05) | 10.67(±4.54) |
| 0.75 | 4.33(±0.29) | 0.65(±0.05) | 13.53(±6.52) |
| 1.0 | 3.17(±0.58) | 0.88(±0.04) | 11.83(±8.08) |
| 1.25 | 2.91(±0.63) | 0.96(±0.03) | 22.74(±5.32) |
| 1.5 | 2.69(±0.42) | 1.04(±0.05) | 30.68(±6.17) |
Table IV shows that BeamDojo maintains minimal tracking error up to the highest training command velocity of , achieving an average speed of . This demonstrates the policy's agility. However, performance degrades significantly beyond , as maintaining such high speeds on highly challenging terrains becomes increasingly difficult, leading to higher error rates.
6.3.2. Robustness Test
The robustness of the precise foothold controller was evaluated through several real-world experiments, illustrated in Fig. 9.
The following figure (Fig. 9 from the original paper) shows various robustness tests:
该图像是一个示意图,展示了机器人在稀疏支撑点上的各个动作,包括推、单腿支撑、静止、迈步、失误和恢复,体现了BeamDojo框架的灵活步态学习能力。
- Heavy Payload (Fig. 9a): The robot successfully carried a payload (approx. 1.5 times its torso weight), demonstrating robust
agile locomotionandprecise foot placementsdespite a significant shift in itscenter of mass. - External Force (Fig. 9b): The robot was subjected to external pushes from various directions. It demonstrated the ability to transition from a stationary pose, endure forces while on
single-leg support, and recover to a stabletwo-leg standing position. - Misstep Recovery (Fig. 9c): When traversing terrains without prior scanning (due to occlusions causing initial missteps), the robot exhibited robust
recovery capabilities, indicating its adaptability to unexpected disturbances.
6.4. Extensive Studies and Analysis
6.4.1. Design of Foothold Reward
The paper compares its sampling-based foothold reward (a continuous reward proportional to the number of safe points) with binary and coarse reward designs.
The following are the results from Table V of the original paper:
| Designs | Rsucc (%, ↑) | Efoot (%, ↓) |
|---|---|---|
| foothold-30% | 93.67(±1.96) | 11.43(±0.81) |
| foothold-50% | 92.71(±1.06) | 10.78(±1.94) |
| foothold-70% | 91.94(±2.08) | 14.35(±2.61) |
| BeaMDOJo | 95.67(±1.53) | 7.79(±1.33) |
Table V shows that BeamDojo's fine-grained continuous reward design yields the highest success rate and the lowest foothold error (7.79%) on stepping stones compared to coarse-grained approaches (e.g., foothold-30%, foothold-50%, foothold-70%). This confirms that gradually encouraging the maximization of overlap leads to more accurate foot placements. Among the coarse-grained methods, foothold-50% performed best, suggesting that thresholds too strict (30%) or too loose (70%) are less effective.
6.4.2. Design of Curriculum
The effectiveness of the terrain curriculum (Section 4.2.5.2) is validated by an ablation study without curriculum learning.
The following are the results from Table VI of the original paper:
| Designs | Medium Difficulty | Hard Difficulty | ||
|---|---|---|---|---|
| Rsucc | Rtrav | Rsucc | Rtrav | |
| w/o curriculum-medium | 88.33 | 90.76 | 2.00 | 18.36 |
| w/o curriculum-hard | 40.00 | 52.49 | 23.67 | 39.94 |
| BEaMDOJo | 95.67 | 96.11 | 82.33 | 86.87 |
Table VI clearly indicates that curriculum learning significantly improves both performance and generalization across terrains of varying difficulty. Without curriculum learning, training solely on hard terrain (w/o curriculum-hard) leads to very poor success rates (23.67%) and traverse rates (39.94%) on hard difficulty. Even training on medium difficulty without curriculum (w/o curriculum-medium) shows a dramatic drop in performance on hard terrain (2% Rsucc). BeamDojo's Rsucc of 82.33% on Hard Difficulty highlights the crucial role of progressive learning.
6.4.3. Design of Commands
In Stage 2, BeamDojo trains without an explicit heading command (), requiring the robot to learn to face forward using its perceptive observations. This is compared against a design that includes a heading command (ours w/ heading command), which applies corrective yaw commands based on directional error.
Real-world trials on stepping stones showed BeamDojo had a success rate of , while the heading command design achieved only . The heading command approach performed poorly due to model overfitting to angular velocities in simulation (making it sensitive to noisy real-world odometry data) and the need for precise manual calibration of the initial position in reality. BeamDojo's approach without continuous yaw correction proves more robust.
6.4.4. Generalization to Non-Flat Terrains
BeamDojo demonstrates good generalization to other non-flat terrains such as stairs and slopes.
The following figure (Fig. 10 from the original paper) shows generalization tests on non-flat terrains:
该图像是图表,展示了在非平坦地形上进行的一般化测试。左侧为高度15cm、宽度25cm的阶梯,右侧为15度倾斜的坡。
Real-world experiments (Fig. 10) on stairs ( width, height) and slopes (15-degree incline) show success rates of 8/10 and 10/10 respectively. The main adaptation required for these terrains is calculating the base height reward relative to the foot height rather than the ground height. For these terrains, the Stage 1 pre-training (with soft dynamics) becomes unnecessary as footholds are not sparse.
6.4.5. Failure Cases
The framework's performance limitations were investigated by varying stone sizes and step distances.
The following figure (Fig. 11 from the original paper) shows failure case analysis:
该图像是一个图表,展示了在不同石头大小和步距下的成功率分析。左侧图表显示了最小石头大小(从10cm到20cm)对成功率的影响,右侧图表显示了最大步距(从45cm到55cm)的影响。整体趋势显示,随着石头大小和步距的增加,成功率有所下降。
As shown in Fig. 11, while tougher training settings improve adaptability, performance drops sharply on 10 cm stones (approximately half the foot length) and 55 cm steps (roughly equal to the leg length). In these extreme scenarios, the difficulty shifts towards maintaining balance on very small footholds and executing larger strides, which the current reward function does not adequately address.
6.5. Domain Randomization Settings
The following are the results from Table IX of the original paper:
| Term | Value |
|---|---|
| Observations angular velocity noise | |
| U(−0.5, 0.5) rad/s | |
| joint position noise | U −0.05, 0.05) rad/s |
| joint velocity noise | U(−2.0, 2.0) rad/s |
| projected gravity noise | U(−0.05, 0.05) rad/s |
| Humanoid Physical Properties | |
| actuator offset | U(—0.05, 0.05) rad |
| motor strength noise | U(0.9, 1.1) |
| payload mass | u(−2.0, 2.0) kg |
| center of mass displacement | u(−0.05, 0.05) m |
| Kp, Kd noise factor | U(0.85, 1.15) |
| Terrain Dynamics | |
| friction factor | U(0.4, 1.0) |
| restitution factor | U(0.0, 1.0) |
| terrain height noise | U(-0.02, 0.02) m |
| Elevation Map | |
| vertical offset | U(−0.03, 0.03) m |
| vertical noise | u(-0.03, 0.03) m |
| map roll, pitch rotation noise | u(-0.03, 0.03) m |
| map yaw rotation noise | U−0.2, 0.2) rad |
| foothold extension probability | 0.6 |
| map repeat probability | 0.2 |
6.6. Hyperparameters
The following are the results from Table X of the original paper:
| Hyperparameter | Value |
|---|---|
| General | |
| num of robots | 4096 |
| num of steps per iteration | 100 |
| num of epochs | 5 |
| learning rate | 1e − 8 |
| PPO | |
| clip range | 0.2 |
| entropy coefficient | 0.01 |
| discount factor | 0.99 |
| GAE balancing factor | 0.95 |
| desired KL-divergence | 0.01 |
| BEAMDOJO | |
| actor and double critic NN | MLP, hidden units |
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces BeamDojo, a novel reinforcement learning (RL) framework designed to enable humanoid robots to achieve agile and robust locomotion on sparse foothold terrains. The key contributions include:
- Accurate Foot Placement: A unique
sampling-based foothold rewardtailored forpolygonal feetensures precise foot placements by continuously encouraging maximization of contact area with safe regions. - Efficient and Effective Training: A
two-stage RL training process(soft constraints for exploration, hard constraints for fine-tuning) addresses theearly termination problemand improvessample efficiency. Coupled with adouble criticarchitecture, the framework effectively handles the balance betweensparse foothold rewardsanddense locomotion rewards, leading to regular gait patterns and improved long-distance foot placement planning. - Real-world Agility and Robustness: Extensive
simulationandreal-world experimentson aUnitree G1 Humanoidrobot demonstrateBeamDojo's ability to achieve highsuccess ratesandagile locomotionin physical environments, even undersignificant external disturbances(payloads, pushes) and real-world variability. The use of aLiDAR-based elevation mapalso enables stablebackward walking, an advantage over depth camera-based systems.Zero-shot generalizationto unseen terrains likeStepping BeamsandGapsfurther highlights the framework's robustness.
7.2. Limitations & Future Work
The authors acknowledge several limitations of the current BeamDojo framework and suggest directions for future work:
- Perception Module Limitations: The performance is significantly constrained by inaccuracies in the
LiDAR odometry,jitter, andmap drift. These issues pose considerable challenges forreal-world deployment, making the system less adaptable tosudden disturbancesor dynamic changes in the environment (e.g., jitter of stones, which is hard to simulate accurately). - Underutilization of Elevation Map Information: The current method does not fully leverage all the information provided by the
elevation map. - Challenges with Significant Foothold Height Variations: The framework has not adequately addressed terrains with
significant variations in foothold height. - Future Work: The authors aim to develop a
more generalized controllerthat can enable agile locomotion across a broader range of complex terrains, including those requiringfootstep planning(e.g., stairs) and those with largeelevation changes.
7.3. Personal Insights & Critique
BeamDojo presents a comprehensive and well-engineered solution to a very challenging problem in humanoid robotics. The integration of multiple carefully designed components—specifically the polygonal foot reward, double critic, and two-stage training—is what makes this work stand out.
One of the most inspiring aspects is the methodical approach to tackling the sparse reward and early termination problems. The two-stage training with soft dynamics is a clever way to enable sufficient exploration in complex environments without constant resets, which is a major bottleneck in many RL applications. The double critic further refines this by ensuring that foundational locomotion skills are learned without being disrupted by the sparse, high-penalty foothold rewards. This detailed reward engineering and architectural design are critical to the observed success rates.
The robust sim-to-real transfer is another strong point. The extensive domain randomization, particularly the inclusion of elevation map measurement noise, shows a deep understanding of the practical challenges in real-world robotics. This level of detail in simulation design is often underestimated but is crucial for successful deployment.
However, the identified limitations also offer interesting avenues for future research. The reliance on the perception module highlights a common bottleneck in embodied AI; even the best control policy cannot compensate for poor or noisy sensory input. Future work could explore more advanced sensor fusion techniques or predictive perception to mitigate LiDAR inaccuracies and delays.
The current reward function's struggle with very small stones or large steps suggests that incorporating more explicit balance rewards or kinematic feasibility constraints directly into the reward structure, or perhaps through a more advanced model-predictive control layer, could further enhance performance in extreme scenarios. The idea of "not fully leveraging elevation map information" is also intriguing; it implies that there might be richer features or higher-level planning opportunities from the elevation map that could be exploited for even more intelligent and adaptive behaviors.
Overall, BeamDojo provides a significant step forward for humanoid locomotion on sparse terrains, demonstrating how thoughtful RL algorithm design and meticulous system integration can lead to highly capable and robust robotic systems.
Similar papers
Recommended via semantic vector search.