Towards Adaptive Humanoid Control via Multi-Behavior Distillation and Reinforced Fine-Tuning
TL;DR Summary
This paper introduces an Adaptive Humanoid Control (AHC) framework that learns adaptive locomotion controllers through multi-behavior distillation and reinforced fine-tuning, showing strong adaptability across various skills and terrains.
Abstract
Humanoid robots are promising to learn a diverse set of human-like locomotion behaviors, including standing up, walking, running, and jumping. However, existing methods predominantly require training independent policies for each skill, yielding behavior-specific controllers that exhibit limited generalization and brittle performance when deployed on irregular terrains and in diverse situations. To address this challenge, we propose Adaptive Humanoid Control (AHC) that adopts a two-stage framework to learn an adaptive humanoid locomotion controller across different skills and terrains. Specifically, we first train several primary locomotion policies and perform a multi-behavior distillation process to obtain a basic multi-behavior controller, facilitating adaptive behavior switching based on the environment. Then, we perform reinforced fine-tuning by collecting online feedback in performing adaptive behaviors on more diverse terrains, enhancing terrain adaptability for the controller. We conduct experiments in both simulation and real-world experiments in Unitree G1 robots. The results show that our method exhibits strong adaptability across various situations and terrains. Project website: https://ahc-humanoid.github.io.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is Adaptive Humanoid Control via a two-stage framework: Multi-Behavior Distillation and Reinforced Fine-Tuning.
1.2. Authors
The authors and their affiliations are:
-
Yingnan Zhao ()
-
Xinniao Wang ()
-
Dewey Wang ()
-
Xinzhe Liu ()
-
Dan Lu ()
-
Qilong Han ()
-
Peng Liu ()
-
Chenjia Bai ()
Affiliations:
-
College of Computer Science and Technology, Harbin Engineering University
-
Institute of Artificial Intelligence (TeleAI), China Telecom
-
School of Information Science and Technology, University of Science and Technology of China
-
School of Information Science and Technology, ShanghaiTech University
-
College of Computer Science and Technology, Harbin Institute of Technology
-
Shenzhen Research Institute of Northwestern Polytechnical University
-
National Engineering Laboratory for Modeling and Emulation in E-Government, Harbin Engineering University
The asterisk (*) next to Dan Lu and Chenjia Bai indicates corresponding authorship. The authors represent a mix of academic institutions and a telecommunications research institute in China, suggesting a collaborative effort between academia and industry.
1.3. Journal/Conference
The paper is published as a preprint, indicated by the arXiv link. Its publication status (e.g., in a specific journal or conference proceedings) is not explicitly stated in the provided text beyond its arXiv posting date. Given the nature of the research (robotics, reinforcement learning) and the typical publication cycle, it is likely intended for a major robotics or AI conference/journal.
1.4. Publication Year
The paper was published at (UTC): 2025-11-09T13:15:20.000Z.
1.5. Abstract
Humanoid robots possess the potential to learn diverse human-like locomotion behaviors, such as standing, walking, running, and jumping. However, current methods typically train separate policies for each skill, resulting in behavior-specific controllers that lack generalization and perform poorly on irregular terrains and in varied situations. To overcome this, the paper introduces Adaptive Humanoid Control (AHC), a two-stage framework. The first stage involves training several primary locomotion policies and then applying a multi-behavior distillation process to create a basic multi-behavior controller, which enables adaptive behavior switching based on the environment. The second stage employs reinforced fine-tuning, gathering online feedback from adaptive behaviors performed on more diverse terrains to enhance the controller's terrain adaptability. Experiments conducted in simulation and on real-world Unitree G1 robots demonstrate the method's strong adaptability across different situations and terrains.
1.6. Original Source Link
The official source link is: https://arxiv.org/abs/2511.06371 The PDF link is: https://arxiv.org/pdf/2511.06371v2.pdf The paper is currently available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the limitation of existing humanoid robot control methods, which primarily rely on training independent policies for each specific skill (e.g., standing up, walking, jumping). While these behavior-specific controllers excel in their narrow domains, they exhibit limited generalization capabilities in terms of behavior diversity and terrain adaptability. This results in brittle performance when deployed on irregular terrains or in complex, dynamic situations where a robot might need to switch between skills or adapt to unexpected environmental changes.
This problem is important because humanoid robots, with their human-like morphology, are envisioned to operate in human environments, requiring a versatile set of locomotion abilities. The ability to seamlessly transition between skills (e.g., recovering from a fall and then walking) and robustly navigate diverse terrains is crucial for the autonomy and practical application of these robots. A significant challenge in directly training a multi-skill policy using Reinforcement Learning (RL) is the occurrence of gradient conflicts among different reward functions, which can hinder the convergence and effectiveness of the learning process.
The paper's entry point or innovative idea is to propose a two-stage framework called Adaptive Humanoid Control (AHC). Instead of attempting to learn all behaviors and adapt to all terrains simultaneously from scratch (which is difficult due to gradient conflicts), AHC first learns a foundational multi-behavior controller and then systematically enhances its terrain adaptability. This decoupled approach, combined with specific techniques like multi-behavior distillation and reinforced fine-tuning with gradient surgery, aims to overcome the limitations of prior work.
2.2. Main Contributions / Findings
The paper's primary contributions are summarized as follows:
-
Motion-Guided Policy Learning and Supervised Distillation for Basic Multi-Behavior Policy: The authors propose an approach that avoids direct training of a
multi-behavior RL policyacross diverse terrains. Instead, they first integratehuman motion priorsusingAdversarial Motion Prior (AMP)into independentbehavior-specific policy learningto create basic human-like controllers. These separate controllers are then combined into a single, basicmulti-behavior policythroughsupervised distillation. This initial stage facilitatesadaptive behavior switchingand addresses the difficulties of directmulti-behavior RLtraining. -
Sample-Efficient RL Fine-Tuning for Terrain Adaptability: The basic
multi-behavior policyobtained from the first stage is further refined usingsample-efficient RL fine-tuning. This process collects online feedback to continuously improve theterrain adaptabilityof each behavior on more diverse and complex terrains. Techniques such asgradient projection(PCGrad) andbehavior-specific criticsare employed to mitigategradient conflictsand ensure efficient learning in thismulti-task RLsetting. -
Extensive Experimental Validation in Simulation and Real World: The learned controller is rigorously evaluated through extensive experiments in both the
IsaacGym simulatorand on areal-world Unitree G1 humanoid robot. The results demonstrate that theAHCcontroller exhibits strong adaptability, enabling the robot to effectively manage environmental state changes (e.g., standing up after a fall and walking) and perform robust locomotion on challenging terrains (e.g., stairs and slopes).The key findings demonstrate that the proposed
AHCframework successfully enables humanoid robots to acquire and adaptively switch between diverse skills like standing up and walking, and perform robustly on various complex terrains. This is achieved by systematically addressing the challenges ofmulti-skill learningandterrain generalizationthrough its innovative two-stage approach and integrated learning mechanisms.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the Adaptive Humanoid Control (AHC) paper, a foundational understanding of several key concepts in robotics and reinforcement learning is necessary:
-
Humanoid Robots: Robots designed to mimic the human body, typically featuring a torso, head, two arms, and two legs. Their human-like morphology allows them to interact with human-centric environments but also introduces challenges in balance, stability, and control due to their complex
degrees of freedom (DoF). -
Locomotion: The act of moving from one place to another. For humanoid robots, this includes a diverse set of behaviors like walking, running, standing up, jumping, climbing, etc.
-
Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make optimal decisions by interacting with anenvironment. The agent receivesrewardsfor desired actions andpenaltiesfor undesired ones, iteratively adjusting itspolicyto maximize cumulative reward over time.- Agent: The entity that perceives the environment and takes actions.
- Environment: The external system with which the agent interacts.
- State (): A complete description of the environment at a given time.
- Action (): The output of the agent, which affects the environment.
- Reward (): A scalar feedback signal from the environment, indicating the desirability of an action taken from a state.
- Policy (): A function that maps states to actions (), defining the agent's behavior.
- Value Function (
V(s)orQ(s, a)): Predicts the expected future reward from a given state or state-action pair.
-
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An
MDPis formally defined as a tuple , where:- is the set of all possible states.
- is the set of all possible actions.
- is the
state transition function, representing the probability of transitioning to state given that the agent took action in state . - is the
reward function, defining the immediate reward received after taking action in state . - is the
discount factor, which determines the present value of future rewards. A higher makes future rewards more influential. The goal in anMDPis to find a policy that maximizes the expected cumulative discounted reward.
-
Proximal Policy Optimization (PPO): A popular
RLalgorithm known for its stability and performance.PPOis anon-policy algorithmthat optimizes astochastic policyby taking small, conservative steps. It uses aclipped surrogate objective functionto prevent excessively large policy updates, which can destabilize training.PPOoften employs anactor-critic architecture. -
Actor-Critic Architecture: A common
RLframework where two neural networks work in tandem:- Actor: The policy network, which takes the current
stateas input and outputs theaction(or probability distribution over actions). - Critic: The value network, which takes the current
stateas input and outputs an estimate of thevalue function(expected future rewards) for that state. The critic's estimate helps the actor learn by providing a baseline foradvantage estimation.
- Actor: The policy network, which takes the current
-
Generalized Advantage Estimation (GAE): A method used in
actor-critic RLalgorithms (likePPO) to more accurately estimate theadvantage function. Theadvantage functionmeasures how much better an action is than the average action at state .GAEbalances the bias-variance trade-off inadvantage estimationby using a weighted average of -step returns. -
Policy Distillation: A technique where knowledge from one or more "teacher" policies (often larger, more complex, or trained on specific tasks) is transferred to a "student" policy (often smaller, simpler, or designed for generalization). In this paper, it's used to combine multiple
behavior-specific policiesinto a singlemulti-behavior policy. The student policy learns by trying to mimic the outputs (actions or probabilities) of the teacher policies. -
Behavioral Cloning (BC): A form of
supervised learningwhere a model learns a policy by observing and imitating expert demonstrations.DAgger (Dataset Aggregation)is an iterativeBCalgorithm that addresses thecovariance shiftproblem (where the learned policy deviates from the expert's state distribution).DAggerrepeatedly collects data by running the current policy and then queries the expert for actions in those visited states, adding them to the training dataset. -
Multi-task Learning (MTL): A machine learning paradigm where multiple tasks are learned simultaneously by a single model. The goal is often to improve the generalization ability of the model by leveraging shared representations among related tasks. A key challenge in
MTLisgradient conflict, where the gradients from different tasks might pull the shared parameters in opposing directions. -
Gradient Conflict: In
MTL,gradient conflictoccurs when the gradients computed for different tasks point in substantially different directions in the parameter space. This can lead to slow convergence, oscillations, or even divergence, as optimizing for one task might degrade performance on another. -
Gradient Surgery / Projecting Conflicting Gradients (PCGrad): A technique to mitigate
gradient conflictinMTL.PCGradmodifies task-specific gradients by projecting conflicting gradients onto the normal plane of other task gradients. This removes the conflicting component while preserving the non-conflicting direction, allowing shared parameters to learn from multiple tasks more harmoniously. -
Adversarial Motion Prior (AMP): A method used in
RLforphysics-based character controlto generate naturalistic and human-like motions.AMPleverages adiscriminatornetwork, similar to aGenerative Adversarial Network (GAN). Thediscriminatoris trained to distinguish between motion sequences from a reference motion dataset (human motion capture data) and those generated by theRL policy. TheRL policythen receives astyle rewardbased on the discriminator's output, encouraging it to produce motions that are indistinguishable from the reference data, thus incorporating human motion priors without complex handcrafted reward functions. -
Mixture-of-Experts (MoE): A neural network architecture that consists of multiple "expert" sub-networks and a "gating network." The
gating networklearns to assign different weights to the outputs of the experts for a given input, effectively allowing different parts of the network to specialize in different aspects of the task. This can enhance model capacity and allow for learning diverse behaviors. -
PD Controller (Proportional-Derivative Controller): A widely used feedback control loop mechanism in robotics. It calculates an
error valueas the difference between a desiredsetpointand a measuredprocess variable. The controller then applies acorrectionbased onproportional (P)andderivative (D)terms of this error.- Proportional term (): Responds to the current error, providing a correction proportional to its magnitude.
- Derivative term (): Responds to the rate of change of the error, helping to dampen oscillations and improve stability.
In this paper,
PD controllersare used to convert the policy's predicted actions (target joint positions) into motor torques.
-
Domain Randomization: A technique used in
sim-to-real transferforRL. Instead of trying to perfectly match the simulation environment to the real world,domain randomizationtrains theRL policyin simulations where various physical parameters (e.g., friction, mass, sensor noise) are randomly varied. This forces the policy to become robust to variations and makes it more likely to generalize to the real world, which can be seen as just another variation within the randomized domain.
3.2. Previous Works
The paper references several prior studies that form the landscape of humanoid locomotion and multi-behavior learning in robots:
-
Humanoid Locomotion with Deep Reinforcement Learning:
Makoviychuk et al. (2021)andTodorov, Erez, and Tassa (2012)(MuJoCo): These works highlight the advancements inRLfor robust and complex locomotion in simulation.Isaac Gym (Makovychuk et al. 2021)is specifically mentioned as providinghigh-performance GPU-based physics simulationwhich facilitateslarge-scale RLtraining.- ,
Radosavovic et al. (2024), , : These works demonstrate challenging behaviors likecoordinated upper and lower body control,whole-body locomotion,robust traversal over complex terrains, andfull-body teleoperation. Huang et al. (2025b) (HoST): This work is directly compared against.HoSTfocuses onfall recoveryfor humanoids, achieving it throughmultiple criticsandforce curriculum. The current paper adaptsHoST's multiple critics strategy for its ownrecovery policybut enhances it withAMP.Long et al. (2024b), ,Wang et al. (2025c), : These address locomotion on more extreme terrains usingexternal sensors(depth cameras, LiDARs) orattention-based network designsfor terrain perception.- Note: The primary limitation highlighted by the authors for all these works is their focus on a
single behavior.
-
Multi-Behavior Learning in Robots:
Zhuang et al. (2023),Zhuang, Yao, and Zhao (2024): These works utilizepolicy distillationto integrate skills from multiple expert policies, enabling diverse behaviors for navigating complex terrains (e.g.,Robot Parkour Learning,Humanoid Parkour Learning).Hoeller et al. (2024): Exploreshierarchical frameworksto select among multiple skill policies for efficientmulti-skill traversal(e.g.,Anymal Parkourfor quadrupedal robots).Xue et al. (2025) (HugWBC): Leveragesinput signals(gait frequency, foot contact patterns) to guide a policy to exhibit different behaviors based on varying commands.Huang et al. (2025a) (MoELoco): Adopts anMoE architectureto reducegradient conflictsinmulti-skill RL, improving training efficiency forquadrupedal locomotion.Wang et al. (2025b) (MoRE): Further enhancesmulti-skill policy performanceby incorporatingAMP-based rewardsandexternal sensor inputsfor humanoids.- Note: The authors differentiate their work by integrating
highly diverse behaviors(like recovery and locomotion) into asingle unified policythatautonomously switchesbased on the robot's state, rather than relying on explicit control signals or combining highly similar behaviors.
3.3. Technological Evolution
The field of robot locomotion control has evolved significantly:
- Early Model-Based Control (Pre-RL Era): Methods like
whole-body control (WBC)(e.g., Santis and Khatib 2006) andmodel predictive control (MPC)(e.g., Li and Nguyen 2023) relied heavily on precise models of the robot and environment. While effective for well-defined tasks, they struggled with adaptability to unknown or dynamic environments. - Emergence of Reinforcement Learning (RL): With advances in computing power and
deep learning,RL(Ernst and Louette 2024) emerged as a powerful paradigm.RLallowed robots to learn complex, emergent behaviors directly from interaction, reducing the need for explicit modeling. Pioneers in this area used simulators likeMuJoCo (Todorov et al. 2012)and laterIsaac Gym (Makovychuk et al. 2021)forlarge-scale simulation. - Single-Skill Mastery: Much of the initial
RLsuccess focused on mastering individual, complex skills such as walking (Gu et al. 2024a), running, standing up (He et al. 2023b; Huang et al. 2025b), jumping (Tan et al. 2024), or squatting (Ben et al. 2025). Techniques likeAdversarial Motion Prior (AMP)(Peng et al. 2021; Escontrela et al. 2022) were introduced to make these behaviors more human-like. - Addressing Generalization and Adaptability (Multi-Skill/Terrain): The next frontier involved making robots more versatile. This led to efforts in:
-
Multi-task RLandpolicy distillationfor combining skills (Zhuang et al. 2023; Zhuang, Yao, and Zhao 2024). -
Mixture-of-Experts (MoE)architectures to handle diverse tasks (Huang et al. 2025a; Wang et al. 2025b). -
Terrain curriculum learningto generalize across varied terrains (Rudin et al. 2022). -
Techniques to address
gradient conflictsinmulti-task RL(Chen et al. 2018; Hessel et al. 2019; Yu et al. 2020).This paper's work fits within this technological timeline by building upon the successes of single-skill
RLandAMP, and then specifically tackling the challenges of integratinghighly diverse behaviors(like recovery and walking) into aunified controllerthat canautonomously switchand adapt tocomplex terrains. It leverages recent advancements inmulti-task RL,policy distillation,MoE, andgradient surgeryto achieve this.
-
3.4. Differentiation Analysis
Compared to the main methods in related work, the AHC framework's core differences and innovations are:
-
Integrated Diverse Behaviors with Autonomous Switching: Unlike most prior works that focus on a single behavior (e.g., locomotion, recovery, specific parkour skills) or combine similar behaviors (e.g., stair climbing and gap jumping),
AHCintegrateshighly diverse behaviors(fall recovery and robust walking) into asingle unified policy. Crucially, it enables the robot toautonomously switchbetween these behaviors based on its currentstate(e.g., detecting a fall and initiating recovery, then transitioning to walking). This holistic approach to multi-skill control is a significant differentiator. -
Two-Stage Training Framework:
- Behavior Distillation for Foundation: Instead of attempting to learn all complex, adaptive behaviors directly through
online RLfrom scratch,AHCfirst creates abasic multi-behavior controllerviamotion-guided policy learningandsupervised distillation. This addresses the inherent difficulty ofgradient conflictsandpoor convergencethat often plague directmulti-skill RLtraining. It effectively bootstraps the learning process with foundational skills. - Reinforced Fine-Tuning for Adaptability: Only after a basic
multi-behavior policyis established does the framework proceed toreinforced fine-tuningon diverse terrains. This staged approach is designed to besample-efficientand incrementally build complexity.
- Behavior Distillation for Foundation: Instead of attempting to learn all complex, adaptive behaviors directly through
-
Specific Technical Innovations for Multi-Task Learning:
-
Mixture-of-Experts (MoE)Architecture: EmployingMoEfor themulti-behavior policy() helps manage the complexity of integrating diverse skills by allowing different "experts" to specialize. This also inherently alleviates somegradient conflictscompared to a monolithic network. -
Adversarial Motion Prior (AMP)Integration:AMPis consistently used across both stages (in training behavior-specific policies and during fine-tuning) to ensure that the learned behaviors arehuman-likeandnaturalistic, providing a stable reward signal that is hard to achieve with handcrafted functions. This is integrated into both recovery and locomotion. -
PCGradfor Gradient Conflict Mitigation: The application ofProjecting Conflicting Gradients (PCGrad)during theRL fine-tuningstage directly addresses the coremulti-task learningchallenge ofgradient conflictin the shared actor. This allows for balanced and efficient learning across different behaviors and terrains. -
Behavior-Specific Critics: Decoupling thevalue function learningby using separatecriticsfor each task (recovery and walking) helps prevent tasks with largerreward scalesfrom dominating thegradient updates, leading to more accuratevalue estimationand stable training.In essence,
AHCdifferentiates itself by providing a structured, robust, and effective pathway to integrated, adaptive humanoid control, specifically designed to handle the complexities ofdiverse behaviorsandchallenging terrainsin a unified manner, leveraging state-of-the-artRLtechniques.
-
4. Methodology
The Adaptive Humanoid Control (AHC) framework is a two-stage approach designed to learn an adaptive humanoid locomotion controller capable of performing various skills across different terrains. An overview of this two-stage framework is illustrated in Figure 2.
4.1. Preliminaries and Problem Definitions
4.1.1. Behavior-Specific Control as an MDP
In the first stage of AHC, each behavior-specific humanoid control problem (e.g., recovery or walking) is formulated as a Markov Decision Process (MDP), defined by the tuple .
-
: The
state space, which encompasses all possible configurations of the robot and its environment. -
: The
action space, representing all possible control commands the robot can execute. -
: The
state transition function, which describes the probability distribution over the next state given the current state and action . -
: The
reward function, which provides scalar feedback to the agent based on its actions and resulting states. -
: The
reward discount factor, which weighs the importance of immediate rewards versus future rewards.During training, a
behavior-specific policylearns to map a state to an action to maximize thediscounted return, which is the expected sum of future rewards: $ \mathbb{E}\left[\sum_{t = 1}^{T}\gamma^{t - 1}R(s_t,\pmb {a}_t)\right] $
4.1.2. Adaptive Humanoid Control as Multi-Task RL
The adaptive humanoid control problem is formulated as a multi-task RL problem. Here, each task (i.e., each behavior) is considered an MDP , for tasks. Since all tasks are performed in a unified environment and the controller needs to adapt its behavior based on the current state, the state space is the union of individual task state spaces (), with the assumption that task state spaces are disjoint ( for ). This implies that each MDP differs primarily in its reward function and the specific part of the state space relevant to that behavior.
The objective for the behavior-adaptive policy is to optimize the sum of expected discounted returns across all tasks:
$
\sum_{i = 1}^{N}\mathbb{E}{P,\pi_i}\left[\sum{t = 1}^{T}\gamma^{t - 1}R_i(s_t^i,a_t)\right], s_i^t\in \mathcal{S}_i. \quad (1)
$
Here, represents the probability distribution of states, and is the policy for task .
4.1.3. Inputs to Policies and Action Generation
Behavior-specific policies (teacher policies) take two types of input: privileged information and robot proprioception . Privileged information includes data not directly available to a real robot (e.g., ground friction, motor controller gains, base mass, center of mass shift) but useful for efficient learning in simulation.
In contrast, the basic multi-behavior policy (the student policy obtained after distillation) and the RL fine-tuned policy are designed to operate using only robot proprioception , as this is the information accessible on a real robot.
The robot proprioception is a vector of 69 dimensions, comprising:
$
s_{t}^{\mathrm{prop}} = [\bar{\omega}{t},\pmb {g}{t},c_{t},\pmb {q}{t},\dot{\pmb{q}}{t},\pmb {a}_{t - 1}]\in \mathbb{R}^{69}, \quad (2)
$
where:
-
: The
base angular velocity(rate of rotation of the robot's main body). -
: The
gravity vectorexpressed in the robot's base frame, indicating its orientation relative to gravity. -
: The
velocity command, typically comprising desired linear velocities along the - and -axes and an angular velocity around the -axis for locomotion. -
: The
joint positions(angles) of the robot's 20 degrees of freedom. -
: The
joint velocities(rates of change of joint angles). -
: The
last actiontaken by the policy, providing a short-term memory of the control output.The policy outputs an action which is then converted into a
PD controllertarget . This target is used to compute the motor torques via aPD controller: $ \bar{\pmb{q}}{t}^{\mathrm{target}} = \pmb{q}^{\mathrm{default}} + \alpha \pmb{a}{t} $ $ T_{t} = K_{p}\cdot (\bar{q}{t}^{\mathrm{target}} - \bar{q}^{\mathrm{default}}) - K{d}\cdot \dot{\bar{q}}_{t}, \quad (3) $ where: -
: Default joint positions.
-
: A scalar used to bound the action range.
-
: The
stiffness coefficient(proportional gain) of thePD controller. -
: The
damping coefficient(derivative gain) of thePD controller. -
: The target joint positions for the motors.
-
: The current joint velocities.
The overall methodology is divided into two main stages:
Multi-Behavior DistillationandRL Fine-Tuning, which are detailed below.
4.2. Multi-Behavior Distillation
The first stage focuses on learning a basic multi-behavior controller by distilling knowledge from independently trained behavior-specific policies. This approach addresses the challenges of gradient conflict and gradient imbalance that arise when attempting to train diverse behaviors directly via online RL from scratch. The authors use Proximal Policy Optimization (PPO) to train the behavior-specific policies.
4.2.1. Falling Recovery Behavior Policy ()
This policy is trained to enable the humanoid robot to recover robustly from various fallen postures.
- Methodology: Inspired by
HoST (Huang et al. 2025b), the policy usesmultiple critics(Mysore et al. 2022). In the surrogate loss for policy gradient, theadvantage functionis estimated using a weighted formulation: $ \hat{A} = \sum_{i = 0}^{n}\omega_{i}\cdot (\hat{A_{i}} -\mu_{\hat{A_{i}}}) / \sigma_{\hat{A_{i}}} $ where is aweighting coefficient, and and are thebatch meanandstandard deviationof the advantage fromgroup, respectively. This multi-critic approach allows different critics to specialize in different aspects of the recovery task. - Training Setup: The robot is initialized in
supine(lying on its back) orprone(lying on its stomach) positions, with additionaljoint initialization noiseto promote robust recovery from diverse abnormal postures. Adversarial Motion Prior (AMP)Integration: To mitigate interference from different initial postures and encouragenatural standing-up motions, anAMP-based reward functionis introduced. Adiscriminatordetermines if an episode's motion ishuman-like(from reference motion) orrobot-generated. The discriminator's output guides the robot to recover smoothly. This means the policy learns behaviors such as using arms to support the ground during standing up. The detailedAMPformulation is in Appendix A.- Focus: The policy specifically focuses on recovery on flat terrains.
4.2.2. Walking Behavior Policy ()
This policy is trained to enable the humanoid robot to perform fundamental locomotion.
- Methodology: A simple framework with specific
reward functionsis used. - Training Setup: The policy learns to walk on flat terrain in response to a
velocity command. Adversarial Motion Prior (AMP)Integration: Similar to , anAMP-based reward functionis used to encouragehuman-like walkingand accelerate convergence.- Focus: While initially trained only for walking on flat terrain, the paper notes that after
distillationandRL fine-tuning, this behavior significantly improves in robustness andterrain adaptability.
4.2.3. Behavior Distillation
After training the behavior-specific policies and , a behavior distillation process is performed using DAgger (Dataset Aggregation) to integrate their knowledge into a single Mixture-of-Experts (MoE)-based multi-behavior policy .
- Purpose: This process eliminates
gradient conflictsthat would arise from directly combining distinct reward landscapes inRL. TheMoEmodule allows the policy to automatically assign different experts to distinct behaviors, leveraging thisprior knowledgeforefficient multi-behavior improvementandmulti-terrain adaptabilityin the subsequentRL fine-tuningstage. - DAgger Process: During training, the robot is initialized in either a fallen or standing posture. The
student policyissupervisedby the appropriateteacher policy( or ) based on the behavior it should perform. - Loss Function: The loss function for is computed as:
$
\mathscr{L}{\pi^{d}}(s{t}) = \left{ \begin{array}{ll} \mathbb{E}{s{t},\pi_{d},\pi_{r}^{b}}\left[\left\Vert \alpha_{t}^{\pi^{d}} - \alpha_{t}^{\pi_{b}^{r}}\right\Vert_{2}^{2}\right], & s_{t}\in \mathscr{S}{r}\ \mathbb{E}{s_{t},\pi_{d},\pi_{w}^{b}}\left[\left\Vert \alpha_{t}^{\pi^{d}} - \alpha_{t}^{\pi_{w}^{b}}\right\Vert_{2}^{2}\right], & s_{t}\in \mathscr{S}_{w} \end{array} \right., \quad (4)
$
where:
- : Actions sampled from the
student policy. - : Actions sampled from the
recovery teacher policy. - : Actions sampled from the
walking teacher policy. - : The
state spacecorresponding to standing-up (recovery) behavior. - : The
state spacecorresponding to walking behavior. The loss aims to minimize thesquared L2-norm differencebetween the actions of the student policy and the corresponding teacher policy.
- : Actions sampled from the
- Inputs and Benefits: The
distillation processuses the samedomain randomizationas thebehavior-specific policy training, and takesonly proprioceptionas input. This process not only integrates basic behaviors into a single policy but also enhances them individually (e.g., more robust walking because it learns to recover from near-falls, and natural standing poses facilitating smooth transitions to walking).
4.3. RL Fine-Tuning
In the second stage, the distilled multi-behavior policy is further fine-tuned to enhance its terrain adaptability for both fall recovery and walking tasks on complex terrains. The fine-tuned policy is denoted as .
- Initialization: The policy is initialized with the parameters of the
distilled policyfrom the first stage. - Reward:
AMP-based rewards(using the same reference motions) are again employed to maintainhuman-likeness. - Efficiency: Leveraging the
MoE moduleandprior knowledgein ,gradient conflictsare alleviated, enabling efficient learning of adaptive behaviors. - Training Setup: The policy is fine-tuned using
PPOon twoGPUs, with eachGPUhandling one task (recovery or walking) under its environment setup. The policies for different tasksshare the same set of parametersfor theactor network.
4.3.1. Behavior-Specific Critics and Shared Actor
- Challenge: In
PPO, whilepolicy gradientsusenormalized advantagesfor stability, thevalue lossrelies onunnormalized return targets. If tasks have vastly differentreward scales, thevalue lossfor the task with larger rewards can dominategradient updates, hindering learning for other tasks. - Solution: The
AHCframework usesbehavior-specific criticswith ashared actorduring fine-tuning.- Behavior-Specific Critics: Each task (recovery or walking) is assigned a
separate critic. Thisisolates value function learningfor each task's specificreward function, allowing for more accuratevalue estimationand evencustomized critic architectures(e.g., multiple critics for standing-up behavior, as used in ). - Shared Actor: The
actor network, which outputs the actions, shares its parameters across all tasks. It is updated usingpolicy gradientsaggregated from all tasks, facilitatingskill transferandterrain adaptability.
- Behavior-Specific Critics: Each task (recovery or walking) is assigned a
4.3.2. Eliminating Gradient Conflict in Multi-Task Learning
Even with behavior-specific critics, the shared actor can still receive potentially conflicting gradients from different tasks.
- Solution:
Projecting Conflicting Gradients (PCGrad)(Yu et al. 2020) is applied to resolve these conflicts. - Mechanism: For any pair of task gradients ( and ), if they conflict (i.e., their
cosine similarityis negative, meaning they point in opposing directions), the gradient of one task isprojected onto the normal planeof the other. This removes the conflicting directional component while preserving progress in the non-conflicting subspace. - Projection Formula: Given two task gradients and , if they conflict, the projected gradient for is computed as:
$
\mathbf{g}_i = \mathbf{g}_i - \frac{\mathbf{g}_i\cdot\mathbf{g}_j}{\left|\mathbf{g}_j\right|^2}\mathbf{g}_j \quad (5)
$
where:
- : The
dot productof the two gradients. - : The
squared Euclidean norm(magnitude) of gradient . The term represents the component of that is parallel to . Subtracting this component projects onto the plane orthogonal to .
- : The
- Implementation:
PCGradis integrated before theactor update step. Each task computes itslocal actor gradienton its dedicatedGPU. All gradients are then sent to amain processwherePCGradperformsgradient surgery. After the optimizer step with these conflict-free gradients, theupdated parametersare broadcast back to all workers. This ensuresefficient multi-task RL learningwithoutgradient conflicts.
4.3.3. Terrain Curriculum
Following previous work (Rudin et al. 2022), a terrain curriculum is adopted to improve learning efficiency and adaptability on diverse terrains.
- Mechanism: An
automatic level adjustment mechanismdynamically modulatesterrain difficultybased ontask-specific performance. - Terrain Types: The curriculum includes four types of terrains for both tasks:
Flat: Basic, level ground.Slope: Inclined surfaces, with a maximum inclination of .Hurdle: Terrains with regularly spaced vertical obstacles (maximum height0.1m).Discrete: Terrains composed of randomly placed rectangular blocks (width/length0.5m-2.0m, heights0.03m-0.15m, 20 obstacles).
- Setup: The terrain map is arranged into a grid of patches, with 10 difficulty levels and 3 columns per terrain type, allowing for systematic progression in difficulty.
Appendix A. AMP Reward Formulation and Discriminator Objective
The Adversarial Motion Prior (AMP) mechanism (Peng et al. 2021; Escontrela et al. 2022) is adopted to provide a style reward that encourages natural and human-like behaviors.
A.1. AMP Input State and Temporal Context
AMP input state(): Extracted by taking 20joint positionsfrom the full observation.- Temporal Context: Unlike some prior works that use only two consecutive states, the
AHCframework provides thediscriminatorwith atemporal contextby feeding a 5-step window ofAMP states. The input sequence for the discriminator is defined as: $ \tau_{t}=(s_{t- 3}^{\mathrm{amp}},s_{t- 2}^{\mathrm{amp}},s_{t- 1}^{\mathrm{amp}},s_{t}^{\mathrm{amp}},s_{t+ 1}^{\mathrm{amp}}). $ This sequence provides a richer understanding of the motion trajectory to the discriminator.
A.2. Discriminator Objective
The AMP consists of a discriminator that distinguishes between reference motion sequences (from a human motion dataset ) and on-policy rollouts (from the robot policy ).
- Training: The discriminator is trained to assign
higher scorestoreference sequencesandlower scorestopolicy-generated ones. - Objective Function: Its objective is formulated as a
least-squares GANloss with agradient penalty: $ \arg \max_{\phi}\mathbb{E}{\tau \sim \mathcal{M}}[(D{\phi}(\tau) - 1)^{2}] + \mathbb{E}{\tau \sim \mathcal{P}}[(D{\phi}(\tau) + 1)^{2}] +\frac{\alpha^{d}}{2}\mathbb{E}{\tau \sim \mathcal{M}}[||\nabla{\phi}D_{\phi}(\tau)||_{2}],\qquad (6) $ where:- : The parameters of the
discriminatornetwork. - : The loss term encouraging the discriminator to output a score of 1 for real motion sequences from the reference dataset .
- : The loss term encouraging the discriminator to output a score of -1 for fake motion sequences generated by the policy .
- : A
gradient penaltyterm that helps mitigate training instability by penalizing large gradients of the discriminator's output with respect to its input. - : A manually specified coefficient controlling the strength of the
gradient penalty. - : The scalar score predicted by the discriminator for the state sequence .
- : The parameters of the
A.3. Style Reward Function
The discriminator output is used to define a smooth surrogate reward function, :
$
r^{\mathrm{style}}(s_{t}) = \alpha \cdot \max \left(0,1 - \frac{1}{4} (d - 1)^{2}\right), \quad (7)
$
where is a scaling factor. This reward encourages the policy to perform locomotion behavior that closely resembles those in the reference dataset.
A.4. Total Reward
The total reward used for policy optimization is the sum of the task reward (designed for the specific objective, e.g., reaching a target velocity or standing up) and the style reward :
$
r_{t} = r_{t}^{\mathrm{task}} + r_{t}^{\mathrm{style}}. \quad (8)
$
In the AHC setup, each task (locomotion and recovery) is associated with an independent discriminator and its corresponding reference motion data.
Appendix B. Training Details
B.1. Multi-Behaviors Distillation Policy
-
Framework: A
teacher-student frameworkis used. Twobehavior-specific policies() are trained asteacher policiesusingPPO. These teachers have access toprivileged information(simulation-only data) for efficient learning. Their learned skills are then distilled into abasic multi-behavior student policy() which operates usingproprioception only. -
Network Architecture:
- Actor-Critic: Each
behavior-specific policyuses anactor-critic architecture. - History Encoder: A
history encoder(shared between actor and critic) processes 10 steps of history observation using a 3-layerMLP(Multi-Layer Perceptron) with hidden dimensions[1024, 512, 128], outputting a 64-dimensionallatent embedding. - Actor Network: The
latent embeddingis concatenated with thecurrent observationand fed to theactor. The actor is a 3-layerMLPwith hidden sizes[512, 256, 128], outputtingmean actionswith alearnable diagonal Gaussian standard deviation. - Critic Network: The
latent embeddingandcurrent observationare also fed to thecritic. The critic consists ofindependent networks, each a 3-layerMLPwith hidden dimensions[512, 256]. Each critic corresponds to a specificreward group(as used in the multi-critic setup for recovery).
- Actor-Critic: Each
-
PPO Parameters: Standard
PPOformulation withclipped surrogate lossandGAE.Adam optimizerwith learning rate .- Rollouts: 32 environment steps per
PPO iteration. - Learning epochs: 5 epochs per
PPO iteration. - Minibatches: 4 minibatches per epoch.
Discount factor.GAE lambda.Clipping ratio: 0.2.Value loss coefficient: 1.0.
-
Reward Definitions: Detailed in Table 3. Recovery rewards are grouped into four categories, following
Huang et al. (2025b). -
MoE Actor for Distillation: The
student policyuses anMoE architecturefor itsactor networkto increase capacity.- Experts: Comprises 2
experts, each anMLPwith the same hidden dimensions as the behavior-specific policy's actor. - Gate Network: An
MLPwith hidden dimensions[512, 256, 128]that determines the mixing weights for the experts' output actions.
- Experts: Comprises 2
-
Distillation Loss: The total
distillation lossis a weighted sum of two components: $ \begin{array}{rl} & {\mathcal{L}{\mathrm{distill}} = \lambda{\mathrm{MSE}}\cdot \mathbb{E}{a^{d\sim \pi^{d}},a^{b\sim \pi^{b}}}\left[\left|\left|a^{d} - a^{b}\right|\right|^{2}\right]}\ & {\qquad +\lambda{\mathrm{KL}}\cdot \mathbb{E}\left[\mathrm{KL}\left(\pi^{d}\Vert \pi^{b}\right)\right]} \end{array} \quad (9) $ where:- : Weight for the
Mean Squared Error (MSE)loss, which penalizes the difference between thestudent policy's action() and theteacher policy's action(). - : Weight for the
Kullback-Leibler (KL) divergenceloss, which encourages thestudent policy's action distributionto be similar to theteacher policy's action distribution.
- : Weight for the
-
Distillation Procedure (Algorithm 1):
Algorithm 1: Behavior Cloning via Multi-Expert Distillation Require: Behavior-specific policies , Multi- behavior policy , number of environments , rollout length , number of update epochs , minibatch size 1: Initialize storage 2: for iteration 3: Collect rollouts in parallel environments: 4: for to do 5: Observe current state 6: Select behavior policy based on 7: // get expert action 8: // get student action 9: Store in 10: end for 11: for epoch to do 12: Sample minibatches of size from 13: Compute loss (Eq.(9)) 14: Update via gradient descent on 15: end for 16: Clear storage 17: end for
- Parameters: parallel environments, rollout length , update epochs , minibatch size .
- Behavior Selection: is selected based on the robot's base height: if base height
> 0.5m, walking policy is used; otherwise, recovery policy is used. - Student Learning Rate: .
B.2. RL Fine-tuned Policy
- Initialization: The
multi-behaviors policyis used to initialize thesecond-stage RL fine-tuning. The network architecture remains unchanged. - Learning Rate: The policy learning rate is reduced to (from in distillation) to prevent
policy collapseandcatastrophic forgettingof previously acquired skills during fine-tuning on diverse terrains. - Gradient Surgery (
PCGrad):- After computing task-specific
policy gradients( for walking, for recovery),PCGradis applied. - If
gradient conflictis detected (), one gradient is projected onto thenormal planeof the other. Stochastic Projection: Since there are only two tasks, the projection direction (locomotion onto recovery, or vice versa) israndomly selectedat each update step to avoid bias.
- After computing task-specific
- Domain Randomization and PD Gains: Both training stages use the
same domain randomization settings(Table 4) andjoint PD gains(Table 5).-
Domain randomization parametersare resampled at the beginning of each episode to enhance robustness to varying environmental and physical conditions and prevent overfitting.The following table details the reward functions used in both training stages, specifically for the walking and full recovery tasks.
-
The following are the results from Table 3 of the original paper:
| Term | Equation | Scale |
|---|---|---|
| Walking r* | ||
| Track lin. vel. | exp{−|sWnminmin||22 (wi2+0.25)1} |
2.0 |
| Track ang. vel. | exp{tVN52n||θi|2θi|22 (ωi2+0.25)1} |
2.0 |
| Joint acc. | ||θi||2 (θi||2||2 (θi||2) |
−5𝑒−7 |
| Joint vel. | ||θi||2 (θi||2 (θi||2) |
−1𝑒−3 |
| Action rate | ||a-k-al1||22 (a-k-al1||22 (a-k-al1||22 |
−0.03 |
| Action smoothness | ||cas-k-al1||2 (cas-k-al1||22 (cas-k-al1||22 |
−0.05 |
| Angular vel. (x - y) | |ωax||22 (ωai||(x-y) |
−0.05 |
| Orientation | ||qa||||22 (qa||||22 |
−2.0 |
| Joint power | |a||θa| | −2.5𝑒−5 |
| Feet clearance | ∑((zi−htargett2)x*|oati)ff | −0.25 |
| Feet stumble | ∑(Schi);F^{(t)}>3)|F^{(t)}_{i}| | −1.0 |
| Torques | ∑ aou||22 (aou||22 (aou||22 |
−1𝑒−5 |
| Arm joint deviations | ∑ (θi - θs) 1 laus||e i min||42 (θi) e 1 min||22 (θn∥a) |
−0.5 |
| Hip joint deviations | ∑ (θi - θlos | −0.5 |
| Hip joint deviations | ∑ funi s ang min||22 (θi - θmin) 22 (θi) st|| |
−1.0 |
| Joint pos. limits | ∑ outi ang min||22 (θi - θmin) 22 (θi) 22 (θi) 22 (θi) |
−2.0 |
| Joint vel. limits | RELU (f - θ<float) -1.0 | |
| Torque limits | RELU (f - πmax) −1.0 | |
| Feet slippage | ∑ [νpotytoi; f Ⅰ Ⅰ Ⅰ Ⅰ] ||min | −0.25 |
| Colision | n collision | −15.0 |
| Feet air time | ∑ (tpilt - 0.5) * ||(f rist contact I) |22 (θi) |
1 |
| Stuck | · ||r|| 2 ≤ 0.1) * ||(e'|| 2 ≤ 0.2) | −1.0 |
| Full Recovery r* | ||
| Task Reward | ||
| Orientation | fou(-θuser; [0.99, inf], 1,0.05) | 1.0 |
| Head height | fou([hhead; [1 inf], 1,0.1) | 1.0 |
| Style Reward | ||
| Hip joint deviation | ∑ (max崽) 0.9 √ minI烈|θ汇|< 0.8) | −10.0 |
| kNθs | ∑ (max崽) 0.2 8≤ minI烈|θ汇|< 0.06) | −0.25 |
| Knesse | ∑ (θkfութ Tomatoes | −2.5 −0.02 |
| Shoulder roll deviation | Σ(⡵ cutoffΣ =<0.02 ‡ ⡵ initialize ‡ ⡵ ) | −0.2 |
| Thigh orientation | fou(Feginups; [0.8 inf], 1,0.1) | 10.0 |
| Feel distance | ∑(|Cp|l | −10.0 |
| Angular vel. (x, y) | exp(-2│u ×|ydif, f||2)+×high|hheight|<br>3<br>hhigh<br>f || | 25.0 |
| Foot displacement | exp(clip | 2) |
| Regularization Reward | ||
| Joint acc. | ||θi||22 | −2.5 −1 −3 |
| Joint vel. | ||θi||2 −1 −1 |
|
| Action rate | ||a-k-al1||22 | −0.05 |
| Action smoothness | ||a-k-al1||22 | −0.05 |
| Torques | ∑ aou|| ||a||22 (θi||z) + \r \\ ti2 22 (θi laus||t)(denial||22 (fini||t)(denial||t)2 ||a|| h||t)(denial||t)1 laus||t)(argy )||22 (θi laus||t)(argy||t) tit|a||t||t argy||t)||22 (fini||t)1 laus||t)(argy ) |
−0.1 −0.03 −0.05 −0.15 |
| Target rotation | f||θi||22 (fini||t)||22||22||22|||| |
|
| Joint pos. limits | ∑ outi outimax |
−2.0 |
| Joint vel. limits | RELU (f ɘ ɘ | −1.0 |
| Post-Task Reward | ||
| Angular vel. (x, y) | exp(+2 ||θsur 22 (aou 22 (θi||z) + \r ||a||22 (fini||z) + \r ||a||) tiid><br>(θi||2)∧ fi||22||22|<br>o<br>na<br>ni<br>laus||t) = 22<br>(θi||22||22 (파파 2) |
10.0 |
| Base lin. vel. (x, y) | exp(−5|va||f||22 (나 hit =||f||22 (나| hit =)|d hit =|d hit =|d| 10.0 |
|
| Orientation | exp(−20|実高 ism a tâteauana|) 10 |
|
| Base height | exp(−20|实高 sm a tâteauana|) 10 |
|
| Target joint deviations | exp(−0.1 BER 忍 это|J| ) 10 ||r||22 r ||r||22 r后面||r||22 r后面||r||22 r后面||||20 ||r||22 r r后面||r||22 r后面 ||r||22 (bey r后面||r||22 r后面 ||r||22 r后面 |r||22 r后面 ||r||22 r后面 ||r||22 托管托管托管托管托管托管托管托管托管管理托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管 |
10.0 |
The fou function in the reward table stands for a Gaussian-style formulation, as detailed in Huang et al. (2025b). The table provides the Equation and Scale (weight) for various reward terms used in the Walking and Full Recovery tasks, categorized into Task Reward, Style Reward, Regularization Reward, and Post-Task Reward. These terms are designed to encourage desired behaviors (e.g., tracking velocity, maintaining orientation, proper joint movement) and penalize undesirable ones (e.g., high joint acceleration/velocity, large torques, collisions).
The following are the results from Table 4 of the original paper:
| Term | Randomization Range | Unit |
|---|---|---|
| Restitution | [0,1] | - |
| Friction coefficient | [0.1,1] | - |
| Base CoM offset | [-0.03, 0.03] | m |
| Mass payload | [-2, 5] | Kg |
| Link mass | [0.8, 1.2]× default value | Kg |
| Kp Gains | [0.8, 1.25] | Nm/rad |
| Kd Gains | [0.8, 1.25] | Nms/rad |
| Actuation offset | [-0.05, 0.05] | Nm |
| Motor strength | [0.8, 1.2]× motor torque | Nm |
| Actions delay | [0,100] | ms |
| Initial joint angle scale | [0.85,1.15]× default value | rad |
| Initial joint angle offset | [-0.1,0.1] | rad |
Table 4 lists the domain randomization settings and their respective ranges used during both training stages. These parameters are randomly varied in simulation to enhance the policy's robustness and transferability to the real world. Examples include varying physical properties like restitution (bounciness), friction coefficient, base center of mass (CoM) offset, mass payload, and link mass. Motor control parameters such as Kp and Kd gains, actuation offset, and motor strength are also randomized. Finally, actions delay and initial joint state parameters (angle scale and offset) are randomized to account for real-world uncertainties.
The following are the results from Table 5 of the original paper:
| Joint | Kp | Kd |
|---|---|---|
| Hip | 150 | 4 |
| Knee | 200 | 6 |
| Ankle | 40 | 2 |
| Shoulder | 40 | 4 |
| Elbow | 100 | 4 |
Table 5 specifies the PD gains ( for stiffness, for damping) used for each major joint group of the humanoid robot during both training and real-world deployment. These values are crucial for controlling the robot's joint positions and velocities, converting the policy's actions into motor torques.
5. Experimental Setup
The experiments for Adaptive Humanoid Control (AHC) were conducted in a two-part manner: first in a simulation environment for training and initial evaluation, and then on a real-world humanoid robot for validation.
5.1. Datasets
The paper does not use traditional "datasets" in the supervised learning sense for direct policy training. Instead, it relies on:
- Simulation Environments: The
IsaacGym simulatorwas used for all training and simulation-based evaluations.IsaacGymprovideshigh-performance GPU-based physics simulation, enablingmassively parallel deep reinforcement learningwith 4096 parallel environments, which is crucial for efficient training of complex robot behaviors. - Reference Motion Data for
AMP:Retargeted motion capture datawas used forrecovery behaviors. This refers to human motion data that has been adapted to fit the robot's kinematics, providing naturalistic examples of how a human would stand up.LAFAN1 datawas used forlocomotion behaviors.LAFAN1is a dataset of human locomotion motions, providing diverse examples of human walking, running, etc. These "datasets" forAMPare essential because they provide thehuman motion priorsthat guide theRL policytowards generating natural and stable movements, which is particularly effective for standing-up and walking. They are effective for validating human-like movement generation, a core aspect of humanoid control.
5.2. Evaluation Metrics
For both locomotion and fail recovery tasks, the paper primarily uses Success Rate (Succ.) and Traversing Distance (Dist.) for the locomotion task.
5.2.1. Success Rate (Succ.)
-
Conceptual Definition:
Success ratequantifies the percentage of trials in which the robot achieves a predefined objective without premature termination. It focuses on the robot's ability to complete the task reliably under specified conditions. -
For Locomotion Task:
- Definition: The percentage of trials where the robot traverses the full
8mterrain within20swithout termination. - Context: During evaluation, the robot is given a fixed forward velocity command, uniformly sampled from , at the start of each episode. An episode terminates if the robot either
walks off the current8m \times 8mterrain patchorfalls irrecoverably. Formulti-behavior policies(like and ), falling does not trigger termination, as the robot is expected to recover and resume.
- Definition: The percentage of trials where the robot traverses the full
-
For Recovery Task:
- Definition: The percentage of trials where the robot successfully
stands up from a fallen postureandmaintains balance without falling again within10s``. - Context: The robot is initialized in various fallen postures (supine, prone).
- Definition: The percentage of trials where the robot successfully
5.2.2. Traversing Distance (Dist.)
-
Conceptual Definition:
Traversing distancemeasures the average physical distance covered by the robot during a trial, regardless of whether the trial was successful or failed. It provides an indication of how far the robot can navigate before encountering an insurmountable obstacle or failing. -
Mathematical Formula: There is no single standardized mathematical formula for "traversing distance" as it depends on the simulation environment's coordinate system and how distance is recorded. Generally, it refers to the Euclidean distance covered by the robot's base or center of mass from its starting point until termination. Let be the initial position of the robot's base and be its position at termination. The 2D traversing distance would be: $ \text{Dist} = \sqrt{(x_t - x_0)^2 + (y_t - y_0)^2} $ The paper calculates this over all trials, including successful and failed ones.
5.3. Baselines
To evaluate the effectiveness of the AHC framework, it was compared against several baseline methods, as well as intermediate policies from its own training stages:
-
HOMIE (Ben et al. 2025): This method focuses on
humanoid loco-manipulation. For comparison, its lower-body locomotion policy wasre-trained on the specific terrain settingsused in theAHCexperiments. This baseline represents a state-of-the-art single-skill locomotion controller. -
HoST (Huang et al. 2025b): This method focuses on
learning humanoid standing-up control across diverse postures. Similarly, its standing-up controller wasre-trained on the terrain settingsused inAHC. This baseline represents a state-of-the-art single-skill recovery controller. -
Fail Recovery Policy (): This is the
behavior-specific policyforfail recoverytrained in the first stage of theAHCframework. It serves to show the performance of the recovery component beforedistillationandfine-tuning. -
Walking Policy (): This is the
behavior-specific policyforwalkingtrained in the first stage of theAHCframework. It serves to show the performance of the locomotion component beforedistillationandfine-tuning. -
Basic Multi-Behaviors Policy (): This is the
basic multi-behavior policyobtained from thefirst-stage distillation processinAHC. It represents the combined knowledge of basic recovery and walking behaviors beforeRL fine-tuningforterrain adaptability.These baselines and intermediate policies are representative because they cover both
single-skill(locomotion or recovery) state-of-the-art methods and the foundational components of theAHCframework itself, allowing for a detailed component-wise and holistic performance comparison.
5.4. General Experimental Setup
- Simulator:
IsaacGymfor all training and simulation evaluations, leveraging 4096 parallel environments. - Training Iterations:
Behavior-specific policies(): 10,000 iterations.Policy distillation(to obtain ): 4,000 iterations.RL fine-tuning(to obtain ): 10,000 additional iterations using onlineRL.
- Hardware: Two
NVIDIA RTX 4090 GPUsforRL fine-tuning. - Action Space: 20
Degrees of Freedom (DoF)action space (excluding waist joints). - Policy Deployment: On a
Unitree G1 humanoid robotat . - Motor Control: A
PD controllerconvertsjoint positionstotorques. - Evaluation Environments: Four types of terrain patches:
FlatSlope: Inclination angle uniformly sampled between and .Hurdle: Obstacle heights uniformly sampled between0.08mto0.1m. Locomotion task uses 3 obstacles, recovery task uses 8 obstacles (more cluttered).Discrete: Randomly positioned rectangular blocks, height variations from0.08mto0.1m.
- Evaluation Trials: All evaluations were conducted using 1000 parallel environments for statistical significance.
6. Results & Analysis
6.1. Core Results Analysis
The paper presents experimental results from both simulation and real-world deployments to validate the AHC framework's effectiveness.
The following are the results from Table 1 of the original paper:
| Method | Locomotion | Fail Recovery | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Plane Succ. Dist. |
Slope Succ. Dist. |
Hurdle Succ. Dist. |
Discrete Succ. Dist. |
Slope Succ. |
Hurdle Succ. |
Discrete Succ. |
|||
| HOMIE (Ben et al. 2025) | 0.802 | 6.421 | 0.599 | 4.795 | 0.407 | 3.259 | - | ||
| HoST (Huang et al. 2025b) | - | - | - | - | - | - | 0.801 | ||
| Fail Recovery Policy () | - | - | - | - | - | - | 0.852 | ||
| Walking Policy () | 0.825 | 6.602 | 0.627 | 4.981 | 0.428 | 3.456 | - | ||
| Basic Multi-Behaviors Policy () | 0.898 | 7.184 | 0.698 | 5.584 | 0.521 | 4.168 | 0.885 | ||
| AHC (Ours) () | 0.915 | 7.320 | 0.781 | 6.248 | 0.612 | 4.900 | 0.923 | ||
Analysis of Table 1: Performance evaluation on locomotion and frail recovered tasks across different terrains.
-
Overall Superiority of AHC: The proposed
AHCpolicy () consistentlyoutperformsall baselines and intermediate policies across bothlocomotionandrecovery taskson various terrain types. For instance, inlocomotion, AHC achieves the highestSuccess RateandTraversing Distanceon all terrains (Plane: 0.915 Succ., 7.320 Dist.; Slope: 0.781 Succ., 6.248 Dist.; Hurdle: 0.612 Succ., 4.900 Dist.; Discrete: Not explicitly listed for AHC, but given the trend, it would be highest). ForFail Recovery, AHC achieves a 0.923Success Rateon discrete terrain, surpassingHoST(0.801) and even its own first-stageFail Recovery Policy(0.852). -
Advantage in Locomotion on Complex Terrains:
AHC's significant performance gain inlocomotiontasks, especially onhurdleanddiscrete terrains, is attributed to its ability toautonomously recover from fallsandresume traversal. Thismulti-behavior capabilityis critical where obstacles frequently cause imbalances.HOMIE, which is a strong locomotion baseline, shows lowersuccess ratesandtraversing distanceson challenging terrains, likely due to its lack of explicit recovery capabilities. -
Impact of
AMP: The paper highlights that theAdversarial Motion Prior (AMP)mechanism, by providingmotion priors, guides the policy towardsstable and robust behaviors. This contributes toAHC's better performance compared toHoST, which lacksAMPintegration. -
Benefits of Multi-Behavior Distillation (): Comparing the
Basic Multi-Behaviors Policy() with thebehavior-specific policies( and ), demonstratessuperior robustnessin thelocomotion taskon complex terrains. For example, onHurdle terrain, achieves 0.698Successand 5.584Distance, significantly better than (0.627 Succ., 4.981 Dist.). This indicates that theseamless integrationof walking and recovery behaviors within a single policy (via distillation) inherently improves robustness, even before specific terrain adaptation. -
Effectiveness of
RL Fine-Tuning(): TheRL fine-tuningstage, which trains on diverse terrains to produce the finalAHCpolicy (), yields further improvements. For instance, onSlope terrain, achieves 0.781Successand 6.248Distancefor locomotion, up from 's 0.698Successand 5.584Distance. This highlights thetransferabilityof the two-stage training framework, where thebasic multi-behavior policyis effectively adapted tochallenging terrains.
6.2. AMP for Standing-Up
As can be seen from the results in Figure 3, the paper compares the recovery motions generated by AHC (which incorporates AMP) against HoST (which does not use AMP).

Figure 3: Comparison of recovery motions under AHC and HoST. We compare our AHC (with AMP) against the HoST (w/o AMP) in both lying and prone scenarios. AHC produces smoother recovery behaviors. This highlights the effectiveness of AMP in guiding the learning of naturalistic recovery motions.
-
Qualitative Observation: Figure 3 visually demonstrates that
AHCgeneratessmootherandmore naturalrecovery behaviors from bothlyingandpronepositions.HoST, lackingAMP, exhibitsuncoordinated and jerky motions, often relying onabrupt limb movementsto stand up. In contrast,AHCproduces anatural get-up motion, includingleg folding,arm support, andtrunk lifting, which are characteristic of human recovery. This provides strong visual evidence for the effectiveness ofAMPin shaping the controller towards human-like dynamics.As can be seen from the results in Figure 4, the paper further analyzes the
joint accelerationduring recovery.
Figure 4: Joint acceleration analysis of the left leg during recovery. Acceleration profiles of hip and knee joints from the left leg illustrate that our AHC results in stable joint actuation, with notably fewer abrupt fluctuations compared to HoST.
- Quantitative Observation: Figure 4 presents the
acceleration profilesofhip and knee jointsfrom the left leg during recovery. TheAHC policyexhibitsstable joint actuationwithnotably fewer abrupt fluctuationscompared to theHoST policy. This indicates thatAMPnot only leads to visually natural motions but also to mechanically smoother and more controlled movements, which are crucial for robustness and energy efficiency in real robots. These results confirm thatAMPhelps in learning stable recovery controllers, a feat difficult to achieve with hand-crafted reward functions alone.
6.3. Ablation on PCGrad and Behavior-Specific Critics
The paper conducts an ablation study to evaluate the individual contributions of PCGrad and behavior-specific critics in the second-stage fine-tuning. Four configurations are examined:
-
AHC-SC-w/o-PC:
Single shared criticwithoutPCGrad. -
AHC-SC-PC:
Single shared criticwithPCGrad. -
AHC-BC-w/o-PC:
Behavior-specific criticswithoutPCGrad. -
AHC (Ours):
Behavior-specific criticswithPCGrad(the full proposed method).The following are the results from Table 2 of the original paper:
Method Cosine Similarity (↑) AHC-SC-w/o-PC 0.247 AHC-SC-PC 0.519 AHC-BC-w/o-PC 0.334 AHC (ours) 0.535
Analysis of Table 2: Gradient Cosine Similarity between Tasks across Different Ablation Settings.
-
Impact on Gradient Conflict:
Cosine Similaritymeasures the alignment of gradients, with higher values indicating less conflict.PCGradsignificantlyreduces gradient conflict. ComparingAHC-SC-w/o-PC(0.247) withAHC-SC-PC(0.519), addingPCGradnearly doubles thecosine similaritywhen using ashared critic.- Similarly, comparing
AHC-BC-w/o-PC(0.334) withAHC(0.535),PCGradagain substantially increases similarity withbehavior-specific critics.
-
Impact of Behavior-Specific Critics: The use of
behavior-specific criticsitself also helps alleviate conflicts.AHC-BC-w/o-PC(0.334) has higher similarity thanAHC-SC-w/o-PC(0.247), suggesting that separatingvalue learninginherently reduces some task interference. -
Combined Effect: The
AHC(ours) configuration, combining bothPCGradandbehavior-specific critics, achieves thehighest cosine similarity(0.535), demonstrating their synergistic effect in mitigatinggradient conflicts.As can be seen from the results in Figure 5, the paper further examines
value loss curvesduring fine-tuning.
Figure 5: Value loss curves during the second-stage fine-tuning. Policies equipped with behavior-specific critics (AHC-BC-w/o-PC and AHC) indicate more stable value learning compared to their shared-critic counterparts (AHC-SC).
-
Analysis of Figure 5: Value Loss Curves: Policies equipped with
behavior-specific critics(AHC-BC-w/o-PCandAHC) consistently achievelower value lossand exhibitmore stable value learningcompared to theirshared-critic counterparts(AHC-SC-w/o-PCandAHC-SC-PC). This supports the hypothesis thatdecoupling value learningfor each task helps mitigate optimization difficulties arising fromreward scale discrepanciesand leads to more accuratevalue function estimation.As can be seen from the results in Figure 6, the paper visualizes the
training episode return curvesduring fine-tuning.
Figure 6: Training episode return curves during second-stage fine-tuning. With PCGrad and behavior-specific critics AHC achieve higher and more balanced returns across tasks.
- Analysis of Figure 6: Training Episode Return Curves:
- Shared-Critic Variants (AHC-SC): These configurations (blue and orange curves) tend to
neglect the locomotion task(lower returns) due to its potentiallysmaller reward magnitudecompared to the recovery task. This illustrates thegradient imbalanceproblem. - AHC (Ours): The full
AHCconfiguration (green curve), which combinesPCGradandbehavior-specific critics, maintainshigh and more balanced returnsfor both tasks. This indicates superior performance inmulti-task learningby effectively optimizing for both locomotion and recovery simultaneously. - Convergence Speed:
AHCalso exhibitsfaster convergencecompared to the other settings, reaching higher returns more quickly.
- Shared-Critic Variants (AHC-SC): These configurations (blue and orange curves) tend to
- Conclusion of Ablation: These results strongly highlight the effectiveness of incorporating both
PCGradandbehavior-specific criticsin facilitatingbalanced and efficient optimizationduring thesecond-stage fine-tuning, leading to a robust and adaptivemulti-behavior controller.
6.4. Deployment Results
The paper also presents qualitative results from deploying the trained AHC policy on a Unitree G1 humanoid robot in real-world settings, without any additional fine-tuning.
As can be seen from the results in Figure 7, the paper shows a sequence of deployment snapshots.

Figure 7: Snapshot of real-world deployment. The robot performs recovery and locomotion in diverse scenarios, including standing up from prone and lying positions on sloped terrain and recovering after external pushes during walking.
- Recovery Evaluation: The robot successfully
recovers from various fallen postures(supine and prone) on bothflat groundandinclined terrain. It also managesmoderate external disturbancesduring recovery. After recovery, itstabilizes itselfandsmoothly transitionsinto awalking-ready posture, demonstratingnatural and coordinated motion. - Locomotion Evaluation: The robot walks stably on
flat groundandinclined surfaces, effectivelytracking velocity commands. Whenexternal pushesare applied in random directions during walking, the robot generallywithstands the perturbationsand continues. If a fall does occur, itautonomously performs the recovery maneuverandresumes locomotion. - Significance: These real-world results are crucial, as they validate the policy's ability to effectively
bridge the sim-to-real gap. They also demonstrate that theAHC policysuccessfully integratesrecovery and locomotion behaviorsin acohesive and robust manner, showcasingstrong resilienceandlong-horizon autonomyin dynamic environments. The visual evidence in Figure 7 confirms these capabilities.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Adaptive Humanoid Control (AHC), a novel two-stage framework for learning an adaptive humanoid locomotion controller. The first stage, Multi-Behavior Distillation, trains behavior-specific policies using Adversarial Motion Priors (AMP) and then distills this knowledge into a basic multi-behavior policy with a Mixture-of-Experts (MoE) architecture. This initial step effectively addresses the challenges of gradient conflicts inherent in multi-skill learning by creating a foundational controller capable of basic adaptive behavior switching. The second stage, Reinforced Fine-Tuning, further enhances the controller's terrain adaptability on diverse terrains by utilizing online RL, integrating behavior-specific critics, and employing gradient projection (PCGrad) to mitigate gradient conflicts in the shared actor.
Extensive experiments in both the IsaacGym simulator and on a real-world Unitree G1 humanoid robot rigorously validate the effectiveness of AHC. The results demonstrate that the learned controller enables robust locomotion across challenging terrains (slopes, hurdles, discrete obstacles) and effective recovery from various types of falls. The AMP integration leads to smoother and more natural motions, while PCGrad and behavior-specific critics ensure balanced and efficient multi-task learning. The seamless sim-to-real transfer and the ability to autonomously switch between recovery and locomotion in real-world scenarios highlight the practical utility and resilience of the proposed AHC policy.
7.2. Limitations & Future Work
The authors acknowledge the following limitations and propose future research directions:
-
Perceptual Capabilities: Currently, the
AHCframework primarily relies on proprioceptive information. A significant limitation is the lack of integration withexternal sensors(e.g., depth cameras, LiDARs) forenvironmental perception. This means the robot cannot actively perceive and reason about its surroundings beyond basic contact information. -
Limited Behavior Categories: While
AHCsuccessfully integratesrecoveryandlocomotion, the range of behaviors is still limited to these two primary skills. Humanoid robots, to be truly versatile, would need to master a much broader set ofhuman-like behaviors(e.g., manipulation, navigation in cluttered spaces, more dynamic movements like jumping over larger gaps).Based on these limitations, the authors suggest the following future work:
-
Augmenting Perceptual Capabilities: Incorporating
external sensorsto enable the robot to perceive and understand its environment more comprehensively. This would allow for more intelligent and adaptive navigation, obstacle avoidance, and interaction with complex surroundings. -
Expanding Behavior Categories: Extending the
AHCframework to include a wider range of behaviors, aiming foreven greater generalizationand versatility in humanoid robot control. This could involve exploring more complexloco-manipulationtasks or advancedsocial interactions.
7.3. Personal Insights & Critique
This paper presents a well-structured and technically sound approach to a critical problem in humanoid robotics. The two-stage framework is a particularly insightful design choice, effectively decoupling the complex problem of learning diverse skills and adapting to varied terrains. Attempting to solve both simultaneously often leads to optimization instabilities in RL, making the distillation-then-fine-tuning strategy a pragmatic and robust solution.
Inspirations and Transferability:
- Modular Learning: The idea of learning
foundational skills(behavior-specific policies) and thendistillingthem into a unifiedmulti-behavior controlleris highly inspiring. This modular approach could be transferred to other complexmulti-task learningdomains beyond robotics, where a large, single model struggles to learn diverse sub-tasks. For instance, innatural language processing, a similar strategy could be used to combine specialized language models into a more general-purpose agent. - Robustness via
AMPandDomain Randomization: The consistent use ofAMPto infusehuman-like priorsanddomain randomizationforsim-to-real transferare well-established but their effective combination here highlights their continued importance. This reinforces the idea that realisticsimulationswith sufficient variability are key to real-world deployment. - Addressing
Gradient Conflict: The explicit and effective application ofPCGradandbehavior-specific criticsto mitigategradient conflictsis a crucial takeaway. This is a common bottleneck inmulti-task deep learning, and the quantitative results (cosine similarity, value loss, returns) clearly demonstrate the benefits of these techniques. These methods could be widely applied to anymulti-task learningproblem facing similar gradient challenges.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Reliance on Reference Motions: While
AMPis powerful, it inherently relies on the availability and quality ofreference motion data. Generating or acquiring high-quality motion capture data for all desired complex behaviors (especiallyloco-manipulationor novel interactions) can be challenging and expensive. This could limit the scalability ofAMPfor truly novel or non-human-like behaviors. -
Computational Cost: Training
behavior-specific policiesfor each skill, followed bydistillationand thenRL fine-tuningacross diverse terrains, is computationally intensive. The use ofMoEalso adds to the model complexity, although it helps withgradient conflicts. While the paper mentions usingIsaacGymwith 4096 parallel environments andNVIDIA RTX 4090 GPUs, the overall training time and resource requirements could still be substantial, potentially limiting its accessibility for researchers without significant computational resources. -
Scalability of
PCGrad: WhilePCGradis effective for a small number of tasks (here, two), its scalability to a very large number of highly diverse tasks might introduce computational overhead as the number of gradient projections grows quadratically with the number of tasks. Future work would need to investigate more efficientgradient surgeryorgradient weightingtechniques for scenarios with many behaviors. -
Definition of Disjoint State Spaces: The paper assumes
disjoint state spaces() for differentMDPsin themulti-task formulation. While this simplifies the problem formulation, in reality, the robot's state might have ambiguous regions that could belong to multiple behaviors (e.g., "about to fall" could be a state for both walking and recovery). The current state-based switching (base height threshold) is simple but might be brittle in edge cases. A more sophisticated, probabilistic, or context-awarebehavior arbitration mechanismcould enhance robustness. -
The "Black Box" of MoE Gating: While
MoEhelps, thegating networkitself can be a "black box." Understanding how it allocates tasks to experts and if it's truly making optimal decisions for dynamic behavior switching could be an area for further analysis.Overall,
AHCrepresents a robust step towards general-purpose humanoid control, effectively demonstrating how a clever combination of existing and novelRLtechniques can overcome significant challenges inmulti-skillandmulti-terrain adaptationfor complex robotic systems.
Similar papers
Recommended via semantic vector search.