Paper status: completed

Towards Adaptive Humanoid Control via Multi-Behavior Distillation and Reinforced Fine-Tuning

Published:11/09/2025

Adaptive Humanoid Control (1)Multi-Behavior Distillation (1)Reinforced Fine-Tuning (1)Humanoid Locomotion Skills (1)Multi-Skill Controller (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces an Adaptive Humanoid Control (AHC) framework that learns adaptive locomotion controllers through multi-behavior distillation and reinforced fine-tuning, showing strong adaptability across various skills and terrains.

Abstract

Humanoid robots are promising to learn a diverse set of human-like locomotion behaviors, including standing up, walking, running, and jumping. However, existing methods predominantly require training independent policies for each skill, yielding behavior-specific controllers that exhibit limited generalization and brittle performance when deployed on irregular terrains and in diverse situations. To address this challenge, we propose Adaptive Humanoid Control (AHC) that adopts a two-stage framework to learn an adaptive humanoid locomotion controller across different skills and terrains. Specifically, we first train several primary locomotion policies and perform a multi-behavior distillation process to obtain a basic multi-behavior controller, facilitating adaptive behavior switching based on the environment. Then, we perform reinforced fine-tuning by collecting online feedback in performing adaptive behaviors on more diverse terrains, enhancing terrain adaptability for the controller. We conduct experiments in both simulation and real-world experiments in Unitree G1 robots. The results show that our method exhibits strong adaptability across various situations and terrains. Project website: https://ahc-humanoid.github.io.

Mind Map

In-depth Reading

English Analysis~39 min read · 52,666 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is Adaptive Humanoid Control via a two-stage framework: Multi-Behavior Distillation and Reinforced Fine-Tuning.

1.2. Authors

The authors and their affiliations are:

Yingnan Zhao ( $^{1,7}$ )
Xinniao Wang ( $^{1,2}$ )
Dewey Wang ( $^{2,3}$ )
Xinzhe Liu ( $^{2,4}$ )
Dan Lu ( $^{1,7*}$ )
Qilong Han ( $^{1,7}$ )
Peng Liu ( $^{5}$ )
Chenjia Bai ( $^{2,6*}$ )

Affiliations:
$^{1}$ College of Computer Science and Technology, Harbin Engineering University
$^{2}$ Institute of Artificial Intelligence (TeleAI), China Telecom
$^{3}$ School of Information Science and Technology, University of Science and Technology of China
$^{4}$ School of Information Science and Technology, ShanghaiTech University
$^{5}$ College of Computer Science and Technology, Harbin Institute of Technology
$^{6}$ Shenzhen Research Institute of Northwestern Polytechnical University
$^{7}$ National Engineering Laboratory for Modeling and Emulation in E-Government, Harbin Engineering University

The asterisk (*) next to Dan Lu and Chenjia Bai indicates corresponding authorship. The authors represent a mix of academic institutions and a telecommunications research institute in China, suggesting a collaborative effort between academia and industry.

1.3. Journal/Conference

The paper is published as a preprint, indicated by the arXiv link. Its publication status (e.g., in a specific journal or conference proceedings) is not explicitly stated in the provided text beyond its arXiv posting date. Given the nature of the research (robotics, reinforcement learning) and the typical publication cycle, it is likely intended for a major robotics or AI conference/journal.

1.4. Publication Year

The paper was published at (UTC): 2025-11-09T13:15:20.000Z.

1.5. Abstract

Humanoid robots possess the potential to learn diverse human-like locomotion behaviors, such as standing, walking, running, and jumping. However, current methods typically train separate policies for each skill, resulting in behavior-specific controllers that lack generalization and perform poorly on irregular terrains and in varied situations. To overcome this, the paper introduces Adaptive Humanoid Control (AHC), a two-stage framework. The first stage involves training several primary locomotion policies and then applying a multi-behavior distillation process to create a basic multi-behavior controller, which enables adaptive behavior switching based on the environment. The second stage employs reinforced fine-tuning, gathering online feedback from adaptive behaviors performed on more diverse terrains to enhance the controller's terrain adaptability. Experiments conducted in simulation and on real-world Unitree G1 robots demonstrate the method's strong adaptability across different situations and terrains.

1.6. Original Source Link

The official source link is: https://arxiv.org/abs/2511.06371 The PDF link is: https://arxiv.org/pdf/2511.06371v2.pdf The paper is currently available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the limitation of existing humanoid robot control methods, which primarily rely on training independent policies for each specific skill (e.g., standing up, walking, jumping). While these behavior-specific controllers excel in their narrow domains, they exhibit limited generalization capabilities in terms of behavior diversity and terrain adaptability. This results in brittle performance when deployed on irregular terrains or in complex, dynamic situations where a robot might need to switch between skills or adapt to unexpected environmental changes.

This problem is important because humanoid robots, with their human-like morphology, are envisioned to operate in human environments, requiring a versatile set of locomotion abilities. The ability to seamlessly transition between skills (e.g., recovering from a fall and then walking) and robustly navigate diverse terrains is crucial for the autonomy and practical application of these robots. A significant challenge in directly training a multi-skill policy using Reinforcement Learning (RL) is the occurrence of gradient conflicts among different reward functions, which can hinder the convergence and effectiveness of the learning process.

The paper's entry point or innovative idea is to propose a two-stage framework called Adaptive Humanoid Control (AHC). Instead of attempting to learn all behaviors and adapt to all terrains simultaneously from scratch (which is difficult due to gradient conflicts), AHC first learns a foundational multi-behavior controller and then systematically enhances its terrain adaptability. This decoupled approach, combined with specific techniques like multi-behavior distillation and reinforced fine-tuning with gradient surgery, aims to overcome the limitations of prior work.

2.2. Main Contributions / Findings

The paper's primary contributions are summarized as follows:

Motion-Guided Policy Learning and Supervised Distillation for Basic Multi-Behavior Policy: The authors propose an approach that avoids direct training of a multi-behavior RL policy across diverse terrains. Instead, they first integrate human motion priors using Adversarial Motion Prior (AMP) into independent behavior-specific policy learning to create basic human-like controllers. These separate controllers are then combined into a single, basic multi-behavior policy through supervised distillation. This initial stage facilitates adaptive behavior switching and addresses the difficulties of direct multi-behavior RL training.
Sample-Efficient RL Fine-Tuning for Terrain Adaptability: The basic multi-behavior policy obtained from the first stage is further refined using sample-efficient RL fine-tuning. This process collects online feedback to continuously improve the terrain adaptability of each behavior on more diverse and complex terrains. Techniques such as gradient projection (PCGrad) and behavior-specific critics are employed to mitigate gradient conflicts and ensure efficient learning in this multi-task RL setting.
Extensive Experimental Validation in Simulation and Real World: The learned controller is rigorously evaluated through extensive experiments in both the IsaacGym simulator and on a real-world Unitree G1 humanoid robot. The results demonstrate that the AHC controller exhibits strong adaptability, enabling the robot to effectively manage environmental state changes (e.g., standing up after a fall and walking) and perform robust locomotion on challenging terrains (e.g., stairs and slopes).

The key findings demonstrate that the proposed AHC framework successfully enables humanoid robots to acquire and adaptively switch between diverse skills like standing up and walking, and perform robustly on various complex terrains. This is achieved by systematically addressing the challenges of multi-skill learning and terrain generalization through its innovative two-stage approach and integrated learning mechanisms.

3.1. Foundational Concepts

To fully understand the Adaptive Humanoid Control (AHC) paper, a foundational understanding of several key concepts in robotics and reinforcement learning is necessary:

Humanoid Robots: Robots designed to mimic the human body, typically featuring a torso, head, two arms, and two legs. Their human-like morphology allows them to interact with human-centric environments but also introduces challenges in balance, stability, and control due to their complex degrees of freedom (DoF).
Locomotion: The act of moving from one place to another. For humanoid robots, this includes a diverse set of behaviors like walking, running, standing up, jumping, climbing, etc.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make optimal decisions by interacting with an environment. The agent receives rewards for desired actions and penalties for undesired ones, iteratively adjusting its policy to maximize cumulative reward over time.
- Agent: The entity that perceives the environment and takes actions.
- Environment: The external system with which the agent interacts.
- State ( $s$ ): A complete description of the environment at a given time.
- Action ( $a$ ): The output of the agent, which affects the environment.
- Reward ( $R$ ): A scalar feedback signal from the environment, indicating the desirability of an action taken from a state.
- Policy ( $\pi$ ): A function that maps states to actions ( $\pi: \mathcal{S} \rightarrow \mathcal{A}$ ), defining the agent's behavior.
- Value Function (V(s) or Q(s, a)): Predicts the expected future reward from a given state or state-action pair.
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is formally defined as a tuple $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)$ , where:
- $\mathcal{S}$ is the set of all possible states.
- $\mathcal{A}$ is the set of all possible actions.
- $\mathcal{P}(s' | s, a)$ is the state transition function, representing the probability of transitioning to state $s'$ given that the agent took action $a$ in state $s$ .
- $\mathcal{R}(s, a)$ is the reward function, defining the immediate reward received after taking action $a$ in state $s$ .
- $\gamma \in [0, 1)$ is the discount factor, which determines the present value of future rewards. A higher $\gamma$ makes future rewards more influential. The goal in an MDP is to find a policy $\pi$ that maximizes the expected cumulative discounted reward.
Proximal Policy Optimization (PPO): A popular RL algorithm known for its stability and performance. PPO is an on-policy algorithm that optimizes a stochastic policy by taking small, conservative steps. It uses a clipped surrogate objective function to prevent excessively large policy updates, which can destabilize training. PPO often employs an actor-critic architecture.
Actor-Critic Architecture: A common RL framework where two neural networks work in tandem:
- Actor: The policy network, which takes the current state as input and outputs the action (or probability distribution over actions).
- Critic: The value network, which takes the current state as input and outputs an estimate of the value function (expected future rewards) for that state. The critic's estimate helps the actor learn by providing a baseline for advantage estimation.
Generalized Advantage Estimation (GAE): A method used in actor-critic RL algorithms (like PPO) to more accurately estimate the advantage function. The advantage function $A(s, a) = Q(s, a) - V(s)$ measures how much better an action $a$ is than the average action at state $s$ . GAE balances the bias-variance trade-off in advantage estimation by using a weighted average of $k$ -step returns.
Policy Distillation: A technique where knowledge from one or more "teacher" policies (often larger, more complex, or trained on specific tasks) is transferred to a "student" policy (often smaller, simpler, or designed for generalization). In this paper, it's used to combine multiple behavior-specific policies into a single multi-behavior policy. The student policy learns by trying to mimic the outputs (actions or probabilities) of the teacher policies.
Behavioral Cloning (BC): A form of supervised learning where a model learns a policy by observing and imitating expert demonstrations. DAgger (Dataset Aggregation) is an iterative BC algorithm that addresses the covariance shift problem (where the learned policy deviates from the expert's state distribution). DAgger repeatedly collects data by running the current policy and then queries the expert for actions in those visited states, adding them to the training dataset.
Multi-task Learning (MTL): A machine learning paradigm where multiple tasks are learned simultaneously by a single model. The goal is often to improve the generalization ability of the model by leveraging shared representations among related tasks. A key challenge in MTL is gradient conflict, where the gradients from different tasks might pull the shared parameters in opposing directions.
Gradient Conflict: In MTL, gradient conflict occurs when the gradients computed for different tasks point in substantially different directions in the parameter space. This can lead to slow convergence, oscillations, or even divergence, as optimizing for one task might degrade performance on another.
Gradient Surgery / Projecting Conflicting Gradients (PCGrad): A technique to mitigate gradient conflict in MTL. PCGrad modifies task-specific gradients by projecting conflicting gradients onto the normal plane of other task gradients. This removes the conflicting component while preserving the non-conflicting direction, allowing shared parameters to learn from multiple tasks more harmoniously.
Adversarial Motion Prior (AMP): A method used in RL for physics-based character control to generate naturalistic and human-like motions. AMP leverages a discriminator network, similar to a Generative Adversarial Network (GAN). The discriminator is trained to distinguish between motion sequences from a reference motion dataset (human motion capture data) and those generated by the RL policy. The RL policy then receives a style reward based on the discriminator's output, encouraging it to produce motions that are indistinguishable from the reference data, thus incorporating human motion priors without complex handcrafted reward functions.
Mixture-of-Experts (MoE): A neural network architecture that consists of multiple "expert" sub-networks and a "gating network." The gating network learns to assign different weights to the outputs of the experts for a given input, effectively allowing different parts of the network to specialize in different aspects of the task. This can enhance model capacity and allow for learning diverse behaviors.
PD Controller (Proportional-Derivative Controller): A widely used feedback control loop mechanism in robotics. It calculates an error value as the difference between a desired setpoint and a measured process variable. The controller then applies a correction based on proportional (P) and derivative (D) terms of this error.
- Proportional term ( $K_p \cdot e$ ): Responds to the current error, providing a correction proportional to its magnitude.
- Derivative term ( $K_d \cdot \frac{de}{dt}$ ): Responds to the rate of change of the error, helping to dampen oscillations and improve stability. In this paper, PD controllers are used to convert the policy's predicted actions (target joint positions) into motor torques.
Domain Randomization: A technique used in sim-to-real transfer for RL. Instead of trying to perfectly match the simulation environment to the real world, domain randomization trains the RL policy in simulations where various physical parameters (e.g., friction, mass, sensor noise) are randomly varied. This forces the policy to become robust to variations and makes it more likely to generalize to the real world, which can be seen as just another variation within the randomized domain.

3.2. Previous Works

The paper references several prior studies that form the landscape of humanoid locomotion and multi-behavior learning in robots:

Humanoid Locomotion with Deep Reinforcement Learning:
- Makoviychuk et al. (2021) and Todorov, Erez, and Tassa (2012) (MuJoCo): These works highlight the advancements in RL for robust and complex locomotion in simulation. Isaac Gym (Makovychuk et al. 2021) is specifically mentioned as providing high-performance GPU-based physics simulation which facilitates large-scale RL training.
- $Shi et al. (2025)$ , Radosavovic et al. (2024), $Gu et al. (2024b)$ , $Ben et al. (2025)$ : These works demonstrate challenging behaviors like coordinated upper and lower body control, whole-body locomotion, robust traversal over complex terrains, and full-body teleoperation.
- Huang et al. (2025b) (HoST): This work is directly compared against. HoST focuses on fall recovery for humanoids, achieving it through multiple critics and force curriculum. The current paper adapts HoST's multiple critics strategy for its own recovery policy but enhances it with AMP.
- Long et al. (2024b), $Ren et al. (2025)$ , Wang et al. (2025c), $He et al. (2023a)$ : These address locomotion on more extreme terrains using external sensors (depth cameras, LiDARs) or attention-based network designs for terrain perception.
- Note: The primary limitation highlighted by the authors for all these works is their focus on a single behavior.
Multi-Behavior Learning in Robots:
- Zhuang et al. (2023), Zhuang, Yao, and Zhao (2024): These works utilize policy distillation to integrate skills from multiple expert policies, enabling diverse behaviors for navigating complex terrains (e.g., Robot Parkour Learning, Humanoid Parkour Learning).
- Hoeller et al. (2024): Explores hierarchical frameworks to select among multiple skill policies for efficient multi-skill traversal (e.g., Anymal Parkour for quadrupedal robots).
- Xue et al. (2025) (HugWBC): Leverages input signals (gait frequency, foot contact patterns) to guide a policy to exhibit different behaviors based on varying commands.
- Huang et al. (2025a) (MoELoco): Adopts an MoE architecture to reduce gradient conflicts in multi-skill RL, improving training efficiency for quadrupedal locomotion.
- Wang et al. (2025b) (MoRE): Further enhances multi-skill policy performance by incorporating AMP-based rewards and external sensor inputs for humanoids.
- Note: The authors differentiate their work by integrating highly diverse behaviors (like recovery and locomotion) into a single unified policy that autonomously switches based on the robot's state, rather than relying on explicit control signals or combining highly similar behaviors.

3.3. Technological Evolution

The field of robot locomotion control has evolved significantly:

Early Model-Based Control (Pre-RL Era): Methods like whole-body control (WBC) (e.g., Santis and Khatib 2006) and model predictive control (MPC) (e.g., Li and Nguyen 2023) relied heavily on precise models of the robot and environment. While effective for well-defined tasks, they struggled with adaptability to unknown or dynamic environments.
Emergence of Reinforcement Learning (RL): With advances in computing power and deep learning, RL (Ernst and Louette 2024) emerged as a powerful paradigm. RL allowed robots to learn complex, emergent behaviors directly from interaction, reducing the need for explicit modeling. Pioneers in this area used simulators like MuJoCo (Todorov et al. 2012) and later Isaac Gym (Makovychuk et al. 2021) for large-scale simulation.
Single-Skill Mastery: Much of the initial RL success focused on mastering individual, complex skills such as walking (Gu et al. 2024a), running, standing up (He et al. 2023b; Huang et al. 2025b), jumping (Tan et al. 2024), or squatting (Ben et al. 2025). Techniques like Adversarial Motion Prior (AMP) (Peng et al. 2021; Escontrela et al. 2022) were introduced to make these behaviors more human-like.
Addressing Generalization and Adaptability (Multi-Skill/Terrain): The next frontier involved making robots more versatile. This led to efforts in:
- Multi-task RL and policy distillation for combining skills (Zhuang et al. 2023; Zhuang, Yao, and Zhao 2024).
- Mixture-of-Experts (MoE) architectures to handle diverse tasks (Huang et al. 2025a; Wang et al. 2025b).
- Terrain curriculum learning to generalize across varied terrains (Rudin et al. 2022).
- Techniques to address gradient conflicts in multi-task RL (Chen et al. 2018; Hessel et al. 2019; Yu et al. 2020).
  
  This paper's work fits within this technological timeline by building upon the successes of single-skill RL and AMP, and then specifically tackling the challenges of integrating highly diverse behaviors (like recovery and walking) into a unified controller that can autonomously switch and adapt to complex terrains. It leverages recent advancements in multi-task RL, policy distillation, MoE, and gradient surgery to achieve this.

3.4. Differentiation Analysis

Compared to the main methods in related work, the AHC framework's core differences and innovations are:

Integrated Diverse Behaviors with Autonomous Switching: Unlike most prior works that focus on a single behavior (e.g., locomotion, recovery, specific parkour skills) or combine similar behaviors (e.g., stair climbing and gap jumping), AHC integrates highly diverse behaviors (fall recovery and robust walking) into a single unified policy. Crucially, it enables the robot to autonomously switch between these behaviors based on its current state (e.g., detecting a fall and initiating recovery, then transitioning to walking). This holistic approach to multi-skill control is a significant differentiator.
Two-Stage Training Framework:
- Behavior Distillation for Foundation: Instead of attempting to learn all complex, adaptive behaviors directly through online RL from scratch, AHC first creates a basic multi-behavior controller via motion-guided policy learning and supervised distillation. This addresses the inherent difficulty of gradient conflicts and poor convergence that often plague direct multi-skill RL training. It effectively bootstraps the learning process with foundational skills.
- Reinforced Fine-Tuning for Adaptability: Only after a basic multi-behavior policy is established does the framework proceed to reinforced fine-tuning on diverse terrains. This staged approach is designed to be sample-efficient and incrementally build complexity.
Specific Technical Innovations for Multi-Task Learning:
- Mixture-of-Experts (MoE) Architecture: Employing MoE for the multi-behavior policy ( $\pi^d$ ) helps manage the complexity of integrating diverse skills by allowing different "experts" to specialize. This also inherently alleviates some gradient conflicts compared to a monolithic network.
- Adversarial Motion Prior (AMP) Integration: AMP is consistently used across both stages (in training behavior-specific policies and during fine-tuning) to ensure that the learned behaviors are human-like and naturalistic, providing a stable reward signal that is hard to achieve with handcrafted functions. This is integrated into both recovery and locomotion.
- PCGrad for Gradient Conflict Mitigation: The application of Projecting Conflicting Gradients (PCGrad) during the RL fine-tuning stage directly addresses the core multi-task learning challenge of gradient conflict in the shared actor. This allows for balanced and efficient learning across different behaviors and terrains.
- Behavior-Specific Critics: Decoupling the value function learning by using separate critics for each task (recovery and walking) helps prevent tasks with larger reward scales from dominating the gradient updates, leading to more accurate value estimation and stable training.
  
  In essence, AHC differentiates itself by providing a structured, robust, and effective pathway to integrated, adaptive humanoid control, specifically designed to handle the complexities of diverse behaviors and challenging terrains in a unified manner, leveraging state-of-the-art RL techniques.

4. Methodology

The Adaptive Humanoid Control (AHC) framework is a two-stage approach designed to learn an adaptive humanoid locomotion controller capable of performing various skills across different terrains. An overview of this two-stage framework is illustrated in Figure 2.

4.1. Preliminaries and Problem Definitions

4.1.1. Behavior-Specific Control as an MDP

In the first stage of AHC, each behavior-specific humanoid control problem (e.g., recovery or walking) is formulated as a Markov Decision Process (MDP), defined by the tuple $\mathcal{M} = \mathcal{(S,A,P,R,\gamma)}$ .

$\mathcal{S}$ : The state space, which encompasses all possible configurations of the robot and its environment.
$\mathcal{A}$ : The action space, representing all possible control commands the robot can execute.
$\mathcal{P}(\cdot |s, \pmb {a})$ : The state transition function, which describes the probability distribution over the next state given the current state $s$ and action $\pmb {a}$ .
$R:S\times A\rightarrow R$ : The reward function, which provides scalar feedback to the agent based on its actions and resulting states.
$\gamma \in [0,1)$ : The reward discount factor, which weighs the importance of immediate rewards versus future rewards.

During training, a behavior-specific policy $\pi^b$ learns to map a state $s_t$ to an action $\alpha_t$ to maximize the discounted return, which is the expected sum of future rewards: $ \mathbb{E}\left[\sum_{t = 1}^{T}\gamma^{t - 1}R(s_t,\pmb {a}_t)\right] $

4.1.2. Adaptive Humanoid Control as Multi-Task RL

The adaptive humanoid control problem is formulated as a multi-task RL problem. Here, each task (i.e., each behavior) is considered an MDP $\mathcal{M}_i = (\mathcal{S}_i,\mathcal{A}_i,\mathcal{P}_i,R_i,\gamma_i)$ , for $i\in [1,N]$ tasks. Since all tasks are performed in a unified environment and the controller needs to adapt its behavior based on the current state, the state space $\mathcal{S}$ is the union of individual task state spaces ( $\mathcal{S} = \bigcup_{i}\mathcal{S}_i$ ), with the assumption that task state spaces are disjoint ( $\mathcal{S}_1\cap \mathcal{S}_i = \varnothing$ for $i\neq j$ ). This implies that each MDP differs primarily in its reward function $R_{i}$ and the specific part of the state space $\mathcal{S}_i$ relevant to that behavior.

The objective for the behavior-adaptive policy is to optimize the sum of expected discounted returns across all tasks: $ \sum_{i = 1}^{N}\mathbb{E}{P,\pi_i}\left[\sum{t = 1}^{T}\gamma^{t - 1}R_i(s_t^i,a_t)\right], s_i^t\in \mathcal{S}_i. \quad (1) $ Here, $P$ represents the probability distribution of states, and $\pi_i$ is the policy for task $i$ .

4.1.3. Inputs to Policies and Action Generation

Behavior-specific policies (teacher policies) take two types of input: privileged information $s_{t}^{\mathrm{prv}}$ and robot proprioception $s_{t}^{\mathrm{prop}}$ . Privileged information includes data not directly available to a real robot (e.g., ground friction, motor controller gains, base mass, center of mass shift) but useful for efficient learning in simulation.

In contrast, the basic multi-behavior policy $\pi^{d}$ (the student policy obtained after distillation) and the RL fine-tuned policy $\pi^{ft}$ are designed to operate using only robot proprioception $s_{t}^{\mathrm{prop}}$ , as this is the information accessible on a real robot.

The robot proprioception $s_{t}^{\mathrm{prop}}$ is a vector of 69 dimensions, comprising: $ s_{t}^{\mathrm{prop}} = [\bar{\omega}{t},\pmb {g}{t},c_{t},\pmb {q}{t},\dot{\pmb{q}}{t},\pmb {a}_{t - 1}]\in \mathbb{R}^{69}, \quad (2) $ where:

$\bar{\omega}_{t}\in \mathbb{R}^{3}$ : The base angular velocity (rate of rotation of the robot's main body).
$\pmb {g}_{t}\in \mathbb{R}^{3}$ : The gravity vector expressed in the robot's base frame, indicating its orientation relative to gravity.
$c_{t}\in \mathbb{R}^{3}$ : The velocity command, typically comprising desired linear velocities along the $x$ - and $y$ -axes and an angular velocity around the $z$ -axis for locomotion.
$\pmb {q}_{t}\in \mathbb{R}^{20}$ : The joint positions (angles) of the robot's 20 degrees of freedom.
$\dot{\pmb{q}}_{t}\in \mathbb{R}^{20}$ : The joint velocities (rates of change of joint angles).
$\pmb {a}_{t - 1}\in \mathbb{R}^{20}$ : The last action taken by the policy, providing a short-term memory of the control output.

The policy outputs an action $\alpha_{t}$ which is then converted into a PD controller target $\bar{\pmb{q}}_{t}^{\mathrm{target}}$ . This target is used to compute the motor torques $T_{t}$ via a PD controller: $ \bar{\pmb{q}}{t}^{\mathrm{target}} = \pmb{q}^{\mathrm{default}} + \alpha \pmb{a}{t} $ $ T_{t} = K_{p}\cdot (\bar{q}{t}^{\mathrm{target}} - \bar{q}^{\mathrm{default}}) - K{d}\cdot \dot{\bar{q}}_{t}, \quad (3) $ where:
$\pmb{q}^{\mathrm{default}}$ : Default joint positions.
$\alpha$ : A scalar used to bound the action range.
$K_{p}$ : The stiffness coefficient (proportional gain) of the PD controller.
$K_{d}$ : The damping coefficient (derivative gain) of the PD controller.
$\bar{q}_{t}^{\mathrm{target}}$ : The target joint positions for the motors.
$\dot{\bar{q}}_{t}$ : The current joint velocities.

The overall methodology is divided into two main stages: Multi-Behavior Distillation and RL Fine-Tuning, which are detailed below.

4.2. Multi-Behavior Distillation

The first stage focuses on learning a basic multi-behavior controller by distilling knowledge from independently trained behavior-specific policies. This approach addresses the challenges of gradient conflict and gradient imbalance that arise when attempting to train diverse behaviors directly via online RL from scratch. The authors use Proximal Policy Optimization (PPO) to train the behavior-specific policies.

4.2.1. Falling Recovery Behavior Policy ( $\pi_{r}^{b}$ )

This policy is trained to enable the humanoid robot to recover robustly from various fallen postures.

Methodology: Inspired by HoST (Huang et al. 2025b), the policy uses multiple critics (Mysore et al. 2022). In the surrogate loss for policy gradient, the advantage function is estimated using a weighted formulation: $ \hat{A} = \sum_{i = 0}^{n}\omega_{i}\cdot (\hat{A_{i}} -\mu_{\hat{A_{i}}}) / \sigma_{\hat{A_{i}}} $ where $\omega_{i}$ is a weighting coefficient, and $\mu_{\hat{A_{i}}}$ and $\sigma_{\hat{A_{i}}}$ are the batch mean and standard deviation of the advantage from group $i$ , respectively. This multi-critic approach allows different critics to specialize in different aspects of the recovery task.
Training Setup: The robot is initialized in supine (lying on its back) or prone (lying on its stomach) positions, with additional joint initialization noise to promote robust recovery from diverse abnormal postures.
Adversarial Motion Prior (AMP) Integration: To mitigate interference from different initial postures and encourage natural standing-up motions, an AMP-based reward function is introduced. A discriminator determines if an episode's motion is human-like (from reference motion) or robot-generated. The discriminator's output guides the robot to recover smoothly. This means the policy learns behaviors such as using arms to support the ground during standing up. The detailed AMP formulation is in Appendix A.
Focus: The policy $\pi_{r}^{b}$ specifically focuses on recovery on flat terrains.

4.2.2. Walking Behavior Policy ( $\pi_{w}^{b}$ )

This policy is trained to enable the humanoid robot to perform fundamental locomotion.

Methodology: A simple framework with specific reward functions is used.
Training Setup: The policy learns to walk on flat terrain in response to a velocity command $c_{t}$ .
Adversarial Motion Prior (AMP) Integration: Similar to $\pi_{r}^{b}$ , an AMP-based reward function is used to encourage human-like walking and accelerate convergence.
Focus: While initially trained only for walking on flat terrain, the paper notes that after distillation and RL fine-tuning, this behavior significantly improves in robustness and terrain adaptability.

4.2.3. Behavior Distillation

After training the behavior-specific policies $\pi_{r}^{b}$ and $\pi_{w}^{b}$ , a behavior distillation process is performed using DAgger (Dataset Aggregation) to integrate their knowledge into a single Mixture-of-Experts (MoE)-based multi-behavior policy $\pi^{d}$ .

Purpose: This process eliminates gradient conflicts that would arise from directly combining distinct reward landscapes in RL. The MoE module allows the policy to automatically assign different experts to distinct behaviors, leveraging this prior knowledge for efficient multi-behavior improvement and multi-terrain adaptability in the subsequent RL fine-tuning stage.
DAgger Process: During training, the robot is initialized in either a fallen or standing posture. The student policy $\pi^{d}$ is supervised by the appropriate teacher policy ( $\pi_{r}^{b}$ or $\pi_{w}^{b}$ ) based on the behavior it should perform.
Loss Function: The loss function for $\pi^{d}$ $π^{d}$ is computed as: $ \mathscr{L}{\pi^{d}}(s{t}) = \left{ \begin{array}{ll} \mathbb{E}{s{t},\pi_{d},\pi_{r}^{b}}\left[\left\Vert \alpha_{t}^{\pi^{d}} - \alpha_{t}^{\pi_{b}^{r}}\right\Vert_{2}^{2}\right], & s_{t}\in \mathscr{S}{r}\ \mathbb{E}{s_{t},\pi_{d},\pi_{w}^{b}}\left[\left\Vert \alpha_{t}^{\pi^{d}} - \alpha_{t}^{\pi_{w}^{b}}\right\Vert_{2}^{2}\right], & s_{t}\in \mathscr{S}_{w} \end{array} \right., \quad (4) $ where:
- $\alpha_{t}^{\pi^{d}}$ : Actions sampled from the student policy $\pi^{d}$ .
- $\alpha_{t}^{\pi_{b}^{r}}$ : Actions sampled from the recovery teacher policy $\pi_{r}^{b}$ .
- $\alpha_{t}^{\pi_{w}^{b}}$ : Actions sampled from the walking teacher policy $\pi_{w}^{b}$ .
- $\mathscr{S}_{r}$ : The state space corresponding to standing-up (recovery) behavior.
- $\mathscr{S}_{w}$ : The state space corresponding to walking behavior. The loss aims to minimize the squared L2-norm difference between the actions of the student policy and the corresponding teacher policy.
Inputs and Benefits: The distillation process uses the same domain randomization as the behavior-specific policy training, and $\pi^d$ takes only proprioception as input. This process not only integrates basic behaviors into a single policy but also enhances them individually (e.g., more robust walking because it learns to recover from near-falls, and natural standing poses facilitating smooth transitions to walking).

4.3. RL Fine-Tuning

In the second stage, the distilled multi-behavior policy $\pi^d$ is further fine-tuned to enhance its terrain adaptability for both fall recovery and walking tasks on complex terrains. The fine-tuned policy is denoted as $\pi^{ft}$ .

Initialization: The policy $\pi^{ft}$ is initialized with the parameters of the distilled policy $\pi^d$ from the first stage.
Reward: AMP-based rewards (using the same reference motions) are again employed to maintain human-likeness.
Efficiency: Leveraging the MoE module and prior knowledge in $\pi^d$ , gradient conflicts are alleviated, enabling efficient learning of adaptive behaviors.
Training Setup: The policy is fine-tuned using PPO on two GPUs, with each GPU handling one task (recovery or walking) under its environment setup. The policies for different tasks share the same set of parameters for the actor network.

4.3.1. Behavior-Specific Critics and Shared Actor

Challenge: In PPO, while policy gradients use normalized advantages for stability, the value loss relies on unnormalized return targets. If tasks have vastly different reward scales, the value loss for the task with larger rewards can dominate gradient updates, hindering learning for other tasks.
Solution: The AHC framework uses behavior-specific critics with a shared actor during fine-tuning.
- Behavior-Specific Critics: Each task (recovery or walking) is assigned a separate critic. This isolates value function learning for each task's specific reward function, allowing for more accurate value estimation and even customized critic architectures (e.g., multiple critics for standing-up behavior, as used in $\pi_{r}^{b}$ ).
- Shared Actor: The actor network, which outputs the actions, shares its parameters across all tasks. It is updated using policy gradients aggregated from all tasks, facilitating skill transfer and terrain adaptability.

4.3.2. Eliminating Gradient Conflict in Multi-Task Learning

Even with behavior-specific critics, the shared actor can still receive potentially conflicting gradients from different tasks.

Solution: Projecting Conflicting Gradients (PCGrad) (Yu et al. 2020) is applied to resolve these conflicts.
Mechanism: For any pair of task gradients ( $\mathbf{g}_i$ and $\mathbf{g}_j$ ), if they conflict (i.e., their cosine similarity is negative, meaning they point in opposing directions), the gradient of one task is projected onto the normal plane of the other. This removes the conflicting directional component while preserving progress in the non-conflicting subspace.
Projection Formula: Given two task gradients $\mathbf{g}_i$ $g_{i}$ and $\mathbf{g}_j$ $g_{j}$ , if they conflict, the projected gradient for $\mathbf{g}_i$ $g_{i}$ is computed as: $ \mathbf{g}_i = \mathbf{g}_i - \frac{\mathbf{g}_i\cdot\mathbf{g}_j}{\left|\mathbf{g}_j\right|^2}\mathbf{g}_j \quad (5) $ where:
- $\mathbf{g}_i \cdot \mathbf{g}_j$ : The dot product of the two gradients.
- $\left\|\mathbf{g}_j\right\|^2$ : The squared Euclidean norm (magnitude) of gradient $\mathbf{g}_j$ . The term $\frac{\mathbf{g}_i\cdot\mathbf{g}_j}{\left\|\mathbf{g}_j\right\|^2}\mathbf{g}_j$ represents the component of $\mathbf{g}_i$ that is parallel to $\mathbf{g}_j$ . Subtracting this component projects $\mathbf{g}_i$ onto the plane orthogonal to $\mathbf{g}_j$ .
Implementation: PCGrad is integrated before the actor update step. Each task computes its local actor gradient on its dedicated GPU. All gradients are then sent to a main process where PCGrad performs gradient surgery. After the optimizer step with these conflict-free gradients, the updated parameters are broadcast back to all workers. This ensures efficient multi-task RL learning without gradient conflicts.

4.3.3. Terrain Curriculum

Following previous work (Rudin et al. 2022), a terrain curriculum is adopted to improve learning efficiency and adaptability on diverse terrains.

Mechanism: An automatic level adjustment mechanism dynamically modulates terrain difficulty based on task-specific performance.
Terrain Types: The curriculum includes four types of terrains for both tasks:
- Flat: Basic, level ground.
- Slope: Inclined surfaces, with a maximum inclination of $16.6^{\circ}$ .
- Hurdle: Terrains with regularly spaced vertical obstacles (maximum height 0.1m).
- Discrete: Terrains composed of randomly placed rectangular blocks (width/length 0.5m-2.0m, heights 0.03m-0.15m, 20 obstacles).
Setup: The terrain map is arranged into a $10 \times 12$ grid of $8m \times 8m$ patches, with 10 difficulty levels and 3 columns per terrain type, allowing for systematic progression in difficulty.

Appendix A. AMP Reward Formulation and Discriminator Objective

The Adversarial Motion Prior (AMP) mechanism (Peng et al. 2021; Escontrela et al. 2022) is adopted to provide a style reward that encourages natural and human-like behaviors.

A.1. AMP Input State and Temporal Context

AMP input state ( $s_t^{\mathrm {amp}}$ ): Extracted by taking 20 joint positions from the full observation.
Temporal Context: Unlike some prior works that use only two consecutive states, the AHC framework provides the discriminator with a temporal context by feeding a 5-step window of AMP states. The input sequence for the discriminator is defined as: $ \tau_{t}=(s_{t- 3}^{\mathrm{amp}},s_{t- 2}^{\mathrm{amp}},s_{t- 1}^{\mathrm{amp}},s_{t}^{\mathrm{amp}},s_{t+ 1}^{\mathrm{amp}}). $ This sequence provides a richer understanding of the motion trajectory to the discriminator.

A.2. Discriminator Objective

The AMP consists of a discriminator $D_{\phi}$ that distinguishes between reference motion sequences (from a human motion dataset $\mathcal{M}$ ) and on-policy rollouts (from the robot policy $\mathcal{P}$ ).

Training: The discriminator is trained to assign higher scores to reference sequences and lower scores to policy-generated ones.
Objective Function: Its objective is formulated as a least-squares GAN loss with a gradient penalty: $ \arg \max_{\phi}\mathbb{E}{\tau \sim \mathcal{M}}[(D{\phi}(\tau) - 1)^{2}] + \mathbb{E}{\tau \sim \mathcal{P}}[(D{\phi}(\tau) + 1)^{2}] +\frac{\alpha^{d}}{2}\mathbb{E}{\tau \sim \mathcal{M}}[||\nabla{\phi}D_{\phi}(\tau)||_{2}],\qquad (6) $ where:
- $\phi$ : The parameters of the discriminator network.
- $\mathbb{E}_{\tau \sim \mathcal{M}}[(D_{\phi}(\tau) - 1)^{2}]$ : The loss term encouraging the discriminator to output a score of 1 for real motion sequences from the reference dataset $\mathcal{M}$ .
- $\mathbb{E}_{\tau \sim \mathcal{P}}[(D_{\phi}(\tau) + 1)^{2}]$ : The loss term encouraging the discriminator to output a score of -1 for fake motion sequences generated by the policy $\mathcal{P}$ .
- $\frac{\alpha^{d}}{2}\mathbb{E}_{\tau \sim \mathcal{M}}[||\nabla_{\phi}D_{\phi}(\tau)||_{2}]$ : A gradient penalty term that helps mitigate training instability by penalizing large gradients of the discriminator's output with respect to its input.
- $\alpha^{d}$ : A manually specified coefficient controlling the strength of the gradient penalty.
- $D_{\phi}(\tau)\in \mathbb{R}$ : The scalar score predicted by the discriminator for the state sequence $\tau$ .

A.3. Style Reward Function

The discriminator output $d = D_{\phi}(\tau_{t})$ is used to define a smooth surrogate reward function, $r^{\mathrm{style}}(s_{t})$ : $ r^{\mathrm{style}}(s_{t}) = \alpha \cdot \max \left(0,1 - \frac{1}{4} (d - 1)^{2}\right), \quad (7) $ where $\alpha$ is a scaling factor. This reward encourages the policy to perform locomotion behavior that closely resembles those in the reference dataset.

A.4. Total Reward

The total reward used for policy optimization is the sum of the task reward $r_{t}^{\mathrm{task}}$ (designed for the specific objective, e.g., reaching a target velocity or standing up) and the style reward $r_{t}^{\mathrm{style}}$ : $ r_{t} = r_{t}^{\mathrm{task}} + r_{t}^{\mathrm{style}}. \quad (8) $ In the AHC setup, each task (locomotion and recovery) is associated with an independent discriminator and its corresponding reference motion data.

Appendix B. Training Details

B.1. Multi-Behaviors Distillation Policy

Framework: A teacher-student framework is used. Two behavior-specific policies ( $\pi_{r}^{b}, \pi_{w}^{b}$ ) are trained as teacher policies using PPO. These teachers have access to privileged information (simulation-only data) for efficient learning. Their learned skills are then distilled into a basic multi-behavior student policy ( $\pi^d$ ) which operates using proprioception only.
Network Architecture:
- Actor-Critic: Each behavior-specific policy uses an actor-critic architecture.
- History Encoder: A history encoder (shared between actor and critic) processes 10 steps of history observation using a 3-layer MLP (Multi-Layer Perceptron) with hidden dimensions [1024, 512, 128], outputting a 64-dimensional latent embedding.
- Actor Network: The latent embedding is concatenated with the current observation and fed to the actor. The actor is a 3-layer MLP with hidden sizes [512, 256, 128], outputting mean actions with a learnable diagonal Gaussian standard deviation.
- Critic Network: The latent embedding and current observation are also fed to the critic. The critic consists of $N$ independent networks, each a 3-layer MLP with hidden dimensions [512, 256]. Each critic corresponds to a specific reward group (as used in the multi-critic setup for recovery).
PPO Parameters: Standard PPO formulation with clipped surrogate loss and GAE.
- Adam optimizer with learning rate $1\times 10^{-3}$ .
- Rollouts: 32 environment steps per PPO iteration.
- Learning epochs: 5 epochs per PPO iteration.
- Minibatches: 4 minibatches per epoch.
- Discount factor $\gamma = 0.99$ .
- GAE lambda $\lambda = 0.95$ .
- Clipping ratio: 0.2.
- Value loss coefficient: 1.0.
Reward Definitions: Detailed in Table 3. Recovery rewards are grouped into four categories, following Huang et al. (2025b).
MoE Actor for Distillation: The student policy $\pi^d$ uses an MoE architecture for its actor network to increase capacity.
- Experts: Comprises 2 experts, each an MLP with the same hidden dimensions as the behavior-specific policy's actor.
- Gate Network: An MLP with hidden dimensions [512, 256, 128] that determines the mixing weights for the experts' output actions.
Distillation Loss: The total distillation loss is a weighted sum of two components: $ \begin{array}{rl} & {\mathcal{L}{\mathrm{distill}} = \lambda{\mathrm{MSE}}\cdot \mathbb{E}{a^{d\sim \pi^{d}},a^{b\sim \pi^{b}}}\left[\left|\left|a^{d} - a^{b}\right|\right|^{2}\right]}\ & {\qquad +\lambda{\mathrm{KL}}\cdot \mathbb{E}\left[\mathrm{KL}\left(\pi^{d}\Vert \pi^{b}\right)\right]} \end{array} \quad (9) $ where:
- $\lambda_{\mathrm{MSE}} = 0.1$ : Weight for the Mean Squared Error (MSE) loss, which penalizes the difference between the student policy's action ( $a^d$ ) and the teacher policy's action ( $a^b$ ).
- $\lambda_{\mathrm{KL}} = 0.5$ : Weight for the Kullback-Leibler (KL) divergence loss, which encourages the student policy's action distribution to be similar to the teacher policy's action distribution.
Distillation Procedure (Algorithm 1):

Algorithm 1: Behavior Cloning via Multi-Expert Distillation Require: Behavior-specific policies $\pi_{r}^{b},\pi_{w}^{b}$ , Multi- behavior policy $\pi^d$ , number of environments $N$ , rollout length $T$ , number of update epochs $K$ , minibatch size $B$ 1: Initialize storage $\mathcal{D}$ 2: for iteration $= 1,2,\ldots {\mathrm{d o}}$ 3: Collect rollouts in $N$ parallel environments: 4: for $\mathrm{t} = 1$ to $T$ do 5: Observe current state $s_t$ 6: Select behavior policy $\pi^{b}$ based on $s_{t}$ 7: $a_t^{b},\mu_t^{b},\sigma_t^{b}\leftarrow \pi^b (s_t)$ // get expert action 8: $a_t^{\ddot{s}},\mu_t^{\ddot{s}},\sigma_t^{s}\leftarrow \pi^{\ddot{s}}(s_t)$ // get student action 9: Store $(a_t^{\dot{s}},\mu_t^{\dot{s}},\sigma_t^{\dot{s}},a_t^{\ddot{s}},\mu_t^{\ddot{s}},\sigma_t^{\ddot{s}})$ in $\mathcal{D}$ 10: end for 11: for epoch $= 1$ to $K$ do 12: Sample minibatches of size $B$ from $\mathcal{D}$ 13: Compute loss $\mathcal{L}$ (Eq.(9)) 14: Update $\pi^{d}$ via gradient descent on $\mathcal{L}$ 15: end for 16: Clear storage $\mathcal{D}$ 17: end for
- Parameters: $N=4096$ parallel environments, rollout length $T=32$ , update epochs $K=5$ , minibatch size $B=4$ .
- Behavior Selection: $\pi^{b}$ is selected based on the robot's base height: if base height > 0.5m, walking policy is used; otherwise, recovery policy is used.
- Student Learning Rate: $1\times 10^{-3}$ .

B.2. RL Fine-tuned Policy

Initialization: The multi-behaviors policy $\pi^{d}$ is used to initialize the second-stage RL fine-tuning. The network architecture remains unchanged.
Learning Rate: The policy learning rate is reduced to $1\times 10^{-4}$ (from $1\times 10^{-3}$ in distillation) to prevent policy collapse and catastrophic forgetting of previously acquired skills during fine-tuning on diverse terrains.
Gradient Surgery (PCGrad):
- After computing task-specific policy gradients ( $g_w$ for walking, $g_r$ for recovery), PCGrad is applied.
- If gradient conflict is detected ( $\langle g_w,g_r\rangle < 0$ ), one gradient is projected onto the normal plane of the other.
- Stochastic Projection: Since there are only two tasks, the projection direction (locomotion onto recovery, or vice versa) is randomly selected at each update step to avoid bias.
Domain Randomization and PD Gains: Both training stages use the same domain randomization settings (Table 4) and joint PD gains (Table 5).
- Domain randomization parameters are resampled at the beginning of each episode to enhance robustness to varying environmental and physical conditions and prevent overfitting.
  
  The following table details the reward functions used in both training stages, specifically for the walking and full recovery tasks.

The following are the results from Table 3 of the original paper:

Term	Equation	Scale
Walking r*
Track lin. vel.	exp{−\|sWnminmin\|\|22 (wi2+0.25)1}	2.0
Track ang. vel.	exp{tVN52n\|\|θi\|2θi\|22 (ωi2+0.25)1}	2.0
Joint acc.	\|\|θi\|\|2 (θi\|\|2\|\|2 (θi\|\|2)	−5𝑒−7
Joint vel.	\|\|θi\|\|2 (θi\|\|2 (θi\|\|2)	−1𝑒−3
Action rate	\|\|a-k-al1\|\|22 (a-k-al1\|\|22 (a-k-al1\|\|22	−0.03
Action smoothness	\|\|cas-k-al1\|\|2 (cas-k-al1\|\|22 (cas-k-al1\|\|22	−0.05
Angular vel. (x - y)	\|ωax\|\|22 (ωai\|\|(x-y)	−0.05
Orientation	\|\|qa\|\|\|\|22 (qa\|\|\|\|22	−2.0
Joint power	\|a\|\|θa\|	−2.5𝑒−5
Feet clearance	∑((zⁱ−h^targett²^{)x*\|oa^ti)^ff}	−0.25
Feet stumble	∑(Schi);F^{(t)}>3)\|F^{(t)}_{i}\|	−1.0
Torques	∑ aou\|\|22 (aou\|\|22 (aou\|\|22	−1𝑒−5
Arm joint deviations	∑ (θi - θs) 1 laus\|\|e i min\|\|42 (θi) e 1 min\|\|22 (θn∥a)	−0.5
Hip joint deviations	∑ (θi - θlos	−0.5
Hip joint deviations	∑ funi s ang min\|\|22 (θi - θmin) 22 (θi) st\|\|	−1.0
Joint pos. limits	∑ outi ang min\|\|22 (θi - θmin) 22 (θi) 22 (θi) 22 (θi)	−2.0
Joint vel. limits	RELU (f - θ<float) -1.0
Torque limits	RELU (f - π^max) −1.0
Feet slippage	∑ [ν_{potytoi; f Ⅰ Ⅰ Ⅰ Ⅰ}] \|\|min	−0.25
Colision	n collision	−15.0
Feet air time	∑ (tpilt - 0.5) * \|\|(f rist contact I) \|22 (θi)	1
Stuck	· \|\|r\|\| 2 ≤ 0.1) * \|\|(e'\|\| 2 ≤ 0.2)	−1.0
Full Recovery r*
Task Reward
Orientation	fou(-θ_user; [0.99, inf], 1,0.05)	1.0
Head height	fou([h_head; [1 inf], 1,0.1)	1.0
Style Reward
Hip joint deviation	∑ (max崽) 0.9 √ minI烈\|θ汇\|< 0.8)	−10.0
kNθs	∑ (max崽) 0.2 8≤ minI烈\|θ汇\|< 0.06)	−0.25
Knesse	∑ (θ_{kfութ Tomatoes}	−2.5 −0.02
Shoulder roll deviation	Σ(⡵ cutoffΣ =<0.02 ‡ ⡵ initialize ‡ ⡵ )	−0.2
Thigh orientation	fou(Feginups; [0.8 inf], 1,0.1)	10.0
Feel distance	∑(\|Cp\|l	−10.0
Angular vel. (x, y)	exp(-2│u ×\|y_dif, f\|\|2)+×_high\|h_height\|<br>3<br>hhigh<br>f \|\|	25.0
Foot displacement	exp(clip	2)
Regularization Reward
Joint acc.	\|\|θi\|\|22	−2.5 −1 −3
Joint vel.	\|\|θi\|\|2 −1 −1
Action rate	\|\|a-k-al1\|\|22	−0.05
Action smoothness	\|\|a-k-al1\|\|22	−0.05
Torques	∑ aou\|\| \|\|a\|\|22 (θi\|\|z) + \r \\ ti² 22 (θi laus\|\|t)(denial\|\|22 (fini\|\|t)(denial\|\|t)2 \|\|a\|\| h\|\|t)(denial\|\|t)1 laus\|\|t)(argy )\|\|22 (θi laus\|\|t)(argy\|\|t) tit\|a\|\|t\|\|t argy\|\|t)\|\|22 (fini\|\|t)1 laus\|\|t)(argy )	−0.1 −0.03 −0.05 −0.15
Target rotation	f\|\|θi\|\|22 (fini\|\|t)\|\|22\|\|22\|\|22\|\|\|\|
Joint pos. limits	∑ outi outi^max	−2.0
Joint vel. limits	RELU (f ɘ ɘ	−1.0
Post-Task Reward
Angular vel. (x, y)	exp(+2 \|\|θ_sur 22 (aou 22 (θi\|\|z) + \r \|\|a\|\|22 (fini\|\|z) + \r \|\|a\|\|) tiid><br>(θi\|\|2)∧ fi\|\|22\|\|22\|<br>o<br>na<br>ni<br>laus\|\|t) = 22<br>(θi\|\|22\|\|22 (파파 2)	10.0
Base lin. vel. (x, y)	exp(−5\|va\|\|f\|\|22 (나 hit =\|\|f\|\|22 (나\| hit =)\|d hit =\|d hit =\|d\| 10.0
Orientation	exp(−20\|実高 ism a tâteauana\|) 10
Base height	exp(−20\|实高 sm a tâteauana\|) 10
Target joint deviations	exp(−0.1 BER 忍 это\|J\| ) 10 \|\|r\|\|22 r \|\|r\|\|22 r后面\|\|r\|\|22 r后面\|\|r\|\|22 r后面\|\|\|\|20 \|\|r\|\|22 r r后面\|\|r\|\|22 r后面 \|\|r\|\|22 (bey r后面\|\|r\|\|22 r后面 \|\|r\|\|22 r后面 \|r\|\|22 r后面 \|\|r\|\|22 r后面 \|\|r\|\|22 托管托管托管托管托管托管托管托管托管管理托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管托管	10.0

The fou function in the reward table stands for a Gaussian-style formulation, as detailed in Huang et al. (2025b). The table provides the Equation and Scale (weight) for various reward terms used in the Walking and Full Recovery tasks, categorized into Task Reward, Style Reward, Regularization Reward, and Post-Task Reward. These terms are designed to encourage desired behaviors (e.g., tracking velocity, maintaining orientation, proper joint movement) and penalize undesirable ones (e.g., high joint acceleration/velocity, large torques, collisions).

The following are the results from Table 4 of the original paper:

Term	Randomization Range	Unit
Restitution	[0,1]	-
Friction coefficient	[0.1,1]	-
Base CoM offset	[-0.03, 0.03]	m
Mass payload	[-2, 5]	Kg
Link mass	[0.8, 1.2]× default value	Kg
Kp Gains	[0.8, 1.25]	Nm/rad
Kd Gains	[0.8, 1.25]	Nms/rad
Actuation offset	[-0.05, 0.05]	Nm
Motor strength	[0.8, 1.2]× motor torque	Nm
Actions delay	[0,100]	ms
Initial joint angle scale	[0.85,1.15]× default value	rad
Initial joint angle offset	[-0.1,0.1]	rad

Table 4 lists the domain randomization settings and their respective ranges used during both training stages. These parameters are randomly varied in simulation to enhance the policy's robustness and transferability to the real world. Examples include varying physical properties like restitution (bounciness), friction coefficient, base center of mass (CoM) offset, mass payload, and link mass. Motor control parameters such as Kp and Kd gains, actuation offset, and motor strength are also randomized. Finally, actions delay and initial joint state parameters (angle scale and offset) are randomized to account for real-world uncertainties.

The following are the results from Table 5 of the original paper:

Joint	Kp	Kd
Hip	150	4
Knee	200	6
Ankle	40	2
Shoulder	40	4
Elbow	100	4

Table 5 specifies the PD gains ( $K_p$ for stiffness, $K_d$ for damping) used for each major joint group of the humanoid robot during both training and real-world deployment. These values are crucial for controlling the robot's joint positions and velocities, converting the policy's actions into motor torques.

5. Experimental Setup

The experiments for Adaptive Humanoid Control (AHC) were conducted in a two-part manner: first in a simulation environment for training and initial evaluation, and then on a real-world humanoid robot for validation.

5.1. Datasets

The paper does not use traditional "datasets" in the supervised learning sense for direct policy training. Instead, it relies on:

Simulation Environments: The IsaacGym simulator was used for all training and simulation-based evaluations. IsaacGym provides high-performance GPU-based physics simulation, enabling massively parallel deep reinforcement learning with 4096 parallel environments, which is crucial for efficient training of complex robot behaviors.
Reference Motion Data for AMP:
- Retargeted motion capture data was used for recovery behaviors. This refers to human motion data that has been adapted to fit the robot's kinematics, providing naturalistic examples of how a human would stand up.
- LAFAN1 data was used for locomotion behaviors. LAFAN1 is a dataset of human locomotion motions, providing diverse examples of human walking, running, etc. These "datasets" for AMP are essential because they provide the human motion priors that guide the RL policy towards generating natural and stable movements, which is particularly effective for standing-up and walking. They are effective for validating human-like movement generation, a core aspect of humanoid control.

5.2. Evaluation Metrics

For both locomotion and fail recovery tasks, the paper primarily uses Success Rate (Succ.) and Traversing Distance (Dist.) for the locomotion task.

5.2.1. Success Rate (Succ.)

Conceptual Definition: Success rate quantifies the percentage of trials in which the robot achieves a predefined objective without premature termination. It focuses on the robot's ability to complete the task reliably under specified conditions.
For Locomotion Task:
- Definition: The percentage of trials where the robot traverses the full 8m terrain within 20s without termination.
- Context: During evaluation, the robot is given a fixed forward velocity command, uniformly sampled from $0.4m/s - 1.0m/s$ , at the start of each episode. An episode terminates if the robot either walks off the current8m \times 8mterrain patch or falls irrecoverably. For multi-behavior policies (like $\pi^{d}$ and $\pi^{\mathrm{AHC}}$ ), falling does not trigger termination, as the robot is expected to recover and resume.
For Recovery Task:
- Definition: The percentage of trials where the robot successfully stands up from a fallen posture and maintains balance without falling again within 10s``.
- Context: The robot is initialized in various fallen postures (supine, prone).

5.2.2. Traversing Distance (Dist.)

Conceptual Definition: Traversing distance measures the average physical distance covered by the robot during a trial, regardless of whether the trial was successful or failed. It provides an indication of how far the robot can navigate before encountering an insurmountable obstacle or failing.
Mathematical Formula: There is no single standardized mathematical formula for "traversing distance" as it depends on the simulation environment's coordinate system and how distance is recorded. Generally, it refers to the Euclidean distance covered by the robot's base or center of mass from its starting point until termination. Let $(x_0, y_0, z_0)$ be the initial position of the robot's base and $(x_t, y_t, z_t)$ be its position at termination. The 2D traversing distance would be: $ \text{Dist} = \sqrt{(x_t - x_0)^2 + (y_t - y_0)^2} $ The paper calculates this over all trials, including successful and failed ones.

5.3. Baselines

To evaluate the effectiveness of the AHC framework, it was compared against several baseline methods, as well as intermediate policies from its own training stages:

HOMIE (Ben et al. 2025): This method focuses on humanoid loco-manipulation. For comparison, its lower-body locomotion policy was re-trained on the specific terrain settings used in the AHC experiments. This baseline represents a state-of-the-art single-skill locomotion controller.
HoST (Huang et al. 2025b): This method focuses on learning humanoid standing-up control across diverse postures. Similarly, its standing-up controller was re-trained on the terrain settings used in AHC. This baseline represents a state-of-the-art single-skill recovery controller.
Fail Recovery Policy ( $\pi_{r}^{b}$ ): This is the behavior-specific policy for fail recovery trained in the first stage of the AHC framework. It serves to show the performance of the recovery component before distillation and fine-tuning.
Walking Policy ( $\pi_{w}^{b}$ ): This is the behavior-specific policy for walking trained in the first stage of the AHC framework. It serves to show the performance of the locomotion component before distillation and fine-tuning.
Basic Multi-Behaviors Policy ( $\pi^{d}$ ): This is the basic multi-behavior policy obtained from the first-stage distillation process in AHC. It represents the combined knowledge of basic recovery and walking behaviors before RL fine-tuning for terrain adaptability.

These baselines and intermediate policies are representative because they cover both single-skill (locomotion or recovery) state-of-the-art methods and the foundational components of the AHC framework itself, allowing for a detailed component-wise and holistic performance comparison.

5.4. General Experimental Setup

Simulator: IsaacGym for all training and simulation evaluations, leveraging 4096 parallel environments.
Training Iterations:
- Behavior-specific policies ( $\pi_{r}^{b}, \pi_{w}^{b}$ ): 10,000 iterations.
- Policy distillation (to obtain $\pi^{d}$ ): 4,000 iterations.
- RL fine-tuning (to obtain $\pi^{\mathrm{AHC}}$ ): 10,000 additional iterations using online RL.
Hardware: Two NVIDIA RTX 4090 GPUs for RL fine-tuning.
Action Space: 20 Degrees of Freedom (DoF) action space (excluding waist joints).
Policy Deployment: On a Unitree G1 humanoid robot at $50\mathrm{{Hz}}$ .
Motor Control: A $500\mathrm{{Hz}}$ PD controller converts joint positions to torques.
Evaluation Environments: Four types of $8m \times 8m$ $8 m \times 8 m$ terrain patches:
- Flat
- Slope: Inclination angle uniformly sampled between $12^{\circ}$ and $16^{\circ}$ .
- Hurdle: Obstacle heights uniformly sampled between 0.08m to 0.1m. Locomotion task uses 3 obstacles, recovery task uses 8 obstacles (more cluttered).
- Discrete: Randomly positioned rectangular blocks, height variations from 0.08m to 0.1m.
Evaluation Trials: All evaluations were conducted using 1000 parallel environments for statistical significance.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents experimental results from both simulation and real-world deployments to validate the AHC framework's effectiveness.

The following are the results from Table 1 of the original paper:

Method	Locomotion						Fail Recovery
Method	Plane Succ. Dist.	Slope Succ. Dist.	Hurdle Succ. Dist.	Discrete Succ. Dist.	Slope Succ.	Hurdle Succ.	Discrete Succ.
HOMIE (Ben et al. 2025)	0.802	6.421	0.599	4.795	0.407	3.259	-
HoST (Huang et al. 2025b)	-	-	-	-	-	-	0.801
Fail Recovery Policy ( $\pi_{r}^{b}$ )	-	-	-	-	-	-	0.852
Walking Policy ( $\pi_{w}^{b}$ )	0.825	6.602	0.627	4.981	0.428	3.456	-
Basic Multi-Behaviors Policy ( $\pi^{d}$ )	0.898	7.184	0.698	5.584	0.521	4.168	0.885
AHC (Ours) ( $\pi^{\mathrm{AHC}}$ )	0.915	7.320	0.781	6.248	0.612	4.900	0.923

Analysis of Table 1: Performance evaluation on locomotion and frail recovered tasks across different terrains.

Overall Superiority of AHC: The proposed AHC policy ( $\pi^{\mathrm{AHC}}$ ) consistently outperforms all baselines and intermediate policies across both locomotion and recovery tasks on various terrain types. For instance, in locomotion, AHC achieves the highest Success Rate and Traversing Distance on all terrains (Plane: 0.915 Succ., 7.320 Dist.; Slope: 0.781 Succ., 6.248 Dist.; Hurdle: 0.612 Succ., 4.900 Dist.; Discrete: Not explicitly listed for AHC, but given the trend, it would be highest). For Fail Recovery, AHC achieves a 0.923 Success Rate on discrete terrain, surpassing HoST (0.801) and even its own first-stage Fail Recovery Policy (0.852).
Advantage in Locomotion on Complex Terrains: AHC's significant performance gain in locomotion tasks, especially on hurdle and discrete terrains, is attributed to its ability to autonomously recover from falls and resume traversal. This multi-behavior capability is critical where obstacles frequently cause imbalances. HOMIE, which is a strong locomotion baseline, shows lower success rates and traversing distances on challenging terrains, likely due to its lack of explicit recovery capabilities.
Impact of AMP: The paper highlights that the Adversarial Motion Prior (AMP) mechanism, by providing motion priors, guides the policy towards stable and robust behaviors. This contributes to AHC's better performance compared to HoST, which lacks AMP integration.
Benefits of Multi-Behavior Distillation ( $\pi^{d}$ ): Comparing the Basic Multi-Behaviors Policy ( $\pi^{d}$ ) with the behavior-specific policies ( $\pi_{r}^{b}$ and $\pi_{w}^{b}$ ), $\pi^{d}$ demonstrates superior robustness in the locomotion task on complex terrains. For example, on Hurdle terrain, $\pi^{d}$ achieves 0.698 Success and 5.584 Distance, significantly better than $\pi_{w}^{b}$ (0.627 Succ., 4.981 Dist.). This indicates that the seamless integration of walking and recovery behaviors within a single policy (via distillation) inherently improves robustness, even before specific terrain adaptation.
Effectiveness of RL Fine-Tuning ( $\pi^{\mathrm{AHC}}$ ): The RL fine-tuning stage, which trains $\pi^{d}$ on diverse terrains to produce the final AHC policy ( $\pi^{\mathrm{AHC}}$ ), yields further improvements. For instance, on Slope terrain, $\pi^{\mathrm{AHC}}$ achieves 0.781 Success and 6.248 Distance for locomotion, up from $\pi^{d}$ 's 0.698 Success and 5.584 Distance. This highlights the transferability of the two-stage training framework, where the basic multi-behavior policy is effectively adapted to challenging terrains.

6.2. AMP for Standing-Up

As can be seen from the results in Figure 3, the paper compares the recovery motions generated by AHC (which incorporates AMP) against HoST (which does not use AMP).

fig 7

Figure 3: Comparison of recovery motions under AHC and HoST. We compare our AHC (with AMP) against the HoST (w/o AMP) in both lying and prone scenarios. AHC produces smoother recovery behaviors. This highlights the effectiveness of AMP in guiding the learning of naturalistic recovery motions.

Qualitative Observation: Figure 3 visually demonstrates that AHC generates smoother and more natural recovery behaviors from both lying and prone positions. HoST, lacking AMP, exhibits uncoordinated and jerky motions, often relying on abrupt limb movements to stand up. In contrast, AHC produces a natural get-up motion, including leg folding, arm support, and trunk lifting, which are characteristic of human recovery. This provides strong visual evidence for the effectiveness of AMP in shaping the controller towards human-like dynamics.

As can be seen from the results in Figure 4, the paper further analyzes the joint acceleration during recovery.

Figure 4: Joint acceleration analysis of the left leg during recovery. Acceleration profiles of hip and knee joints from the left leg illustrate that our AHC results in stable joint actuation, with notably fewer abrupt fluctuations compared to HoST.

Quantitative Observation: Figure 4 presents the acceleration profiles of hip and knee joints from the left leg during recovery. The AHC policy exhibits stable joint actuation with notably fewer abrupt fluctuations compared to the HoST policy. This indicates that AMP not only leads to visually natural motions but also to mechanically smoother and more controlled movements, which are crucial for robustness and energy efficiency in real robots. These results confirm that AMP helps in learning stable recovery controllers, a feat difficult to achieve with hand-crafted reward functions alone.

6.3. Ablation on PCGrad and Behavior-Specific Critics

The paper conducts an ablation study to evaluate the individual contributions of PCGrad and behavior-specific critics in the second-stage fine-tuning. Four configurations are examined:

AHC-SC-w/o-PC: Single shared critic without PCGrad.
AHC-SC-PC: Single shared critic with PCGrad.
AHC-BC-w/o-PC: Behavior-specific critics without PCGrad.
AHC (Ours): Behavior-specific critics with PCGrad (the full proposed method).

The following are the results from Table 2 of the original paper:

Method Cosine Similarity (↑)

AHC-SC-w/o-PC 0.247

AHC-SC-PC 0.519

AHC-BC-w/o-PC 0.334

AHC (ours) 0.535

Method	Cosine Similarity (↑)
AHC-SC-w/o-PC	0.247
AHC-SC-PC	0.519
AHC-BC-w/o-PC	0.334
AHC (ours)	0.535

Analysis of Table 2: Gradient Cosine Similarity between Tasks across Different Ablation Settings.

Impact on Gradient Conflict: Cosine Similarity measures the alignment of gradients, with higher values indicating less conflict.
- PCGrad significantly reduces gradient conflict. Comparing AHC-SC-w/o-PC (0.247) with AHC-SC-PC (0.519), adding PCGrad nearly doubles the cosine similarity when using a shared critic.
- Similarly, comparing AHC-BC-w/o-PC (0.334) with AHC (0.535), PCGrad again substantially increases similarity with behavior-specific critics.
Impact of Behavior-Specific Critics: The use of behavior-specific critics itself also helps alleviate conflicts. AHC-BC-w/o-PC (0.334) has higher similarity than AHC-SC-w/o-PC (0.247), suggesting that separating value learning inherently reduces some task interference.
Combined Effect: The AHC (ours) configuration, combining both PCGrad and behavior-specific critics, achieves the highest cosine similarity (0.535), demonstrating their synergistic effect in mitigating gradient conflicts.

As can be seen from the results in Figure 5, the paper further examines value loss curves during fine-tuning.

Figure 5: Value loss curves during the second-stage fine-tuning. Policies equipped with behavior-specific critics (AHC-BC-w/o-PC and AHC) indicate more stable value learning compared to their shared-critic counterparts (AHC-SC).

Analysis of Figure 5: Value Loss Curves: Policies equipped with behavior-specific critics (AHC-BC-w/o-PC and AHC) consistently achieve lower value loss and exhibit more stable value learning compared to their shared-critic counterparts (AHC-SC-w/o-PC and AHC-SC-PC). This supports the hypothesis that decoupling value learning for each task helps mitigate optimization difficulties arising from reward scale discrepancies and leads to more accurate value function estimation.

As can be seen from the results in Figure 6, the paper visualizes the training episode return curves during fine-tuning.

Figure 6: Training episode return curves during second-stage fine-tuning. With PCGrad and behavior-specific critics AHC achieve higher and more balanced returns across tasks.

Analysis of Figure 6: Training Episode Return Curves:
- Shared-Critic Variants (AHC-SC): These configurations (blue and orange curves) tend to neglect the locomotion task (lower returns) due to its potentially smaller reward magnitude compared to the recovery task. This illustrates the gradient imbalance problem.
- AHC (Ours): The full AHC configuration (green curve), which combines PCGrad and behavior-specific critics, maintains high and more balanced returns for both tasks. This indicates superior performance in multi-task learning by effectively optimizing for both locomotion and recovery simultaneously.
- Convergence Speed: AHC also exhibits faster convergence compared to the other settings, reaching higher returns more quickly.
Conclusion of Ablation: These results strongly highlight the effectiveness of incorporating both PCGrad and behavior-specific critics in facilitating balanced and efficient optimization during the second-stage fine-tuning, leading to a robust and adaptive multi-behavior controller.

6.4. Deployment Results

The paper also presents qualitative results from deploying the trained AHC policy on a Unitree G1 humanoid robot in real-world settings, without any additional fine-tuning.

As can be seen from the results in Figure 7, the paper shows a sequence of deployment snapshots.

fig 2

Figure 7: Snapshot of real-world deployment. The robot performs recovery and locomotion in diverse scenarios, including standing up from prone and lying positions on sloped terrain and recovering after external pushes during walking.

Recovery Evaluation: The robot successfully recovers from various fallen postures (supine and prone) on both flat ground and inclined terrain. It also manages moderate external disturbances during recovery. After recovery, it stabilizes itself and smoothly transitions into a walking-ready posture, demonstrating natural and coordinated motion.
Locomotion Evaluation: The robot walks stably on flat ground and inclined surfaces, effectively tracking velocity commands. When external pushes are applied in random directions during walking, the robot generally withstands the perturbations and continues. If a fall does occur, it autonomously performs the recovery maneuver and resumes locomotion.
Significance: These real-world results are crucial, as they validate the policy's ability to effectively bridge the sim-to-real gap. They also demonstrate that the AHC policy successfully integrates recovery and locomotion behaviors in a cohesive and robust manner, showcasing strong resilience and long-horizon autonomy in dynamic environments. The visual evidence in Figure 7 confirms these capabilities.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Adaptive Humanoid Control (AHC), a novel two-stage framework for learning an adaptive humanoid locomotion controller. The first stage, Multi-Behavior Distillation, trains behavior-specific policies using Adversarial Motion Priors (AMP) and then distills this knowledge into a basic multi-behavior policy with a Mixture-of-Experts (MoE) architecture. This initial step effectively addresses the challenges of gradient conflicts inherent in multi-skill learning by creating a foundational controller capable of basic adaptive behavior switching. The second stage, Reinforced Fine-Tuning, further enhances the controller's terrain adaptability on diverse terrains by utilizing online RL, integrating behavior-specific critics, and employing gradient projection (PCGrad) to mitigate gradient conflicts in the shared actor.

Extensive experiments in both the IsaacGym simulator and on a real-world Unitree G1 humanoid robot rigorously validate the effectiveness of AHC. The results demonstrate that the learned controller enables robust locomotion across challenging terrains (slopes, hurdles, discrete obstacles) and effective recovery from various types of falls. The AMP integration leads to smoother and more natural motions, while PCGrad and behavior-specific critics ensure balanced and efficient multi-task learning. The seamless sim-to-real transfer and the ability to autonomously switch between recovery and locomotion in real-world scenarios highlight the practical utility and resilience of the proposed AHC policy.

7.2. Limitations & Future Work

The authors acknowledge the following limitations and propose future research directions:

Perceptual Capabilities: Currently, the AHC framework primarily relies on proprioceptive information. A significant limitation is the lack of integration with external sensors (e.g., depth cameras, LiDARs) for environmental perception. This means the robot cannot actively perceive and reason about its surroundings beyond basic contact information.
Limited Behavior Categories: While AHC successfully integrates recovery and locomotion, the range of behaviors is still limited to these two primary skills. Humanoid robots, to be truly versatile, would need to master a much broader set of human-like behaviors (e.g., manipulation, navigation in cluttered spaces, more dynamic movements like jumping over larger gaps).

Based on these limitations, the authors suggest the following future work:
Augmenting Perceptual Capabilities: Incorporating external sensors to enable the robot to perceive and understand its environment more comprehensively. This would allow for more intelligent and adaptive navigation, obstacle avoidance, and interaction with complex surroundings.
Expanding Behavior Categories: Extending the AHC framework to include a wider range of behaviors, aiming for even greater generalization and versatility in humanoid robot control. This could involve exploring more complex loco-manipulation tasks or advanced social interactions.

7.3. Personal Insights & Critique

This paper presents a well-structured and technically sound approach to a critical problem in humanoid robotics. The two-stage framework is a particularly insightful design choice, effectively decoupling the complex problem of learning diverse skills and adapting to varied terrains. Attempting to solve both simultaneously often leads to optimization instabilities in RL, making the distillation-then-fine-tuning strategy a pragmatic and robust solution.

Inspirations and Transferability:

Modular Learning: The idea of learning foundational skills (behavior-specific policies) and then distilling them into a unified multi-behavior controller is highly inspiring. This modular approach could be transferred to other complex multi-task learning domains beyond robotics, where a large, single model struggles to learn diverse sub-tasks. For instance, in natural language processing, a similar strategy could be used to combine specialized language models into a more general-purpose agent.
Robustness via AMP and Domain Randomization: The consistent use of AMP to infuse human-like priors and domain randomization for sim-to-real transfer are well-established but their effective combination here highlights their continued importance. This reinforces the idea that realistic simulations with sufficient variability are key to real-world deployment.
Addressing Gradient Conflict: The explicit and effective application of PCGrad and behavior-specific critics to mitigate gradient conflicts is a crucial takeaway. This is a common bottleneck in multi-task deep learning, and the quantitative results (cosine similarity, value loss, returns) clearly demonstrate the benefits of these techniques. These methods could be widely applied to any multi-task learning problem facing similar gradient challenges.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Reliance on Reference Motions: While AMP is powerful, it inherently relies on the availability and quality of reference motion data. Generating or acquiring high-quality motion capture data for all desired complex behaviors (especially loco-manipulation or novel interactions) can be challenging and expensive. This could limit the scalability of AMP for truly novel or non-human-like behaviors.
Computational Cost: Training behavior-specific policies for each skill, followed by distillation and then RL fine-tuning across diverse terrains, is computationally intensive. The use of MoE also adds to the model complexity, although it helps with gradient conflicts. While the paper mentions using IsaacGym with 4096 parallel environments and NVIDIA RTX 4090 GPUs, the overall training time and resource requirements could still be substantial, potentially limiting its accessibility for researchers without significant computational resources.
Scalability of PCGrad: While PCGrad is effective for a small number of tasks (here, two), its scalability to a very large number of highly diverse tasks might introduce computational overhead as the number of gradient projections grows quadratically with the number of tasks. Future work would need to investigate more efficient gradient surgery or gradient weighting techniques for scenarios with many behaviors.
Definition of Disjoint State Spaces: The paper assumes disjoint state spaces ( $\mathcal{S}_1\cap \mathcal{S}_i = \varnothing$ ) for different MDPs in the multi-task formulation. While this simplifies the problem formulation, in reality, the robot's state might have ambiguous regions that could belong to multiple behaviors (e.g., "about to fall" could be a state for both walking and recovery). The current state-based switching (base height threshold) is simple but might be brittle in edge cases. A more sophisticated, probabilistic, or context-aware behavior arbitration mechanism could enhance robustness.
The "Black Box" of MoE Gating: While MoE helps, the gating network itself can be a "black box." Understanding how it allocates tasks to experts and if it's truly making optimal decisions for dynamic behavior switching could be an area for further analysis.

Overall, AHC represents a robust step towards general-purpose humanoid control, effectively demonstrating how a clever combination of existing and novel RL techniques can overcome significant challenges in multi-skill and multi-terrain adaptation for complex robotic systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.