Paper status: completed

Learning Robust and Agile Legged Locomotion Using Adversarial Motion Priors

Published:06/28/2023
Original Link
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces the first blind locomotion system for legged robots, achieving robust traversal in complex terrains and rapid movement on natural terrains, utilizing Adversarial Motion Priors (AMP) for zero-shot generalization.

Abstract

Developing both robust and agile locomotion skills for legged robots is non-trivial. In this work, we present the first blind locomotion system capable of traversing challenging terrains robustly while moving rapidly over natural terrains. Our approach incorporates the Adversarial Motion Priors (AMP) in locomotion policy training and demonstrates zero-shot generalization from the motion dataset on flat terrains to challenging terrains in the real world. We show this result on a quadruped robot Go1 using only proprioceptive sensors consisting of the IMU and joint encoders. Experiments on the Go1 demonstrate the robust and natural motion generated by the proposed method for traversing challenging terrains while moving rapidly over natural terrains.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Learning Robust and Agile Legged Locomotion Using Adversarial Motion Priors." It focuses on developing advanced locomotion skills for legged robots that combine both robustness in challenging terrains and agility on natural terrains.

1.2. Authors

The authors of the paper are Jinze Wu, Guiyang Xin, Chenkun Qi, and Yufei Xue. Their affiliations are indicated by 1\textcircled{1} and ©, suggesting different research institutions or departments. Guiyang Xin and Chenkun Qi are noted as Members of IEEE, indicating their professional involvement and recognition in the field of electrical and electronics engineering, which often encompasses robotics and automation.

1.3. Journal/Conference

The paper does not explicitly state the journal or conference it was published in, but the standard format for academic papers suggests it is likely intended for a peer-reviewed publication. The index terms "Legged Robots," "Reinforcement Learning," and "Machine Learning for Robot Control" are typical for robotics and AI conferences or journals.

1.4. Publication Year

The publication date (UTC) is given as 2023-06-28T00:00:00.000Z.

1.5. Abstract

The paper addresses the challenge of developing both robust and agile locomotion skills for legged robots. It introduces a novel blind locomotion system—meaning it relies solely on internal (proprioceptive) sensors without external perception—that can robustly traverse challenging terrains while moving rapidly over natural ones. The core methodology integrates Adversarial Motion Priors (AMP) into the locomotion policy training. A key finding is its zero-shot generalization capability, where the policy, trained on motion data from flat terrains, successfully applies to complex real-world environments. This system is demonstrated on a Unitree Go1 quadruped robot using only IMU and joint encoders. Experiments show the method generates natural and robust motion for traversing obstacles and agile movement over varied outdoor terrains.

The original source link provided is /files/papers/695a514dba758f5b2a1ecae0/paper.pdf. This typically refers to a local file path on a system or a server, suggesting it is a direct link to the PDF of the paper. Its publication status is likely an officially published or accepted paper, given the formal structure and presentation.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the non-trivial task of developing locomotion skills for legged robots that are simultaneously robust (able to handle challenging terrains) and agile (able to move rapidly over natural terrains).

This problem is crucial in the current field because legged robots have the potential for versatile real-world applications (e.g., search and rescue, exploration, logistics) that require adapting to diverse and unpredictable environments. Current research faces specific challenges:

  • Robustness: Most existing methods for robust locomotion over challenging terrains rely on computationally intensive exteroceptive sensors like cameras and LIDARs. These sensors are susceptible to environmental factors (lighting, weather) and require complex foothold planning, making the systems less reliable and more complex.

  • Agility: Approaches for agile locomotion often use model predictive control (MPC) with hand-designed models, which struggle to balance model accuracy with computational complexity.

  • Combination Challenge: Combining these two capabilities—robustness and agility—into a single controller is a significant hurdle.

  • Naturalness of Motion: Even with Reinforcement Learning (RL) methods, many legged robot movements in experiments appear unnatural and jerky, indicating a lack of refined gait control.

    The paper's entry point and innovative idea revolve around using proprioceptive sensors (such as joint encoders and IMU), which are more robust and fundamental, to achieve both robust and agile locomotion. It leverages Reinforcement Learning (RL) to learn a blind locomotion controller (one that does not use external perception) and addresses the unnatural motion problem by incorporating Adversarial Motion Priors (AMP) into the policy training.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of legged robotics and Reinforcement Learning:

  • First Blind Locomotion System for Robust and Agile Skills: The paper presents what it claims to be the first blind locomotion system capable of traversing challenging terrains robustly while moving rapidly over natural terrains. This addresses the long-standing challenge of combining these two capabilities without relying on exteroceptive sensors.

  • Incorporation of Adversarial Motion Priors (AMP) for Natural Gaits: The approach successfully integrates Adversarial Motion Priors (AMP) into the locomotion policy training. This AMP technique introduces "soft gait priors" that replace complex auxiliary rewards with a style reward, enabling the robot to learn natural and smooth motion styles.

  • Zero-Shot Generalization: The work demonstrates zero-shot generalization from a motion dataset collected on flat terrains to challenging, real-world environments. This means the robot can adapt to unforeseen obstacles and disturbances without prior explicit training on those specific difficult terrains.

  • Teacher-Student Training Framework: The authors propose a teacher-student training framework that concurrently learns robust locomotion over challenging terrains and agile locomotion over even terrains, relying solely on proprioception. This framework effectively distills knowledge from a teacher (trained with privileged information in simulation) to a student (which operates blindly).

  • Demonstration on Real Hardware (Go1 Robot): The method is validated on a real quadruped robot, the Unitree Go1, using only proprioceptive sensors (IMU and joint encoders). The robot achieves robust traversal of obstacles up to 25 cm high (e.g., curbs, stairs, rocky, vegetation) and agile movements (sprinting up to 3.5m/s3.5 m/s and spinning 5.8rad/s5.8 rad/s outdoors), exhibiting natural and smooth gaits across all conditions.

    The key findings are that a single RL policy, trained with AMP and a teacher-student framework, can indeed achieve both robust and agile locomotion capabilities on a real robot using only proprioceptive feedback, showcasing impressive zero-shot generalization and natural motion. These findings solve the problem of requiring separate controllers or complex exteroceptive sensing for robust and agile behaviors, offering a more streamlined and reliable approach.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a beginner needs to grasp several fundamental concepts related to robotics, control, and machine learning:

  • Legged Robots: These are robots that use articulated limbs (legs) for locomotion, mimicking biological organisms. They are designed to navigate complex and uneven terrains where wheeled or tracked robots might struggle, offering high mobility and adaptability. Examples include quadrupedal (four-legged) robots like the Unitree Go1 used in this paper.

  • Locomotion: Refers to the act or power of moving from place to place. In robotics, it involves the algorithms and control strategies that enable a robot to move its body and limbs to achieve desired velocities and navigate an environment.

  • Proprioceptive Sensors: These are internal sensors that provide information about the robot's own state, such as joint angles, joint velocities, body orientation, and angular velocity. They report on the robot's internal configuration and movement.

    • IMU (Inertial Measurement Unit): A device that measures and reports a body's velocity, orientation, and gravitational forces. It typically combines accelerometers, gyroscopes, and sometimes magnetometers.
    • Joint Encoders: Sensors attached to robot joints that measure the angular position or velocity of each joint.
    • Why they are important: Proprioceptive sensors are generally more robust to environmental conditions (e.g., lighting, weather) than exteroceptive sensors and are fundamental for low-level control and stability.
  • Exteroceptive Sensors: These are external sensors that gather information about the robot's environment.

    • Cameras: Provide visual data (images/videos) of the surroundings.
    • LIDARs (Light Detection and Ranging): Use pulsed laser light to measure distances to the surrounding environment, creating detailed 3D maps.
    • Why they are challenging: While providing rich environmental context, exteroceptive sensors can be computationally intensive, susceptible to environmental conditions (e.g., dust, fog, bright sunlight), and require complex processing for tasks like foothold planning.
  • Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent performs actions, receives rewards (or penalties) based on those actions, and observes new states. The goal is to learn a policy—a mapping from states to actions—that maximizes the cumulative reward over time.

    • Agent: The learner or decision-maker (e.g., the robot's controller).
    • Environment: The world with which the agent interacts (e.g., the simulated or real terrain).
    • State (s\mathbf{s} or x\pmb{x}): A complete description of the environment at a given time.
    • Action (a\mathbf{a}): A decision or output from the agent that affects the environment.
    • Reward (rr): A scalar feedback signal from the environment indicating the desirability of an action taken from a state.
    • Policy (πθ\pi_{\theta}): A function parameterized by θ\theta that dictates the agent's behavior, mapping observed states to actions.
    • Discount Factor (γ\gamma): A value between 0 and 1 that determines the present value of future rewards. A higher γ\gamma makes the agent care more about long-term rewards.
  • Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It assumes that the future state depends only on the current state and action, not on the sequence of events that preceded it (Markov property).

    • Partially Observable Markov Decision Process (POMDP): An MDP variant where the agent does not have direct access to the true underlying state of the environment. Instead, it receives observations that are related to the state but may be noisy or incomplete. This is typical for blind locomotion where external terrain features are unknown.
  • Sim-to-real Transfer: The process of training a robot control policy in a simulation environment and then deploying it on a physical robot. This is crucial because training in the real world is often expensive, time-consuming, and potentially damaging to the robot. However, sim-to-real gaps (differences between simulation and reality) can make direct transfer challenging. Techniques like dynamics randomization are used to bridge these gaps.

  • Model Predictive Control (MPC): An advanced control method that uses a model of the system to predict its future behavior over a finite time horizon. At each time step, an optimization problem is solved to determine the optimal control actions that minimize a cost function while satisfying constraints. Only the first part of the optimal control sequence is implemented, and the process is repeated at the next time step with updated measurements.

  • Generative Adversarial Networks (GAN): A class of machine learning frameworks where two neural networks, a generator and a discriminator, compete against each other.

    • Generator: Learns to create new data instances that resemble the training data (e.g., generate realistic images or robot motions).
    • Discriminator: Learns to distinguish between real data (from the training dataset) and fake data (generated by the generator).
    • The two networks are trained simultaneously in a zero-sum game: the generator tries to fool the discriminator, and the discriminator tries to not be fooled. This adversarial process drives both networks to improve.
  • Trajectory Optimization (TO): A method used to find a sequence of control inputs (actions) and corresponding states (trajectory) that optimize a certain objective function (e.g., minimize energy, maximize speed) while satisfying system dynamics and constraints. It's often used to generate physically plausible and efficient motions for robots.

3.2. Previous Works

The paper discusses various prior studies, broadly categorized into those focusing on robust locomotion, agile locomotion, and motion imitation using RL.

  • RL for Robust Locomotion (using privileged learning or exteroceptive sensors):

    • [1] Hwangbo et al. (2019): Used an actuator network to model complex actuator dynamics, aiding sim-to-real transfer for the ANYmal robot. This improved agility and dynamics.
    • [2] Lee et al. (2020): Extended the robustness of locomotion by training robots on diverse terrains using the privileged learning paradigm. Privileged learning involves training a "teacher" policy with access to additional information (privileged states) only available in simulation, then distilling this knowledge into a "student" policy that only uses deployable observations.
    • [3] Kumar et al. (2021) - RMA (Rapid Motor Adaptation): A similar teacher-student framework that further improved robustness. The student policy included a 1-D CNN adaptation module to infer privileged information from observations. This is a direct baseline for comparison in the current paper.
    • [4] Miki et al. (2022) & [5] Agarwal et al. (2022): Integrated proprioceptive and exteroceptive states (e.g., cameras, LIDARs) to improve locomotion efficiency in the wild. These works demonstrated robust locomotion but typically at low or moderate speeds and relied on external sensing. The current paper differentiates by achieving robustness blindly.
    • [6] Yu et al. (2022): Focused on Visual-locomotion for complex terrains with vision, again highlighting the reliance on exteroceptive sensing.
  • RL for Agile Locomotion (high-speed, but potentially unnatural):

    • [7] Margolis et al. (2022): Used RL to train a Mini Cheetah robot for high-speed movements over natural terrains. They achieved remarkable results in sprinting and spinning but noted that the locomotion gaits could be unnatural. This work is also a baseline (Concurrent in the paper) in terms of its high-speed training curriculum.
    • [8] Ji et al. (2022): Proposed concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion. This method also focused on agility but did not explicitly address naturalness of gait on challenging terrains.
  • RL for Motion Imitation (to achieve natural gaits):

    • [15] Peng et al. (2018) - DeepMimic: Used example-guided deep reinforcement learning for physics-based character skills. These methods often utilize motion tracking techniques, where an agent mimics a desired motion by explicitly tracking target poses.
    • [16] Peng et al. (2020): Further explored motion tracking in RL for legged robots.
    • [17] Florensa et al. (2017), [18] Peng et al. (2019), [19] Yang et al. (2020): Used hierarchical policies and latent space models to learn reusable skills from large motion datasets, often dividing the problem into pre-training and transfer stages. Many of these were verified in simulation with moderately challenging terrains.
    • [20] Bogdanovic et al. (2022) & [21] Fuchioka et al. (2022): Leveraged trajectory optimization (TO) to generate stable demonstrations for model-free RL for robust locomotion. These methods still rely on a pre-training stage to track reference motions, which can be decoupled from high-level tasks.
  • Generative Adversarial Imitation Learning (GAIL) and Adversarial Motion Priors (AMP):

    • [22] Ho and Ermon (2016) - GAIL: A general approach for imitating behaviors from unstructured demonstration datasets without motion tracking. It integrates GANs into RL with a discriminator measuring similarity between policy and demonstrations.
    • [23] Goodfellow et al. (2020): The foundational work on Generative Adversarial Networks (GANs).
    • [24] Peng et al. (2021) - AMP (Adversarial Motion Priors): Proposed AMP, using the GAIL framework to predict if a state transition is real (from a dataset) or fake (from the agent). This enabled simulated agents to perform high-level tasks with motion styles from unstructured datasets. This is the direct inspiration for the current paper's core methodology.
    • [25] Peng et al. (2022): Extended AMP to latent space (ASE) for reusable adversarial skill embeddings.
    • [26] Escontrela et al. (2022): Applied AMP to learn quadrupedal locomotion on flats without complex reward functions.
    • [27] Vollenweider et al. (2022): Extended AMP to multiple style control for wheeled-legged locomotion.
    • [28] Li et al. (2022): Showed AMP can enable quadruped robots to learn agile skills using a Wasserstein GAN formulation and rough partial demonstrations.
    • The key limitation of these prior AMP works: Most motion priors were collected on even terrains, leaving the effectiveness of AMP for robust locomotion over complex terrains unknown. This is precisely what the current paper investigates.

3.3. Technological Evolution

The field of legged robot locomotion control has evolved from traditional optimization-based approaches (MPC, trajectory optimization) that rely on hand-designed models and explicit environment sensing, to Reinforcement Learning (RL). Early RL efforts focused on either robustness (often with privileged learning or exteroceptive sensors) or agility (high-speed movements). However, these often resulted in unnatural or jerky gaits and struggled to combine both robust and agile capabilities in a single controller, especially without external sensing.

To address the unnatural gait problem, researchers turned to imitation learning, using motion tracking techniques or hierarchical policies to learn from demonstrations. Generative Adversarial Imitation Learning (GAIL) and more recently Adversarial Motion Priors (AMP) emerged as powerful tools to learn natural motion styles from unstructured datasets, essentially replacing complex reward engineering with a style reward derived from a discriminator.

This paper marks a significant step in this evolution by taking AMP, which was primarily shown effective for stylized motion on flat terrains in simulation, and applying it to achieve both robust and agile blind locomotion on a real robot over challenging, unseen terrains. It bridges the gap between stylized natural motion and real-world robustness and agility, showcasing zero-shot generalization with only proprioceptive sensors.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's approach presents several core differences and innovations:

  • Combines Robustness and Agility Blindly: Prior work typically achieved either robust locomotion (often relying on exteroceptive sensors [4], [5], [6] or privileged information in simulation [2], [3]) or agile locomotion (often with unnatural gaits [7], [8]). This paper is the first to achieve both robust and agile locomotion with a single policy on a real robot, using only proprioceptive sensors. This blind locomotion capability is a major differentiator.

  • Zero-Shot Generalization to Challenging Terrains with AMP: While AMP has been used for stylized motion [24], [26], [27], [28], its application for robust locomotion over complex, unseen terrains was largely unexplored, especially when the motion dataset only contains flat-terrain motions. This paper demonstrates zero-shot generalization from flat-terrain AMP data to real-world challenging terrains (stairs, curbs, vegetation), which is a novel finding for AMP's capabilities.

  • Concise Reward Functions via AMP: Instead of complex, hand-engineered auxiliary rewards to induce natural gaits or robustness (which can be tedious to tune), this method leverages the AMP style reward to implicitly encourage natural motion, significantly simplifying the reward function design. This contrasts with explicit motion tracking methods [15], [16], [20], [21] that require pre-training to track reference motions.

  • Effective Teacher-Student Framework for Blind Locomotion on Challenging Terrains: The paper utilizes a teacher-student training framework similar to RMA [3], but with a crucial difference in how the student's low-level network is handled. Unlike RMA which freezes the student's low-level network, this paper reuses the teacher's weights and continues training them for the student, which is shown to be more effective for traversing increasingly difficult terrains (demonstrated in the ablation study with 25 cm steps). This allows the student to better adapt to complex situations by continuously refining its control layers.

  • Proprioception-Only Sensing: Unlike works that combine proprioceptive and exteroceptive sensors [4], [5], [6], this paper strictly adheres to proprioceptive-only sensing, making the system inherently more robust to environmental conditions and less computationally demanding for perception. The memory encoder (LSTM) in the student policy is critical for inferring unobservable states from historical proprioceptive observations.

4. Methodology

4.1. Principles

The core idea of the method is to enable a quadruped robot to achieve both robust and agile locomotion using only proprioceptive sensors by integrating Adversarial Motion Priors (AMP) within a teacher-student Reinforcement Learning (RL) framework. The theoretical basis is that by learning from a dataset of natural-looking motions (even if collected on flat terrains) via an adversarial process, the robot can acquire a "soft gait prior" that promotes natural movement. Simultaneously, the teacher-student framework (inspired by privileged learning) helps the blind student policy implicitly infer unobservable environmental properties (like terrain roughness) from a history of proprioceptive observations, allowing it to adapt to challenging terrains and achieve sim-to-real transfer. The reward function is kept concise by using the AMP-derived style reward in place of complex, hand-engineered auxiliary rewards.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology can be broken down into three main parts: Reinforcement Learning Problem Formulation, Reward Terms Design (including AMP), and Motion Dataset Generation, followed by the Training details for both Teacher and Student policies.

4.2.1. Reinforcement Learning Problem Formulation

The control problem is formulated as a discrete-time dynamic system. The goal of RL is to find optimal parameters θ\theta for a policy πθ\pi_{\theta} to maximize the expected discounted return over future trajectories. The expected discounted return is defined as: J(θ)=Eπθ[t=0γtrt] J \left( \theta \right) = \mathbb { E } _ { \pi _ { \theta } } \left[ \sum _ { t = 0 } ^ { \infty } \gamma ^ { t } r _ { t } \right] Here,

  • J(θ)J(\theta): The objective function representing the expected total discounted reward, which the RL agent aims to maximize.

  • θ\theta: The parameters of the policy πθ\pi_{\theta}.

  • Eπθ[]\mathbb{E}_{\pi_{\theta}}[\cdot]: The expectation taken over trajectories generated by following the policy πθ\pi_{\theta}.

  • γ\gamma: The discount factor (a value between 0 and 1, exclusive), which weighs immediate rewards more heavily than future rewards.

  • rtr_t: The reward received at time step tt.

    The paper focuses on blind locomotion, which is a Partially Observable Markov Decision Process (POMDP) because exteroceptive sensors are not used, meaning terrains are not fully observable. To handle this, privileged learning [14] is employed:

  1. A teacher policy is trained in simulation with access to privileged states (information only available in simulation, like accurate linear velocity or terrain properties).
  2. A student policy then learns to imitate the teacher's actions and infer the privileged states from a history of its proprioceptive observations via supervised learning.

State Space

  • Teacher Policy State (xtteacher\pmb{x}_t^{\mathrm{teacher}}): Consists of three components:
    • Proprioceptive Observation (δotp\mathbf{\delta}_{o_t^p}): This is the only part accessible to the student. It includes:
      • Orientation of the gravity vector (in robot's base frame).
      • Base angular velocity (in robot's base frame).
      • Joint positions.
      • Joint velocities.
      • Previous action (at1\mathbf{a}_{t-1}) selected by the current policy.
      • Desired base velocity command vector.
    • Privileged State (stp\mathbf{\boldsymbol{s}}_t^p): Information available only to the teacher in simulation. It includes:
      • Base linear velocity (computed from state estimation, typically noisy in reality).
      • Friction coefficients.
      • Restitution coefficients (for ground contact).
      • Contact forces.
      • External forces and their positions on the robot.
      • Collision states (of the trunk, thigh, and shank).
    • Terrain Information (ite{\boldsymbol{i}}_t^e): 187 measurements of terrain points sampled from a grid around the robot base, representing the vertical distance from the sampled point to the robot base.
  • Student Policy State: Only the proprioceptive observation (δotp\mathbf{\delta}_{o_t^p}) is accessible.

Action Space

  • The policy action at\mathbf{a}_t is a 12-dimensional vector.
  • It's interpreted as a target joint position offset. This offset is added to a time-invariant nominal joint position to specify the target motor position for each of the robot's 12 joints (3 joints per leg x 4 legs).
  • Joint PD controllers (Proportional-Derivative controllers) then compute the torque commands using these target motor positions with fixed gains Kp=20K_p = 20 and Kd=0.5K_d = 0.5.
    • PD Controller: A feedback control loop mechanism widely used in industrial control systems. It calculates an "error" value as the difference between a measured process variable and a desired setpoint. The controller attempts to minimize the error by adjusting the process control inputs. In this case, it adjusts motor torques based on the difference between the target joint position and the current joint position (P-term) and the joint velocity (D-term).

4.2.2. Reward Terms Design

The paper aims for concise reward functions to induce robust and agile legged locomotion with a natural gait. The total reward rtr_t is composed of three terms: rt=rtg+rts+rtl. r _ { t } = r _ { t } ^ { g } + r _ { t } ^ { s } + r _ { t } ^ { l } . Here,

  • rtr_t: The total reward at time step tt.
  • rtgr_t^g: The task reward component, encouraging velocity tracking.
  • rtsr_t^s: The style reward component, based on AMP, encouraging natural gait.
  • rtlr_t^l: The regularization reward component, ensuring smoothness and safety.

Task Reward (rtgr_t^g)

This term focuses on tracking desired linear and angular velocities. Its specific equation and weight are provided in Table I.

Style Reward (rtsr_t^s)

The style reward is crucial for promoting natural and smooth trot gait. It evaluates the similarity between the demonstrator's behavior (from a reference dataset D\mathcal{D}) and the agent's behavior (A\mathcal{A}). The more similar they are, the higher the style reward. It is based on the Adversarial Motion Priors (AMP) framework.

The system uses a discriminator DφD_{\varphi}, a neural network with parameters φ\varphi, to predict whether a state transition (st,st+1)(\boldsymbol{s}_t, \boldsymbol{s}_{t+1}) is a real sample from the reference dataset D\mathcal{D} or a fake sample produced by the agent A\mathcal{A}. The AMP state stAMPR31s_t^{AMP} \in \mathbb{R}^{31} includes:

  • Joint positions
  • Joint velocities
  • Base linear velocity
  • Base angular velocity
  • Base height with respect to the terrains. To mitigate mode collapse (a common GAN issue where the generator produces a limited variety of samples), the reference dataset D\mathcal{D} contains only trot gait motion clips.

The training objective for the discriminator DφD_{\varphi} is defined as a least squares GAN formulation with a gradient penalty: argminφE(st,st+1)D[(Dφ(st,st+1)1)2]+E(st,st+1)A[(Dφ(st,st+1)+1)2]+αgp2E(st,st+1)D[φDφ(st,st+1)2], \begin{array} { r l } & { \underset { \varphi } { \arg \operatorname* { m i n } } \mathbb { E } _ { ( s _ { t } , s _ { t + 1 } ) \sim \mathcal { D } } \left[ \left( D _ { \varphi } \left( s _ { t } , s _ { t + 1 } \right) - 1 \right) ^ { 2 } \right] } \\ & { \quad + \mathbb { E } _ { ( s _ { t } , s _ { t + 1 } ) \sim \mathcal { A } } \left[ \left( D _ { \varphi } \left( s _ { t } , s _ { t + 1 } \right) + 1 \right) ^ { 2 } \right] } \\ & { \quad + \frac { \alpha ^ { g p } } { 2 } \mathbb { E } _ { ( s _ { t } , s _ { t + 1 } ) \sim \mathcal { D } } \left[ \left. \nabla _ { \varphi } D _ { \varphi } \left( s _ { t } , s _ { t + 1 } \right) \right. _ { 2 } \right] , } \end{array} Here,

  • argminφ\arg \min_{\varphi}: The discriminator aims to find the parameters φ\varphi that minimize this objective function.

  • E(st,st+1)D[]\mathbb{E}_{(\boldsymbol{s}_t, \boldsymbol{s}_{t+1}) \sim \mathcal{D}}[\cdot]: Expectation over state transitions sampled from the real motion dataset D\mathcal{D}.

  • E(st,st+1)A[]\mathbb{E}_{(\boldsymbol{s}_t, \boldsymbol{s}_{t+1}) \sim \mathcal{A}}[\cdot]: Expectation over state transitions generated by the agent's policy A\mathcal{A}.

  • Dφ(st,st+1)D_{\varphi}(\boldsymbol{s}_t, \boldsymbol{s}_{t+1}): The output of the discriminator, which tries to output 1 for real samples and -1 for fake samples (as per least squares GAN formulation).

  • (1)2(\cdot - 1)^2: Encourages the discriminator to output 1 for real samples.

  • (+1)2(\cdot + 1)^2: Encourages the discriminator to output -1 for fake samples.

  • αgp\alpha^{gp}: A manually-specified coefficient for the gradient penalty term (αgp=10\alpha^{gp} = 10).

  • αgp2E(st,st+1)D[φDφ(st,st+1)2]\frac{\alpha^{gp}}{2} \mathbb{E}_{(\boldsymbol{s}_t, \boldsymbol{s}_{t+1}) \sim \mathcal{D}}[\left. \nabla_{\varphi} D_{\varphi}(\boldsymbol{s}_t, \boldsymbol{s}_{t+1}) \right|_2]: The gradient penalty term, which encourages the discriminator's gradients on real data samples to be close to zero. This helps stabilize GAN training and prevent mode collapse.

    The style reward rtsr_t^s for the agent is then defined based on the discriminator's score: rts[(st,st+1)A]=max[0,10.25(dtscore1)2] r _ { t } ^ { s } \left[ \left( s _ { t } , s _ { t + 1 } \right) \sim \mathcal { A } \right] = \operatorname* { m a x } \left[ 0 , 1 - 0 . 2 5 \left( d _ { t } ^ { \mathrm { s c o r e } } - 1 \right) ^ { 2 } \right] Here,

  • rts[]r_t^s[\dots]: The style reward received by the agent for a state transition generated by its policy.

  • dtscore=Dφ(st,st+1)d_t^{\mathrm{score}} = D_{\varphi}(\boldsymbol{s}_t, \boldsymbol{s}_{t+1}): The score given by the discriminator for the agent's state transition.

  • The max[0, ...] term ensures the reward is non-negative.

  • The term 10.25(dtscore1)21 - 0.25(d_t^{\mathrm{score}} - 1)^2 scales the style reward to the range [0, 1]. If dtscored_t^{\mathrm{score}} is close to 1 (meaning the discriminator believes the agent's motion is very similar to the real data), the reward will be close to 1.

Regularization Reward (rtlr_t^l)

This term is added to aid sim-to-real transfer over challenging terrains, imposing constraints on motion for smoothness and safety.

  • Smoothness: Induced by joint-level penalty (penalizing large joint torques and changes in joint positions) and stride duration bonus (encouraging consistent foot-ground contact timing).

  • Safety: Ensured by penalties for collisions (e.g., robot trunk with the ground, or self-collisions).

    The detailed reward terms and their weights are summarized in Table I.

The following are the results from Table I of the original paper:

Term Equation Weight
Task rg exp kvdes vt,xy∥2/0.15) |des exp − ωt,z∥2/0.15) 1.0 0.5
Smoothness rl −|τ ∥2 −∥l2 −kqt−1 − qt ||2 ∑ i=0 min (tair,i − 0.5, 0) 1 × 10−4 2.5 × 10−7 0.1 1.0
Safety rl −ncollision 0.1
Style rs max [0, 1 − 0.25 (dscore − 1)2] 0.5

4.2.3. Motion Dataset Generation

The motion dataset D\mathcal{D} for AMP is generated using trajectory optimization (TO) techniques.

  • Method: The paper adopts a single TO formulation from previous work [11] to generate quadrupedal locomotion with a trotting gait on flat terrain.
  • Robot Model: The robot is first represented using a simplified centroidal dynamics model to reduce computational complexity.
  • Optimization Problem: This model is then transcribed into a nonlinear programming problem. This problem explicitly enforces friction cone constraints (ensuring feet don't slip) and kinematic constraints (ensuring joints operate within their physical limits).
  • Solver: The TO problem is solved using TOWR [29], an open-source trajectory optimizer that does not require a hand-tuned cost function, simplifying the optimization process.
  • Dataset Content: The dataset D\mathcal{D} includes various locomotion trajectories: forward, backward, lateral left, lateral right, left steering, right steering, and combined movements. The total duration of these motions is 30 seconds.
  • Benefit: Generating motion datasets using TO ensures that they fully match the state space of the simulated agent and the demonstrator, eliminating the need for complex motion retargeting techniques like those in [15].

4.2.4. Training

Simulation Environment and General Settings

  • Simulator: IsaacGym [9], a high-performance GPU-accelerated physics simulator, is used for training.
  • Parallelism: 4096 parallel agents are trained concurrently on different terrain types.
  • Training Time:
    • Teacher policy: 400 million simulated time steps.
    • Student policy: 200 million simulated time steps.
    • Total wall-clock time: seven hours for both stages.
  • Episode Duration: Each RL episode lasts a maximum of 1000 steps (equivalent to 20 seconds) and terminates early if specific termination criteria are met.
  • Control Frequency: 50 Hz in simulation.
  • Hardware: All training is performed on a single NVIDIA RTX 3090Ti GPU.

Termination Criteria

An episode terminates, and a new one starts, if any of the following conditions are met:

  • Trunk collisions with the ground.
  • Large body inclinations (robot tipping over too much).
  • Robot being trapped for a long period of time (not making progress).

Dynamics Randomization

To enhance the robustness of the policy and facilitate sim-to-real transfer, various dynamic parameters are randomized in each episode during training. Some of these randomized parameters are included in the privileged state stp\mathbf{\boldsymbol{s}}_t^p to aid the teacher policy training. Observation noise (as described in [9]) is also added. The randomized parameters include:

  • Mass of the trunk and legs.

  • Mass and position of payload applied to the robot's body.

  • Ground friction and restitution coefficients.

  • Motor strength.

  • Joint-level PD gains.

  • Initial joint positions.

    The randomization ranges for these parameters are detailed in Table II.

The following are the results from Table II of the original paper:

Parameters Range[Min, Max] Unit
Link Mass [0.8, 1.2]×nominal value Kg
Payload Mass [0, 3] Kg
Payload Position [-0.1, 0.1] relative to base origin m
Ground Friction [0.05, 2.75] -
Ground Restitution [0.0, 1.0] -
Motor Strength [0.8, 1.2]× motor torque Nm
Joint Kp [0.8, 1.2]×20 -
Joint Kd [0.8, 1.2]×0.5 -
Initial Joint Positions [0.5, 1.5]×nominal value rad

Training Curriculum

Given the complexity of blind locomotion over challenging terrains, an automated terrain curriculum is adopted and improved from [9].

  • Terrain Types: Five types of procedurally generated terrains are created: rough flats, slopes, waves, stairs, and discrete steps.

  • Curriculum Structure: A height-field map of 100 terrains arranged in a 20x10 grid is used. Each row contains the same type of terrain with increasing difficulty. Each terrain section is 8m long and wide.

    • Rough flats: Noise increases from ±1cm\pm 1 \mathrm{cm} to ±8cm\pm 8 \mathrm{cm}.
    • Slopes: Inclination increases from 0deg0 \mathrm{deg} to 30deg30 \mathrm{deg}.
    • Waves: Amplitude increases from 20cm20 \mathrm{cm} to 50cm50 \mathrm{cm}.
    • Stairs: Fixed width of 30cm30 \mathrm{cm}, step height increases from 5cm5 \mathrm{cm} to 23cm23 \mathrm{cm}.
    • Discrete obstacles: Two height levels, increasing from ±5cm\pm 5 \mathrm{cm} to ±15cm\pm 15 \mathrm{cm}.
  • Progression Logic:

    • Initially, robots are assigned to all terrain types at the lowest difficulty.
    • A robot moves to a more difficult terrain once it adapts to its current difficulty, defined as achieving more than 85% of the average linear velocity tracking reward (meaning it successfully follows velocity commands).
    • If a robot fails to travel at least half the distance required by its command linear velocity by the end of an episode, it is reset to an easier terrain.
    • To prevent skill forgetting, robots that solve the hardest terrains are looped back to a randomly selected difficulty level of the current terrain type.
  • Command-Conditioned Policy: Robots learn to track desired velocity commands vtdes=(vx,vy,ωz)\mathbf{v}_t^{\mathrm{des}} = (v_x, v_y, \omega_z), representing longitudinal velocity (vxv_x), lateral velocity (vyv_y), and yaw velocity (ωz\omega_z) in the base frame.

    • During the terrain curriculum phase, longitudinal and lateral velocity commands are sampled from a small range [1m/s,1m/s][ -1 \mathrm{m/s}, 1 \mathrm{m/s}]. Yaw velocity is computed from heading errors to guide the robot out of terrains.
  • Concurrent High-Speed Training: After an agent on rough flat terrains successfully steps out of the roughest flats, its (vx,ωz)(v_x, \omega_z) velocity command sampling schedules switch to a grid adaptive curriculum strategy [7]. This enables learning high-speed sprinting and spinning. This high-speed training only terminates the terrain curriculum for the agents engaged in it, not for others.

    The following figure (Figure 2 from the original paper) illustrates the teacher-student training framework with adversarial motion priors.

    该图像是一个示意图,展示了在教师策略训练与学生策略训练中使用对抗运动优先的方法。左侧展示了传感器信息和重建损失的计算,右侧则呈现了GAN鉴别器的结构及轨迹优化的公式,其中 \(r_i = r_g^s + r_i^s + r'_i\)。 该图像是一个示意图,展示了在教师策略训练与学生策略训练中使用对抗运动优先的方法。左侧展示了传感器信息和重建损失的计算,右侧则呈现了GAN鉴别器的结构及轨迹优化的公式,其中 ri=rgs+ris+rir_i = r_g^s + r_i^s + r'_i

Oh. δotpp\mathbf { \delta } _ { o _ { t } ^ { p } } ^ { p } , privileged state stps _ { t } ^ { p } and terrain information ite{ i } _ { t } ^ { e } using RL. The total reward r _ { t } obtained by the teacher consists of a task reward rtgr _ { t } ^ { g } , a style reward rts\boldsymbol { r } _ { t } ^ { s } based on AMP, and a regularization reward rtlr _ { t } ^ { l } which atteachera _ { t } ^ { \mathrm { t e a c h e r } } and reconstructs the latent representation ltteacherl _ { t } ^ { \mathrm { t e a c h e r } } otpo _ { t } ^ { p } lov seeuaiTheo priors eom Ous bo.

4.2.5. Teacher Policy Training and Architecture

  • Training Algorithm: The teacher policy is trained using Proximal Policy Optimization (PPO) [31].

    • PPO: An on-policy RL algorithm that aims to strike a balance between ease of implementation, sample efficiency, and good performance. It iteratively optimizes a clipped surrogate objective function to update the policy while keeping the new policy close to the old one.
  • Synchronization: The training of the teacher policy and the discriminator (DφD_{\varphi}) are synchronized.

    • The teacher performs a rollout in the environment, generating a state transition (stAMP,st+1AMP)(s_t^{AMP}, s_{t+1}^{AMP}).
    • This transition is fed to the discriminator DφD_{\varphi} to get a dtscored_t^{score}.
    • The dtscored_t^{score} is used to compute the trot gait style reward rtsr_t^s according to (4), which is then added to the other rewards (rtg+rtlr_t^g + r_t^l) obtained by the teacher.
    • After collecting rollout data, the teacher policy parameters θ\theta and discriminator parameters φ\varphi are optimized in every training step using PPO for the policy and the objective in (3) for the discriminator.
  • Teacher Policy Architecture (πθteacher\pi_{\theta}^{\mathrm{teacher}}): (As shown in Figure 2 and detailed in Table III)

    • Consists of Multi-Layer Perceptron (MLP) components.
    • Terrain Encoder (EθeE_{\theta_e}): Compresses the terrain information iteR187\boldsymbol{i}_t^e \in \mathbb{R}^{187} into a low-dimensional latent representation lteR16l_t^e \in \mathbb{R}^{16}.
    • Privileged Encoder (EθpE_{\theta_p}): Compresses the privileged state stpR30s_t^p \in \mathbb{R}^{30} into a low-dimensional latent representation ltpR8l_t^p \in \mathbb{R}^8.
    • Full Latent Representation: ltteacherR24l_t^{\mathrm{teacher}} \in \mathbb{R}^{24} is formed by concatenating ltel_t^e and ltpl_t^p. This compression preserves essential information for the student to reconstruct.
    • Low-Level Network: Takes the concatenated ltteacherl_t^{\mathrm{teacher}} and the proprioceptive observation otpR45\pmb{o}_t^p \in \mathbb{R}^{45} as input. It has a tanh output layer to output the mean μtR12\pmb{\mu}_t \in \mathbb{R}^{12} of a Gaussian distribution atteacherN(μt,σ)\pmb{a}_t^{\mathrm{teacher}} \sim \mathcal{N}(\pmb{\mu}_t, \pmb{\sigma}). The variance σR12\pmb{\sigma} \in \mathbb{R}^{12} is determined by PPO.
    • Critic Network: An MLP with three hidden layers, providing target values VtV_t for the Generalized Advantage Estimator (GAE) used in PPO.
  • Discriminator (DφD_{\varphi}): A single MLP with two hidden layers and a linear output layer.

    The following are the results from Table III of the original paper:

    Module Inputs Hidden Layers Outputs
    Low-Level (MLP) lt, op [256, 128, 64] at
    Critic (MLP) xt [512, 256, 128] Vt
    Memory (LSTM) op , ht−1, ct−1 [256, 256, 256] ht
    Eθp (MLP) [64, 32] lp
    Eθe (MLP) ie [256, 128] lt
    Eθm (MLP) ht [256, 128]
    D (MLP) , st+1 AM P [1024, 512]

All networks use ELU activations for hidden layers.

4.2.6. Student Policy Training and Architecture

  • Goal: The student policy is trained to replicate the teacher's actions without access to the privileged state stps_t^p and terrain information ite\boldsymbol{i}_t^e. This means the student operates in a POMDP setting and must infer unobservable states from its history of proprioceptive observations.

  • Memory Encoder: Since the student needs to infer unobservable states from historical observations, a memory encoder is used to capture sequential correlations.

    • Architecture: A Long Short Term Memory (LSTM) network [33] is chosen as the RNN architecture. LSTMs are suitable for processing sequential data and retaining past information within hidden states without requiring large memory buffers for storing raw historical observations.
      • The proprioceptive observation otpo_t^p, along with the past hidden state ht1h_{t-1} and cell state ct1c_{t-1} of the LSTM, are fed into the LSTM module.
      • The LSTM module outputs the current hidden state htR256h_t \in \mathbb{R}^{256}.
      • This hth_t is then passed to an MLP encoder EθmE_{\theta_m} to produce the student's latent representation ltstudentl_t^{\mathrm{student}}.
    • Memory Length: The student policy is trained with a sequence of 50 proprioceptive observations (otpo_t^p), corresponding to 1 second of memory, to learn robust blind locomotion.
  • Training Losses (Supervised Fashion): The student is trained by minimizing two losses:

    • Imitation Loss: Enables the student to mimic the teacher's actions atteachera_t^{\mathrm{teacher}}. This is typically a mean squared error (MSE) loss between the student's actions and the teacher's actions.
    • Reconstruction Loss: Encourages the student's memory encoder to reconstruct the teacher's full latent representation ltteacherl_t^{\mathrm{teacher}}. This helps the student learn to infer the necessary context (terrain and privileged information) that the teacher had access to.
  • Robustness Enhancement: The dataset aggregation strategy (DAgger) [32] is employed to generate samples by rolling out the student policy. This iterative process helps increase the robustness of the student policy by exposing it to states it might encounter during its own execution, mitigating distribution shift issues.

  • Curriculum: The same training curriculum (terrain progression and command scheduling) as the teacher's is applied to the student.

  • No Discriminator: No discriminator is trained during student training.

  • Low-Level Network: The student's low-level MLP network has the same structure as the teacher's low-level net. Critically, its learned weights are reused from the teacher's low-level net for initialization, and it continues to be trained (not frozen) during student training to speed up convergence and improve adaptation, unlike some prior methods.

5. Experimental Setup

5.1. Datasets

The paper primarily uses one type of dataset for training the Adversarial Motion Priors (AMP):

  • AMP Dataset: This dataset consists of quadrupedal locomotion trajectories generated using Trajectory Optimization (TO) techniques [11] specifically for flat terrain with a trotting gait.
    • Source: Generated synthetically using TO with TOWR [29].

    • Scale: Total duration of 30 seconds.

    • Characteristics: Contains various trot gait motion clips including forward, backward, lateral left, lateral right, left steering, right steering, and combined locomotion. Each state transition (st,st+1)(\boldsymbol{s}_t, \boldsymbol{s}_{t+1}) includes joint positions, joint velocities, base linear velocity, base angular velocity, and base height with respect to the terrains.

    • Domain: Critically, it only contains motions on flat terrains.

    • Why chosen: This dataset is chosen because TO generates stable and dynamically feasible motions that perfectly match the simulated agent's state space, avoiding issues with motion capture data or motion retargeting. The trot gait is a common and efficient gait for quadrupedal robots. Its limited domain (flat terrains) allows the paper to test zero-shot generalization to challenging terrains.

      For the actual Reinforcement Learning training, the IsaacGym simulator is used to create diverse and challenging environments. These procedurally generated terrains (rough flats, slopes, waves, stairs, discrete steps) are not pre-recorded datasets but are generated dynamically during training based on a curriculum.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate the robustness and agility of the proposed locomotion controller.

  • Success Rate (for Obstacle Traversal):

    1. Conceptual Definition: This metric quantifies the reliability of the robot in successfully navigating specific obstacles, particularly steps. It measures the proportion of attempts where the robot completes a defined traversal task without falling, getting stuck, or incurring significant undesirable events. It focuses on the robot's ability to maintain stability and complete the task under challenging conditions.
    2. Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
    3. Symbol Explanation:
      • Number of Successful Trials: The count of trials where the robot successfully traversed the obstacle (defined as traversing the step with both front and hind legs).
      • Total Number of Trials: The total number of attempts made for a given obstacle type and height (10 trials per step height in this paper).
  • Average Running Speed (m/s):

    1. Conceptual Definition: This metric quantifies the robot's agility and efficiency over a given distance. It measures how quickly the robot can cover a certain distance on various terrains, indicating its ability to sustain desired linear velocities.
    2. Mathematical Formula: $ \text{Average Running Speed} = \frac{\text{Total Distance}}{\text{Time Elapsed}} $
    3. Symbol Explanation:
      • Total Distance: The fixed distance the robot is commanded to sprint (e.g., 10 m in this paper).
      • Time Elapsed: The total time taken by the robot to cover the specified total distance.
  • Maximum Yaw Velocity (rad/s):

    1. Conceptual Definition: This metric specifically evaluates the robot's agility in rotational movements (spinning). It quantifies how fast the robot can rotate around its vertical axis, demonstrating its dynamic control over angular motion.
    2. Mathematical Formula: No explicit formula is given, but it represents the peak angular velocity achieved around the Z-axis (yaw).
    3. Symbol Explanation:
      • rad/s: Radians per second, the unit for angular velocity.
  • Froude Number (Fr):

    1. Conceptual Definition: The Froude number is a dimensionless quantity used to compare the inertial forces with the gravitational forces in fluid dynamics, and analogously, in animal locomotion. For legged locomotion, it provides a way to characterize the gait and dynamic similarity of different animals or robots, independent of their absolute size or speed. A higher Froude number generally implies more dynamic, "running" or "bounding" gaits, while lower values correspond to more static, "walking" gaits.
    2. Mathematical Formula: For legged locomotion, the Froude number is typically defined as: $ \mathrm{Fr} = \frac{v^2}{gL} $
    3. Symbol Explanation:
      • vv: The average forward speed of the robot (m/s).
      • gg: The acceleration due to gravity (approximately 9.81m/s29.81 \mathrm{m/s}^2).
      • LL: The characteristic length of the robot, often taken as the leg length or hip height (m). In the context of quadrupeds, this is usually the hip height [35].

5.3. Baselines

The paper compares its method against several baselines, all of which use only proprioception for a fair comparison in the single-step tests. All RL methods were trained using the same curriculum strategy, reward functions, random seed, and low-level network architecture as the proposed method.

  • 1. RMA [3] (Rapid Motor Adaptation):

    • Description: This baseline uses a teacher-student training framework similar to the proposed method. The student policy incorporates a 1-D CNN adaptation module to implicitly infer privileged information and adapt its control. A key difference from the proposed method is that RMA freezes the student's low-level network (copied from the teacher) during student training, relying only on the adaptation module to infer context.
    • Representativeness: RMA is a state-of-the-art method for robust blind locomotion for legged robots, demonstrating strong sim-to-real transfer.
  • 2. Concurrent [8]:

    • Description: This method involves a policy concurrently trained with a state estimation network that explicitly estimates the body state. For comparison, no terrain information was provided during its training (though the original paper by Ji et al. might use it or estimate it).
    • Representativeness: Represents methods that use concurrent state estimation to handle partially observable environments for dynamic and robust locomotion, often focusing on agility.
  • 3. Domain Randomization (DR):

    • Description: This baseline involves directly training the policy using Reinforcement Learning without any privileged state or terrain information (i.e., no teacher-student distillation or explicit state estimation). It relies solely on extensive dynamics randomization and observation noise during training to achieve sim-to-real transfer and robustness.
    • Representativeness: Domain Randomization is a common and foundational technique for sim-to-real transfer in RL robotics.
  • 4. Built-in MPC:

    • Description: This refers to the Model Predictive Control (MPC) controller that is pre-installed or standard on the Unitree Go1 Edu robot.
    • Representativeness: Provides a strong baseline representing traditional optimization-based control methods, which often excel in known, structured environments but can struggle with unforeseen disturbances or complex, varied terrains without online adaptation.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results validate the effectiveness of the proposed method in achieving both robust and agile locomotion with natural gaits using only proprioceptive sensors and demonstrating zero-shot generalization.

6.1.1. Ablation Study for the Design of the Reward Terms

The ablation study evaluates the impact of the different reward components on the robot's gait motion. Three policies were trained with the same time and random seed:

  1. Total Reward: rtg+rtl+rtsr_t^g + r_t^l + r_t^s (proposed method).

  2. Task + Style Reward: rtg+rtsr_t^g + r_t^s.

  3. Task + Regularization Reward: rtg+rtlr_t^g + r_t^l.

    The following figure (Figure 3 from the original paper) shows the positions and velocities of the two joints (thigh and shank) of the front-right leg during forward walking (around 0.6m/s0.6 m/s) in simulation.

    Fig. 3. Measured positions and velocities of the two joints (thigh and shank) of the front-right leg during forward walking (around \(0 . 6 \\mathrm { m } / \\mathrm { s } ,\) in simulation. 该图像是图表,展示了前右腿的两个关节(股关节和小腿关节)在前进行走时的位置和速度变化情况。图中使用不同的线型表示不同的数据组合,记录时间为2.5秒。

Fig. 3. Measured positions and velocities of the two joints (thigh and shank) of the front-right leg during forward walking (around 0.6mathrmm/mathrms,0 . 6 \\mathrm { m } / \\mathrm { s } , in simulation.

As can be seen from Figure 3:

  • The policy trained with the total reward terms (rtg+rtl+rtsr_t^g + r_t^l + r_t^s, purple lines) shows joint positions and velocities that are closest to those generated from the Trajectory Optimization (TO) dataset (red lines). This indicates that the combination of task tracking, style imitation via AMP, and regularization successfully produces a gait that closely matches the desired natural trot. The resulting motion is described as natural and smooth.
  • The policies trained with only two types of reward (rtg+rtsr_t^g + r_t^s, orange lines, and rtg+rtlr_t^g + r_t^l, blue lines) generate jerky velocities and positions.
    • When the regularization reward (rtlr_t^l) is omitted (policy rtg+rtsr_t^g + r_t^s), the gait lacks smoothness, suggesting that safety and joint-level penalties are crucial for stable and fluid movement.

    • When the style reward (rtsr_t^s) is omitted (policy rtg+rtlr_t^g + r_t^l), the gait deviates significantly from the TO reference, confirming that AMP is essential for learning natural motion styles and preventing "unnatural and jerky" behaviors often seen in RL locomotion.

      This ablation study strongly validates that all three components of the reward function are necessary for achieving a natural, smooth, and desired gait, with the AMP-based style reward being key for naturalness and the regularization reward for smoothness and safety.

6.1.2. Evaluation of the Robust Locomotion

Traversability for Challenging Obstacles

The robot's robustness was tested on significant obstacles. For example, Figure 4 shows the robot successfully traversing a 25 cm high step. This is a highly challenging obstacle for the Go1 robot, which stands 28 cm tall, implying the obstacle is almost as tall as the robot's standing height. The AMP dataset only contained motions with 8 cm foot clearance, meaning this 25 cm step represents a completely unseen challenge in the motion prior.

The following figure (Figure 4 from the original paper) illustrates the robot's locomotion over challenging steps.

Fig. 4. Locomotion over challenging steps. Our robot learns to detect huge steps of \(2 5 \\mathrm { c m }\) by head collision and lifts base height and swing height to overcome obstructions.
该图像是一个示意图,展示了机器人在面对 2 5 ext{ cm} 的高台阶时的运动策略。机器人通过感知头部碰撞来检测障碍,并调整底部高度和摆动高度,以克服这些障碍。

Fig. 4. Locomotion over challenging steps. Our robot learns to detect huge steps of 25mathrmcm2 5 \\mathrm { c m } by head collision and lifts base height and swing height to overcome obstructions.

Figure 4 illustrates the emergent foot-trapping reflex:

  1. The policy first identifies the head collision (implying the front of the robot body hits the obstacle) purely from proprioceptive observations. This highlights the student policy's ability to implicitly infer contact and the presence of an obstacle from its memory of proprioceptive stream.

  2. It then raises the body height to ensure the trunk clears the obstacle.

  3. Next, it lifts the foot as high as possible to step over.

    This demonstrates the policy's impressive adaptation and recovery behaviors without any exteroceptive sensors, learning to adapt robustly to obstacles by interpreting changes in its own body configuration.

Comparison with Baselines on Step Traversal

The method was compared with several proprioception-only baselines in single-step tests (10 trials per step height, given a forward command of 0.4m/s0.4 m/s for 10 s). A test was successful if the robot traversed the step with both front and hind legs.

The following figure (Figure 5 from the original paper) shows the success rates of different methods for different step heights.

Fig. 5. Success rates of different methods for different step heights. Success rate was evaluated over 10 trials for each step height.
该图像是图表,展示了不同方法在不同台阶高度下的成功率。图中左侧为上台阶,右侧为下台阶,数据基于10次试验评估,红色表示我们的算法,绿色为RMA,橙色为Concurrent,蓝色为Built-in MPC,紫色为Domain Randomization。

Fig. 5. Success rates of different methods for different step heights. Success rate was evaluated over 10 trials for each step height.

From Figure 5, the following observations can be made:

  • Superior Performance: The proposed method (red line) significantly outperformed all other baselines in both stepping up and stepping down. It maintained a near 100% success rate up to 25 cm step height, demonstrating exceptional robustness.
  • Effectiveness of Teacher-Student Framework: Teacher-student training frameworks (ours and RMA, green line) are generally more effective than methods without explicit terrain information (Concurrent, orange; Domain Randomization, purple; Built-in MPC, blue). Methods without access to terrain information (even implicitly) often failed when step heights exceeded 13 cm. This confirms that inferring terrain properties (even implicitly) is crucial for large obstacle negotiation.
  • Advantage Over RMA: While RMA performed well up to 15 cm, its performance decreased rapidly beyond that. The paper attributes this difference to the student policy's low-level network handling:
    • Our method: Reuses the teacher's low-level network weights but continues training them for the student. This allows the student to refine its control layers and better adapt to increasingly difficult terrains, even when reconstruction errors of latent representations might be larger.
    • RMA: Freezes the student's low-level network after copying from the teacher. With fixed weights, the output becomes unstable as terrain difficulty increases and the adaptation module's inferred information becomes less accurate, leading to a rapid performance drop. This highlights the importance of allowing the entire student network to adapt.

Robustness in Diverse Environments

The robot's robustness was further evaluated in diverse real-world environments (shown in Figure 1 and the accompanying video), including large curbs, dense vegetation, moderately rocky terrains, loose rubble, and grassland. These terrains present significant challenges due to variations in material properties, deformability, and friction. Despite never having seen these specific terrains in simulation (which only used procedurally generated terrains), the policy demonstrated strong zero-shot generalization. The robot showed impressive recovery and adaptation behaviors to continue running over these uneven surfaces, indicating that the history of proprioceptive observations allows it to infer and react to changes in interaction dynamics.

The following figure (Figure 1 from the original paper) shows the robot achieving robust locomotion over challenging terrains and agile locomotion over natural terrains.

Fig. 1. Our robot can achieve robust locomotion over challenging terrains and agile locomotion over natural terrains. The robot can successfully traverse curbs, stairs, rocky and vegetation while running and spinning at high speed over natural terrains. Whether on even or complex terrains, at high or low speeds, the emergent behaviors learned by our robot exhibit a natural gait and smooth motion.
该图像是展示了四足机器人在不同复杂地形上进行稳健与灵活运动的插图。机器人可以成功地跨越路缘、楼梯以及崎岖和植被等障碍,在自然地形上以高速奔跑和旋转,同时展现出自然的步态和流畅的动作。

Fig. 1. Our robot can achieve robust locomotion over challenging terrains and agile locomotion over natural terrains. The robot can successfully traverse curbs, stairs, rocky and vegetation while running and spinning at high speed over natural terrains. Whether on even or complex terrains, at high or low speeds, the emergent behaviors learned by our robot exhibit a natural gait and smooth motion.

6.1.3. Evaluation of the Agile Locomotion

Outdoor Sprinting

The agility of the controller was tested through outdoor sprinting on three different real-world terrains over a 10 m distance, with a maximum longitudinal velocity command of 4 m/s.

The following are the results from Table IV of the original paper:

Terrains Time Elapsed (s) Speed (m/s)
Plastic Track (Real) 2.89 ± 0.04 3.46 ± 0.04
Rocky Road (Real) 3.19 ± 0.06 3.14 ± 0.06
Grassland (Real) 3.65 ± 0.10 2.74 ± 0.07

From Table IV:

  • High Speeds Achieved: The robot achieved a maximum average speed of 3.46m/s3.46 m/s on the plastic track, which is an even terrain. This demonstrates high agility.
  • Generalization to Natural Terrains: On rocky road and grassland (terrains with more variations in friction, restitution, and deformation, and not seen during simulation training), the robot still maintained high average speeds of 3.14m/s3.14 m/s and 2.74m/s2.74 m/s respectively. This again confirms the strong zero-shot generalization of the policy to unseen natural terrains.
  • Sim-to-Real Gaps: The paper notes that tracking errors still remain (e.g., actual speed 3.46m/s3.46 m/s vs. command 4m/s4 m/s). This is attributed to remaining sim-to-real gaps potentially from robot motors or unreal dynamics in the simulator.

High-Speed Spinning

The robot also demonstrated impressive high-speed spinning ability on a lab flat, accelerating to a maximum yaw velocity of 5.8 rad/s and stopping safely.

Agility in Uneven Environments

The robot was able to run robustly on vegetation terrains and small stairs at high speeds. These environments, with their tangles of branches/leaves and small steps, can cause significant disruptions to high-speed movements. The robot's ability to show impressive recovery and adaptation behaviors and continue running under such conditions further highlights the combined robustness and agility of the single policy.

Overall Comparison with Previous RL Methods

Table V provides an overall comparison of the proposed method's locomotion abilities with other RL controllers on different quadrupeds.

The following are the results from Table V of the original paper:

Robot Blind Complex Terrains Froude [35] Speed (m/s)
A1 [36] Yes Unknown 1.1 1.7
MC [7] Yes Unknown 5.1 3.9
ANYmal [4] No Yes 0.3 1.2
Ours Go1 Yes Yes 4.2 3.5

From Table V:

  • Blind and Complex Terrains: The proposed method (Ours Go1) is one of the few that achieves blind locomotion (Yes) and can traverse complex terrains (Yes), setting it apart from Mini Cheetah (MC) [7] and A1 [36] (where complex terrain performance is Unknown) and ANYmal [4] (which is not blind).
  • Froude Number and Speed:
    • The Froude number for our robot is 4.2, indicating a highly dynamic, agile gait, comparable to MC [7] (5.1). This is significantly higher than A1 (1.1) and ANYmal (0.3), which represent more walking-like gaits.

    • Our robot achieves a speed of 3.5m/s3.5 m/s, which is highly competitive, especially considering its blindness and complex terrain capabilities. MC [7] achieves a higher speed of 3.9m/s3.9 m/s but without reported performance on complex terrains.

      These results conclusively demonstrate that the proposed method yields a highly competitive robot capable of both robust locomotion over challenging terrains and agile locomotion over natural terrains, all while operating blindly with a natural gait.

6.2. Ablation Studies / Parameter Analysis

The paper presents an ablation study on the reward terms design (discussed in Section 6.1.1 and Figure 3) which confirmed the necessity of task reward, style reward, and regularization reward for generating natural, smooth, and robust gaits.

While not a formal ablation study in the results section, the comparison with RMA in the robust locomotion evaluation (Section 6.1.2 and Figure 5) can be seen as an analysis of a key architectural parameter/decision: how the student's low-level network is treated during training. The finding that continuing to train the student's low-level network (as in our method) is superior to freezing it (as in RMA) for handling more difficult terrains highlights a critical hyper-parameter choice that significantly affects the method's performance and generalization capability. This indicates that the learning process for the low-level control layers should remain active to adapt to the inferred environmental context from the memory network, especially when facing extreme challenges.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully demonstrates a pioneering Reinforcement Learning (RL)-based approach that enables legged robots to achieve both robust and agile locomotion simultaneously. A single policy, trained using Adversarial Motion Priors (AMP) within a teacher-student framework, effectively generates natural gaits. A critical contribution is the zero-shot generalization of these learned skills from flat-terrain motion data to diverse, challenging real-world environments. The system, deployed on a Unitree Go1 quadruped robot using only proprioceptive sensors (IMU and joint encoders), showcases impressive capabilities in traversing 25 cm high obstacles and sprinting up to 3.5m/s3.5 m/s while maintaining natural, smooth motions. To the best of the authors' knowledge, this is the first system to achieve such a combination of blind robustness and agility in legged locomotion.

7.2. Limitations & Future Work

The authors acknowledge a significant limitation inherent to blind locomotion:

  • Lack of Proactive Planning: Since the system relies solely on proprioception, it operates in a "trial and error" manner and cannot perform tasks that require planning ahead of time. Examples given include efficiently climbing stairs (which might require a high-level plan for foot placement and body trajectory) or avoiding holes (which requires foresight of upcoming negative obstacles). The robot reacts to direct physical interactions or inferred states rather than pre-emptively strategizing.

    For future work, the authors propose:

  • Integrating Vision Sensors (Exteroception): Extending the system with vision sensors (e.g., cameras, LIDARs) to incorporate exteroception. This would allow the robot to perceive and understand its environment more directly, enabling planning ahead and potentially improving locomotion efficiency in complex scenarios like stair climbing or hole avoidance.

7.3. Personal Insights & Critique

This paper offers several profound inspirations. The most significant is the demonstration that AMP can be a powerful tool not just for stylized animation but for real-world robot control, particularly in achieving natural gaits with zero-shot generalization. The finding that motion priors learned on flat terrains can transfer to challenging, unseen environments is remarkable and opens avenues for generating diverse and realistic robot behaviors from simpler, more accessible datasets. The elegance of replacing complex reward engineering with a style reward is also highly appealing, simplifying the RL process.

The teacher-student framework, combined with the continuous training of the student's low-level network, is a crucial insight. It highlights that while privileged information can kickstart learning, allowing the student's entire control architecture to remain adaptable is key for robust generalization to extreme conditions. This nuanced approach to sim-to-real transfer and blind locomotion is valuable.

However, a potential issue or area for improvement, as hinted by the authors' sim-to-real gap discussion, is the realism of the IsaacGym simulator's dynamics randomization. While effective, it's possible that even more sophisticated randomization techniques or adaptive domain randomization could further reduce the performance gap between desired and actual speeds, especially for agile tasks. The paper doesn't delve deeply into the specifics of the sim-to-real gap beyond attributing it to motor or unreal dynamics. Further analysis or specific mitigation strategies for these gaps could enhance the practical applicability of the system.

The core limitation of blind locomotion is well-articulated by the authors. This method excels at reactive adaptation, but true intelligence in navigating complex environments often requires a predictive capacity. The suggested future work of integrating exteroception is a natural and necessary next step. However, it will introduce new challenges, such as sensor fusion, handling noisy visual data, and integrating high-level planning with the low-level proprioceptive control. The paper provides a strong foundation upon which such hybrid systems can be built.

The methods and conclusions could be highly transferable to other domains involving complex dynamic systems where natural motion or adaptive behavior under partial observability is desired, such as humanoid robotics, prosthetics control, or even virtual character animation with more realistic physics. The AMP approach, in particular, could be generalized to imbue various RL agents with naturalistic styles without arduous reward shaping.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.