Paper status: completed

Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control

Published:01/30/2024

Dynamic Bipedal Robot Control (1)Application of Deep Reinforcement Learning (1)Robot Adaptivity and Robustness (1)Diverse Locomotion Skills (1)Dual-History Control Architecture (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper develops dynamic locomotion controllers for bipedal robots using deep reinforcement learning, surpassing single skill limitations with a novel dual-history architecture that enhances adaptivity and robustness. The controllers show superior performance in diverse skills

Abstract

This paper presents a comprehensive study on using deep reinforcement learning (RL) to create dynamic locomotion controllers for bipedal robots. Going beyond focusing on a single locomotion skill, we develop a general control solution that can be used for a range of dynamic bipedal skills, from periodic walking and running to aperiodic jumping and standing. Our RL-based controller incorporates a novel dual-history architecture, utilizing both a long-term and short-term input/output (I/O) history of the robot. This control architecture, when trained through the proposed end-to-end RL approach, consistently outperforms other methods across a diverse range of skills in both simulation and the real world. The study also delves into the adaptivity and robustness introduced by the proposed RL system in developing locomotion controllers. We demonstrate that the proposed architecture can adapt to both time-invariant dynamics shifts and time-variant changes, such as contact events, by effectively using the robot's I/O history. Additionally, we identify task randomization as another key source of robustness, fostering better task generalization and compliance to disturbances. The resulting control policies can be successfully deployed on Cassie, a torque-controlled human-sized bipedal robot. This work pushes the limits of agility for bipedal robots through extensive real-world experiments. We demonstrate a diverse range of locomotion skills, including: robust standing, versatile walking, fast running with a demonstration of a 400-meter dash, and a diverse set of jumping skills, such as standing long jumps and high jumps.

Mind Map

In-depth Reading

English Analysis~29 min read · 35,350 chars

1. Bibliographic Information

1.1. Title

Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control

1.2. Authors

Zhongyu Li, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, Koushil Sreenath

Zhongyu Li: Affiliated with the University of California Berkeley.
Xue Bin Peng: Affiliated with Simon Fraser University.
Pieter Abbeel: Affiliated with the University of California Berkeley.
Sergey Levine: Affiliated with the University of California Berkeley.
Glen Berseth: Affiliated with Université de Montréal and Mila - Quebec AI Institute.
Koushil Sreenath: Affiliated with the University of California Berkeley.

The authors represent prominent institutions known for their leading research in robotics, artificial intelligence, and control systems, suggesting a strong background in deep reinforcement learning and robotic locomotion.

1.3. Journal/Conference

The paper is published as a preprint on arXiv. The abstract mentions a preliminary version was presented at Robotics: Science and System [6]. Robotics: Science and Systems (RSS) is a highly reputable and selective conference in robotics research. Publication on arXiv indicates it's shared for community review and dissemination before, or in parallel with, formal peer review.

1.4. Publication Year

2024

1.5. Abstract

This paper presents a comprehensive study on using deep reinforcement learning (RL) to develop dynamic locomotion controllers for bipedal robots. Unlike prior work focusing on single skills, this research develops a general control solution capable of handling a range of dynamic bipedal skills, including periodic walking and running, and aperiodic jumping and standing. The RL-based controller introduces a novel dual-history architecture that utilizes both long-term and short-term input/output (I/O) history of the robot. This architecture, trained via an end-to-end RL approach, consistently surpasses other methods across diverse skills in both simulation and real-world deployment. The study also investigates the adaptivity and robustness of the proposed RL system. It demonstrates that the architecture can adapt to time-invariant dynamics shifts and time-variant changes (e.g., contact events) by effectively leveraging the robot's I/O history. Task randomization is identified as another crucial source of robustness, fostering better task generalization and disturbance compliance. The resulting control policies are successfully deployed on Cassie, a torque-controlled human-sized bipedal robot. This work significantly pushes the limits of agility for bipedal robots through extensive real-world experiments, showcasing robust standing, versatile walking, fast running (including a 400-meter dash), and diverse jumping skills (standing long jumps and high jumps).

1.6. Original Source Link

https://arxiv.org/abs/2401.16889 The paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The overarching goal in bipedal robotics is to develop robots capable of operating reliably in diverse human environments. A significant bottleneck is the lack of a general control solution for diverse, agile, and robust legged locomotion skills (like walking, running, and jumping) for high-dimensional, human-sized bipedal robots.

Core Problem: Previous research often focuses on single locomotion skills or struggles with the complexity of underactuated dynamics and distinct contact plans of bipedal robots. Bipedal robots, with their floating base and high-dimensional nonlinear dynamics, present challenges for motion planning and control due to:

Underactuated Dynamics & Contacts: Reliance on contacts with the environment for movement leads to discontinuities in trajectories, making contact mode planning and stabilization difficult. Leveraging full-order dynamics models is computationally expensive.
Diverse Locomotion Skills: Different skills (periodic like walking/running vs. aperiodic like jumping) have distinct stability requirements. Periodic skills can leverage orbital stability, while aperiodic skills require finite-time stability, further complicated by large impact forces during landing.

Why this problem is important: Human environments are predominantly tailored for bipedal locomotion. Addressing these control challenges is critical for enabling bipedal robots to operate effectively and safely in human-centric spaces, unlocking their full potential for complex, agile maneuvers.

Paper's Entry Point / Innovative Idea: The paper leverages deep reinforcement learning (RL) to create controllers that can learn directly from the robot's full-order dynamics. Its innovative idea centers on a novel dual-history architecture for RL-based controllers and a multi-stage training framework that emphasizes task randomization. This approach aims to:

Develop a general control solution: Go beyond single-skill focus to encompass a wide range of dynamic bipedal skills.
Improve adaptivity: Enable controllers to leverage proprioceptive information (I/O history) to adapt to uncertain, potentially time-varying dynamics.
Enhance robustness: Generalize to new environments and unexpected scenarios, demonstrating robust behaviors through task randomization.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of legged locomotion control for bipedal robots:

Development of a New Framework for General Bipedal Locomotion Control: Introduces a general RL framework effective across periodic (walking, running), aperiodic (jumping), and stationary (standing) skills. The controllers are zero-shot deployable on real robots.
Novel Design Choices for RL-based Control Policy: Proposes a dual-history architecture for non-recurrent RL policies, integrating both long-term and short-term input/output (I/O) history. This, combined with an end-to-end training strategy, achieves state-of-the-art performance, validated in simulation and real-world experiments.
Empirical Investigation of Adaptivity in RL Controllers: Conducts a detailed empirical study showing that RL-induced adaptivity covers both time-invariant dynamics shifts and time-variant changes (like contact events) by effectively using the robot's I/O history.
Improved Robustness in RL Controllers: Identifies task randomization as a key source of robustness, distinct from traditional dynamics randomization. It significantly enhances task generalization and disturbance compliance.
Extensive Real-World Validation and Demonstrations: Successfully deploys the system on Cassie, a human-sized bipedal robot, demonstrating robust standing, versatile walking (including a 400-meter dash), and diverse jumping skills (standing long jumps and high jumps). This pushes the agility limits for bipedal robots.

Key Conclusions/Findings:

The dual-history architecture effectively leverages both recent feedback (short history) and system identification/state estimation (long history) for superior control.
End-to-end training of the base policy and history encoder is more effective than policy distillation methods (like Teacher-Student or RMA) for bipedal control.
Adaptivity in RL controllers stems from the ability of the history encoder to capture meaningful information about time-varying events and time-invariant dynamics changes.
Task randomization is a crucial, "orthogonal" source of robustness, enabling policies to generalize across tasks and exhibit compliant, recovery behaviors not explicitly trained for.
The developed skill-specific versatile policies can implicitly learn contact planning online and dynamically adjust maneuvers for stability and robustness.

3.1. Foundational Concepts

3.1.1. Reinforcement Learning (RL)

Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent does not receive explicit instructions but learns through trial-and-error.

Agent: The learner or decision-maker (in this paper, the robot's controller).
Environment: The world with which the agent interacts (the physical robot and its surroundings).
State ( $s_t$ ): A representation of the current situation in the environment at time $t$ .
Observation ( $o_t$ ): The partial information the agent perceives about the state (in POMDPs).
Action ( $a_t$ ): A decision made by the agent that affects the environment.
Reward ( $r_t$ ): A scalar feedback signal from the environment indicating the desirability of the agent's action at a given state. The agent's goal is to maximize the cumulative reward over time.
Policy ( $\pi$ ): A function that maps states (or observations) to actions. The goal of RL is to find an optimal policy $\pi^*$ that maximizes expected cumulative reward.
Deep Reinforcement Learning (DRL): Combines RL with deep neural networks to handle high-dimensional states and actions, often using deep learning models (like MLPs or CNNs) as function approximators for policies or value functions.

3.1.2. Bipedal Robots

Bipedal robots are robots that walk on two legs, mimicking human or animal locomotion.

Floating Base: Refers to the main body of the robot (e.g., torso or pelvis) which is not fixed to the ground and has 6 Degrees of Freedom (DoFs) (3 translational, 3 rotational). This makes control more complex due to underactuated dynamics.
Underactuated Dynamics: Systems where the number of actuators (motors) is less than the number of Degrees of Freedom (DoFs). Bipedal robots are inherently underactuated during certain phases of locomotion (e.g., flight phase in running/jumping, or even when standing with both feet on the ground, as torso motions are not directly actuated).
Torque-controlled robots: Robots where the actuators (motors) are directly commanded with desired torques (forces that cause rotation). This offers finer control over interaction forces but requires more sophisticated controllers compared to position-controlled robots. Cassie is a torque-controlled robot.

3.1.3. Sim-to-Real Transfer

Sim-to-real transfer is the process of training a robot control policy in a simulated environment and then deploying it on a physical robot without significant retraining or fine-tuning. This is crucial because training directly on physical robots can be expensive, time-consuming, and risky.

Dynamics Randomization: A common technique in sim-to-real transfer where physical parameters of the simulated robot and environment (e.g., mass, friction, motor gains, sensor noise, latency) are varied randomly during training. This forces the RL agent to learn policies that are robust to uncertainties and discrepancies between simulation and the real world.
Zero-shot transfer: When a policy trained purely in simulation works effectively on a real robot without any additional training or fine-tuning on the hardware.

3.1.4. Partially Observable Markov Decision Process (POMDP)

A Partially Observable Markov Decision Process (POMDP) is a generalization of an MDP where the agent does not know the exact state of the environment. Instead, it makes observations that are probabilistically related to the state.

Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It is defined by states, actions, transition probabilities, and rewards.
Belief State: In a POMDP, the agent maintains a belief state, which is a probability distribution over the possible underlying states, given the history of observations and actions.

3.1.5. `Input/Output (I/O) History`

In the context of robot control, I/O history refers to a sequence of past inputs (actions sent to the robot, e.g., motor commands) and outputs (observable states or sensor readings from the robot, e.g., joint positions, velocities, base orientation) over a certain time window. This history provides the controller with context about recent dynamics, enabling it to infer unobservable states or adapt to changing system parameters.

3.1.6. `Low Pass Filter (LPF)`

A Low Pass Filter (LPF) is an electronic or digital filter that passes low-frequency signals and attenuates (reduces the amplitude of) signals with frequencies higher than a certain cutoff frequency. In robotics, it's often used to smooth out noisy sensor data or control commands, preventing jerky movements and reducing wear on mechanical components.

3.2. Previous Works

The paper categorizes previous efforts into model-based optimal control (OC) and model-free reinforcement learning (RL).

3.2.1. Model-based Optimal Control (OC)

Model-based optimal control methods formulate locomotion as an optimization problem where the robot's dynamics model is explicitly used as a constraint.

Challenges:
- Computational Complexity: Full-order dynamics are too complex for online optimization.
- Contact Planning: Making and breaking contact creates non-smooth trajectories and is difficult to optimize.
- Scalability: Often task-specific, requiring significant effort to adapt to new skills.
Approaches:
- Cascaded Optimization: Hierarchical control where high-level planning generates reference trajectories and low-level controllers handle real-time execution.
- Reduced-Order Models: Simplifications of robot dynamics (e.g., centroidal dynamics, Linear Inverted Pendulum (LIP), Angular Momentum Linear Inverted Pendulum (ALIP), Hybrid-LIP (H-LIP)) for online trajectory optimization.
  - LIP Model: A simplified model for walking, assuming the center of mass (CoM) moves with constant height and the horizontal motion is decoupled from vertical motion. This allows for simpler planning but limits dynamic capabilities.
- Whole-Body Control (WBC): A reactive controller that translates reduced-order states to joint-level inputs, often solving a Quadratic Program (QP) with various constraints.
- Hybrid Zero Dynamics (HZD): Uses the robot's full-order model to design attractive periodic gaits offline, with online feedback controllers to enforce virtual constraints. It relies on the stability of periodic gaits, making it less suitable for aperiodic motions like jumping.
Contact Planning:
- Many studies pre-define fixed contact sequences (e.g., for walking, running, jumping).
- Efforts to integrate contact planning with trajectory optimization involve mixed-integer programming or contact-implicit methods (using complementarity constraints or bilevel optimization), but these are often limited to offline optimization for bipedal robots.

3.2.2. Model-free Reinforcement Learning (RL)

Model-free RL allows robots to learn control policies through trial-and-error without an explicit dynamics model.

Control Policy Structure:
- Observation History: Varies from states-only history to I/O history (robot's input and output).
- History Length: Ranging from short (1-15 timesteps) to long (50+ timesteps).
- Policy Architecture: MLPs for shorter histories, recurrent units (like LSTMs) for longer sequences. The paper notes that a short state history was reported to be better for bipedal humanoid control [79].
Sim-to-Real Transfer Techniques in RL:
- End-to-end training with history: Policies trained directly under randomized dynamics using a history of robot measurements or I/O. Applied to bipedal robots [12, 14, 13].
- Policy Distillation (Teacher-Student / RMA): An expert policy (teacher) with access to privileged information supervises a student policy with only proprioceptive feedback.
  - Teacher-Student (TS) / Rapid Motor Adaptation (RMA): A two-stage method. First, an expert policy is trained in simulation with access to all ground truth environment parameters (privileged information). Second, a student policy learns to mimic the expert's behavior, often by inferring these privileged parameters from proprioceptive observations (like I/O history) using an encoder.
    - Expert Policy: Can be seen as a policy that directly uses information that is not available to the real robot (e.g., exact mass, friction, external forces). It learns how to react optimally if it knew these parameters.
    - Student Policy: The actual policy deployed on the robot. It learns to estimate these privileged parameters from its limited sensory observations and then uses these estimates to adapt its behavior.
  - Adaptive Rapid Motor Adaptation (A-RMA): An extension of RMA where, after the student policy learns to infer privileged information, the base MLP (which takes the inferred information) is further fine-tuned with the encoder frozen. This aims to improve performance by allowing the control part of the policy to adapt to potential inaccuracies in the inference mechanism.
Scalability to Different Locomotion Skills:
- Single-skill, fixed-task policies: Early RL efforts focused on basic skills like walking forward.
- Single-skill, multi-task policies: Training policies to track varying commands (e.g., different walking velocities) without explicit reference motions.
- Parameterized Reference Motions: Providing parameterized motions or using policy distillation from task-specific policies for bipedal robots.
- Adversarial Motion Priors (AMP): Using adversarial training to learn diverse skills by matching the learned motion to a distribution of reference motions [92].

3.3. Technological Evolution

Early bipedal locomotion heavily relied on model-based optimal control, which provided theoretical guarantees but struggled with computational complexity, real-world uncertainties, and scaling to diverse, dynamic behaviors. Simplifications (e.g., LIP model, fixed contact sequences) were necessary, limiting agility and robustness.

The rise of deep learning and reinforcement learning offered a new paradigm. Initially successful in simulation, RL faced the sim-to-real gap. Techniques like dynamics randomization and policy distillation (e.g., Teacher-Student, RMA) bridged this gap, first for quadrupedal robots and then adapted for bipedal robots.

This paper's work fits into the current era of RL-driven robotics, building on these sim-to-real advancements. It pushes beyond single-skill focus and static robustness (from dynamics randomization) by introducing task randomization and a dual-history architecture to achieve truly versatile, dynamic, and robust control for human-sized bipedal robots. It moves towards more general and adaptive controllers that can handle the inherent complexity and uncertainties of real-world bipedal locomotion.

3.4. Differentiation Analysis

Compared to prior model-based and model-free methods, this paper's core innovations and differentiations are:

Generalization across Diverse Skills:
- Prior Model-based OC: Highly task-specific, requiring significant re-engineering for each new skill (e.g., HZD for walking vs. running). Aperiodic skills like jumping are particularly challenging.
- Prior RL: Often focused on single skills or required separate policies/fine-tuning for different tasks within a skill.
- This Paper: Develops a single, general RL framework that produces skill-specific versatile policies (e.g., one policy for all walking tasks, one for all running tasks, one for all jumping tasks) across periodic, aperiodic, and stationary skills using largely the same architecture and training scheme.
Novel Dual-History Architecture:
- Prior RL (History Usage): Varied from no history, short history (MLP), to long states-only or I/O history (recurrent networks or CNNs). Ablation studies often contrasted long histories with immediate state feedback.
- This Paper: Introduces a dual-history approach with both a long-term I/O history (encoded by a 1D CNN) and a short-term I/O history fed directly into the base MLP. This is shown to significantly outperform policies relying solely on long or short history, or states-only history, by providing both contextual system identification and immediate real-time feedback.
Emphasis on Task Randomization for Robustness:
- Prior RL (Robustness): Primarily attributed to dynamics randomization (e.g., Teacher-Student, RMA) to bridge the sim-to-real gap. While effective, this mainly addresses parametric uncertainty.
- This Paper: Identifies and empirically demonstrates task randomization as an "orthogonal" and crucial source of robustness. By training on a wide range of tasks within a skill, the robot learns generalized behaviors and compliant recovery maneuvers that go beyond adhering strictly to a single commanded task, even in unseen perturbations. This is a novel insight into RL-based robustness.
End-to-End Training vs. Policy Distillation:
- Prior RL (Sim-to-Real): Policy distillation (TS/RMA) is prevalent, especially for quadrupeds, relying on an expert policy and inferring privileged information.
- This Paper: Advocates for and demonstrates the superiority of end-to-end training (jointly training the base policy and history encoder) over policy distillation for complex bipedal control. It argues that direct adaptive control, without explicit estimation of pre-selected parameters, allows the policy to implicitly learn more relevant time-varying information (like contact events) for control.
Real-World Agility and Novel Capabilities:
- Prior Work: While some bipedal robots achieved running or jumping, demonstrations of versatile and robust performance across multiple dynamic skills, especially with record-breaking feats (e.g., 400m dash, 1.4m long jump, 0.44m high jump on Cassie), were limited.
- This Paper: Achieves unprecedented levels of real-world agility, setting new benchmarks for Cassie and human-sized bipedal robots, including demonstrating online contact planning (implicit through learned policies) for recovery.

4. Methodology

4.1. Principles

The core idea of this method is to leverage deep reinforcement learning (RL) to create general, dynamic, and robust locomotion controllers for bipedal robots. The theoretical basis or intuition behind it is that by exposing an RL agent to a wide variety of tasks and environmental uncertainties (dynamics randomization) within a simulation, and by providing it with a rich contextual observation (dual-history I/O), the agent can learn adaptive and robust control policies that are effectively model-free and direct adaptive. This means the robot learns to identify aspects of its own dynamics and environment changes implicitly from its past inputs and outputs and adjusts its control actions directly, rather than relying on an explicit model or estimated parameters.

4.2. Core Methodology In-depth (Layer by Layer)

The proposed RL system for general bipedal locomotion control is structured as follows:

4.2.1. Robot Model and Control Policy Structure

The experimental platform is Cassie, a human-sized bipedal robot (height $1.1 \text{ m}$ , weight $31 \text{ kg}$ ).

Joints: 7 joints per leg, totaling 14 joints ( $\mathbf{q}_j \in \mathbb{R}^{14}$ ). 10 are actuated ( $\mathbf{q}_m \in \mathbb{R}^{10}$ ), 4 are passive.
Floating Base: $\mathbf{q}_b$ with 6 DoFs (3 translational $q_{x,y,z}$ and 3 rotational $q_{\phi,\theta,\psi}$ ).
Generalized Coordinates: The full system's generalized coordinates are $\mathbf{q} = [\mathbf{q}_b, \mathbf{q}_j] \in \mathbb{R}^{20}$ .
Observable States ( $\mathbf{o}$ ): Only partial states are reliably measured/estimated. $\mathbf{o} \in \mathbb{R}^{26}$ contains motor positions and velocities ( $\mathbf{q}_m, \dot{\mathbf{q}}_m$ ), base orientation ( $q_{\phi,\theta,\psi}$ ), and estimated base linear velocity ( $\dot{q}_{x,y,z}$ from an EKF).

The robot's dynamics equation is given by: $\mathbf{M}(\mathbf{q}) \ddot{\mathbf{q}} + \mathbf{C}(\mathbf{q}, \dot{\mathbf{q}}) \dot{\mathbf{q}} + \mathbf{G}(\mathbf{q}) = \mathbf{B}\tau + \kappa_{\mathrm{sp}}(\mathbf{q}, \dot{\mathbf{q}}) + \zeta_{\mathrm{ext}}$ Where:
$\mathbf{M} \in \mathbb{R}^{n \times n}$ is the generalized mass matrix.
$\mathbf{C} \in \mathbb{R}^{n \times n}$ is the centrifugal and Coriolis matrix.
$\mathbf{G} \in \mathbb{R}^{n}$ is the generalized gravity vector.
$\mathbf{q} \in \mathbb{R}^n$ are the generalized coordinates.
$\dot{\mathbf{q}}$ are their time derivatives.
$\tau \in \mathbb{R}^{n_a}$ are the motor torques (control inputs).
$\mathbf{B} \in \mathbb{R}^{n \times n_a}$ distributes motor torques to generalized coordinates.
$\kappa_{\mathrm{sp}}(\mathbf{q}, \dot{\mathbf{q}})$ represents state-dependent spring torques (from passive joints).
$\zeta_{\mathrm{ext}}$ represents generalized external forces, including foot contact wrenches and joint-level friction/perturbations.

The control policy $\pi_\theta$ is a deep neural network with parameters $\theta$ .
Action Space: The policy outputs desired motor positions $\mathbf{q}_m^d \in \mathbb{R}^{10}$ as the robot's action $\mathbf{a}_t$ .
Action Smoothing: The action is first smoothed by a Low Pass Filter (LPF) (a Butterworth low pass filter with a cut-off frequency of $4 \text{ Hz}$ ). This helps prevent jerky movements.
Control Loop: The policy is queried at $33 \text{ Hz}$ . The filtered actions are then fed to joint-level PD controllers operating at $2 \text{ kHz}$ to calculate motor torques $\tau$ .

The following figure (Figure 3 from the original paper) illustrates the proposed RL-based controller architecture:

该图像是展示了机器人（Cassie）在不同条件下行走的示意图。这包括在持续横向力和随机矢量力的作用下，机器人如何保持稳定的行走姿态，图中展示了多个动态调整的过程。 Fig. 3: The proposed RL-based controller architecture that leverages a dual-history of input (a) and output (o) (I/O) from the robot. The control policy $\pi _ { \theta }$ ,operating at $3 3 \ \mathrm { \textmu } \mathrm { H z }$ ,processes a 2-second long I/O history. This data is initially encoded via a 1D CNN along its time axis before being merged with a base MLP. In addition, a short history spanning 4 timesteps is directly input into the base MLP, combined with skill-specific reference motion $\mathbf { q } _ { t } ^ { r }$ and variable commands $\mathbf { c } _ { t }$ that parameterize the tasks. The policy outputs desired motor positions $\mathbf { q } _ { m } ^ { d }$ as the robot's actions, which are then smoothed using a low-pass filter (LPF). These filtered outputs are employed by joint-level PD controllers operating at $2 \mathrm { k H z }$ to specify motor torques $\tau$ . This architecture is general for various locomotion skills like standing, walking, running, and jumping. This figure also annotates the generalized coordinates for Cassie, which include actuated joints $( q _ { 1 , 2 , 3 , 4 , 7 } ^ { L / R }$ , marked as re and passive oints $( q _ { 5 , 6 } ^ { L / R }$ , marked as blue).

4.2.2. Dual-History Architecture

The policy's input at each timestep $t$ consists of four components:

Given Command ( $\mathbf{c}_t$ ): A time-varying vector that parameterizes the task the robot needs to accomplish (e.g., desired velocity, target position).
Reference Motion ( $\mathbf{q}_t^r$ ): A preview of the desired trajectory, sampled at future timesteps ( $t+1, t+4, t+7$ ). This helps the robot anticipate and avoid being short-sighted. It includes desired motor positions and potentially base height.
Robot's Short I/O History: A brief history spanning 4 timesteps of $<observable states ($ \mathbf{o}{t:t-4}), actions (\mathbf{a}{t-1:t-4})>. This provides immediate feedback for real-time control.
Robot's Long I/O History: A longer history spanning two seconds, comprising 66 pairs of I/O data ( $<o_{t-k}, a_{t-k-1}>$ , for $k \in [0, 65]$ ). This is crucial for$system identificationandstate estimation`, especially during ballistic movements.

Policy Representation Details:

Base Network: A multilayer perceptron (MLP) with two hidden layers, each having 512 tanh units. It receives the command, reference motion, and short I/O history directly.
Long-Term History Encoder: A 1D convolutional neural network (CNN) consisting of two hidden layers. Its configuration is $[kernel size, filter size, stride size] = [6, 32, 3]$ and [4, 16, 2] with relu activation and no padding. This CNN encodes the 66-timestep long I/O history into a latent representation which is then fed into the base MLP.
Output Layer: The base MLP's output layer consists of tanh units that specify the mean of the Gaussian distribution of the normalized action (desired motor positions). The standard deviation is fixed at $0.1\mathbf{I}$ .

4.2.3. Multi-Stage Training Framework

A multi-stage training strategy is developed to train versatile locomotion control policies in simulation for zero-shot transfer to hardware. This strategy provides a structured curriculum, as illustrated in the following figure (Figure 4 from the original paper):

$该图像是插图，展示了人形机器人在不同阶段的跳跃动作，包括飞行阶段、施加力和跳跃跃起。图(a)显示了在施加扰动力期间的动态效果，图(b)展示了机器人在跳跃的飞行阶段。相关参数为 $(q^d_x, q^d_y, q^d_ heta) = (0m, 0m, 0°)$。$ Fig. 4: The multi-stage training framework to obtain a versatile control policy that can be zero-shot transferred to the real world. It starts with single-task training stage, where the robot is encouraged to mimic a single reference motion with a fixed goal. This is followed by task randomization stage, which expands the range of tasks the robot learns and fosters task generalization resulting in a versatile policy. Once the robot is adept at various locomotion tasks and their transitions, extensive dynamics randomization is incorporated to enhance policy robustness for sim-to-real transfer. This framework is suitable for diverse bipedal locomotion skills, including walking, running, and jumping, and for learning from different sources of skill-specific reference motions such as trajectory optimization, human mocap, and animation.

Stage 1: Single-Task Training:
- Objective: To acquire a locomotion skill from scratch (e.g., walking forward, running forward, jumping in place) with a fixed command.
- Focus: Mastering the basic skill, avoiding undesired strategies.
Stage 2: Task Randomization:
- Objective: To develop a versatile policy by encouraging the robot to perform a large variety of tasks within the acquired skill.
- Mechanism: Introduces diverse commands to expand the range of tasks.
- Combining Standing Skill: For periodic skills (walking, running), a sub-stage is added to learn transitions to/from standing.
  - For Walking: After mastering diverse walking tasks, a standing skill is commanded after random walking intervals. The reference motion changes to standing, and smoothing rewards are increased. This allows learning transitions from walking to standing and back.
  - For Running: Similar to walking, but uses a separate reference motion for the transition from fast running to standing (retargeted human mocap) due to higher challenge.
  - For Jumping: Standing is inherently part of the post-landing phase, so no separate sub-stage is needed.
Stage 3: Dynamics Randomization:
- Objective: To robustify the policy for successful zero-shot transfer from simulation to hardware.
- Mechanism: Introduces extensive randomization of dynamics parameters in simulation after the robot is proficient in a simple simulation environment.

4.2.4. Reference Motion

The framework can accommodate diverse sources of reference motion, crucial for shaping the robot's desired behaviors.

Trajectory Optimization: For the walking skill, a library of 1331 diverse periodic walking gaits is generated based on the robot's full-order dynamics [8, 26]. These gaits define desired sagittal ( $\dot{q}_x^d$ ), lateral ( $\dot{q}_y^d$ ), and height ( $q_z^d$ ) ranges. A reference motion is a set of Bézier trajectories for each actuated motor over a fixed walking period.
Motion Capture: For the running skill, motion capture data from a human actor [100] is retargeted to Cassie's morphology using inverse kinematics. A single reference motion for periodic running (average speed $3 \text{ m/s}$ ) and one for running-to-standing transition are used.
Animation: For the jumping skill, hand-crafted animation in a 3D suite is used, providing a single jumping-in-place animation (apex foot height $0.5 \text{ m}$ ). Crucially, no trajectory optimization is performed to make these kinematically feasible motions dynamically feasible for the robot, relying on RL to learn dynamic feasibility.

4.2.5. Reward Function

The reward function $r_t$ incentivizes the robot to perform desired locomotion skills and complete tasks. It is a weighted summation of several reward components $\mathbf{r}$ : $r_t = (\mathbf{w} / ||\mathbf{w}||_1)^T \mathbf{r}$ Each individual reward component follows the format: $r(\mathbf{u}, \mathbf{v}) = \exp(-\alpha ||\mathbf{u}-\mathbf{v}||_2)$ Where:

$\mathbf{u}$ and $\mathbf{v}$ are two vectors (e.g., actual state vs. desired state).
$||\mathbf{u}-\mathbf{v}||_2$ is the Euclidean distance between the vectors. Minimizing this distance maximizes the reward.
$\alpha > 0$ is a scaling factor introduced for each term to normalize units, ensuring the output range is (0, 1].
$\mathbf{w}$ is the weight vector for each component.

The reward components are grouped into three key terms:

Motion Tracking: Incentivizes following the provided reference motion.
- Motor position reward: $r(\mathbf{q}_m, \mathbf{q}_m^r(t))$
- Global pelvis height: $r(q_z, q_z^r(t) + \delta_z)$
- Global foot height: $r(\mathbf{e}_z, \mathbf{e}_z^r(t) + \delta_z)$
- $\delta_z$ : Accounts for terrain height variations or target elevated heights (privileged information).
Task Completion: Ensures the robot accomplishes assigned tasks.
- Pelvis velocity tracking: $r(\dot{q}_{x,y}, \dot{q}_{x,y}^d)$ and $r(\dot{q}_{\phi,\theta,\psi}, [0, 0, \dot{q}_{\psi}^d])$ (for desired linear and angular velocities).
- Global pose tracking: $r(q_{x,y}, q_{x,y}^d)$ and $r(\cos(q_{\phi,\theta,\psi} - [0, 0, q_{\psi}^d]), 1)$ (for position and orientation). The cosine term handles angular periodicity.
- For jumping, desired landing targets are specified, and average velocity terms shape the sparse position reward ( $\dot{q}_{x,y,\psi}^d = q_{x,y,\psi}^d / T_J$ , where $T_J$ is jumping timespan).
Smoothing: Discourages jerky behaviors.
- Foot Impact: $r(F_z, 0)$ (reduce impact forces).
- Torque: $r(\tau, 0)$ (reduce energy consumption).
- Motor velocity: $r(\dot{\mathbf{q}}_m, 0)$ (smooth motions).
- Joint acceleration: $r(\ddot{\mathbf{q}}, 0)$ (damp out accelerations).
- Change of action: $r(\mathbf{a}_t, \mathbf{a}_{t+1})$ (regulate action changes).

Reward Weights:

Across Stages: Motion tracking dominates in Stage 1. Task completion takes precedence in Stage 2 (task randomization). Smoothing weights are initially low but gradually increased in later stages.
Across Skills: Weights are largely consistent, but foot height motion tracking and task completion have higher weights for skills with a flight phase (running, jumping). Change of action is increased for aggressive movements in running and jumping.

4.2.6. Episode Design

Unified Approach: Consistent across all skills and stages.
Episode Duration: 2500 timesteps (76 seconds).
Variable Tasks (Stage 2): Commands randomized after 1-15 second intervals.
Aperiodic Tasks (Stage 1): Episode length adjusted to cover full trajectory (e.g., $1.66 \text{ s}$ for jumping) plus a significant extension for learning to maintain the final standing pose.
Early Termination Conditions: In addition to standard conditions (e.g., robot falling), specific conditions are added to encourage desired behaviors:
- Foot height tracking tolerance: Terminates if $||\mathbf{e}_z - \mathbf{e}_z^r(t) - \delta_z|| > E_e$ . Effective for flight phases. $E_e$ starts tight, relaxes later.
- Task completion tolerance: Terminates if $|q_{x,y,\psi} - q_{x,y,\psi}^d| > E_t$ . $E_t$ is gradually reduced.

4.2.7. Dynamics Randomization

Applied in Stage 3 to ensure sim-to-real transfer. Parameters are sampled from uniform distributions at each episode.

Modeling Uncertainty:
- Ground Friction Coefficient: [0.3, 3.0]
- Joint Damping Ratio: $[0.3, 4.0] \text{ Nms/rad}$
- Spring Stiffness: $[0.8, 1.2] \times \text{default}$ (crucial for Cassie's passive joints)
- Link Mass: $[0.5, 1.5] \times \text{default}$
- Link Inertia: $[0.7, 1.3] \times \text{default}$
- Pelvis (Root) CoM Position: $[-0.1, 0.1] \text{ m}$ in $q_{x,y,z}$
- Other Link CoM Position: $[-0.05, 0.05] \text{ m} + \text{default}$
- Motor PD Gains: $[0.7, 1.3] \times \text{default}$ (applied independently per joint)
Measurement Uncertainty:
- Motor Position Noise Mean: $[-0.002, 0.002] \text{ rad}$
- Motor Velocity Noise Mean: $[-0.01, 0.01] \text{ rad/s}$
- Gyro Rotation Noise: $[-0.002, 0.002] \text{ rad}$
- Linear Velocity Estimation Error: $[-0.04, 0.04] \text{ m/s}$
- Communication Delay: $[0, 0.025] \text{ s}$ (zero-order hold)
Optional Randomization:
- Randomized Perturbation: External wrenches (forces & torques) applied to the robot's pelvis. Range: Force $[-20, 20] \text{ N}$ $[- 20, 20] N$ , Torque $[-5, 5] \text{ Nm}$ $[- 5, 5] Nm$ .
  - Elapsed Time Interval (Walking): $[0.1, 3.0] \text{ s}$
  - Elapsed Time Interval (Running): $[0.1, 1.0] \text{ s}$
  - Excluded for jumping and transitions to standing as it hindered learning.
- Randomized Terrain: Various types (Waved, Slopes, Stairs, Steps) using parameterized height maps. Used only after proficiency in other dynamics randomization (e.g., for running).

4.2.8. Training Details

Simulator: MuJoCo based on [101, 102].
RL Algorithm: Proximal Policy Optimization (PPO) [103].
Policy: Actor (control policy) described in Section 4.2.2.
Value Function: A 2-layered MLP with access to ground truth observations.
Hyperparameters: Differ across stages and skills, detailed in Appendix D (Tables VII, VIII).

5. Experimental Setup

5.1. Datasets

The paper does not use traditional "datasets" in the sense of a fixed collection of samples. Instead, it generates data through continuous interaction with a simulated environment (powered by MuJoCo) during the reinforcement learning training process. The "dataset" for learning is thus the stream of observations, actions, and rewards experienced by the robot across millions of simulation steps, especially under dynamics randomization and task randomization.

The training environment is the simulated Cassie robot, which is a torque-controlled bipedal robot.

Source: The Cassie robot model and simulation environment are based on previous work [101, 102].
Characteristics: The simulation includes detailed physics models of the robot's body, joints, and interactions with the ground. It also incorporates simulated sensor noise, communication delays, and dynamics parameter variations as part of the dynamics randomization process.
Domain: The domain is robot locomotion control, specifically for human-sized bipedal robots.
Effectiveness: These simulated environments are designed to be as close to the real world as possible, while also being sufficiently diverse (via randomization) to enable zero-shot transfer to the physical Cassie robot.

5.2. Evaluation Metrics

The paper evaluates the performance of the control policies using both quantitative metrics (primarily Mean Absolute Error) and qualitative observations from real-world experiments.

5.2.1. Mean Absolute Error (MAE)

Conceptual Definition: Mean Absolute Error (MAE) measures the average magnitude of the errors in a set of predictions, without considering their direction. It quantifies the average absolute difference between predicted (or actual) values and desired (or reference) values. A lower MAE indicates better accuracy or tracking performance.
Mathematical Formula: $ \mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $
Symbol Explanation:
- $N$ : The total number of data points or observations.
- $y_i$ : The $i$ -th actual or observed value (e.g., actual velocity, actual joint position).
- $\hat{y}_i$ : The $i$ -th desired or reference value (e.g., commanded velocity, reference joint position).
- $| \cdot |$ : The absolute value.

5.2.2. Qualitative Observations

Beyond MAE, the paper relies heavily on qualitative assessments from real-world deployment, including:

Drift: Observing if the robot maintains its position or path when commanded to stay in place.
Stability: The robot's ability to maintain balance and prevent falls under various conditions.
Compliance: How the robot reacts to external disturbances or unexpected terrain changes, e.g., by adjusting its gait rather than falling.
Recovery Maneuvers: The robot's ability to regain stability after a perturbation, potentially by performing complex sequences of actions.
Flight Phase: The presence and characteristics of periods where both feet are off the ground, indicative of dynamic running or jumping.
Completion Time: For specific tasks like the 100-meter or 400-meter dash.
Accuracy of Landing: For jumping tasks, how precisely the robot lands on a target.

5.3. Baselines

The paper conducts an extensive ablation study and comparisons against several baselines to evaluate its proposed policy architecture and training strategy. These baselines represent common design choices in RL-based locomotion control:

Ours (Proposed Method):
- Architecture: Dual-history architecture (short-term I/O directly to MLP, long-term I/O encoded by 1D CNN to MLP).
- Action: Directly specifies desired motor positions.
- Training: End-to-end training.
Residual:
- Architecture: Same as Ours, but the policy outputs a residual term that is added to a reference motor position (i.e., $\mathbf{q}_m^d = \mathbf{a}_t + \mathbf{q}_m^r(t)$ ).
- Purpose: Tests the impact of residual learning, a common approach in prior work [11, 12].
State Feedback Only:
- Architecture: Same model structure and action space as Ours.
- Observation: Relies solely on historical states (robot's output history), omitting the robot's input history.
- Purpose: Evaluates the importance of including the robot's input history in I/O observations. This is common in prior work [12, 14, 22].
Long History Only:
- Architecture: Relies only on long-term I/O history encoded by the CNN. The base MLP receives the latest observation (immediate state feedback) but no explicit short history.
- Purpose: Serves as a baseline to demonstrate the added benefit of explicitly providing short history alongside long history [5, 74].
Short History Only:
- Architecture: Relies solely on short-term I/O history, excluding the long-term I/O history CNN encoder.
- Purpose: Represents an approach common in quadruped control and some bipedal work [13, 65, 70].
RMA/Teacher-Student (Policy Distillation):
- Mechanism: Two-phase training:
  1. Expert (Teacher) Policy: Trained by RL with access to privileged environment information (encoded into an 8D extrinsics vector by an MLP encoder). This policy can only operate in simulation.
  2. RMA (Student) Policy: Copies the base MLP from the expert policy and learns to leverage the long I/O history encoder to estimate the teacher's extrinsic vector.
- Modification: In this study, all expert, RMA, and A-RMA policies are modified to include short I/O histories in their base MLP for a fairer comparison with Ours.
- Purpose: Compares end-to-end training with policy distillation methods [71, 74].
A-RMA (Adaptive Rapid Motor Adaptation):
- Mechanism: An additional training phase after RMA. The long I/O history encoder's parameters remain fixed, while the base MLP is further updated through RL.
- Purpose: Explores improvements over standard RMA by fine-tuning the control part of the policy [67].
  
  The experimental design for these baselines involved training 3 policies per method for each locomotion skill (walking, running, jumping), using identical multi-stage training frameworks and hyperparameters, but different random seeds. This resulted in $3 \times 3 \times 8 = 72$ distinct control policies for comprehensive evaluation.

5.4. Hyperparameters

The paper details the command ranges for different skills and the hyperparameters used in PPO training.

The following are the results from Table VI of the original paper:

Task Parameters	Range
Walking
Sagittal Velocity $\dot{q}_x^d$	[-1.5, 1.5] m/s
Lateral Velocity $\dot{q}_y^d$	[-0.6, 0.6] m/s
Turning Velocity $\dot{q}_\psi^d$	[-45, 45] deg/s
Walking Height $q_z^d$	[0.65, 1.0] m
Running
Sagittal Velocity $\dot{q}_x^d$	[2.0, 5.0] m/s
Lateral Velocity $\dot{q}_y^d$	[-0.75, 0.75] m/s
Turning Velocity $\dot{q}_\psi^d$	[-30, 30] deg/s
Jumping
Sagittal Landing Location $q_x^d$	[-0.5, 1.5] m
Lateral Landing Location $q_y^d$	[-1.0, 1.0] m
Turning Direction at Landing $q_\psi^d$	[-100, 100] deg
Elevation Change $e_z^d$	[-0.5, 0.5] m

The following are the results from Table VII of the original paper:

PPO Training Iterations	Walking	Running	Jumping
Single-Task	6000	6000	6000
Task Randomization	8000	18000	12000
Combining Standing	2000	5000	N/A (inherent)
Dynamics Randomization	8000	15000	20000
Added Perturbation Training	5000	5000	N/A (not used)

The following are the results from Table VIII of the original paper:

Hyperparameter	Value
PPO iteration batch size	65536
PPO clip rate	0.2
Optimization step size (both actor and critic)	1e-4
Optimization batch size	8192
Optimization epochs	2
Discount factor ( $\gamma$ )	0.98
GAE smoothing factor ( $\lambda$ )	0.95

6. Results & Analysis

6.1. Core Results Analysis

The paper presents a comprehensive analysis of the proposed RL-based controller's performance, focusing on its policy architecture, adaptivity, robustness, and real-world deployment. The results consistently validate the effectiveness of the proposed method across various dynamic bipedal locomotion skills.

6.1.1. Advantages of Policy Architecture

The analysis of learning performance in simulation (Stage 3, with randomized tasks and dynamics) and sim-to-real transfer for in-place walking demonstrates the superiority of the dual-history architecture and end-to-end training.

The following figure (Figure 6 from the original paper) displays the learning performance for training walking, running, and jumping policies with randomized tasks and dynamics parameters.

该图像是一个图表，展示了不同历史长度（1秒、2秒、3秒和4秒）对归一化回报的影响。随着样本数量的增加，4秒历史的表现最佳，且波动较小，表明更长的历史能够提高回报一致性。 Fig. 6: Learning performance for training walking, running, and jumping policies with randomized tasks and dynamics parameters, using our method and various baselines. The y-axis shows the normalized return and the x-axis shows the number of samples ( $10^8$ ). The shaded region represents the standard deviation across policies trained from three different random seeds, indicating the consistency of our method.

Key Observations from Learning Performance (Fig. 6):

Choices of Action (Residual vs. Direct): Policies using a residual term (purple curves) consistently show deteriorated learning performance across all skills. This suggests that directly specifying desired motor positions (as in Ours) is more effective, as residual terms can introduce undesired movements and hinder the policy's ability to explore beyond reference motions.
Choices of Observation (I/O History vs. State Feedback Only): Omitting the robot's input (action) history (pink curves) leads to a decline in learning performance. This highlights the crucial role of utilizing both input and output history for the control policy to perform system identification and state estimation, enhancing adaptivity to uncertain dynamics.
History Length (Long vs. Short vs. Dual-History):
- Long History Only (blue curves) generally performs worse than Short History Only (orange curves) or other methods.
- The proposed dual-history approach (Ours, red curves), which provides direct access to short I/O history in the base MLP while also using a long history encoder, significantly improves learning performance. This indicates that short history provides critical real-time feedback that complements the contextual information from long history.
Comparison with Policy Distillation (RMA/Teacher-Student):
- RMA (student) policies (green curves) show significant degradation compared to the expert policy (black curves) and Ours. This is attributed to unavoidable errors in estimating pre-selected environment parameters using long history. RMA even fails to learn for challenging skills like running.
- A-RMA (dark green curves), which fine-tunes the base MLP after RMA, improves performance but still falls short of Ours, despite using more training samples. In cases where the encoder struggles (running), A-RMA effectively behaves like Short History Only.
- Ours achieves performance similar to the expert policy (theoretical upper bound for the student) but is deployable in the real world, unlike the expert.
  
  The following figure (Figure 7 from the original paper) compares the in-place walking performance of different policy architectures on real hardware.
  
  该图像是示意图，展示了在动态双足机器人控制中，潜在值和足部冲击力随时间变化的关系。上部分显示了模型的潜在值，标记为Initialization和Perturbation，底部显示了左脚和右脚的冲击力曲线，数据在时间轴上展现了机器人的响应特征。 Fig. 7: Real-world in-place walking experiments of different policy architectures. The top row shows snapshots of Cassie in the middle of walking. The bottom row compares the estimated sagittal velocity $\dot{q}_x$ (black line), lateral velocity $\dot{q}_y$ (red line), and pelvis yaw angle $q_\psi$ (blue line) over time. The policies are trained from the same random seed ( $1^{\text{st}}$ policy in Fig. 8), and deployed without any tuning. Our proposed method shows minimal drift in sagittal and lateral directions, as well as yaw angle, compared to other methods.
  
  Case Study: In-place Walking Experiments (Real World):
The proposed method (Ours) demonstrates notably lower tracking errors and successfully maintains in-place walking with minimal drift.
Other methods (Long History Only, Short History Only, State Feedback Only) result in substantial drift (e.g., to the robot's left).
RMA shows the most obvious sagittal shift and walks forward at high speed despite a zero velocity command. A-RMA reduces this but still experiences considerable lateral movement.
The Residual policy fails to maintain a stable gait.

The following figure (Figure 8 from the original paper) quantitatively compares the speed and orientation tracking errors for in-place walking across simulation and real-world deployments for policies trained with different random seeds.

该图像是示意图，展示了在跑步和行走状态下，机器人控制系统的输入/输出历史编码。图中分别标记了初始化阶段和扰动阶段，以及短期和长期输入历史，帮助理解动态平衡和适应性控制的机制。 Fig. 8: Quantitative comparisons of speed tracking error ( $||\dot{q}_{x,y}||_2$ ) and orientation tracking error (MAE of $q_{\phi,\theta,\psi}$ ) for in-place walking using policies trained from different random seeds but the same method. The figure shows results in (a) simulation and (b) real-world tests. Note that the $1^{\text{st}}$ , $4^{\text{th}}$ , $7^{\text{th}}$ policies trained from different random seeds are reported in Fig. 8.
In simulation (Fig. 8a), most policies show good tracking performance.
In real-world (Fig. 8b), Ours exhibits minimal degradation and consistently better control performance in both command tracking and stabilizing the floating base. Other methods show significant speed drift and rotational tracking errors (e.g., Long History Only performs worst in stabilizing the pelvis rotation despite minimal oscillation in simulation).

Summary of Policy Architecture Advantages:

Direct Motor Commands: Policies should directly specify motor-level commands rather than using a residual term.
Full I/O History: Utilize a history of both the robot's input and output, not solely state feedback.
Dual-History Approach: Combine long-term history (for system identification) with short-term history (for real-time feedback) in the base policy.
End-to-End Training: Train the base policy and history encoder in an end-to-end manner for better performance and reduced complexity compared to policy distillation.

6.1.2. Source of Adaptivity

The paper investigates the latent representation from the long I/O history encoder to understand how the proposed method adapts to varying environment parameters.

The following figure (Figure 9 from the original paper) shows latent representations for periodic running and their changes under different dynamics.

Fig. 35: More bipedal jumps using the same flat-ground jumping policy. The paper tag on the ground indicates the jumping target. Fig. 9: (a) (Top) Recorded latent representation after long-term I/O history encoder during running. (Bottom) Comparison of two selected dimensions (marked as red lines in the top plot) with recorded impact forces on each of the robot's feet. (b) The blue plot shows the robot's latent representation with default dynamics parameters during running. The red plots indicate changes in the same region under different dynamics. (c)(Top) Recorded latent representation after long-term I/O history encoder during jumping. (Bottom) Comparison of two selected dimensions (marked as red lines in the top plot) with recorded total impact forces on both of the robot's two feet. (d) The blue plot shows the robot's latent representation with default dynamics parameters during jumping. The red plots indicate changes in the same region under different dynamics.

Time Varying Embedding (Adaptivity to Time-Varying Events):

Periodic Running (Fig. 9a):
- The latent embedding exhibits a periodic pattern once the gait stabilizes.
- It captures time-varying disturbances, showing variations when a persistent backward perturbation force is applied.
- Two latent dimensions show strong correlation with foot impact forces, effectively performing contact estimation. These signals accurately track take-off and landing events (zero when foot is in swing phase).
- Intriguingly, these latent values shift to a lower envelope during perturbation, even with unchanged ground impact force, suggesting implicit learning of external forces and generalized dynamics without explicit engineering.
Aperiodic Jumping (Fig. 9c):
- The latent representation distinguishes between jumping phases (more varying signals) and standing phases (less varying signals).
- Different jumping tasks result in distinct latent values during jumping phases.
- Two latent dimensions correlate with contact events: Latent Value 1 (take-off event, increases and drops before contact force is zero), and Latent Value 2 (landing event, active upon landing). This suggests the robot learns to estimate separate take-off and landing cues, which are more informative for control than a single binary contact variable.

Adaptive Embedding for Changes in Dynamics (Adaptivity to Time-Invariant Dynamics Shifts):

Running (Fig. 9b) and Jumping (Fig. 9d):
- Varying dynamics parameters (e.g., link CoM position, link mass, joint damping ratio, PD gains, ground friction) result in significant shifts in the latent embedding compared to the default model.
- Despite these changes in latent representation, control performance metrics (e.g., task completion error $E_t$ , motion tracking error $E_m$ ) show minimal change. This highlights the controller's adaptivity to time-invariant dynamics shifts.
- Latency (communication delay) causes noticeable changes in latent embedding, but increased measurement noise (2x training bounds) has little effect, suggesting the CNN encoder effectively filters out zero-mean noise.

Summary of Adaptivity: The history encoder enables the proposed controller to adapt by capturing meaningful information from I/O history for:

Time-varying events (external perturbations, contact events).
Time-invariant changes in dynamics parameters.
Filtering out measurement noise. This capability explains the strong performance in challenging training settings with extensive dynamics randomization.

6.1.3. Advantages of Versatile Policies and Source of Robustness

The paper identifies task randomization as a key source of robustness, distinct from dynamics randomization. A single versatile policy capable of diverse tasks significantly improves robustness.

The following figure (Figure 10 from the original paper) demonstrates robustness tests in simulation for walking, running, and jumping under out-of-distribution uncertainties.

Fig. 2: Overview of this paper. First, Sec. IV introduces the formulation of the locomotion control problem and the importance of utilizing the robot's I/O history. The details of our dual-history-ba… Fig. 10: Robustness tests in simulation for walking, running, and jumping policies under out-of-distribution uncertainties that exceed training bounds. (a, b) Walking under consistent lateral force and CoM offset. (c, d) Running under consistent forward force and CoM offset. (e, f) Jumping under lateral force and CoM offset. Policies are: (i) Single Task (trained with dynamics randomization), (ii) Single-Task w/ Perturbation (trained with dynamics randomization + perturbations), (iii) Versatile (trained with task randomization + dynamics randomization). Versatile policies show compliant behaviors and task generalization to handle disturbances, especially in running and jumping (Fig. 10c, 10e).

Robustness Tests in Simulation (Fig. 10):

Walking (Fig. 10a, 10b):
- Under consistent lateral pulling force ( $22 \text{ N}$ ): Single-Task (i) fails. Single-Task w/ Perturbation (ii) progresses with minor deviation. Versatile (iii, Ours), without perturbation training, stabilizes by using a learned side walking skill to compensate, showing compliant gait.
- Under CoM offset ( $-8 \text{ cm}$ ): Single-Task (i) fails. Single-Task w/ Perturbation (ii) uses learned stabilization to counter backward force, walking forward at reduced speed. Versatile (iii) leverages backward walking gaits to offset the CoM shift.
Running (Fig. 10c, 10d):
- Under constant forward perturbation ( $30 \text{ N}$ ): Single-Task (i) and Single-Task w/ Perturbation (ii) fail to maintain stable gaits. Versatile (iii) adapts by using faster running skills to overcome the perturbation.
- Under CoM offset ( $+8 \text{ cm}$ ): Similar results, Versatile policy successfully handles the offset.
Jumping (Fig. 10e, 10f):
- Under lateral perturbation (Fig. 10e) and forward CoM offset (Fig. 10f), Versatile policies (without perturbation training) successfully adapt by using lateral or forward jump skills respectively.

Conclusion on Robustness Sources:

Dynamics Randomization (+ Perturbation Training): Allows policies to function within an expanded scenario range, but limits them to the trained task.
Task Randomization: Enables policies to generalize learned tasks for greater robustness and compliance, finding better maneuvers even without extensive dynamics randomization.

The following figure (Figure 11 from the original paper) shows robust standing recovery with the versatile walking policy.

Fig. 11: Robust standing recovery with versatile walking policy. (a) Single-task standing policy fails if pushed beyond support region. (b) Single-task standing policy trained with perturbations still fails after being pushed too far. (c) Versatile walking policy (trained with task randomization) recovers by transitioning to walking gaits when pushed beyond its support region while standing. This is an illustration of task generalization for robustness in the real world.

Case Study: Robust Standing Experiments (Real World):

When single-task standing policies (trained with or without perturbations) are pushed beyond their support region, they lose balance (Fig. 11a, 11b).
A versatile walking policy (also trained with standing skill) demonstrates intelligent recovery maneuvers (Fig. 11c). When pushed, it transitions to a walking gait, executes several steps (forward/backward), and then smoothly reverts to a standing pose. This occurs autonomously without human commands or explicit perturbation training, showcasing task generalization.

The following figure (Figure 12 from the original paper) illustrates other robust recovery maneuvers from lateral push, collision, and unstable landing.

Fig. 12: More robust recovery maneuvers by versatile policies in the real world. (a) When laterally perturbed while standing, the versatile walking policy utilizes varied walking skills to recover and return to a stand. (b) The versatile running policy shows robustness to collision with a track guard, using side-stepping skills to disengage and maintain stability. (c) The versatile jumping policy executes a corrective hop after unstable landing from a complex multi-axis jump to achieve a more stable configuration. These complex recovery maneuvers are emergent properties of versatile policies, not explicitly trained.
Versatile walking policy recovers from lateral perturbations by using varied walking skills to lower its center of mass and return to stand (Fig. 12a).
Versatile running policy recovers from collision with a track guard by using side-stepping skills and maintaining a stable stance (Fig. 12b).
Versatile jumping policy performs a corrective hop after an unstable landing to achieve a more stable configuration (Fig. 12c).

These complex, long-horizon recovery maneuvers are emergent properties of versatile policies, not explicitly trained.

Understanding Robustness from Training Distributions: The findings are conceptually illustrated by considering training distributions, analogous to invariant sets in control theory. The following figure (Figure 13 from the original paper) conceptually illustrates training distributions.

Fig. 13: An illustration of the concept of training distributions using different methods to enhance robustness. During deployment, as conceptually illustrated by the red curve, we want the robot controlled by its RL-based policy to operate inside the training distribution of the robot's trajectories. When the training is focused on a single task, the training distribution is confined to nominal trajectories specific to that task, drawn as the yellow region. Incorporating extensive dynamics randomization, including simulated perturbations or varying terrains, can expand this distribution. However, this expansion is still centered around the fixed task. Task randomization significantly broadens the training distribution (to the orange region) by enabling the robot to learn and generalize various control strategies across different tasks (marked as different faded yellow regions). It is important to note that task randomization can be combined with dynamics randomization, further widening the training distribution and enhancing the policy's robustness.

Single-task policies have a limited training distribution. Dynamics randomization expands this distribution but keeps it centered around the fixed task.
Task randomization significantly broadens the training distribution by allowing the robot to learn and generalize various control strategies across different tasks. This enables the robot to remain within its enhanced training distribution even when faced with disturbances, using learned tasks for recovery.
Combining task randomization with dynamics randomization further expands the training distribution and enhances robustness.
Task randomization is an "orthogonal" way to improve robustness beyond pushing the limits of dynamics randomization (which can hinder learning if too extreme).

Summary of Robustness: Versatile policies (trained with task randomization) show significant improvements in robustness compared to task-specific policies. This stems from their ability to generalize learned tasks and find better maneuvers to tackle unforeseen situations, leading to better stability and compliance to disturbances.

6.1.4. Dynamic Bipedal Locomotion in the Real World

Extensive real-world experiments on Cassie validate the adaptivity and robustness of the developed skill-specific versatile policies.

6.1.4.1. Walking Experiments

The following figure (Figure 14 from the original paper) illustrates the walking policy's performance in tracking variable commands and consistency over time.

$该图像是示意图，展示了在不同时间（秒）下，机器人在三个维度的速度变化情况，其中包括 $q_x$、$q_y$ 和 $q_z$ 的速度（单位：cm/s）。图中红色曲线表示速度变化，黑色曲线为参考线。$ 该图像是示意图，展示了在不同时间（秒）下，机器人在三个维度的速度变化情况，其中包括 $q_x$ 、 $q_y$ 和 $q_z$ 的速度（单位：cm/s）。图中红色曲线表示速度变化，黑色曲线为参考线。 Fig. 14: The walking policy tracking variable commands in the real world without any tuning. (a) Tracking variable sagittal velocity $\dot{q}_x^d$ , lateral velocity $\dot{q}_y^d$ , and walking height $q_z^d$ . (b, c) Consistency of tracking performance over a long timespan (492 and 325 days after initial testing). The tracking errors (MAE) are reported for each test, showing minimal degradation over time.

Tracking Performance:

Variable Commands (Fig. 14a): The policy efficiently tracks varying, fast-changing commands for sagittal velocity ( $\dot{q}_x$ ), lateral velocity ( $\dot{q}_y$ ), and walking height ( $q_z$ ). MAE in $(\dot{q}_x, \dot{q}_y, q_z)$ are $0.10 \text{ m/s}$ , $0.10 \text{ m/s}$ , $0.06 \text{ m}$ respectively.
The following figure (Figure 15 from the original paper) shows the robot tracking turning yaw commands.

$该图像是示意图，展示了在使用默认动态与与默认动态变化相关的潜在特征的比较。图中左侧部分显示了默认情境下的潜在特征，包括不同的动态参数示例；右侧部分则展示了在存在变化（如噪声、延迟等）情况下的潜在特征。每个子图下方标注了对应的能量值 $\[E_t, E_m\]$，反映了不同条件下的动态行为表现。$ 该图像是示意图，展示了在使用默认动态与与默认动态变化相关的潜在特征的比较。图中左侧部分显示了默认情境下的潜在特征，包括不同的动态参数示例；右侧部分则展示了在存在变化（如噪声、延迟等）情况下的潜在特征。每个子图下方标注了对应的能量值 $[E_t, E_m]$ ，反映了不同条件下的动态行为表现。 Fig. 15: A snapshot from the real world demonstrating the robot reliably tracking various turning yaw commands $q _ { \psi } ^ { d }$ using the same controller frames in the real word.The robot can executefull turns in both counterclockwise and clockwis dretions.
The policy reliably tracks varying turning commands (clockwise/anti-clockwise) (Fig. 15).
Consistency over Long Timespan (Fig. 14b, 14c): The RL-based walking policy adapts to changing robot dynamics (due to wear and tear) and consistently performs well over extended periods (492 and 325 days after initial testing) with minimal tracking error degradation. This highlights the adaptivity without hardware-specific tuning.
Fast Walking (Fig. 16):
- The robot can transition from standstill to fast forward walking (average $1.14 \text{ m/s}$ , peak $3.54 \text{ m/s}$ to track $1.4 \text{ m/s}$ command), and quickly return to standing (Fig. 16a, 16c).
- The following figure (Figure 16 from the original paper) shows fast forward and backward walking.
  
  该图像是一个示意图，展示了机器人在不同情况下的站立和行走技能。左侧展示了在前向推力作用下，仅具备站立技能的情况，中间展示带扰动的情况，而右侧则展示了经过训练的同时具备行走和站立技能的表现。 Fig. 16: Fast forward walking (a) and fast backward walking (b) demonstrations of Cassie in the real world without any tuning. The robot can transition from a stationary stance to a rapid gait and return to standing with a single command, even during dynamic maneuvers like fast walking. The recorded sagittal velocity $\dot{q}_x$ for fast forward walking is shown in (c), with average $1.14 \text{ m/s}$ and peak $3.54 \text{ m/s}$ while tracking a $1.4 \text{ m/s}$ command.
- It also performs fast backward walking (average $-0.5 \text{ m/s}$ to track $-1 \text{ m/s}$ command) (Fig. 16b).

Robust Walking Maneuvers:

Uneven Terrains (Untrained) (Fig. 17): The policy shows considerable robustness to varying elevation changes (backward walking on small stairs or declined slopes) despite no specific training for this terrain and lacking terrain elevation sensors. This is attributed to robustness to changes in contact timing or wrench.
- The following figure (Figure 17 from the original paper) shows walking on uneven terrains.
  
  该图像是插图，展示了一种双足机器人在外力横向推动下，通过多种行走技能进行恢复的过程。从左到右分别展示了机器人被推、恢复行走技巧和最终站立的状态。 Fig. 17: Robust walking on uneven terrains without any tuning. The robot can walk backward on stairs (a) and declined slopes (b). The sagittal velocity $\dot{q}_x$ and pelvis height $q_z$ are consistently tracked. The robot lacks terrain elevation sensors, adapting through its I/O history and robustness to contact changes.
Robustness to Random Perturbations:
- Impulse Perturbation (Fig. 18a): A substantial lateral perturbation force causes a lateral velocity peak of $0.5 \text{ m/s}$ . The robot swiftly recovers by moving in the opposite lateral direction, compensating for the perturbation and restoring stable in-place walking.
  - The following figure (Figure 18 from the original paper) shows recovery from lateral impulse and comparison with model-based control.
    
    该图像是示意图，展示了双足机器人在不同动态技能下的运动表现，包括跑步、站立、侧步及跳跃等动作。图中展示了在执行侧步时，机器人成功应对了动态变化的环境，通过运行技能增强了稳定性。 Fig. 18: Robust walking under lateral impulse perturbation (a) by the RL-based walking policy versus a model-based controller (b) in the real world. In (a), the robot, despite being pushed laterally and accelerated to $- 0 . 5 ~ \mathrm { m / s }$ , still maintains a stable walking gait and compensates such a lateral impulse by walking in the opposite direction. The corresponding lateral velocity $\dot { q } _ { y }$ is recorded in the lower part of the figure. Additionally, the planar position ( q _ { y } , q _ { x } ) is estimated, with points appearing in progressively darker colors as they are recorded later in time. The model-based controller (b) fails to maintain control when subjected to similar lateral perturbation (recorded in Vid. 3).
- Persistent Perturbation (Fig. 19): Under persistent lateral dragging force or random sagittal forces, the robot shows compliance, following the force directions while commanded to walk in place, without losing balance. This demonstrates potential for safe human-robot interaction.
  - The following figure (Figure 19 from the original paper) shows compliance to persistent perturbations.
    
    Fig. 19: Robust walking under persistent and random external perturbation (a) with a lateral dragging force and (b) with sagittal force at a lower height. The robot shows compliance to these forces, maintaining balance and returning to the commanded task after the force is removed. This demonstrates the advantages of the RL-based policy for safe human-robot interaction.
Comparison with a Model-based Controller (Fig. 18b): A model-based controller fails to maintain control under lateral perturbation, resulting in a crash, due to inability to deal with large modeling errors from unmodeled external forces. It also fails under persistent perturbation and shows obvious lateral drifts without manual tuning.

6.1.4.2. Running Experiments

The versatile running policies achieve impressive real-world feats.

Running a 400-meter Dash (Fig. 20):
- The robot successfully completes a 400-meter dash in 2 minutes and 34 seconds.
- It smoothly transitions from standing to running, accelerates to an average estimated speed of $2.15 \text{ m/s}$ (peak $3.54 \text{ m/s}$ ), and maintains desired speed while responding to varying turning commands (MAE 5.95 degrees).
- Substantial flight phases are evident, distinguishing it from fast walking. Lateral running skills learned during training allow correction of lateral drifts.
- The following figure (Figure 20 from the original paper) shows the 400-meter dash demonstration and recorded data.
  
  $该图像是图表，展示了时间与三个坐标轴速度（$q_x$, $q_y$, $q_z$）的关系。图中红色曲线为估计值，黑色虚线为期望值，在时间范围内显示了动态变化和稳定性。$ 该图像是图表，展示了时间与三个坐标轴速度（ $q_x$ , $q_y$ , $q_z$ ）的关系。图中红色曲线为估计值，黑色虚线为期望值，在时间范围内显示了动态变化和稳定性。 Fig. 20: Cassie completing a 400-meter dash on a standard outdoor running track. (a) Snapshots showing the robot transitioning from standing to running, maintaining a dynamic gait with flight phases, and navigating turns. (b) Recorded data of sagittal velocity $\dot{q}_x$ , lateral velocity $\dot{q}_y$ , and turning yaw angle $q_\psi$ . The robot achieves a peak speed of $3.54 \text{ m/s}$ and maintains an average lateral speed of $0.05 \text{ m/s}$ while tracking commands.
Tracking Varying Commands while Running (Fig. 21):
- The same versatile policy reliably tracks varying sagittal velocity (Fig. 21a) and lateral velocity (Fig. 21b).
- Command changes in one dimension do not affect control performance in others, indicating decoupled control in fast running.
- The robot performs a 90-degree sharp turn in 2 seconds (5 steps) with a natural running gait (Fig. 21c), despite not being specifically trained for sharp turns (only smooth turning rates up to $30 \text{ deg/s}$ ). This demonstrates task generalization.
- The following figure (Figure 21 from the original paper) illustrates tracking variable sagittal velocity, lateral velocity, and sharp turning.
  
  $Fig. 15: A snapshot from the real world demonstrating the robot reliably tracking various turning yaw commands $q _ { \\psi } ^ { d }$ using the same controller frames in the real word.The robot can e…$ 该图像是一个插图，展示了机器人在追踪多个转向偏航命令 q _ { heta }时的表现。上方是机器人在不同时间点的动作快照，下方为控制器输出的估计值与期望值的对比图，显示机器人能够有效执行顺时针和逆时针的全转。 Fig. 21: Versatile running policy tracking variable commands (a) sagittal velocity $\dot{q}_x^d$ , (b) lateral velocity $\dot{q}_y^d$ , and (c) performing a 90-degree sharp turn in the real world. The command changing in one dimension does not affect control performance in others, showing decoupled control. The sharp-turn scenario was not specifically trained, demonstrating task generalization.
Running a 100-meter Dash (Fig. 22, Table V):
- Achieves a fastest time of 27.06 seconds (Table V).
- Transitions from stationary stand to fast running gait within 1.8 seconds (Fig. 22a) with aggressive maneuvers.
- Reaches a peak estimated speed of $4.2 \text{ m/s}$ during the cruising phase, showing notable flight phases.
- The following figure (Figure 22 from the original paper) shows the 100-meter dash demonstration.
  
  该图像是图表，展示了双足机器人在不同步态下的动态行为，包括快速向前行走（左）和向后行走（右）。图中标注了过渡状态，右侧的子图显示了与这些行为相关的记录数据，包括估计和期望的速度变化。 Fig. 22: Cassie completing a 100-meter dash in the real world. (a) Transition from stationary stance to rapid running gait within 1.8 seconds. (b) Cruising phase, maintaining a fast running gait with a peak estimated speed of $4.2 \text{ m/s}$ . (c) Recorded sagittal velocity $\dot{q}_x$ during the dash, showing estimated and desired speeds. The robot completed the dash in 27.06 seconds.
- The following are the results from Table V of the original paper:
  
  Trial Completion Time (s)
  
  1 27.06
  
  2 27.99
  
  3 28.28
Running on Uneven Terrains (Trained) (Fig. 23):
- Successfully traverses terrains with different slopes ( $7^\circ$ inclined, $3^\circ$ lateral, $10^\circ$ inclined) without explicit terrain height estimation or external sensors.
- Maintains a stable running gait with flight phases even on challenging terrains, avoiding degradation to walking. This is the first demonstration of running (with flight phases) over large uneven terrains by a human-sized bipedal robot.
- The following figure (Figure 23 from the original paper) shows running on uneven terrains.
  
  $Fig.18a, the robot, despite being pushed laterally and accelerated to $- 0 . 5 ~ \\mathrm { m / s }$ , still maintains a stable walking gait and compensates such a lateral impulse by walking in the op…$ 该图像是图表，展示了在机器人遭受侧向扰动时的恢复动作及控制效果。左侧（图a）展示了机器人在施加横向扰动后的运动过程，并记录了其横向速度 $\dot{q}_y$ 随时间变化的图表，红线表示估计速度，黑线为期望速度。在右侧，平面位置 ( $q_y$ , $q_x$ ) 的估计结果显示，随着时间推移，记录点的颜色逐渐加深，体现了机器人对扰动的反应。右下角（图b）对比了未能有效恢复的模型控制器的表现。 Fig. 23: Cassie running on different types of terrains with different slopes. (a) Snapshots of the robot traversing a $7^\circ$ incline, $3^\circ$ lateral slope, and $10^\circ$ incline, maintaining flight phases. (b) Phase plots of left/right thigh ( $q_{3,4}^{L/R}$ vs. $\dot{q}_{3,4}^{L/R}$ ) during terrain traversal, showing consistent gait adaptation.
Robust Running Maneuvers (Fig. 24):
- Recovers from abrupt impulse perturbation (safety cord causing speed drop, leaning, and twisting) by maintaining stability and quickly returning to a stable running gait, due to training on simulated perturbations and diverse running tasks.
- The following figure (Figure 24 from the original paper) illustrates robust running maneuvers.
  
  Fig. 24: Robust running maneuvers in the real world. (a) Recovery from abrupt impulse perturbation while running at high speed. (b) Compensation for lateral perturbation by exerting a lateral running gait. These demonstrate the robustness of versatile running policies against unexpected disturbances.
- Compensates lateral perturbation by exerting a lateral running gait.

Trial	Completion Time (s)
1	27.06
2	27.99
3	28.28

6.1.4.3. Jumping Experiments

The versatile jumping policies achieve a large variety of different bipedal jumps.

Jump and Turn (Flat-ground Policy) (Fig. 25a, Fig. 26c, Fig. 35):
- Executes various target jumps by just changing the command: jumping in place while turning ( $60^\circ$ ), jumping backward ( $0.3 \text{ m}$ ), jumping forward ( $1 \text{ m}$ ).
- Adjusts take-off pose (leans backward for rear targets, forward for forward jumps).
- Capable of multi-axis jumps (forward, lateral, and turning simultaneously) (Fig. 26c).
- The following figure (Figure 25 from the original paper) shows jump and turn and jump to elevated platforms.
  
  Fig. 25: Versatile jumping capabilities of Cassie in the real world without any tuning. (a) Flat-ground policy executing various target jumps: (i) in-place jump with60^\circturn, (ii) backward jump ( $0.3 \text{ m}$ ), (iii) forward jump ( $1 \text{ m}$ ). (b) Discrete-terrain policy jumping to elevated platforms: (i) high jump ( $0.44 \text{ m}$ elevated), (ii) long jump ( $1.4 \text{ m}$ forward and $0.22 \text{ m}$ elevated), (iii) forward jump ( $1 \text{ m}$ and $0.11 \text{ m}$ elevated).
Jump to Elevated Platforms (Discrete-terrain Policy) (Fig. 25b, Fig. 26):
- Jumps precisely to targets at different positions and elevations.
- Achieves standing long jumps over $1.4 \text{ m}$ and standing high jumps to a $0.44 \text{ m}$ elevated platform. These are novel capabilities for human-sized bipedal robots.
- Adjusts take-off maneuvers for different targets and manages angular momentum after landing impacts.
- The following figure (Figure 26 from the original paper) shows diverse bipedal jumps.
  
  Fig. 26: Diverse bipedal jumps demonstrated in the real world. (a-c) Examples using the flat-ground policy: (a) lateral jump ( $0.3 \text{ m}$ ), (b) diagonal jump ( $0.3 \text{ m}$ ahead and $0.3 \text{ m}$ left), (c) complex jump (forward $0.5 \text{ m}$ , lateral $0.2 \text{ m}$ , turn $-45^\circ$ ). (d-g) Examples using the discrete-terrain policy jumping to elevated platforms: (d) in-place jump ( $0.32 \text{ m}$ elevated), (e) forward jump ( $0.32 \text{ m}$ ahead, $0.16 \text{ m}$ elevated), (f) forward jump ( $0.64 \text{ m}$ ahead, $0.32 \text{ m}$ elevated), (g) forward jump ( $0.64 \text{ m}$ ahead, $0.32 \text{ m}$ elevated).
Robust Jumping Maneuvers (Fig. 27):
- Under an impulse perturbation at the jump's apex, the robot deviates and lands in an unstable pose (backward lean, toes pitched up).
- It then executes a backward hop (a learned behavior from task randomization) to correct its pose and achieve a more stable landing configuration. This is a successful real-world recovery from perturbation during a jump.
- The following figure (Figure 27 from the original paper) shows robust jumping maneuvers.
  
  $该图像是图表，展示了机器人在100米冲刺比赛中的跑步快照，时间戳指示了对应的帧。下方图表记录了机器人在比赛中的矢量速度 $v_x$ 随时间的变化情况，显示了估计速度与期望速度的对比。$ Fig. 27: Robust jumping maneuvers in the real world. (a) Impulse perturbation applied at the apex of an in-place jump. (b) The robot's response: deviation from nominal trajectory, unstable landing pose, and subsequent backward hop to correct position and achieve a more stable configuration. This illustrates task generalization for agile recovery from unforeseen perturbations.
  
  Emergent Behaviors:
Online Contact Strategy: The robot develops its own contact strategy online, deviating from reference motions, enhancing stability (e.g., small hops after landing in jumping, varying double-support phases in walking, not strictly following periodic contact in running). This aligns with contact-implicit optimization but is achieved online on a real robot.
Unified Control Policy Challenges: Combining highly dynamic (aperiodic jumping) with stationary (standing) skills within a single policy can lead to oscillations in the stationary phase, indicating challenges in learning a single unified policy across vastly different dynamic characteristics.

6.2. Data Presentation (Tables)

All relevant tables are transcribed and presented in Section 5.4. Dynamics Randomization table (Table IV) is included in Section 4.2.7.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Ablation on Action Filter (`LPF`)

The following figure (Figure 28 from the original paper) shows the ablation study on the use of Low Pass Filter (LPF).

该图像是一个示意图，包括机器人在不同地形上奔跑的快照以及关节相位图。上方展示了机器人在不同坡度（7°, 10°, 3°）和飞行状态下的运动情况，下方则是左大腿和右大腿、左膝盖和右膝盖的相位图，展现了位置与速度的关系。 Fig. 28: Ablation study on the use of Low Pass Filter (LPF) as an action filter for the training of the in-place jumping skill from scratch. Without using the LPF, the training return (blue curve) is much lower than the one using LPF (ours, red curve). These two policies are obtained by using the exact hyperparameters and training settings. The underlying reason is that it is harder for the RL-based policy to damp out high-frequency jittering motion without the use of LPF.

An LPF applied after the policy output helps smooth actions.
Training a jumping-in-place policy without LPF results in worse learning performance (lower converged return) due to jittering motion.
LPF reduces the need for excessively high smoothing reward weights, which could otherwise lead to suboptimal stationary behaviors rather than dynamic skill learning.

6.3.2. Comparison of Different History Lengths

The following figure (Figure 29 from the original paper) shows the learning performance using different lengths of I/O history.

该图像是示意图，展示了使用两种不同策略（平坦地面政策和离散地形政策）进行的不同跳跃动作。图中每个阶段标注了“飞行阶段”，并分别列出了跳跃时的机器人的姿态和参数设置。左侧为平坦地面策略下的跳跃，右侧为离散地形策略下的跳跃，各自展示了机器人的运动轨迹和动态。 Fig. 29: Learning performance using different lengths of the robot's I/O history when training a single-task running policy with dynamics randomization. All of these policies used the proposed dual-history-based policy. When increasing the explicit length of robot history from 1 second (pink curve), 2 seconds (red curve, as used in this work), and 3 seconds (blue curve), we observe an increase in learning performance. However, if we keep increasing the history length, such as to 4 seconds (dark blue curve), the improvement of the learning performance may get saturated.

Increasing history length (e.g., 1s to 2s, 2s to 3s) generally enhances learning performance as it provides more information for state estimation and dynamics parameter inference.
However, continuously increasing history length can lead to saturation or even degradation (e.g., 4s history) as it may introduce redundant information that the robot needs to filter.
A 2-second history length is found to perform consistently well across skills.

6.3.3. Comparison of Different Temporal Encoders

The following figure (Figure 30 from the original paper) compares learning performance using TCN and LSTM encoders with and without the dual-history approach.

该图像是一个插图，展示了机器人在奔跑过程中如何从横向扰动中恢复。图中显示了不同时间点的机器人姿态变化，特别标出了施加扰动的时刻。这一过程展示了机器人在动态环境中的适应能力和控制策略。 Fig. 30: Learning performance using different neural network architectures to encode the robot's I/O history when training a single-task walking policy with dynamics randomization. For both Long Hist. Only methods, we still provide explicit immediate state feedback alongside the temporal encoder. As shown in Fig. 30a, using the proposed dual-history approach by providing an explicit short I/O history alongside the TCN encoder, the learning performance is much better than the TCN only. The TCN encodes 2-second robot I/O history and has 3 layers with filter sizes of [34,34,34], a kernel size of 5, a dilation base of 2, and a stride size of 2, with ReLU activation, as suggested in [71]. Fig. 30b shows that the dual-history approach will not help with the LSTM-based policy. The LSTM encoder has 1 layer of 128 units. However, both TCN with dual-history approach and Long Hist. Only perform better than the LSTM-based policy while using the hyperparameters tuned for LSTM. It suggests that LSTM may only learn to leverage a recent short history and converge to a more suboptimal policy.

TCN (Non-recurrent): The dual-history approach significantly improves learning performance for TCN encoders, consistent with 1D CNN results.
LSTM (Recurrent): The dual-history approach does not significantly aid learning for LSTM-based policies. LSTM tends to converge to suboptimal policies and is sensitive to hyperparameter tuning across different MDPs. LSTM-based policies struggle to learn highly dynamic skills like jumping (Fig. 31), highlighting its sensitivity.

6.3.4. Latent Visualization of Walking Policy

The following figure (Figure 32 from the original paper) shows latent visualization and adaptivity for the walking policy.

$该图像是示意图，展示了一个仿人机器人在执行不同跳跃技能的动作。多个图例（如（a）至（g））显示了机器人的飞行阶段及目标落地点，分别展示了不同的跳跃高度和角度配置，涉及的参数如 $(q^d_x, q^d_y, q^d_ heta)$ 表示跳跃姿态和目标位置。$ 该图像是示意图，展示了一个仿人机器人在执行不同跳跃技能的动作。多个图例（如（a）至（g））显示了机器人的飞行阶段及目标落地点，分别展示了不同的跳跃高度和角度配置，涉及的参数如 $(q^d_x, q^d_y, q^d_ heta)$ 表示跳跃姿态和目标位置。 Fig. 32: Adaptivity of the walking policy demonstrated by latent representations. (a) Recorded latent representation after long-term I/O history encoder during walking. The figure below compares two selected dimensions (marked as red lines) with recorded impact forces on each of the robot's feet. (b) The blue plot shows the robot's latent representation with default dynamics parameters during walking. The red plots indicate changes in the same region under different dynamics. Despite significant environment changes, control performance metrics like task completion ( $E_t$ ) and motion tracking error ( $E_m$ ) show little changes.

The latent embedding from the walking policy (Fig. 32a) also captures time-variant changes (external perturbation, contact events) and shows a periodic pattern.
Time-invariant dynamics shifts (Fig. 32b) cause changes in latent embedding, but control performance ( $E_t, E_m$ ) remains largely unaffected, confirming the adaptivity.

6.3.5. Saliency Map Analysis

Saliency maps of the MLP base (which produces the final action) show that the robot focuses more on the short I/O history, especially the most recent observation, for both running and walking. This supports the importance of direct short history input.
Saliency maps on the encoded long history show that different parts of the long history embedding are attended to under external perturbation, indicating its utilization for adjusting actions. This confirms that long I/O history provides useful context.

6.3.6. Estimator Errors in High-Speed Running

The following figure (Figure 34 from the original paper) illustrates estimation errors in high-speed running.

该图像是一个对比图，展示了在不同样本数量下，使用双历史架构（红色）与仅使用长历史架构（蓝色）所获得的标准化回报。左侧图标显示了大约300百万样本的表现，右侧图则呈现较小样本量下的表现，双历史架构在多样本条件下显著优于单一架构。 Fig. 34: The large estimation error using the robot onboard velocity estimator (based on EKF) during high-speed running in simulation. In this figure, the robot is controlled by the proposed running policy to track variable commands (black dashed line) in simulation, the estimated sagittal velocity $\dot { q } _ { x }$ is recorded as the red line while the robot's actual running speed is recorded as the blue line. The ground-truth speed is obtained from the simulator. Although showing accurate results under slow speed (such as $2 \text{ m/s}$ ), the estimated velocity shows a significant error in the high-speed region (above $3 \text{ m/s}$ ) compared to the ground-truth speed. The robot's actual speed tends to be the upper envelope of the estimated speed. In real-world experiments, we only have access to the state estimator whose result we can report, such as the running speed tracking results in Fig. 20b. This comparison shows that the robot's actual running speed in the real world is faster than the reported estimated value and closer to the command.

The onboard EKF velocity estimator shows large estimation errors for sagittal velocity ( $\dot{q}_x$ ) during high-speed running (above $3 \text{ m/s}$ ) compared to ground-truth. The actual speed tends to be higher than the estimated speed.
This implies that the robot's actual running speed in real-world experiments is faster and closer to the command than reported by the estimated values, suggesting even better tracking performance.
This highlights the necessity of training with an inaccurate estimator to enable sim-to-real transfer and points to developing reliable state estimators as future work.

6.4. Data Presentation (Figures)

All figures are integrated into the text at their most relevant points, as per the instructions.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work makes significant strides in bipedal robot locomotion control by introducing an RL-based framework that yields versatile, robust, and dynamic controllers. The core innovation lies in its dual-history architecture, which effectively integrates both long-term and short-term I/O history to enhance adaptivity. The framework employs a multi-stage training strategy that includes single-task training, task randomization, and dynamics randomization. Key findings show that the long I/O history encoder implicitly captures time-variant events (like contact forces) and time-invariant dynamics shifts, allowing the controller to adapt without explicit model parameters. Furthermore, task randomization is demonstrated as a crucial, orthogonal source of robustness, enabling task generalization and compliant recovery maneuvers from unforeseen disturbances, a capability distinct from dynamics randomization.

The effectiveness of the proposed method is rigorously validated on Cassie, a torque-controlled human-sized bipedal robot, through extensive real-world experiments. These demonstrations include:

Robust standing and versatile walking with consistent performance over long periods (over a year).
Fast running, including a 400-meter dash and running over challenging uneven terrains with flight phases.
A diverse repertoire of jumping skills, such as standing long jumps (1.4 meters) and high jumps (0.44 meters elevated). The work successfully bridges the sim-to-real gap and pushes the limits of agility for human-sized bipedal robots.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Learning a Unified Control Policy: While the current work achieves skill-specific versatile policies, learning a single, truly unified policy that encompasses all different locomotion skills (walking, running, jumping, and potentially manipulation) remains challenging. The current approach of combining skills (e.g., standing with walking) can lead to catastrophic forgetting if not carefully managed.
Adversarial Motion Prior (AMP): While AMP could potentially help learn unified policies without explicit motion tracking rewards, applying it to aggressive real-world bipedal locomotion is challenging due to mode-collapse issues in GAN-styled methods and the large sim-to-real gap.
Continual RL / Imitation Learning: Continual RL (to keep learning new skills) or imitation learning from offline datasets could be avenues for unified policies, but their robustness and sim-to-real transfer for bipedal robots are open questions.
Precision vs. Generalization: While the policies demonstrate generalization and adaptivity, achieving perfect precise control (e.g., minimal errors in fast-running tasks) with a single policy handling wide variations remains an open question. There is a trade-off between generalization and precision.
Oscillations in Standing: After large jumps, the robot occasionally oscillates while standing. This indicates the difficulty of learning both dynamic aperiodic skills and stationary skills with a single policy.
Reliable State Estimators: The paper highlights large estimation errors from the onboard velocity estimator during high-speed running. Developing more reliable state estimators for dynamic bipedal locomotion skills using RL is an interesting future direction.

Future work suggestions include:
Humanoid Robots: Extending the method to humanoid robots that leverage upper-body motions for agility and stability.
Depth Vision Integration: Integrating depth vision directly into the locomotion controller by adding an additional depth encoder alongside the I/O history encoder.
Loco-Manipulation Tasks: Combining bipedal locomotion with bimanual manipulation to tackle long-horizon loco-manipulation tasks.

7.3. Personal Insights & Critique

This paper represents a significant step forward in bipedal locomotion control. The dual-history architecture is a clever and effective design choice that addresses a fundamental challenge in RL-based control: how to best incorporate past information for adaptivity and real-time responsiveness. The empirical validation of I/O history for implicit system identification and contact estimation is a powerful demonstration of RL's emergent capabilities.

The most profound insight, in my opinion, is the explicit identification and emphasis on task randomization as a source of robustness. This idea, that exposing a robot to a wide range of tasks rather than just dynamics variations leads to more flexible and compliant behaviors, is very intuitive yet often overlooked in the RL community's focus on dynamics randomization. It suggests a shift in how we think about robustness in RL, moving from merely hardening against noise to fostering intelligent, adaptable reactions grounded in a broader behavioral repertoire. This could inspire new curriculum learning strategies for complex robotic tasks.

The real-world experiments on Cassie are exceptionally impressive, particularly the long-term consistency, the 400-meter dash, uneven terrain running, and diverse jumping feats. These demonstrations are a strong testament to the practical applicability and superior performance of the proposed method.

Potential Issues / Areas for Improvement:

Explainability of Latent Space: While the paper shows the latent space changes with dynamics shifts and correlates with contact events, a deeper understanding of what specific dynamics parameters or environmental features are being encoded in each latent dimension could provide more interpretability and potentially guide future controller designs.
Trade-off between Generalization and Precision: The paper briefly mentions this trade-off. While versatility is critical, for many industrial or mission-critical applications, extreme precision is also required. Future work could explore how to integrate precision-focused fine-tuning or hierarchical control on top of these generalized policies.
Unified Policy for All Skills: The current approach provides skill-specific versatile policies. While effective, the ultimate goal of a single policy for all locomotion and potentially manipulation remains a grand challenge. The paper acknowledges this, and the issues of catastrophic forgetting when simply combining skills with current methods are significant. Exploring more advanced continual learning or multi-task learning architectures would be crucial here.
State Estimation Reliance: The reliance on an EKF for linear velocity estimation, which shows significant errors at high speeds, highlights a potential vulnerability. While the RL policy adapts to this noisy input, developing an RL-based state estimator that is jointly learned with the control policy could further improve performance and robustness, especially for highly dynamic tasks.
Hyperparameter Sensitivity: The paper notes LSTM's sensitivity to hyperparameters. While CNN-based non-recurrent policies seem more robust in this regard, a deeper analysis into the hyperparameter landscape for different history encoder types could further inform design choices for future RL-based locomotion.

Overall, this paper provides a robust and highly impactful contribution to legged locomotion, setting new benchmarks and offering valuable insights into the design of adaptive and robust RL controllers for complex bipedal robots.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~29 min read · 35,350 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Reinforcement Learning (RL)

3.1.2. Bipedal Robots

3.1.3. Sim-to-Real Transfer

3.1.4. Partially Observable Markov Decision Process (POMDP)

3.1.5. Input/Output (I/O) History

3.1.6. Low Pass Filter (LPF)

3.2. Previous Works

3.2.1. Model-based Optimal Control (OC)

3.2.2. Model-free Reinforcement Learning (RL)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Robot Model and Control Policy Structure

4.2.2. Dual-History Architecture

4.2.3. Multi-Stage Training Framework

4.2.4. Reference Motion

4.2.5. Reward Function

4.2.6. Episode Design

4.2.7. Dynamics Randomization

4.2.8. Training Details

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Mean Absolute Error (MAE)

5.2.2. Qualitative Observations

5.3. Baselines

5.4. Hyperparameters

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Advantages of Policy Architecture

6.1.2. Source of Adaptivity

6.1.3. Advantages of Versatile Policies and Source of Robustness

6.1.4. Dynamic Bipedal Locomotion in the Real World

6.1.4.1. Walking Experiments

6.1.4.2. Running Experiments

6.1.4.3. Jumping Experiments

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Ablation on Action Filter (LPF)

6.3.2. Comparison of Different History Lengths

6.3.3. Comparison of Different Temporal Encoders

6.3.4. Latent Visualization of Walking Policy

6.3.5. Saliency Map Analysis

6.3.6. Estimator Errors in High-Speed Running

6.4. Data Presentation (Figures)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

3.1.5. `Input/Output (I/O) History`

3.1.6. `Low Pass Filter (LPF)`

6.3.1. Ablation on Action Filter (`LPF`)