Paper status: completed

Design and Control of a Bipedal Robotic Character

Published:01/09/2025

Reinforcement Learning for Robotic Control (2)Bipedal Robot Control (1)Dynamic Gait Generation (1)Entertainment Robot Design (1)Human-Robot Interaction Interface (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces a bipedal robot that integrates expressive artistic movements with robust dynamic mobility for entertainment applications, utilizing a reinforcement learning control architecture to perform complex actions based on command signals, enhanced by an animation e

Abstract

Legged robots have achieved impressive feats in dynamic locomotion in challenging unstructured terrain. However, in entertainment applications, the design and control of these robots face additional challenges in appealing to human audiences. This work aims to unify expressive, artist-directed motions and robust dynamic mobility for legged robots. To this end, we introduce a new bipedal robot, designed with a focus on character-driven mechanical features. We present a reinforcement learning-based control architecture to robustly execute artistic motions conditioned on command signals. During runtime, these command signals are generated by an animation engine which composes and blends between multiple animation sources. Finally, an intuitive operator interface enables real-time show performances with the robot. The complete system results in a believable robotic character, and paves the way for enhanced human-robot engagement in various contexts, in entertainment robotics and beyond.

Mind Map

In-depth Reading

English Analysis~25 min read · 30,754 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is the "Design and Control of a Bipedal Robotic Character," focusing on unifying expressive, artist-directed motions with robust dynamic mobility for entertainment applications.

1.2. Authors

The authors are Ruben Grandia*, Espen Knoop*, Michael A. Hopkins†, Georg Wiedebach†, Jared Bishop‡, Steven Pickles‡, David Müller*, and Moritz Bächer*. Their affiliations are:

*Disney Research, Switzerland
†Disney Research, USA
‡Walt Disney Imagineering R&D, USA

This indicates a collaborative research effort primarily from Disney's research and development divisions, suggesting a strong focus on applications within entertainment robotics.

1.3. Journal/Conference

The paper is published on arXiv, a preprint server. While not a peer-reviewed journal or conference in its current form, arXiv is a widely respected platform for disseminating research rapidly. The affiliations with Disney Research and Walt Disney Imagineering R&D suggest that the work is likely intended for a high-impact robotics or computer graphics conference or journal, given the expertise of the authors in these fields.

1.4. Publication Year

The paper was published on 2025-01-09, indicating it is a very recent work.

1.5. Abstract

The paper addresses the challenge of designing and controlling legged robots for entertainment, where appealing to human audiences requires expressive, artist-directed motions alongside robust dynamic mobility. It introduces a novel bipedal robot with mechanical features driven by character intent. The core methodology involves a reinforcement learning (RL)-based control architecture that executes artistic motions robustly, conditioned on command signals. These command signals are generated at runtime by an animation engine that composes and blends various animation sources. An intuitive operator interface allows for real-time show performances. The complete system aims to create a believable robotic character, enhancing human-robot engagement in entertainment and other contexts.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2501.05204 PDF Link: https://arxiv.org/pdf/2501.05204v1.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the gap between the remarkable dynamic locomotion capabilities of legged robots in challenging terrains and their limited ability to perform expressive, human-appealing motions required for entertainment applications. Most existing robotic systems prioritize utility and efficiency, leading to impressive feats in navigation and obstacle courses. However, as robots increasingly interact directly with humans in fields like collaborative robotics, companion robots, art, and entertainment, their success becomes dependent on subjective human perception. This introduces additional challenges: robots must not only satisfy complex kinematic and dynamic balancing requirements for whole-body locomotion but also perform motions that are expressive and "believable" to human audiences.

Prior research has explored animating robots (e.g., applying animation principles, learning from motion capture), but significant effort is often required to transfer these from simulation to reality. Other studies focus on sparking emotional responses through facial expressions or body language, but often with robots that have limited mobile capabilities. Existing humanoid platforms, while versatile, are typically designed as general-purpose platforms, and their expressive demonstrations are often separate projects, lacking a unified artistic vision from mechanical design to motion.

The paper's entry point is to unify expressive, artist-directed motions with robust dynamic mobility for legged robots. Its innovative idea is to co-develop the robot's mechanical design and its motion from an artistic vision, creating a unified character from the ground up, rather than adapting existing general-purpose robots. This character-driven approach seeks to overcome the limitations of purely utility-driven designs for entertainment contexts.

2.2. Main Contributions / Findings

The paper presents several primary contributions:

A Novel Workflow for Robotic Character Development: It introduces a comprehensive workflow that tightly integrates animation content creation, mechanical design, reinforcement learning (RL) control, and real-time puppeteering. This pipeline enables the rapid development of custom robotic characters that are both expressive and robust.
Character-Driven Mechanical Design: The paper presents a new bipedal robot whose morphology and kinematics (how it moves) are primarily dictated by creative intent and simplicity, rather than purely functional requirements like speed or load capacity. This is a significant departure from conventional robot design, where utility often takes precedence.
Divide-and-Conquer RL Control Architecture: The work proposes a reinforcement learning-based control architecture that breaks down complex behaviors into separate categories: perpetual motions (e.g., standing), periodic motions (e.g., walking), and episodic motions (e.g., specific emotional expressions). Each motion type is controlled by a specialized policy, conditioned on high-level command signals, allowing for robust execution of artist-authored animations under real-world conditions.
Intuitive Puppeteering Interface: A user-friendly interface is developed that leverages conditional policy inputs. This interface allows an operator to perform real-time robot shows by layering, blending, and switching between different motion elements (background animations, triggered animations, and joystick-driven commands), thereby creating believable and interactive performances.

The key findings demonstrate that the complete system results in a believable robotic character capable of executing a wide range of expressive motions robustly. This approach paves the way for enhanced human-robot engagement, particularly in entertainment robotics, but also suggests broader applicability for more expressive autonomous robots. The system successfully addresses the challenge of combining artistic expressiveness with the dynamic stability required for legged locomotion.

3.1. Foundational Concepts

To fully grasp this paper, a beginner needs to understand several core concepts:

Legged Robotics:
- Dynamic Locomotion: The ability of robots to move in a way that involves continuous changes in balance, often with phases of flight (e.g., running, jumping) or complex foot-ground interactions. Unlike static walking, dynamic locomotion is inherently unstable but allows for faster and more agile movements.
- Balancing: The act of maintaining an upright posture and stability while moving or standing. For legged robots, this often involves precise control of the robot's center of mass (CoM) relative to its support polygon (the area on the ground enclosed by the contact points of its feet).
- Degrees of Freedom (DoF): The number of independent parameters that define the configuration of a mechanical system. For a robot, this usually refers to the number of joints it has, each allowing movement along one or more axes (e.g., a hinge joint has 1 DoF, a spherical joint has 3 DoF). More DoF generally mean more kinematic dexterity.
Reinforcement Learning (RL):
- Agent-Environment Interaction: RL involves an agent (the robot controller) learning to make decisions by interacting with an environment (the simulated or real world). The agent performs actions, receives observations (state) from the environment, and gets rewards for its actions.
- Policy ( $\pi$ ): A strategy or function that the agent uses to map observed states to actions. In this paper, policies are often neural networks that take the robot's state and commands as input and output desired joint positions.
- State ( $s_t$ ): A comprehensive description of the environment at a given time $t$ . For a robot, this includes joint positions, velocities, torso orientation, linear/angular velocities, previous actions, etc.
- Action ( $a_t$ ): The output of the policy, which the agent executes in the environment. In this paper, actions are joint position setpoints for proportional-derivative (PD) controllers.
- Reward ( $r_t$ ): A scalar feedback signal from the environment that indicates how good or bad the agent's last action was. The goal of the agent is to maximize its cumulative reward over time. In this paper, rewards are designed to encourage imitation of reference motions, dynamic balancing, and discourage undesired behaviors (e.g., self-collision).
- Proximal Policy Optimization (PPO): A popular and robust algorithm for training RL agents. PPO is an on-policy algorithm, meaning it learns from data generated by the current policy. It updates the policy iteratively, trying to take the largest possible improvement step without collapsing the policy's performance. It achieves this by clipping the objective function to prevent excessively large policy updates.
Kinematics and Dynamics:
- Kinematics: The study of motion without considering the forces that cause it. It deals with positions, velocities, and accelerations of robot parts (joints, end-effectors) and their relationships.
- Dynamics: The study of motion that considers the forces and torques causing it. This involves mass, inertia, gravity, and external forces, which are crucial for simulating realistic robot behavior and ensuring dynamic stability.
- Inverse Dynamics: A method to calculate the joint torques required to achieve a desired motion (positions, velocities, accelerations) for a robot, considering its mass properties and external forces.
Actuators:
- Quasi-Direct Drive Actuators: Motors that have a relatively low gear ratio compared to traditional geared motors. This allows them to have high bandwidth (fast response), low impedance (can be back-driven easily), and good force control, which is beneficial for dynamic and compliant interaction in legged robots.
- Proportional-Derivative (PD) Control: A common feedback control loop mechanism. A PD controller calculates an output (e.g., motor torque) based on the current error (difference between desired and actual state) and the derivative of the error (rate of change of the error). The proportional (P) term provides a response proportional to the error, while the derivative (D) term helps dampen oscillations and anticipate future errors.
Animation Principles:
- Rigging: The process of creating a skeletal system (rig) for a 3D model, allowing animators to deform and pose it. A rig consists of bones and joints that define the range of motion.
- Composition and Blending: Techniques used in animation to combine multiple animation sources or segments. Composition involves arranging different animated elements, while blending smoothly transitions between them, creating more natural and complex motions. Slerp (spherical linear interpolation) is a common blending method for orientations (quaternions) that ensures shortest path and constant speed.
- Path Frame: A coordinate system that moves and rotates with the robot's intended path or trajectory. It helps in defining and controlling motions relative to the robot's current progression, making movements consistent even when the global position changes.

3.2. Previous Works

The paper contextualizes its work by referencing several key areas in robotics:

High-Performance Dynamic Locomotion: Early work focused on robots for utility, leading to systems like those that can hike mountains [28] or conquer obstacle courses [13]. These often use model-based control approaches.
Expressive Motions on Utility Robots: Some robots, originally designed for dynamic locomotion (e.g., Boston Dynamics robots), have been repurposed for expressive tasks like dancing [3, 1]. This shows the potential but highlights that expressiveness wasn't their primary design goal.
Robots for Social Interaction (HRI): Robots specifically designed for human interaction often prioritize social cues over dynamic mobility. Examples include humanoids like iCub [27], NAO [9], and Pepper [31], which are widely used for HRI research and motion control. However, these are often general-purpose platforms, and their motion repertoire might not be tied to a specific character vision.
Animating Robots (Artistic Control):
- Principles of Animation: Van Breemen [45] proposed applying classical animation principles to make robots more lifelike. He also developed an animation engine for composing and blending animations based on external commands [44].
- Software Architectures for Entertainment Robots: Fujita et al. [8] outlined software architectures for robots like the quadrupedal AIBO [7] and the humanoid SDR-4X [6]. Choregraphe [35] provided a graphical tool for easier behavior programming on NAO. These works laid the groundwork for blending and composing motions based on operator input.
Learning-Based Control for Legged Systems:
- Imitation Learning from Animation/Motion Capture: Reinforcement Learning (RL) has become a popular method to synthesize closed-loop control policies directly from reference motions [32, 33]. This allows robots to learn complex skills from human or artist-generated data.
- Combining Model-Based and RL: RL can also imitate solutions from model-based motion planners [19] or gait libraries [25].
- Scaling RL to Diverse Motions: Techniques like mixture-of-experts or adversarial rewards [34, 5] have been explored to handle large and diverse motion datasets, overcoming the need for explicit tracking rewards.
Model-Based Optimization: Approaches like Model Predictive Control (MPC) [4, 41, 30, 24, 10] are used to convert animations into dynamically feasible trajectories and stabilize the system online. These are crucial for ensuring physical plausibility.

3.3. Technological Evolution

The field of legged robotics has evolved from a primary focus on utility and efficiency (e.g., navigation in unstructured terrain, robust locomotion) to increasing interest in human-robot interaction and expressiveness. Early robots were engineered to perform tasks, often in isolation from humans. The mid-2000s saw the rise of more general-purpose humanoid robots (iCub, NAO, Pepper) designed for social interaction, but their mechanical designs were often generic, and their expressive motions were often developed separately from the hardware.

More recently, advancements in reinforcement learning, particularly deep reinforcement learning, have enabled robots to learn complex, agile, and dynamic behaviors from data, bridging the sim-to-real gap. This has opened avenues for imitating artist-generated motions with high fidelity.

This paper's work fits into the current trajectory of integrating these advancements. It pushes the boundary by explicitly prioritizing creative intent in the mechanical design itself, rather than adapting an existing utility-focused platform. It then leverages advanced RL techniques to robustly execute artist-authored animations, combining these with real-time puppeteering to create a new paradigm for expressive robotic characters.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

Character-Driven Mechanical Design: Unlike most legged robots (e.g., ANYmal, Boston Dynamics robots) designed primarily for functional objectives (speed, terrain navigation, efficiency), or general-purpose humanoids (iCub, NAO, Pepper), this paper introduces a robot whose morphology and kinematics are explicitly driven by artistic vision and creative intent. The "knee joints bend backwards as creatively envisioned" is a prime example of this philosophy. This contrasts with adapting existing hardware for expressive tasks.
Co-development of Design and Motion: The paper emphasizes that the mechanical design and the motion repertoire are "co-developed and driven by an artistic vision to create a unified character." This integrated approach ensures that the robot's physical form supports and enhances its expressive capabilities from the outset, rather than trying to force expressive motions onto a functionally optimized body.
Hybrid Control Architecture with Specialized RL Policies: While other works use RL for imitation learning [32, 33] or cluster motions [25], this paper adopts a "divide-and-conquer" strategy. It trains separate policies for distinct motion types (perpetual, periodic, episodic), each conditioned on carefully chosen high-level commands. This allows for specialized, robust control for each behavioral mode, which are then seamlessly blended and switched at runtime by a higher-level animation engine. This contrasts with approaches striving for a single, monolithic policy or directly tracking velocities which can lead to unnatural motions.
Real-time Puppeteering Interface: The paper introduces an intuitive puppeteering interface that combines fixed artist-created motions with real-time operator input (joystick, button triggers). This allows for dynamic, interactive, and "unscripted" performances that balance the precision of authored animations with the responsiveness of live control, providing more nuanced human-robot interaction than purely autonomous or pre-scripted systems.
Integration of "Show Functions": The robot incorporates specific show functions (actuated antennas, illuminated eyes, head lamp, speakers) that are critical for character expression but do not affect system dynamics. These are controlled separately and integrated into the animation engine, providing animators with additional expressive means, which is a key element for entertainment applications.

In essence, this paper's innovation lies in its holistic, character-first approach, from mechanical design to control, explicitly tailored for the nuances of entertainment and believable human-robot engagement.

4. Methodology

4.1. Principles

The core idea behind this work is to create a believable robotic character by unifying artistic expressiveness with robust dynamic mobility. This is achieved through a multi-layered workflow that interweaves mechanical design, animation content creation, reinforcement learning (RL)-based control, and a real-time puppeteering interface. The theoretical basis is that a robot's mechanical form should be driven by its intended character (creative intent), and its movements should be robustly executable in the real world while being highly controllable by an operator.

The intuition is that complex, nuanced character performances are best built from a foundation of specialized, robust control policies for different types of motions, which can then be seamlessly blended and commanded by an animation engine and a human operator. This "divide and conquer" strategy in RL, combined with a character-first design philosophy, aims to overcome the limitations of purely utility-driven robots when deployed in human-centric entertainment contexts.

4.2. Core Methodology In-depth (Layer by Layer)

The overall workflow for character design and control is outlined in Figure 2 of the original paper, which is a diagram illustrating the framework.

该图像是一个示意图，展示了双足机器人设计与控制的框架。图中分为四个主要部分：动画创作、强化学习、机械设计与运行时。动画创作部分展示了周期性和情节性运动的实现；强化学习部分说明了通过模仿奖励和仿真模型进行命令随机化训练；机械设计部分概述了机器人的硬件模块；运行时部分则演示了通过远程控制和动画引擎进行实时表演。整体系统旨在提升人机互动和娱乐机器人表现。

The framework starts with an iterative process involving mechanical design and animation. Classical animation tools define the character's general proportions and range of motion. A procedural gait generation tool creates physically plausible walking cycles, informing mechanical design by providing joint positions, velocities, and torques. This iterative loop allows for finding a balance between hardware limitations and creative intent.

Once initial motions and mechanical design converge, they are used to define a reinforcement learning (RL) problem. The mechanical design forms the simulation model, incorporating actuator models and domain randomization. Kinematic motion references from animation tools serve as imitation rewards in RL. Multiple policies are trained, each for a specific motion or motion type, conditioned on high-level commands and robust against disturbances.

During run-time, an animation engine receives user input from a remote control interface, generating commands for the control policies and triggering policy switches. Show functions (e.g., visual effects, audio) are synchronized with the robot's motion via the animation engine and state feedback, contributing to the character's expressiveness without directly affecting dynamics.

4.2.1. Mechatronic Design

The robot's mechanical design is fundamentally driven by creative intent and simplicity, rather than purely functional requirements.

The final bipedal robotic character design is shown in Figure 3.

该图像是图示，展示了我们的机器人角色的机械设计。机器人每条腿有5个自由度，脖子和头部组合有4个自由度，躯干内部包含自定义通信板、电池模块和IMU。头部装有主控计算机、无线接收器和表演功能板，还配备了发光的眼睛、头灯和扬声器。

The robot features:

Degrees of Freedom (DoF): 5 DoF per leg and a 4 DoF neck and head assembly. This large workspace in the legs allows for dynamic locomotion and lower body motions, while the head can be independently posed.
Actuator Placement: Similar to ANYmal [16], actuators are directly placed at joints. Components connecting off-the-shelf actuators are 3D printed to expedite custom robot construction.
Ankle Design: The robot lacks active ankle roll actuators. Instead, it has rounded foot soles to allow for passive ankle roll. Urethane foam molding further dampens foot impacts.
Knee Joints: The knee joints bend backwards, aligning with the creative vision for the character.
Physical Specifications:
- Total mass: $15.4 \mathrm{kg}$
- Torso: $5.8 \mathrm{kg}$
- Neck and head: $2.4 \mathrm{kg}$
- Each leg: $3.6 \mathrm{kg}$
- Height (excluding antennas): $0.66 \mathrm{m}$
- Leg nominal length: $0.28 \mathrm{m}$
- Leg extended length: $0.34 \mathrm{m}$
Actuator Types:
- Stronger Actuators: Hip-adduction-abduction, hip-flexion-extension, and knee actuators:
  - Peak torque: $34 \mathrm{Nm}$
  - Maximum velocity: $20 \mathrm{rads}^{-1}$
- Mid-range Actuators: Hip-rotation, ankle, and lower neck actuators:
  - Peak torque: $24 \mathrm{Nm}$
  - Maximum velocity: $30 \mathrm{rads}^{-1}$ These two types are quasi-direct drive actuators [49], supporting high-bandwidth open-loop torque control suitable for dynamic locomotion.
- Head Actuators: 3 actuators in the head:
  - High gear ratio
  - Peak torque: $4.8 \mathrm{Nm}$
  - Maximum velocity: $6.3 \mathrm{rads}^{-1}$
Electronics:
- Custom Communication Board: Microcontroller-driven, interfacing the onboard PC, actuators, and an Inertial Measurement Unit (IMU). Communication rate: $600 \mathrm{Hz}$ .
- Actuator Drives: Integrated, providing low-level control and reporting motor position via built-in encoders.
- Onboard PC: Communicates with the operator's handheld controller via redundant wireless (WiFi and LoRa radios).
- Battery: Removable, providing at least 1 hour of continuous operation.
Show Functions: A distinctive feature for expressiveness:
- Actuated antennas
- Illuminated eyes (LED arrays)
- Head lamp These are treated separately from main actuators as they don't affect system dynamics and are controlled open-loop, allowing for easy modification during design.
Audio: Stereo loudspeakers in both the body and the head.

4.2.2. Reinforcement Learning

The robot is controlled by multiple policies (neural networks that map observations to actions) conditioned on a time-varying control input $g_t$ . At each time step $t$ , the agent produces an action $a_t$ based on the policy $\pi ( { \boldsymbol a } _ { t } | { \boldsymbol s } _ { t } , \phi _ { t } , { \boldsymbol g } _ { t } )$ .

$a_t$ : The action taken by the agent at time $t$ , which are joint position setpoints for proportional-derivative (PD) controllers.
$s_t$ : The observable state of the robot at time $t$ .
$\phi_t$ : An optional phase signal, used for periodic and episodic motions.
$g_t$ : Optional conditional control inputs, which are high-level commands from the operator or animation engine.

The environment (simulation) then produces the next state $s_{t+1}$ , updates the phase signal $\phi_t$ , and returns a scalar reward r_t = r ( s _ { t } , \pmb { a } _ { t } , \pmb { s } _ { t + 1 } , \phi _ { t } , \pmb { g } _ { t } ). The reward function encourages close imitation of artist-specified kinematic reference motions and dynamic balancing.

To structure character performances, three motion types are defined based on their temporal properties:

Perpetual motions: Do not have a clear start or end. The robot maintains balance and responds to continuous control inputs. These are characterized by an indefinitely cycling periodic phase signal.
Periodic motions: Have a repetitive nature (e.g., walking). Policies receive a periodic phase signal that cycles indefinitely.
Episodic motions: Have a predefined duration (e.g., a specific dance move or emotion). Policies receive a monotonically-increasing phase signal. A transition to a new motion is forced once it ends.

Each policy is trained as a separate RL problem. Control inputs ( $g_t$ ) are randomized over their full range during training, enabling robust control with arbitrary inputs and seamless policy switching at runtime. The simulation model is also perturbed with randomized disturbances and model parameters to enhance robustness.

For the presented character, specific policies trained include:

One perpetual policy for standing (controlling head and torso).
One periodic policy for walking (with separate head control).
Several episodic policies (each for a single animation sequence).

4.2.2.1. Animation Input

The interface with animation content is through kinematic motion references that define the character's time-varying target state: $ \pmb { x } _ { t } = ( \pmb { p } _ { t } , \pmb { \theta } _ { t } , \pmb { v } _ { t } , \omega _ { t } , \pmb { q } _ { t } , \dot { \pmb { q } } _ { t } , c _ { t } ^ { L } , c _ { t } ^ { R } ) $ Where:

$\pmb { p } _ { t }$ : Global position of the torso at time $t$ .
$\pmb { \theta } _ { t }$ : Orientation of the torso at time $t$ , represented with a quaternion.
$\pmb { v } _ { t }$ : Linear velocity of the torso at time $t$ .
$\omega _ { t }$ : Angular velocity of the torso at time $t$ .
$\pmb { q } _ { t }$ : Joint positions at time $t$ .
$\dot { \pmb { q } } _ { t }$ : Joint velocities at time $t$ .
$c _ { t } ^ { L }$ : Contact state of the character's left foot at time $t$ (e.g., in contact or not).
$c _ { t } ^ { R }$ : Contact state of the character's right foot at time $t$ .

For each motion type, a generator function $f$ maps a path frame $\pmb { f } _ { t }$ and optional inputs to the kinematic target state: $ x _ { t } = f ^ { \mathrm { p e r p } } ( { \pmb f } _ { t } , { \pmb g } _ { t } ^ { \mathrm { p e r p } } ) $ $ ( { \pmb x } _ { t } , \dot { \phi } _ { t } ) = f ^ { \mathrm { p e r i } } ( { \pmb f } _ { t } , \phi _ { t } , { \pmb g } _ { t } ^ { \mathrm { p e r i } } ) $ $ { { { \pmb x } _ { t } = f ^ { \mathrm { e p i s } } ( { \pmb f } _ { t } , \phi _ { t } ) } } $ Where:
$f^{\mathrm{perp}}$ , $f^{\mathrm{peri}}$ , $f^{\mathrm{epis}}$ : Generator functions for perpetual, periodic, and episodic motions, respectively.
$\pmb { f } _ { t }$ : Path frame at time $t$ .
$\pmb { g } _ { t } ^ { \mathrm { perp } }$ : Control input for perpetual motion.
$\phi _ { t }$ : Phase signal.
$\pmb { g } _ { t } ^ { \mathrm { peri } }$ : Control input for periodic motion.
$\dot { \phi } _ { t }$ : Phase rate (output by periodic motion generator).

The reference generator for periodic motions additionally outputs the phase rate $\dot { \phi } _ { t }$ , allowing variation in stepping frequency. For episodic motions, phase rate is determined by duration. Figure 4 illustrates the path frame.

该图像是一个插图，展示了一种双足机器人在执行不同动作时的姿态变化。图中上部显示机器人在多个方向上的动态运动，下部则呈现了其运动轨迹，使用了 y = f(x) 的形式表示。整体表现出机器人灵活的运动能力和动态稳定性。

For the perpetual standing motion, the control input $g _ { t } ^ { \mathrm { p e r p } }$ consists of head and torso commands: $ g _ { t } ^ { \mathrm { p e r p } } = ( \Delta h _ { t } ^ { \mathrm { h e a d } } , \Delta \theta _ { t } ^ { \mathrm { h e a d } } , h _ { t } ^ { \mathrm { t o r s o } } , \theta _ { t } ^ { \mathrm { t o r s o } } ) $ Where:

$\Delta h _ { t } ^ { \mathrm { h e a d } }$ : Head height offset relative to nominal.
$\Delta \theta _ { t } ^ { \mathrm { h e a d } }$ : Head orientation offset relative to nominal.
$h _ { t } ^ { \mathrm { t o r s o } }$ : Torso height (in path frame coordinates).
$\theta _ { t } ^ { \mathrm { t o r s o } }$ : Torso orientation (ZYX-Euler angles, in path frame coordinates).

For periodic motion (walking), the control input $g _ { t } ^ { \mathrm { p e r i } }$ includes head commands as offsets and path frame velocities: $ \pmb { g } _ { t } ^ { \mathrm { p e r i } } = ( \Delta h _ { t } ^ { \mathrm { h e a d } } , \Delta \theta _ { t } ^ { \mathrm { h e a d } } , \pmb { v } _ { t } ^ { \mathcal { P } } , \omega _ { t } ^ { \mathcal { P } } ) $ Where:
$\Delta h _ { t } ^ { \mathrm { h e a d } }$ : Head height offset relative to a nominal head motion within the walk cycle.
$\Delta \theta _ { t } ^ { \mathrm { h e a d } }$ : Head orientation offset relative to a nominal head motion within the walk cycle.
$\pmb { v } _ { t } ^ { \mathcal { P } }$ : Linear velocity (in path frame coordinates).
$\omega _ { t } ^ { \mathcal { P } }$ : Angular rate (in path frame coordinates).

The path frame is crucial for motion transitions:
Artist-designed motions are stored in path coordinates.
Mapped to world coordinates based on the path frame state.
During standing, the path frame slowly converges to the center of the two feet.
During walking, the next frame is computed by integrating the path velocity commands.
For episodic motions, the path frame trajectory relative to the starting location is part of the artistic input.
$\pmb { f } _ { t }$ is projected to a maximum distance from the current torso state to prevent excessive deviation.

Animation content generation:
Perpetual references: Inverse dynamics is used to find a pose satisfying $g _ { t } ^ { \mathrm { p e r p } }$ and optimizing remaining DoF to center the pressure within the support polygon.
Periodic walking motions: Artists provide reference gaits at various speeds. These are procedurally combined [14] into new gaits based on $g _ { t } ^ { \mathrm { p e r i } }$ . A model predictive controller [51] plans desired center of mass and pressure, then an inverse dynamics controller [21] tracks these to obtain whole-body trajectories. This is done with an interactive tool for artist fine-tuning.
Episodic motions: Generated in Maya [2].

Reference generators are densely sampled, and reference look-up during RL training is performed via interpolation of these samples to prevent slowing down training.

4.2.2.2. Reward

The reward function $r_t$ combines motion-imitation rewards, regularization rewards, and survival rewards: $ r _ { t } = r _ { t } ^ { \mathrm { i m i t a t i o n } } + r _ { t } ^ { \mathrm { r e g u l a r i z a t i o n } } + r _ { t } ^ { \mathrm { s u r v i v a l } } $

$r _ { t } ^ { \mathrm { imitation } }$ : Rewards for closely matching the simulated robot's pose to the target pose from the kinematic reference. Additional rewards are given for matching foot contact states.
$r _ { t } ^ { \mathrm { regularization } }$ : Penalizes joint torques and promotes action smoothness to mitigate vibrations and unnecessary actions.
$r _ { t } ^ { \mathrm { survival } }$ : A simple objective to keep the character "alive" and prevent early termination of the episode (e.g., if head or torso touch the ground, or self-collision occurs).

The detailed weighted reward terms are provided in Table I. A hat (e.g., $\hat{q}$ ) denotes target state quantities from the reference pose $\pmb{x}_t$ . The following are the results from Table I of the original paper:

Name	Reward Term	Weight
Imitation
Torso position xy	$exp (−200.0 ⋅ kpx,y − \overline{px,y} ∥2)$	1.0
Torso orientation	$exp p (−20.0 ⋅ k\|\theta \equiv ∥2)$	1.0
Linear velocity xy	$exp (−8.0 ⋅ kvx,y − vx,y∥k2)$	1.0
Linear velocity z	$b (−8.0 ⋅ (vz − vz)2) exp$	1.0
Angular velocity xy	$exp (−2.0 ⋅ ∥ωx,y − ωx,y k2)$	0.5
Angular velocity z	$exp (−2.0 ⋅ (ωz − ωz)2)$	0.5
Leg joint positions	$-∥\|q − l∥2$	15.0
Neck joint positions	$-kqn − n∥2$	100.0
Leg joint velocities	$-\|\overline{a}l − k\|2$	$1.0 \cdot 10^{-3}$
Neck joint velocities	$-∥qn − qnk2$	1.0
Contact	$\sum i\{L,R\} I [ci = ci]$	1.0
Regularization
Joint torques	$-\|τ∥2$	$1.0 \cdot 10^{-3}$
Joint accelerations	$-\|∥2$	$2.5 \cdot 10^{-6}$
Leg action rate	$-kl − −1,lk2$	1.5
Neck action rate	$-∥an − at−1,n∥2$	5.0
Leg action acc.	$-\|al − 2−1,l + at−2,l∥k\|2$	0.45
Neck action acc.	$-kan − 2at−1,n + at−2,n∥\|2$	5.0
Survival
Survival	1.0

For episodic motions like "excited motion" and "jump", specific weights are increased during certain phases to emphasize key aspects of the motion: $ \tilde { w } ( \phi ) = w _ { 0 } + \mathrm { I } \left[ \phi _ { \mathrm { s t a r t } } < \phi < \phi _ { \mathrm { e n d } } \right] \cdot w _ { \mathrm { e x t r a } } $ Where:

$\tilde { w } ( \phi )$ : The modified weight of a reward term at phase $\phi$ .
w _ { 0 }: The base weight from Table I.
$\mathrm { I } \left[ \cdot \right]$ : An indicator function that is 1 if the condition inside is true, and 0 otherwise.
$\phi _ { \mathrm { s t a r t } }$ : The start phase of the emphasized segment.
$\phi _ { \mathrm { e n d } }$ : The end phase of the emphasized segment.
$w _ { \mathrm { e x t r a } }$ : Additional weight added during the emphasized segment. For "excited motion," angular velocity tracking weight is increased when the robot shakes its torso. For "jump," torso height and orientation weights are increased during the jump, along with contact reward to prevent "cheating" by keeping toes on the ground.

4.2.2.3. Policy

Policy actions $a_t$ are joint position setpoints for proportional-derivative (PD) controllers. The policy receives a state $s_t$ as input, in addition to an optional motion-specific phase ( $\phi_t$ ) and control command ( $g_t$ ): $ \pmb { s } _ { t } = ( \pmb { p } _ { t } ^ { \mathcal { P } } , \pmb { \theta } _ { t } ^ { \mathcal { P } } , \pmb { v } _ { t } ^ { \mathcal { T } } , \omega _ { t } ^ { \mathcal { T } } , \pmb { q } _ { t } , \dot { \pmb { q } } _ { t } , \pmb { a } _ { t - 1 } , \pmb { a } _ { t - 2 } ) $ Where:

$\pmb { p } _ { t } ^ { \mathcal { P } }$ : Torso's horizontal (xy-plane) position in path frame coordinates.
$\pmb { \theta } _ { t } ^ { \mathcal { P } }$ : Torso's orientation in path frame coordinates.
$\pmb { v } _ { t } ^ { \mathcal { T } }$ : Torso's linear velocity in body coordinates.
$\omega _ { t } ^ { \mathcal { T } }$ : Torso's angular velocity in body coordinates.
$\pmb { q } _ { t }$ : Joint positions.
$\dot { \pmb { q } } _ { t }$ : Joint velocities.
$\pmb { a } _ { t - 1 }$ : Actions from the previous time step.
$\pmb { a } _ { t - 2 }$ : Actions from two time steps ago. This state representation makes the policy invariant to the robot's global location. All policies are trained using Proximal Policy Optimization (PPO) [38].

Policy architectures and RL details (from Appendix A):

Training Time: 100,000 iterations (approx. 2 days on Nvidia RTX 4090), though nominal behavior is learned in the first 1,500 iterations (30 min).
PPO Hyperparameters: (Refer to Table IV below for details).
Network Architecture: Three fully connected layers of 512 hidden units with ELU activation functions for the actor (policy network). A separate critic network with the same number of hidden units.
Critic Input: Receives the simulation state without noise and the randomized friction parameter as privileged information (information only available during training, not deployment).

Input/Output Transformations:

Input Normalization: All inputs are normalized by their expected range.
Phase Feature Vector:
- For walking: Phase $\phi$ is replaced with the first two harmonics $( \sin ( k \cdot 2 \pi \phi ) , \cos ( k \cdot 2 \pi \phi ) )$ for $k \in \{ 1 , 2 \}$ . The first harmonic represents the walking cycle, and the second helps with head-bob occurring at twice the gait frequency.
- For episodic motions: $N=50$ Gaussian basis functions $\exp \left( - ( \phi - \phi _ { i } ) ^ { 2 } / ( 2 N ^ { 2 } ) \right)$ , where $\phi_i$ are equally spaced. These provide highly local, phase-dependent feedforward signals.
Output Transformation: A linear transform maps actions such that 0 corresponds to nominal joint positions and 1 to the expected range per joint.

Action Clipping: Joint position setpoints are clipped to a maximum deviation around the measured joint position, ensuring that maximum actuator torque can still be produced.

The following are the results from Table IV of the original paper:

Param.	Value
Num. iterations	100 000
Batch size (envs. × steps)	8192 × 24
Num. mini-batches	4
Num. epochs	5
Clip range	0.2
Entropy coefficient	0.0
Discount factor	0.99
GAE discount factor	0.95
Desired KL-divergence	0.01
Max gradient norm	1.0

4.2.2.4. Low-level Control

The policy generates actions at a rate of $50 \mathrm{Hz}$ , while actuators operate at $600 \mathrm{Hz}$ . This gap is bridged by:

First-order-hold: Linear interpolation of previous and current policy actions.
Low-pass filter: With a cut-off frequency of $37.5 \mathrm{Hz}$ . The low-level controller also implements the path frame dynamics and phase signal propagation as described earlier. These aspects are consistent between the RL training environment and runtime.

4.2.2.5. Simulation

A detailed simulation model is derived from the robot's CAD model, describing its physics, actuators, and environment interaction.

Physics Engine: Isaac Gym [26] is used for rigid body dynamics.
Actuator Models: Custom models [17, 42] are added, derived from first principles with parameters obtained from system identification experiments (see Appendix B).
Domain Randomization: Parameters of actuator models, mass properties, and frictional coefficients are randomized within observed ranges. Noise is added to the state received by the policy.
Disturbances: Random disturbance forces and torques are applied to the torso, head, hips, and feet. For walking policy training, terrain is also randomized.

Disturbance Parameters (from Appendix A): Details for applied disturbance forces are given in Table V. All parameters specify minimum and maximum magnitudes drawn from a uniform distribution. Forces and torques are applied per body (Hips, Feet, Pelvis, Head) for a random "on" duration, followed by a random "off" duration. These forces are gradually introduced over the first 1,500 iterations using a linear curriculum.

The following are the results from Table V of the original paper:

Param.		Short / small	Long / small	Short / large
Body		Hips, Feet	Pelvis, Head	Pelvis
Force [N]	XY	[0.0, 5.0]	[0.0, 5.0]	[90.0, 150.0]
	Z	[0.0, 5.0]	[0.0, 5.0]	[0.0, 10.0]
Torque [N m]	XY	[0.0, 0.25]	[0.0, 0.25]	[0.0, 15.0]
	Z	[0.0, 0.25]	[0.0, 0.25]	[0.0, 15.0]
Duration [s]	On	[0.25, 2.0]	[2.0, 10.0]	[0.1, 0.1]
	Off	[1.0, 3.0]	[1.0, 3.0]	[12.0, 15.0]

4.2.3. Runtime

After training, the neural control policies' weights are frozen, and the policy networks are deployed onto the robot's onboard computer. The deployed policies and low-level controllers interact with the robot hardware and a standard state estimator [11] (which fuses IMU and actuator measurements) instead of a simulator.

The runtime system enables an operator to puppeteer the character using a remote control interface. An Animation Engine maps puppeteering commands (policy switching, triggered animation events, joystick input) to policy control commands, show function signals, and audio signals.

4.2.3.1. Perpetual & Periodic Motions

For standing and walking, show function and policy commands are generated by combining event-driven animation playback with live puppeteering. The robot configuration is defined as $\begin{array} { r l } { c _ { t } } & { { } = } \end{array} \left( p _ { t } ^ { \mathcal { P } } , \theta _ { t } ^ { \mathcal { P } } , q _ { t } \right)$ , from which control inputs are extracted. An extended animation command is defined as $\pmb { y } _ { t } = ( \pmb { \nu } _ { t } , \pmb { c } _ { t } )$ .

$\pmb { \nu } _ { t }$ : Represents all show function commands.
$\pmb { c } _ { t }$ : Represents the robot configuration.

The show function parameters are summarized in Table II. The following are the results from Table II of the original paper:

Function Parameters	Dimensionality	Units
Antenna positions	2× 1	[rad]
Eye colors	2× 3	[RGB]
Eye radii	2× 1	[%]
Head lamp brightness	1	%]

Figure 5 shows the high-level diagram of the animation pipeline, which computes a target output $\pmb { y } _ { t }$ by combining three functional animation layers:

$Fig. 5. The animation engine procedurally generates the animation command, ${ \\mathbf { } } _ { \\mathbf { } } \\mathbf { \\cdot } \\mathbf { } \\mathbf { \\sigma } _ { \\mathbf { } }$ , based on three layers: background animation, triggered animations, and animations derived from joystick inputs. A triggered animation is blended in and out as illustrated by the green curve. In contrast, the background animation remains continuously active.$ 该图像是插图，展示了动画引擎的结构，以生成动画指令 $y_t$ 。图中分为三层：背景动画通过循环生成 $y_t^{bg}$ ，触发动画通过混合生成 $y_t^{trig}$ ，而操控杆动画则通过输入映射生成 $u_t$ 。触发动画的混合过程以绿色曲线表示，背景动画则持续活跃。

Background Animation: $\pmb { y } _ { t } ^ { \mathrm { b g } }$ is a looped animation (e.g., intermittent eye-blinking, antenna motion) that is always active in the absence of other inputs.
Triggered Animations: Operator-triggered clips (` $\pmb { y } _ { t } ^ { \mathrm { t r i g } }$ ) from a library (e.g., yes-no animations, complex scan wipes) are blended on top of the background animation. The blending for show function commands ( $\pmb { \nu }$ ) and configuration ( $\pmb { c }$ ) is: $ \pmb { \nu } _ { t } ^ { \mathrm { b l d } } = ( 1 - \beta ) \pmb { \nu } _ { t } ^ { \mathrm { b g } } + \beta \pmb { \nu } _ { t } ^ { \mathrm { t r i g } } $ $ \pmb { c } _ { t } ^ { \mathrm { b l d } } = \mathrm { i n t e r p } ( \pmb { c } _ { t } ^ { \mathrm { b g } } , \pmb { c } _ { t } ^ { \mathrm { t r i g } } , \alpha ) $ Where:
- $\pmb { \nu } _ { t } ^ { \mathrm { b l d } }$ : Blended show function commands.
- $\pmb { \nu } _ { t } ^ { \mathrm { b g } }$ : Background show function commands.
- $\pmb { \nu } _ { t } ^ { \mathrm { t r i g } }$ : Triggered show function commands.
- $\beta$ : Blend ratio for show functions.
- $\pmb { c } _ { t } ^ { \mathrm { b l d } }$ : Blended configuration.
- $\pmb { c } _ { t } ^ { \mathrm { b g } }$ : Background configuration.
- $\pmb { c } _ { t } ^ { \mathrm { t r i g } }$ : Triggered configuration.
- $\mathrm { i n t e r p }$ : Interpolation function (linear for position/joint angles, slerp for body orientation).
- $\alpha$ : Blend ratio for configuration. $\beta$ and $\alpha$ ramp linearly from 0 to 1 at the beginning of an animation and back to 0 at the end, with $T_\beta = 0.1 \mathrm{s}$ and $T_\alpha = 0.35 \mathrm{s}$ (facial expressions blend faster).
Joystick Animation: This layer modifies the blended animation state $\pmb { y } _ { t } ^ { \mathrm { b l d } }$ based on joystick input $\pmb { u }$ from the puppeteer.
- While standing: The target robot configuration is computed as: $ \pmb { y } _ { t } = \mathcal { I } ^ { \mathrm { p e r p } } \left( \pmb { y } _ { t } ^ { \mathrm { b l d } } , \pmb { u } _ { t } \right) $ Where $\mathcal { I } ^ { \mathrm { p e r p } }$ is a non-linear mapping that modifies the current animation state (gaze, posture) based on joystick inputs. Figure 6 provides examples.
- While walking: The target robot configuration and path velocity commands are computed: $ ( { \pmb y } _ { t } , { \pmb v } _ { t } ^ { \mathcal { P } } , \omega _ { t } ^ { \mathcal { P } } ) = \mathcal { I } ^ { \mathrm { p e r i } } \left( { \pmb y } _ { t } ^ { \mathrm { b l d } } , { \pmb u } _ { t } \right) $ Where $\mathcal { I } ^ { \mathrm { p e r i } }$ is a similar mapping that also produces path velocity commands ( $\pmb { v } _ { t } ^ { \mathcal { P } }, \omega _ { t } ^ { \mathcal { P } }$ ) for the periodic policy. Joystick axes for posture control are remapped to forward, lateral, and turning velocity commands. Show function modulation (e.g., antennas ducking, eyes narrowing) is applied based on path velocity, conveying exertion.
  
  该图像是一个示意图，展示了机器人的姿势控制和视线控制功能。左侧的姿势控制通过移动躯干而不影响视线，右侧的视线控制主要通过移动头部来改变视线，同时也命令附加的躯干运动以扩展视线范围。

Once the animation output $\pmb { y } _ { t }$ (which contains $\pmb { \nu } _ { t }$ and $\pmb { c } _ { t }$ ) is computed:

Show functions are directly controlled with $\pmb { \nu } _ { t }$ .
Policy command signals ( $g_t$ ) are derived from $\pmb { c } _ { t }$ . For the head, $\Delta h _ { t } ^ { \mathrm { h ead } }$ and $\Delta \theta _ { t } ^ { \mathrm { h ead } }$ are derived by comparing $\pmb { c } _ { t }$ to the robot's nominal configuration. For standing, $g _ { t } ^ { \mathrm { p e r p } }$ is directly derived. For walking, the lower body motion is entirely determined by the periodic policy and commanded path velocities. Leg joint positions are ignored as they are not policy inputs.

4.2.3.2. Episodic Motions

When an episodic motion is triggered, the animation engine initiates a transition to the corresponding policy and triggers an associated animation clip. This clip syncs with appropriate show function animations. No additional user input is accepted until the episodic motion concludes.

4.2.3.3. Audio Engine

An onboard audio engine processes and mixes all audio. An operator can trigger short sound clips (e.g., vocalizations). If an animation or episodic motion has associated audio, the animation engine sends a synchronous playback command. The system also supports sound effects modulated by robot actuator speeds to create artificial gear sounds.

4.2.4. Actuator Model (Appendix B)

The paper describes the actuator models used to augment the physics simulator. System identification was performed on single actuators on a test-bench. The proportional derivative motor torque equation for the quasi-direct drives (used in legs/hips/lower neck) is: $ \tau _ { m } = k _ { \mathrm { P } } ( a - \tilde { q } ) - k _ { \mathrm { D } } \dot { q } $ Where:

$\tau _ { m }$ : Motor torque.
$k _ { \mathrm { P } }$ : Proportional gain.
$k _ { \mathrm { D } }$ : Derivative gain.
$a$ : Joint setpoint (desired position).
$\tilde { q }$ : Joint position with an added encoder offset.
$\dot { q }$ : Joint velocity.

The encoder offset $\epsilon _ { q }$ is added to the true joint position $q$ : $ \tilde { q } = q + \epsilon _ { q } $ Where $\epsilon _ { q }$ is drawn from a uniform distribution $\mathcal { U } ( - \epsilon _ { q , \mathrm { m a x } } , \epsilon _ { q , \mathrm { m a x } } )$ at the start of each RL episode to account for calibration inaccuracies.

The actuators used in the head employ a different PD equation: $ \tau _ { m } = k _ { \mathrm { P } } ( a - \tilde { q } ) + k _ { \mathrm { D } } \frac { \mathrm { d } } { \mathrm { d } t } ( a - \tilde { q } ) $ Here, the derivative term operates on the numerical differentiation of the setpoint error, rather than the joint velocity.

Friction in the joint is modeled as: $ \tau _ { f } = \mu _ { s } \operatorname { t a n h } { \left( \dot { q } / \dot { q } _ { s } \right) } + \mu _ { d } \dot { q } $ Where:

$\tau _ { f }$ : Friction torque.
$\mu _ { s }$ : Static coefficient of friction.
$\dot { q } _ { s }$ : Activation parameter for static friction.
$\mu _ { d }$ : Dynamic coefficient of friction.

The torque produced at the joint $\tau$ is computed by applying torque limits to $\tau _ { m }$ and subtracting friction forces: $ \tau = \mathrm { c l a m p } _ { [ \underline { { \tau } } ( \dot { q } ) , \overline { { \tau } } ( \dot { q } ) ] } \left( \tau _ { m } \right) - \tau _ { f } $ Where:
$\mathrm { c l a m p } _ { [ \underline { { \tau } } ( \dot { q } ) , \overline { { \tau } } ( \dot { q } ) ] } (\cdot)$ : Clamps a value within the range defined by the minimum ( $\underline { { \tau } } ( \dot { q } )$ ) and maximum ( $\overline { { \tau } } ( \dot { q } )$ ) velocity-dependent torques. These limits consist of a constant limit torque $\tau _ { \mathrm { m a x } }$ for braking and low velocities, and a linear decrease in available torque above an identified velocity $\dot { q } _ { \tau _ { \mathrm { m a x } } }$ , reaching $0 \mathrm{Nm}$ at $\dot { q } _ { \mathrm { m a x } }$ .

The measured joint position reported by the actuator model includes the encoder offset, backlash, and noise: $ \hat { q } = \tilde { q } + 0 . 5 \cdot b \cdot \operatorname { t a n h } { \left( \tau _ { m } / \tau _ { b } \right) } + N \left( 0 , \sigma _ { q } ^ { 2 } \right) $ Where:

$\hat { q }$ : Measured joint position.
$b$ : Total motor backlash, drawn from $\mathcal { U } ( b _ { \mathrm { m i n } } , b _ { \mathrm { m a x } } )$ at each episode start.
$\tau _ { b }$ : Activation parameter for backlash.
$N \left( 0 , \sigma _ { q } ^ { 2 } \right)$ : Gaussian noise with mean 0 and variance $\sigma _ { q } ^ { 2 }$ .

The noise model uses a standard deviation proportional to the motor rotating velocity: $ \sigma _ { q } = \sigma _ { q , 0 } + \sigma _ { q , 1 } | \dot { q } | $ Where $\sigma _ { q , 0 }$ and $\sigma _ { q , 1 }$ are constants.

Finally, the reflected inertia of the actuator, I _ { m }, is added to the simulation model by setting the armature value of the corresponding joint in Isaac Gym. This value is randomized up to a $20\%$ offset at the start of each episode.

The actuator gains and model parameters are detailed in Table VI. The following are the results from Table VI of the original paper:

Param.	Unitree A1	Unitree Go1	Dynamixel XH540-V150	Units
kp	15.0	10.0	5.0	[N m rad−1]
kD	0.6	0.3	0.2	[N m s rad−1]
Tmax	34.0	23.7	4.8	[N m]
qTmax	7.4	10.6	0.2	[rad s−1]
qmax	20.0	28.8	7.0	[rad s−1]
µs	0.45	0.15	0.05	[N m]
μd	0.023	0.016	0.009	[N m s rad−1]
bmin	0.005	0.002	0.002	[rad]
bmax	0.015	0.005	0.005	[rad]
q,max	0.02	0.02	0.02	[rad]
σq,0	1.80 ·10-4	1.89 · 10−4	4.31 · 10−4	[rad]
σq,1	3.61 ·10-5	5.47·10-5	2.43 · 10−5	[s]
Im	0.011	0.0043	0.0058	[kg m2]

4.2.5. Puppeteering Interface (Appendix C)

The button layout of the remote controller used in this work (a Steam Deck [43]) is shown and annotated in Figure 12.

$Fig. 12. Steam Deck \[43\] layout.$ 该图像是一个插图，展示了 Steam Deck 的布局，包括多个按钮和操控杆的位置标识。上半部分显示了 D-pad、ABXY 按钮以及摇杆 L3 和 R3，底半部分则列出了 R1、R2、L1、L2 等按钮和 LoRa 无线模块的位置。

The effect of each button is described in Table VII. The following are the results from Table VII of the original paper:

Button		Effect
Menu		Trigger a safety mode called motion stop. This forces a transition to standing and freezes the joint setpoints with high position gains after waiting 0.5 s.
View		Slowly move all joints to the default pose. Only available at startup or while in motion stop.
D-pad	$\uparrow \downarrow$	Move the head up-down.
	$\leftarrow \rightarrow$	Roll the head left-right.
Left Joystick	$\uparrow \downarrow$	During walking: Longitudinal walking velocity. During standing: Up pitches the torso forward while the head remains stationary, and Down lowers the torso height.
	$\leftarrow \rightarrow$	During walking: Turning rate. During standing: Torso yaw while the head remains stationary.
	L3	Pressing the left joystick triggers a scanning animation.
Right Joystick	$\uparrow \downarrow$	Pitches the head. During standing, this additionally commands torso pitch.
	$\leftarrow \rightarrow$	Yaws the head left-right. During standing, the end of the range additionally commands torso yaw.
	R3	Pressing the right joystick toggles the audio level.
ABXY	A	Transition to standing.
	B	Fully tuck the neck in, turn off the eyes, and retract the antennas. While standing, the torso height is also lowered. Cancel all active animations.
	Y	Turn on the background animation layer.
Left Trackpad		Trigger an episodic motion. Each quadrant of the trackpad maps to a different motion.
Right Trackpad		Like the left trackpad. Reserved to trigger four additional episodic motions.
Backside	L1	Turn the head lamp on and off.
	R1	Single press: Start and stop walking, Hold: increase the walking velocity gain to 100 %. Without holding R1, all velocity commands are scaled to 50 % of the maximum.
	L2	During walking: Lateral walking velocity. During standing: Roll the torso while the head remains stationary.
	R2	Short press: trigger a happy animation. Long press: trigger an angry animation.
	L4	Short press: trigger an anxious animation. Long press: trigger a curious animation.
	L5	Short press: trigger a "yes" animation. Long press: trigger a "no" animation.
	R4	Trigger an expressive audio clip.

Figure 11 illustrates how torso and head yaw offsets are computed based on left and right joystick inputs to achieve independent torso movement and coordinated gaze.

Fig. 11. Joystick mapping during standing for the torso and head yaw offset depending on the left and right joystick inputs. The resulting gaze offset is the sum of the torso and local head offset. 该图像是表示手柄输入与腿部机器人头部和躯干偏航角度关系的示意图。左侧图展示了左手柄输入对躯干、头部和视线偏航的影响，右侧图则展示了右手柄输入的对应影响。偏航角度以弧度表示，范围为-1到1。

The gaze offset (total head yaw) is the sum of torso yaw offset and local head yaw offset. The left joystick primarily controls torso yaw while counter-rotating the head to maintain fixed gaze. The right joystick primarily controls head yaw, with additive torso movement at kinematic limits to extend range. This separation reduces cognitive load on the puppeteer.

5. Experimental Setup

5.1. Datasets

The paper does not use traditional datasets in the sense of pre-collected images or text for training. Instead, its "data" for learning comes from:

Artist-Generated Kinematic Motion References: These are bespoke animations created by artists using tools like Maya [2] or a procedural gait generation tool [14]. These references define the robot's desired time-varying joint positions, velocities, orientations, and contact states ( $\pmb{x}_t$ ).
- Perpetual Motion References: Generated using inverse dynamics to satisfy high-level head/torso commands and maintain balance.
- Periodic Motion References: Artist-provided reference gaits at various speeds, procedurally combined and refined using model predictive control and inverse dynamics.
- Episodic Motion References: Direct artistic input from Maya.
Domain Randomization: During RL training, the simulation environment's parameters (actuator model parameters, mass properties, frictional coefficients) are randomized, and random disturbance forces and torques are applied. This acts as a synthetic dataset of varied environmental conditions, enabling the policy to generalize to real-world uncertainties.

These "datasets" are specifically designed for the unique character and its intended artistic motions, ensuring direct relevance to the paper's objectives. They are effective for validating the method's performance because the goal is to imitate these specific artistic motions robustly.

5.2. Evaluation Metrics

The paper primarily uses quantitative metrics to assess tracking performance and qualitative evaluations for robustness and expressiveness.

Mean Absolute Tracking Error (MAE) of Joint Positions:
- Conceptual Definition: MAE measures the average magnitude of the absolute differences between the predicted (simulated robot's) joint positions and the target (reference animation's) joint positions. It indicates how closely the robot's actual movements match the desired artistic motions. A lower MAE means better tracking fidelity.
- Mathematical Formula: $ \mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} |q_{i, \text{sim}} - q_{i, \text{ref}}| $
- Symbol Explanation:
  - $N$ : Total number of joint position samples (across all joints and time steps) considered.
  - $q_{i, \text{sim}}$ : The $i$ -th simulated joint position.
  - $q_{i, \text{ref}}$ : The $i$ -th reference (target) joint position.
  - $|\cdot|$ : Absolute value.
Tracking of Commanded Velocities:
- Conceptual Definition: For walking motions, this metric assesses how accurately the robot's estimated torso linear and angular velocities (in the path frame) follow the commanded velocities from the operator interface. This measures responsiveness and control fidelity for dynamic locomotion.
- Formula: While no explicit formula is given, it's typically evaluated by comparing the time-series of commanded velocities to measured velocities (as shown in Figure 7), often involving metrics like Root Mean Square Error (RMSE) or visual inspection of alignment.
Actuator Torque Limits Utilization:
- Conceptual Definition: This involves monitoring the actual joint torques during motion and comparing them against the velocity-dependent maximum torque limits of the actuators. It indicates if the robot is operating within its physical capabilities and if the motion demands high power from the motors. Reaching limits can lead to tracking errors or reduced performance.
- Formula: No single formula, but involves comparing $\tau$ (actual joint torque) with $\underline { { \tau } } ( \dot { q } )$ and $\overline { { \tau } } ( \dot { q } )$ (velocity-dependent minimum and maximum torque limits).
Qualitative Robustness:
- Conceptual Definition: Evaluated by observing the robot's ability to maintain balance and continue executing its motion under challenging conditions, such as external pushes or walking over small obstacles. This demonstrates the generalization and disturbance rejection capabilities learned by the RL policies.
Qualitative Expressiveness and Believability:
- Conceptual Definition: Assessed by human observation of the robot's performance, focusing on how natural, characterful, and engaging its motions are, and how well it interacts with operators and audiences. This is the ultimate subjective measure for entertainment applications.

5.3. Baselines

The paper compares its RL formulation against several alternative approaches, primarily within the context of controlling legged systems:

Direct Velocity Tracking (without phase signal):
- Description: This baseline approach involves training an RL policy to directly track commanded walking velocities using torso-related rewards, while de-emphasizing leg joint position, velocity, and contact rewards. The phase signal is explicitly removed from the policy inputs.
- Purpose: To show that simply commanding velocities without proper phase information or contact incentives leads to poor, unnatural gaits.
- Outcome (as described in results): Results in a policy that "rapidly shuffles the feet" and yields unnatural gaits, even with common penalties for foot clearance and slip.
Phase Signal + Contact Reference Tracking:
- Description: This approach provides the RL policy with a phase signal and corresponding contact reference, in addition to velocity commands. It activates the contact reward, similar to some prior works [39, 40].
- Purpose: To compare against common RL practices for legged locomotion that explicitly use phase and contact information.
- Outcome (as described in results): While this policy follows the stepping pattern well, it results in a "stiff and upright torso motion," indicating a lack of naturalness and expressiveness compared to the proposed method.
Current and Future Kinematic Reference Poses as Policy Inputs (instead of phase signal):
- Description: This alternative formulation provides the policy with the current and future kinematic reference poses as inputs, instead of just the phase signal, while keeping the rest of the formulation identical to the proposed method [33, 25].
- Purpose: To investigate if explicitly providing future kinematic information (which the phase signal implicitly encodes) leads to better or different performance.
- Outcome (as described in results): For both walking and episodic motions, this formulation converges to a similar reward and produces a "visually identical motion." The paper notes that the benefit of using a phase signal over explicit kinematic references is that it avoids the need to store and reproduce complex reference motions directly on the robot.
  
  These comparisons highlight the specific design choices made in the proposed RL formulation (e.g., the specific reward structure, the use of phase signals for different motion types, and the separation of policies) and justify their effectiveness in achieving both robust dynamic performance and expressive, natural motions for the robotic character.

6. Results & Analysis

The results section evaluates the individual control policies, demonstrates the animation engine's capabilities in translating user commands, and discusses the overall system deployment.

6.1. Core Results Analysis

6.1.1. Evaluating Control Policies

Standing: The robot exhibits expressive motion during standing, allowing direct control over its torso and head. Each policy input corresponds to a controllable dimension (e.g., torso yaw, head pitch), enabling a wide range of expressive poses. The accompanying video demonstrates the robot traversing the full range of each policy input, validating its versatility.
Walking: The system accurately tracks commanded walking velocities, demonstrating responsiveness and control. The walking style supports a maximum longitudinal velocity of $0.7 \mathrm{ms}^{-1}$ , lateral velocity of $0.4 \mathrm{ms}^{-1}$ , and turning rate of $1.8 \mathrm{rads}^{-1}$ . Figure 7 illustrates the close alignment between commanded path velocities (dashed lines) and measured torso velocities (solid lines) in the path frame, confirming the robot's ability to follow complex locomotion commands.

该图像是图表，展示了命令的路径速度（蓝色和橙色线条）与测量的躯干速度（绿色线条）在时间上的变化。上半部分显示了以米/秒为单位的速度，底部则表示以弧度/秒为单位的角速度。图中包含了时间轴和相应的速度单位标记。
Episodic Policies: These policies enable the robot to perform diverse and coordinated motions, such as a "happy dance," "excited motion," "jump," and "tantrum." The specialized nature of these policies allows for a high level of coordinated joint movement that would be challenging to achieve with more general control. Figure 8 shows the torque profiles for the neck pitch (NP) and left knee pitch (KP) actuators during the "jump" motion. It reveals that the knee actuators reach their velocity-dependent torque limits during push-off, and the head actuator also reaches its limit during the upward pitching movement. This demonstrates that the robot utilizes its full dynamic range for expressive actions while staying within physical limits.

该图像是图表，展示了在 episodic "jump" 动作期间测得的关节扭矩（实线）和由驱动器模型计算的速度相关扭矩限制（虚线）。上部图显示了颈部俯仰（NP）驱动器，下部图显示了左膝关节（KP）驱动器的扭矩变化。
Robustness: A key strength of the RL approach is its ability to learn robust behaviors. The robot successfully stabilizes itself under external pushes and can walk over small obstacles (demonstrated in the video). The policy dynamically deviates from the reference trajectory and contact schedule to recover balance, showcasing its adaptability and resilience in unstructured environments. This contrasts with traditional optimization-based approaches that often struggle to adapt to unexpected disturbances while maintaining real-time performance and adhering to contact references.
Policy Transitions: The transitions between different policies (e.g., standing to walking, or between episodic motions) are designed to be seamless. Figure 9 illustrates the continuity of joint position setpoints (actions) across policy switches. The policies receive previous actions as input and are trained to promote smoothness, making transitions virtually undetectable to an outside observer. For instance, when transitioning from walking to standing, the switch is delayed until the next double support phase, ensuring the swing phase is completed naturally and the standing policy begins with both feet on the ground.

该图像是图表，展示了在短暂动作序列中政策动作的变化。上部分描绘了颈关节的动作，包含颈偏航（NY）、颈滚（NR）、颈俯仰（NP）和颈前倾（NF）；中部显示了左腿各关节的动作，包括髋偏航（HY）、髋滚（HR）、髋俯仰（HP）、膝关节（KP）和踝关节（AP）；下方则显示阶段信号及过渡时刻。此图提供了动作切换的详细信息。

6.1.2. Alternative RL Formulations

The paper compared its RL formulation against three alternatives:

Direct velocity tracking (no phase signal): Resulted in "rapidly shuffl[ing] the feet" and an "unnatural" gait. This confirms the importance of structured gait control beyond simple velocity commands.
Phase signal + contact reference tracking: Improved over direct velocity tracking but led to "stiff and upright torso motion," indicating that while it helps with stepping, it doesn't fully capture expressive body movements.
Current and future kinematic reference poses as policy inputs: Produced visually identical motion to the proposed phase-signal-based approach and converged to similar rewards. This suggests that the phase signal effectively encodes the necessary information without the overhead of storing and reproducing full kinematic references on the robot.

6.1.3. Animation Engine

The animation engine effectively combines multiple layers to create rich and controllable performances:

Background animation: Provides a baseline of activity.
Triggered animation layer: Adds specific expressive clips.
Joystick commands: Allows real-time, direct operator control.

The combination enables simple, intuitive, and expressive puppeteering, allowing operators to adapt performances to the robot's environment. For example, an operator can direct the robot's gaze towards a person, then trigger a stylized "yes" or "no" animation while simultaneously maintaining eye contact and modifying body posture. Figure 10 shows stills demonstrating the effectiveness of the puppeteering for interactive scenes.

该图像是一个插图，展示了一个双足机器人在与操作员互动时的不同动作场景。在画面中，机器人执行了多种指令，包括捡起和放置物品，体现了机器人在娱乐领域中与人类的互动能力。
Gaze and Posture Control: The joystick mapping (Figure 11) provides independent control over gaze and body posture, reducing the cognitive load on the puppeteer. The left joystick controls torso posture (yaw and pitch) while counter-rotating the head to maintain a fixed gaze. The right joystick controls head gaze (yaw and pitch), with the torso following additively when the neck reaches its kinematic limits. This functional separation allows puppeteers to easily direct the robot's line of sight while conveying emotion through body posture. The interface was continuously refined with feedback from operators without robotics backgrounds.

6.1.4. System Deployment

The robot has been successfully deployed in public, with up to three robots operating simultaneously for about 10 hours without a single fall. This demonstrates the robustness and reliability of the overall system. Audiences found the robot captivating, often attributing sentience or perception capabilities to it (e.g., "can the robot really see me?"). However, some feedback indicated that the visible presence of a puppeteer could be a distraction or reduce the character's believability.

6.2. Data Presentation (Tables)

The following are the results from Table III of the original paper:

Type	Name	MAE [rad]
Perpetual	Standing	0.035
Periodic	Walking	0.123
Episodic	Excited Motion	0.029
	Happy Dance	0.027
	Jump	0.043
	Tantrum	0.032

The Mean Absolute Error (MAE) for joint positions in radians is consistently low across all motion types, indicating high fidelity in tracking the artist-designed reference motions. Standing shows the lowest error, as expected for a static pose. Walking, a more dynamic and complex motion, has a higher MAE, but still within acceptable limits for robust execution. Episodic motions also demonstrate good tracking, with MAEs similar to or slightly higher than standing, reflecting their dynamic and often rapid nature.

6.3. Ablation Studies / Parameter Analysis

The paper includes a comparison of Alternative RL Formulations which serves as an implicit ablation study on the choice of policy inputs and reward structures:

Impact of Phase Signal and Contact Reward: By comparing the proposed approach (using a phase signal for periodic motions) with:
1. A policy directly tracking velocities without a phase signal or leg joint/contact rewards.
2. A policy using a phase signal with contact rewards. The authors demonstrate the progressive benefits. Removing the phase and contact rewards leads to unnatural "shuffling" gaits. Adding the phase signal and contact rewards improves stepping but results in a "stiff and upright torso motion." This highlights that while crucial, phase and contact information alone are insufficient to achieve the natural, expressive whole-body movements. The proposed method's nuanced reward design and specific input conditioning are essential.
Equivalence of Phase Signal vs. Full Kinematic Reference: The comparison with a formulation using current and future kinematic reference poses (instead of just a phase signal) showed that both achieved similar rewards and visually identical motions. This is a practical finding, as it justifies the use of the more compact phase signal, which is easier to implement and less resource-intensive to store and reproduce on the robot, without sacrificing performance. This suggests the phase signal effectively encapsulates the necessary temporal information for accurate motion tracking in this context.
Curriculum Learning for Disturbances: In Appendix A, the paper mentions that disturbance forces (Table V) are gradually introduced using a linear curriculum over the first 1,500 iterations of training. This is a form of parameter analysis in training, where gradually increasing the difficulty of the environment helps the policy learn robust behaviors incrementally, rather than being overwhelmed by strong disturbances from the start.

These comparisons and training strategies validate the efficacy of the paper's specific choices for policy input design and reward formulation, demonstrating that they are effective for achieving both robustness and expressiveness.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work successfully demonstrates a novel approach for designing and controlling bipedal robotic characters that seamlessly integrates expressive, artist-directed motions with robust dynamic mobility. The key contributions include a comprehensive workflow that unifies mechanical design, animation, reinforcement learning, and real-time puppeteering. A new robot was introduced, whose design prioritizes creative intent over purely functional requirements, enabling unique character forms. The reinforcement learning architecture employs specialized policies for perpetual, periodic, and episodic motions, robustly executing artistic content conditioned on high-level commands. Finally, an intuitive operator interface and animation engine allow for real-time show performances, blending various motion sources. The complete system yields believable robotic characters, fostering enhanced human-robot engagement beyond traditional utility-driven robotics, especially in entertainment.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Training Overhead for Multiple Policies: While separating behaviors into multiple specialized policies provides precise control, it leads to significant training overhead, particularly when scaling up the number of episodic motions.
Opportunity for Single Policy Learning: A future direction is to explore if a single, more generalized policy could learn several skills with the same level of accuracy as the specialized policies, potentially reducing training complexity and improving versatility. This is a common goal in more advanced hierarchical or multi-task reinforcement learning.
Operator Input Limitations: There's a natural limit to how many buttons and joysticks a human puppeteer can effectively use.
Embedding Autonomy: To further expand the character's expressive capabilities and potentially alleviate operator load, the authors suggest embedding more autonomy within the animation engine. This could involve the robot making some high-level decisions or stylistic choices on its own.
Audience Perception of Autonomy: The authors note that audiences sometimes assume the robot can perceive its environment or understand speech, even when it's being puppeteered. This highlights a gap between perceived and actual autonomy, which could be addressed by incorporating more sensing and intelligent decision-making.

7.3. Personal Insights & Critique

This paper presents a truly inspiring and refreshing perspective on robot design and control. My key insights and critiques are:

Interdisciplinary Synergy: The most powerful aspect is the deep integration of animation principles, mechanical engineering, and advanced reinforcement learning. This interdisciplinary synergy is crucial for applications where aesthetics and emotional connection are as important as physical performance. It's a testament to how artistic vision can drive technological innovation, not just adapt to it.
"Character-First" Design Philosophy: The decision to design the robot's morphology and kinematics based on creative intent is a paradigm shift for many robotics applications. It allows for fantastical and non-anthropomorphic characters, breaking free from the constraints of imitating human or animal forms when the goal is emotional resonance. This opens up vast possibilities for robotics in entertainment, storytelling, and even educational contexts.
Practicality of the Hybrid Control System: The layered control architecture (RL policies for robust low-level execution, animation engine for high-level composition and blending, puppeteering for real-time interaction) is highly practical. It leverages the strengths of RL for robust dynamic movement while retaining human artistic control for expressiveness. The seamless transitions between policies are a critical achievement, making the robot feel fluid and alive.
Scalability Challenges (Critique): While the "divide and conquer" RL strategy is effective for a defined set of motions, the acknowledged training overhead for scaling to a large number of episodic motions is a significant practical challenge. Learning a single, more generalized policy that can blend and execute diverse skills on-the-fly with comparable fidelity remains a complex open problem in RL, especially for high-DoF, dynamic systems. The paper correctly identifies this as future work.
Human-Robot Interaction Nuances: The observation that audiences attribute intelligence to the robot and are sometimes distracted by the puppeteer highlights a fascinating tension in HRI. As robots become more expressive, the line between machine and character blurs, raising questions about believability, agency, and perceived autonomy. Future work integrating sophisticated perception and AI-driven decision-making into the animation engine could address this, allowing the robot to respond more "intelligently" to its environment and human interaction, further enhancing immersion.
Cost and Complexity: While the paper emphasizes simplicity in mechanical design using off-the-shelf components where possible, custom bipedal robots are still inherently complex and expensive to build and maintain. The reliance on specialized tools (e.g., Maya, custom gait generators) and powerful simulation environments (Isaac Gym) also points to a high barrier to entry for smaller research groups or hobbyists.
Value Beyond Entertainment: The methodologies presented here, particularly the integration of artistic control with robust dynamic locomotion, have potential applications far beyond entertainment. They could inform the design of more intuitive assistive robots, expressive companion robots, or even educational robots that need to convey information or emotions through body language. The principles for crafting believable, dynamic interaction are broadly applicable.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.