Paper status: completed

Learning Smooth Humanoid Locomotion through Lipschitz-Constrained Policies

Published:10/16/2024

Lipschitz-Constrained Policies (1)Smooth Locomotion Control for Legged Robots (1)Reinforcement Learning and Sim-to-Real Transfer (1)Development of Smooth Behaviors for Robots (1)Low-Pass Filtering and Smoothness Rewards (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Lipschitz-Constrained Policies (LCP) to enhance humanoid robot locomotion control. LCP enforces smooth behaviors in a reinforcement learning framework, replacing traditional smoothing rewards, and integrates easily with automatic differentiation. Experiments

Abstract

Reinforcement learning combined with sim-to-real transfer offers a general framework for developing locomotion controllers for legged robots. To facilitate successful deployment in the real world, smoothing techniques, such as low-pass filters and smoothness rewards, are often employed to develop policies with smooth behaviors. However, because these techniques are non-differentiable and usually require tedious tuning of a large set of hyperparameters, they tend to require extensive manual tuning for each robotic platform. To address this challenge and establish a general technique for enforcing smooth behaviors, we propose a simple and effective method that imposes a Lipschitz constraint on a learned policy, which we refer to as Lipschitz-Constrained Policies (LCP). We show that the Lipschitz constraint can be implemented in the form of a gradient penalty, which provides a differentiable objective that can be easily incorporated with automatic differentiation frameworks. We demonstrate that LCP effectively replaces the need for smoothing rewards or low-pass filters and can be easily integrated into training frameworks for many distinct humanoid robots. We extensively evaluate LCP in both simulation and real-world humanoid robots, producing smooth and robust locomotion controllers. All simulation and deployment code, along with complete checkpoints, is available on our project page: https://lipschitz-constrained-policy.github.io.

Mind Map

In-depth Reading

English Analysis~31 min read · 40,476 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is Learning Smooth Humanoid Locomotion through Lipschitz-Constrained Policies. It focuses on developing robust and smooth locomotion controllers for humanoid robots using reinforcement learning by imposing a Lipschitz constraint on the learned policy.

1.2. Authors

The authors are: Zixuan Chen*, Xialin He*, Yen-Jen Wang*, Qiayuan Liao, Yanjie Ze, Zhongyu Li, S. Shankar Sastry, Jiajun Wu, Koushil Sreenath, Saurabh Gupta, Xue Bin Peng. The affiliations are:

Simon Fraser University (1)
UIUC (2)
UC Berkeley (3)
Stanford University (4)
NVIDIA (5) (* denotes equal contribution)

1.3. Journal/Conference

The paper is published as a preprint on arXiv. While the specific journal or conference is not stated in the provided text, the authors' affiliations and the nature of the research suggest a high-impact venue in robotics, machine learning, or artificial intelligence. Conferences like Conference on Robot Learning (CoRL), International Conference on Robotics and Automation (ICRA), or IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) are typical venues for such work.

1.4. Publication Year

The paper was published on 2024-10-15T17:52:20.000Z (UTC).

1.5. Abstract

The abstract introduces Reinforcement Learning (RL) and sim-to-real transfer as a general framework for developing locomotion controllers for legged robots. It highlights the challenge of jittery behaviors in policies trained in simulation, which often leads to sim-to-real transfer failures. Current solutions, such as low-pass filters and smoothness rewards, are non-differentiable and require extensive manual tuning. To overcome this, the paper proposes Lipschitz-Constrained Policies (LCP), a novel method that imposes a Lipschitz constraint on the learned policy. This constraint is implemented as a gradient penalty, providing a differentiable objective easily integrated into automatic differentiation frameworks. The authors demonstrate that LCP effectively replaces traditional smoothing techniques, integrates seamlessly into RL training frameworks for various humanoid robots, and produces smooth and robust locomotion controllers in both simulation and real-world deployments.

1.6. Original Source Link

The official source link is: https://arxiv.org/abs/2410.11825v3. This indicates the paper is a preprint, meaning it has been made publicly available before, or concurrently with, peer review. The provided PDF link is https://arxiv.org/pdf/2410.11825v3.pdf.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the difficulty of achieving smooth and robust locomotion controllers for humanoid robots that can successfully transfer from simulation to the real world.

This problem is important because humanoid robots are designed to operate in human environments, requiring reliable and adaptable mobility. Reinforcement Learning (RL) combined with sim-to-real transfer has emerged as a powerful framework for developing these controllers, alleviating the need for complex model-based designs. However, policies trained in simplified simulations often develop jittery or bang-bang control behaviors. These behaviors lead to rapid, high-frequency changes in actuator commands, which real-world motors cannot physically execute, causing sim-to-real transfer failures.

Prior research has addressed this using:

Smoothness rewards: Penalizing high joint velocities, accelerations, or energy consumption.
Low-pass filters: Applying filters to the policy's output actions.

However, these existing methods face significant challenges:

Smoothness rewards require tedious, platform-specific tuning of numerous hyperparameters to balance smoothness with task performance. They are also typically non-differentiable with respect to the policy parameters, making optimization harder.
Low-pass filters can dampen exploration during RL training, potentially leading to sub-optimal policies. They are also generally non-differentiable.

The paper's entry point is to find a general, differentiable, and effective method to enforce smooth behaviors in RL policies for humanoid locomotion, addressing the limitations of existing techniques. The innovative idea is to impose a Lipschitz constraint on the policy output, which can be implemented as a gradient penalty.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Proposal of Lipschitz-Constrained Policies (LCP): A novel and general method for encouraging smooth RL policy behaviors by imposing a Lipschitz constraint on the policy's output actions with respect to its input observations.
Differentiable Implementation: Demonstrating that the Lipschitz constraint can be effectively implemented as a gradient penalty, which is a differentiable objective. This allows for seamless integration with modern automatic differentiation frameworks and gradient-based optimization algorithms.
Replacement for Traditional Smoothing: Showing that LCP can effectively replace existing non-differentiable smoothing techniques like smoothness rewards and low-pass filters, simplifying the hyperparameter tuning process.
Generalizability Across Robots: Extensive evaluation demonstrating that LCP can be easily integrated into training frameworks for a diverse suite of humanoid robots (e.g., Fourier GR1T1, GR1T2, Unitree H1, Berkeley Humanoid).
Robust Sim-to-Real Transfer: Producing smooth and robust locomotion controllers that can be deployed zero-shot to real-world robots, even on challenging terrains and under external perturbations.

The key conclusions and findings reached by the paper are:
LCP successfully encourages smooth policy behaviors, comparable to or better than smoothness rewards, without directly penalizing specific smoothness metrics (like DoF velocities or accelerations) in the reward function.
LCP maintains strong task performance, outperforming low-pass filters which can inhibit exploration.
The gradient penalty coefficient ( $\lambda_{\mathrm{gp}}$ ) is a crucial hyperparameter; too small, and behaviors remain jittery; too large, and policies become overly smooth and sluggish, hindering task performance. An optimal balance can be found (e.g., $\lambda_{\mathrm{gp}} = 0.002$ ).
Applying the gradient penalty to the whole observation (including historical information) is more effective than applying it only to the current observation for maintaining smooth behaviors, especially in sim-to-real transfer frameworks that use observation histories (like ROA).
LCP enables humanoid robots to walk robustly on various real-world terrains (smooth, soft, rough) and recover from external forces, indicating its practical utility for real-world deployment.

3.1. Foundational Concepts

3.1.1. Reinforcement Learning (RL)

Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make sequential decisions by interacting with an environment. The agent observes the state of the environment, takes an action, and receives a reward signal, which indicates the desirability of the action. The goal of the agent is to learn a policy (a mapping from states to actions) that maximizes the cumulative reward over time.

Agent: The learner or decision-maker.
Environment: The world with which the agent interacts.
State ( $\mathbf{s}$ ): A complete description of the environment at a given time step.
Action ( $\mathbf{a}$ ): A choice made by the agent that affects the environment.
Reward ( $r$ ): A scalar feedback signal from the environment to the agent, indicating the goodness or badness of the agent's action.
Policy ( $\pi$ ): A function that maps states to probabilities of selecting each action, or directly to actions in deterministic cases. The policy defines the agent's behavior.
Trajectory ( $\tau$ ): A sequence of states, actions, and rewards over time: $(\mathbf{s}_0, \mathbf{a}_0, r_0, \mathbf{s}_1, \mathbf{a}_1, r_1, \dots)$ .
Return: The total discounted reward from a given time step. The agent's objective is to maximize the expected return. The paper uses Proximal Policy Optimization (PPO) [48], a popular RL algorithm that optimizes the policy by taking multiple gradient steps on a clipped surrogate objective function, ensuring that new policies do not deviate too far from old policies, which helps with stability.

3.1.2. Sim-to-Real Transfer

Sim-to-real transfer refers to the process of training a robot controller in a simulated environment and then deploying it directly onto a physical robot without further training in the real world. This approach is highly desirable because training RL agents in the real world is often too slow, expensive, and potentially damaging to the robot. The main challenge in sim-to-real transfer is the domain gap: the discrepancies between the simulation (simplified physics, idealized sensors/actuators) and the real world (complex physics, sensor noise, actuator limits, unmodeled disturbances). Techniques like domain randomization (varying simulation parameters during training to expose the policy to a wider range of dynamics) and teacher-student frameworks (where a privileged teacher policy with access to true environment parameters trains a student policy that relies only on sensor observations) are commonly used to bridge this gap. This paper leverages Regularized Online Adaptation (ROA) [6], [32] for sim-to-real transfer.

3.1.3. Lipschitz Continuity

Lipschitz continuity is a mathematical property of a function that quantifies its smoothness or how fast it can change. Intuitively, a Lipschitz continuous function has a bounded rate of change; it cannot change arbitrarily quickly.

Formally, as defined in Definition III.1 of the paper: Given two metric spaces $(X, d_X)$ and $(Y, d_Y)$ , where $d_X$ denotes the metric on the set $X$ and $d_Y$ is the metric on set $Y$ , a function $f: X \to Y$ is deemed Lipschitz continuous if there exists a real constant $K$ such that, for all $\mathbf{x}_1$ and $\mathbf{x}_2$ in $X$ , $d_Y(f(\mathbf{x}_1), f(\mathbf{x}_2)) \leq K d_X(\mathbf{x}_1, \mathbf{x}_2)$ Here:

$f$ : The function (in this paper, the RL policy $\pi$ mapping observations to actions).
X, Y: The input and output spaces of the function (e.g., observation space and action space).
$d_X, d_Y$ : Metrics (distance functions) in the input and output spaces, typically Euclidean distance ( $\ell^2$ -norm).
$K$ : The Lipschitz constant. It represents the maximum possible ratio of the change in output to the change in input. A smaller $K$ implies a "smoother" function.

A crucial corollary mentioned in the paper, relevant to LCP, is that if the gradient of a function is bounded, then the function is Lipschitz continuous. Specifically, if: $\| \nabla_{\mathbf{x}} f(\mathbf{x}) \| \leq K$ then $f$ is Lipschitz continuous. This means that by penalizing the norm of the gradient of the policy, we can enforce Lipschitz continuity and thus encourage smoother behavior.

3.1.4. Gradient Penalty

A gradient penalty is a regularization technique commonly used in machine learning, particularly in Generative Adversarial Networks (GANs). It works by penalizing the norm of the gradient of a network's output with respect to its input. The goal is to encourage the network to have gradients with norms close to a target value (often 1 in WGAN-GP), or, as in this paper, to simply keep the gradient norm bounded to enforce smoothness. By adding a term proportional to the gradient norm (or its square) to the loss function, the optimization process is encouraged to find parameters that result in a smoother function. The key advantage is that it provides a differentiable objective, unlike smoothness rewards or low-pass filters applied post-policy.

3.2. Previous Works

The paper contextualizes its work by reviewing existing methods in legged robot locomotion, sim-to-real transfer, learning smooth behaviors, and gradient penalties.

3.2.1. Legged Robot Locomotion

Model-based control: Traditional methods like Model Predictive Control (MPC) [12]-[14] require precise system structure and dynamics modeling, which is labor-intensive and challenging.
Learning-based methods: Recent model-free Reinforcement Learning (RL) approaches have shown great success in automating controller development for quadrupeds [15]-[18], bipedal robots [11], [19]-[21], and humanoids [7], [22]-[24]. These methods alleviate the need for meticulous dynamics modeling.

3.2.2. Sim-to-Real Transfer Techniques

A major hurdle for RL-based methods is bridging the domain gap between simulation and reality.

High-fidelity simulators: Developing more realistic simulators [25], [26] helps reduce the domain gap.
Domain randomization: Varying simulation parameters (e.g., friction, mass, motor strength) during training to make the policy robust to uncertainties in the real world [18], [22], [27], [28].
Teacher-student frameworks: Training a privileged teacher policy with full state information, then distilling its knowledge into an observation-based student policy [15], [20], [22], [24], [29]-[31]. The current framework also uses a teacher-student framework called Regularized Online Adaptation (ROA) [6], [16], [23], which trains a latent representation of the dynamics based on observation history.
Unified policies: Some work explores single policies for robots with different morphologies [34], but their validation on real humanoids and ease of integration are noted as limitations.

3.2.3. Learning Smooth Behaviors

Policies trained in simulation often exhibit jittery (bang-bang-like) behaviors due to simplified dynamics. These high-frequency action changes are not physically realizable by real actuators and lead to sim-to-real transfer failures.

Smoothness rewards: Common techniques include penalizing sudden changes in actions, DoF velocities, DoF accelerations [24], [29], [31], [32], [35], [36], and energy consumption [6], [9]. These require careful manual design and tuning of weights and are typically non-differentiable.
Low-pass filters: Applying filters to policy output actions to smooth them before execution [10], [11], [18], [37]. These can dampen exploration and are also generally not directly differentiable for policy training.

3.2.4. Gradient Penalty in Other Contexts

Gradient penalties are well-established regularization techniques.

GANs: Introduced in Wasserstein GAN (WGAN) [38] with weight clipping, and later refined by WGAN-GP [39] to penalize the norm of the discriminator's gradient to stabilize GAN training and prevent vanishing/exploding gradients. It has become widely used in GANs [40], [41].
Adversarial Imitation Learning: Used in systems like AMP [42], CALM [43], and ASE [44] to regularize an adversarial discriminator, enabling policies to imitate complex motions.
Differentiation: While prior work used gradient penalties for discriminators in GANs or imitation learning, this paper innovatively applies a gradient penalty directly to the RL policy itself as a regularizer to encourage smooth behaviors.

3.3. Technological Evolution

The field has evolved from traditional model-based control, which requires extensive manual modeling, to model-free reinforcement learning, which automates controller design. Early RL applications often struggled with sim-to-real transfer due to the reality gap and jittery policies. Domain randomization and teacher-student architectures helped bridge the gap, but the issue of policy smoothness persisted, leading to heuristic solutions like smoothness rewards and low-pass filters.

This paper's work (LCP) represents a significant step in this evolution by offering a more principled, mathematically grounded (Lipschitz continuity), and computationally efficient (differentiable gradient penalty) approach to policy smoothness. It moves away from empirical, non-differentiable smoothing heuristics towards a differentiable regularization technique that can be seamlessly integrated into modern RL frameworks.

3.4. Differentiation Analysis

Compared to the main methods in related work, LCP offers several core differences and innovations:

Differentiability: Unlike smoothness rewards (which are part of the environment dynamics and thus non-differentiable with respect to policy parameters) and low-pass filters (applied post-policy, making gradient propagation difficult), LCP's gradient penalty is a differentiable objective. This allows for direct optimization using gradient-based methods, which are standard in deep reinforcement learning.
Generality & Simplicity: LCP provides a general technique that can be applied across diverse humanoid robots with minimal changes, replacing the need for complex, robot-specific smoothness reward designs and their associated hyperparameter tuning. It requires only a single hyperparameter ( $\lambda_{\mathrm{gp}}$ ) for tuning.
Theoretical Foundation: LCP grounds its approach in Lipschitz continuity, a well-defined mathematical concept for function smoothness. This offers a more theoretical basis compared to heuristic smoothness rewards that penalize arbitrary combinations of velocities and accelerations.
Direct Policy Regularization: Instead of indirectly influencing smoothness through rewards or post-processing actions with filters, LCP directly regularizes the policy's gradient, pushing it towards smoother action outputs with respect to observations.
Improved Exploration: Low-pass filters can dampen exploration because they filter out high-frequency actions that might be necessary for initial exploration. LCP, by directly regularizing the policy, aims to learn inherently smooth behaviors without restricting the action space during exploration.

4. Methodology

4.1. Principles

The core idea behind Lipschitz-Constrained Policies (LCP) is to leverage the mathematical property of Lipschitz continuity to enforce smooth behaviors in Reinforcement Learning (RL) policies. Lipschitz continuity provides a way to quantify how "smooth" a function is by bounding its rate of change. The key principle is derived from a corollary of Lipschitz continuity: if the gradient of a function is bounded, then the function itself is Lipschitz continuous. Therefore, by imposing a constraint on the gradient norm of the RL policy with respect to its input observations, the paper aims to encourage the policy to produce smooth output actions. This constraint is translated into a differentiable gradient penalty term added to the standard RL objective function.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Motivating Example: Gradient Magnitude and Smoothness

The paper begins by illustrating the motivation for LCP with a simple experiment (Figure 3). RL-based policies are known to produce jittery behaviors. A common way to mitigate this is using smoothness rewards. The smoothness of a function is inherently related to its derivatives. The authors compare the $\ell^2$ -norm of the gradient of policies trained with and without smoothness rewards. The observation is that policies trained with smoothness rewards exhibit significantly smaller gradient magnitudes compared to those without. This direct correlation between smaller gradient magnitudes and smoother behaviors inspires LCP, which explicitly regularizes the policy's gradient.

The following figure (Figure 3 from the original paper) shows the gradient changes of policies trained with and without smoothness rewards during training.

Fig. 3: Gradient of policies trained with and without smoothness rewards. Policies with smoother behaviors also exhibit smaller gradient magnitudes.
该图像是一个图表，展示了训练过程中带有和平滑奖励和不带平滑奖励的策略梯度变化。可以看到，使用平滑奖励的策略梯度波动较小，表现出更平滑的行为。

Fig. 3: Gradient of policies trained with and without smoothness rewards. Policies with smoother behaviors also exhibit smaller gradient magnitudes.

4.2.2. Lipschitz Constraint as a Differentiable Objective

Traditional smoothness rewards are often complex to design, require extensive hyperparameter tuning, and are non-differentiable with respect to policy parameters, making gradient-based optimization challenging. LCP addresses this by formulating smoothness as a differentiable objective based on Lipschitz continuity.

As discussed in Section 3.1.3, if the gradient of a function is bounded, the function is Lipschitz continuous. Specifically, Equation 2 (from the paper, which states $\|\nabla_{\mathbf{x}} f(\mathbf{x})\| \leq K$ ) leads to the formulation of a constrained policy optimization problem: $\begin{array} { r l } { \underset { \boldsymbol { \pi } } { \operatorname* { m a x } } } & { { } J ( \boldsymbol { \pi } ) } \\ { \mathrm { s . t . } } & { { } \underset { \mathbf { s } , \mathbf { a } } { \operatorname* { m a x } } \left[ \lVert \nabla _ { \mathbf { s } } \log \boldsymbol { \pi } ( \mathbf { a } | \mathbf { s } ) \rVert ^ { 2 } \right] \leq K ^ { 2 } } \end{array}$ where:

$J(\boldsymbol{\pi})$ : The standard Reinforcement Learning (RL) objective (expected return) that the policy $\boldsymbol{\pi}$ aims to maximize. It is defined in Equation 3: $J ( \pi ) = \mathbb { E } _ { p ( \tau \mid \pi ) } \left[ \sum _ { t = 0 } ^ { T - 1 } \gamma ^ { t } r _ { t } \right]$ where $\pi$ is the policy, $\mathbb{E}$ is the expectation, $p(\tau \mid \pi)$ is the likelihood of a trajectory $\tau$ given policy $\pi$ , $t$ is the timestep, $T$ is the time horizon, $\gamma$ is the discount factor, and $r_t$ is the reward at timestep $t$ .
$\log \boldsymbol{\pi}(\mathbf{a} | \mathbf{s})$ : The log-probability of taking action $\mathbf{a}$ in state $\mathbf{s}$ according to policy $\boldsymbol{\pi}$ . This is used because RL policies often output action distributions.
$\nabla_{\mathbf{s}} \log \boldsymbol{\pi}(\mathbf{a} | \mathbf{s})$ : The gradient of the log-probability of the action with respect to the input state $\mathbf{s}$ . This measures how sensitive the policy's action output is to changes in the input state.
$\| \cdot \|^2$ : The squared $\ell^2$ -norm (Euclidean norm) of the gradient.
$\underset{\mathbf{s}, \mathbf{a}}{\operatorname*{max}}[\cdot]$ : The maximum value of the squared gradient norm over all possible states $\mathbf{s}$ and actions $\mathbf{a}$ .
$K^2$ : A constant representing the squared upper bound for the gradient norm, derived from the Lipschitz constant $K$ .

Calculating the maximum gradient norm across all states is intractable. Following a heuristic from Schulman et al. [47], this constraint is approximated by an expectation over samples collected from policy rollouts: $\begin{array} { r l } { \underset { \boldsymbol \pi } { \operatorname* { m a x } } } & { J ( \boldsymbol \pi ) } \\ { \mathrm { s . t . } } & { \mathbb { E } _ { \mathbf { s } , \mathbf { a } \sim \mathcal { D } } \left[ \lVert \nabla _ { \mathbf { s } } \log \pi ( \mathbf { a } | \mathbf { s } ) \rVert ^ { 2 } \right] \leq K ^ { 2 } , } \end{array}$ where:
$\mathbb{E}_{\mathbf{s}, \mathbf{a} \sim \mathcal{D}}[\cdot]$ : The expectation is taken over state-action pairs $(\mathbf{s}, \mathbf{a})$ sampled from a dataset $\mathcal{D}$ .
$\mathcal{D}$ : A dataset consisting of state-action pairs collected during policy rollouts (i.e., interactions with the environment).

To facilitate optimization with gradient-based methods, this constrained optimization problem is reformulated into an unconstrained problem by introducing a Lagrange multiplier $\lambda$ : $\operatorname* { m i n } _ { \lambda \geq 0 } \operatorname* { m a x } _ { \pi } \quad J ( \pi ) - \lambda \left( \mathbb { E } _ { \mathbf { s } , \mathbf { a } \sim \mathcal { D } } \left[ \| \nabla _ { \mathbf { s } } \log \pi ( \mathbf { a } | \mathbf { s } ) \| ^ { 2 } \right] - K ^ { 2 } \right) .$ Here, the Lagrange multiplier $\lambda$ controls the trade-off between maximizing the RL objective and satisfying the gradient constraint. By minimizing over $\lambda$ and maximizing over $\pi$ , this expresses the Lagrangian dual problem.

Further simplification is achieved by treating $\lambda_{\mathrm{gp}}$ as a manually specified, fixed coefficient (instead of an adaptively learned Lagrange multiplier). Since $K^2$ is a constant, it can be absorbed into the $\lambda_{\mathrm{gp}}$ term. This leads to the final, simple, and differentiable gradient penalty (GP) objective: $\begin{array} { r l } { \displaystyle \operatorname* { m a x } _ { \pi } } & { { } J ( \pi ) - \lambda _ { \mathrm { g p } } \mathbb { E } _ { \mathbf { s } , \mathbf { a } \sim \mathcal { D } } \left[ \| \nabla _ { \mathbf { s } } \log \pi ( \mathbf { a } | \mathbf { s } ) \| ^ { 2 } \right] . } \end{array}$ Here:

$\lambda_{\mathrm{gp}}$ : A manually specified gradient penalty coefficient that determines the strength of the smoothness regularization.
The objective is to maximize the RL return $J(\pi)$ while simultaneously minimizing the expected squared gradient norm of the policy's log-probability with respect to its input state. This directly encourages the policy to be less sensitive to small changes in its input observations, thereby promoting smoother action outputs.

This gradient penalty can be easily implemented using automatic differentiation frameworks (e.g., PyTorch, TensorFlow) by computing the gradient of the policy's log-probability with respect to the input observations and then adding its squared norm to the loss function.

4.2.3. Training Setup

The paper applies LCP to train locomotion policies for humanoid robots to walk and follow steering commands.

Observations ( $\mathbf{o}_t$ ): The input to the policy at time $t$ is $\mathbf{o}_t = [\phi_t, \mathbf{c}_t, \mathbf{s}_t^{\mathrm{robot}}, \mathbf{a}_{t-1}, \mathbf{e}_t]$ .
- $\phi_t \in \mathbb{R}^2$ : A gait phase variable, represented by its sine and cosine components (a periodic clock signal). This helps the robot synchronize its gait.
- $\mathbf{c}_t$ : The command input, specifying desired velocities.
- $\mathbf{s}_t^{\mathrm{robot}}$ : Measured joint positions and velocities of the robot.
- $\mathbf{a}_{t-1}$ : The previous action taken by the policy. This provides a temporal context for the policy.
- $\mathbf{e}_t$ : Privileged information, available during simulation training but not real-world deployment. This includes base mass, center of mass, motor strengths, and root linear velocity. This information is used to train a latent representation for sim-to-real transfer. Observations $\mathbf{o}_t$ are normalized with a running mean and standard deviation before being fed into the policy network.
Commands ( $\mathbf{c}_t$ ): The command input is $\mathbf{c}_t = [\mathbf{v}_{\mathrm{x}}^{\mathrm{cmd}}, \mathbf{v}_{\mathrm{y}}^{\mathrm{cmd}}, \mathbf{v}_{\mathrm{yaw}}^{\mathrm{cmd}}]$ in the robot frame.
- $\mathbf{v}_{\mathrm{x}}^{\mathrm{cmd}} \in [0 \mathrm{m/s}, 0.8 \mathrm{m/s}]$ : Desired linear velocity along the X-axis (forward/backward).
- $\mathbf{v}_{\mathrm{y}}^{\mathrm{cmd}} \in [-0.4 \mathrm{m/s}, 0.4 \mathrm{m/s}]$ : Desired linear velocity along the Y-axis (sideways).
- $\mathbf{v}_{\mathrm{yaw}}^{\mathrm{cmd}} \in [-0.6 \mathrm{rad/s}, 0.6 \mathrm{rad/s}]$ : Desired yaw velocity (rotational velocity around the vertical axis). During training, commands are randomly sampled from their respective ranges every 150 timesteps or upon environment reset to encourage robust command following.
Actions: The policy's output actions specify target joint rotations for all active joints. These target rotations are then converted into torque commands using Proportional-Derivative (PD) controllers with manually specified PD gains.
Training:
- Policies are modeled using neural networks.
- Training is performed using the Proximal Policy Optimization (PPO) algorithm [48].
- Training is conducted solely in simulation with domain randomization [27] to enhance sim-to-real transfer.
- Sim-to-real transfer is further facilitated by Regularized Online Adaptation (ROA) [6], [32].

4.2.4. Regularized Online Adaptation (ROA)

As described in Appendix A, ROA is employed for sim-to-real transfer. In this framework:

An encoder $\mu$ maps the privileged information $\mathbf{e}$ (available only in simulation) to an environment extrinsic latent vector $\mathbf{z}^\mu$ . This latent vector captures unknown environment parameters or dynamics.
An adaptation module $\phi$ is trained to estimate this latent vector $\mathbf{z}^\mu$ based only on the robot's recent history of proprioceptive observations (observations available on the real robot, like joint positions and velocities). During real-world deployment, this adaptation module provides the latent vector to the policy.

The full training loss for ROA combined with the Lipschitz constraint is given by: $\begin{array} { r l } & { L ( \theta _ { \pi } , \theta _ { \mu } , \theta _ { \phi } ) = - L ^ { PPO } ( \theta _ { \pi } , \theta _ { \mu } ) + \lambda \left\| \mathbf { z } ^ { \mu } - \mathbf { s g } \left[ \mathbf { z } ^ { \phi } \right] \right\| } \\ & { \qquad + \left\| \mathbf { s g } \left[ \mathbf { z } ^ { \mu } \right] - \mathbf { z } ^ { \phi } \right\| + \lambda _ { \mathrm { g p } } L _ { \mathrm { g p } } ( \pi ) , } \end{array}$ where:
$L(\theta_{\pi}, \theta_{\mu}, \theta_{\phi})$ : The total loss function for optimizing the policy network parameters $\theta_{\pi}$ , encoder network parameters $\theta_{\mu}$ , and adaptation module parameters $\theta_{\phi}$ .
$-L^{PPO}(\theta_{\pi}, \theta_{\mu})$ : The negative of the PPO loss. PPO is an actor-critic algorithm, and its loss typically includes a policy loss (for the actor) and a value loss (for the critic). The negative sign indicates maximization of the PPO objective. The policy in this context uses the latent vector from the encoder (or adaptation module) as part of its observation.
$\lambda \left\| \mathbf{z}^\mu - \mathbf{sg}[\mathbf{z}^\phi] \right\|$ : This term is part of the ROA adaptation loss. It encourages the adaptation module $\phi$ to predict the privileged latent vector $\mathbf{z}^\mu$ . sg denotes the stop gradient operator, meaning gradients do not flow through $\mathbf{z}^\phi$ to $\mathbf{z}^\mu$ for this term. This ensures that $\mathbf{z}^\mu$ is treated as a target.
$\left\| \mathbf{sg}[\mathbf{z}^\mu] - \mathbf{z}^\phi \right\|$ : Another term for the ROA adaptation loss. It directly penalizes the difference between the privileged latent vector $\mathbf{sg}[\mathbf{z}^\mu]$ (treated as a fixed target) and the adaptation module's prediction $\mathbf{z}^\phi$ . Together, these two adaptation loss terms train $\phi$ to accurately infer environment parameters from proprioceptive history.
$\lambda_{\mathrm{gp}} L_{\mathrm{gp}}(\pi)$ : The Lipschitz constraint penalty term, where $L_{\mathrm{gp}}(\pi)$ is defined as: $L _ { \mathrm { g p } } ( \pi ) = \mathbb { E } _ { { \mathbf s } , { \mathbf a } \sim { \mathcal { D } } } \left[ \| \nabla _ { \mathbf { s } } \log \pi ( { \mathbf a } | { \mathbf s } ) \| ^ { 2 } \right]$ This is the gradient penalty introduced in Equation 7, which encourages the policy $\pi$ to be Lipschitz continuous with respect to its input observations. The state $\mathbf{s}$ here would encompass the full policy input, including the latent vector provided by ROA.

The paper sets $\lambda_{\mathrm{gp}} = 0.002$ and $\lambda = 0.1$ during the training process.

4.2.5. Reward Curriculum

The paper mentions using a reward curriculum for training. The total reward expression is: $\mathbb { E } \left[ \sum _ { t = 1 } ^ { T } \gamma ^ { t - 1 } \sum _ { i } s _ { t , i } r _ { t , i } \right]$ where:

$r_{t,i}$ : The $i$ -th reward term at time $t$ .
$s_{t,i}$ : A scaling factor applied to individual reward terms. This scaling factor is dynamically adjusted: $s _ { t , i } = { \left\{ \begin{array} { l l } { s _ { \mathrm { c u r r e n t } } } & { { \mathrm { i f } } \ r _ { t , i } < 0 } \\ { 1 } & { { \mathrm { i f } } \ r _ { t , i } \geq 0 } \end{array} \right. }$ Here, $s_{\mathrm{current}}$ starts at 0.8 and is dynamically adjusted. If the average episode length is below 50, it suggests the robot is struggling to complete the task. The reward curriculum likely uses this scaling factor to regularize negative reward terms more aggressively early in training (larger s_current means larger penalty for negative rewards) to prevent the robot from getting stuck in bad states, while allowing for more diverse behaviors later. The paper implies that negative regularization rewards are scaled down at the start for better exploration and then scaled up later for desired behaviors, though the specific logic for adjusting $s_{\mathrm{current}}$ is not fully detailed.

4.2.6. Training Details

Simulator: IsaacGym [26], a high-performance GPU-accelerated physics simulator, is used.
Parallel Environments: 4096 parallel environments are used, which is common for efficient RL training with IsaacGym.

Reward Functions: The reward functions consist of various components. The specific implementation details are referred to the codebase. However, a table of regularization rewards and their weights is provided in Appendix C.

The following are the results from Table IV of the original paper:

Name	Weight
Angular velocity xy penalty	0.2
Joint torques	6e −7
Collisions	10
Linear velocity z	-1.5
Contact force penalty	-0.002
Feet Stumble	-1.25
Dof position limit	-10
Base Orientation	-1.0

These regularization rewards encourage specific desired behaviors (e.g., small angular velocity in x-y plane for stability) and penalize undesired ones (e.g., high joint torques, collisions, falling, feet stumbling, joint limit violations). The negative weights typically indicate penalties.

5. Experimental Setup

5.1. Datasets

The experiments primarily involve training and evaluating locomotion policies on various simulated and real-world humanoid robots. The "datasets" are essentially the robotic platforms themselves, as policies are trained through interaction with these simulated and then real environments.

The paper evaluates its framework on four distinct humanoid robots:

Fourier GR1T1 & Fourier GR1T2: These are human-sized robots with the same mechanical structure. They have 21 joints in total (12 in the lower body, 9 in the upper body). For control, the ankle roll joint is treated as passive due to minimal torque limit, so 19 joints are actively controlled.
Unitree H1: This robot has 19 actively controlled joints (10 in the lower body, 9 in the upper body, and 1 ankle joint per leg).
Berkeley Humanoid: A smaller robot, $0.85 \mathrm{m}$ tall, with 12 degrees of freedom (6 joints in each leg and 2 joints in each ankle).

These robots were chosen because they represent a diverse suite of humanoid robots with varying morphologies and degrees of freedom, allowing the authors to demonstrate the generality and scalability of LCP as a smoothing technique. They are effective for validating the method's performance across different platforms.

5.2. Evaluation Metrics

To evaluate the effectiveness of LCP and compare it against other smoothing techniques, the paper records a suite of smoothness metrics and the mean task return. For all smoothness metrics, lower values indicate smoother behavior and are generally desirable for real-world transfer. For Task Return, higher values are better.

Action Jitter ( $\mathrm{rad/s^3}$ ):
- Conceptual Definition: Action jitter quantifies the abruptness or choppiness of the robot's actions. It measures how rapidly the rate of change of actions itself changes. High action jitter indicates rapid and erratic fluctuations in motor commands, which is undesirable for real-world actuators and energy efficiency. It is defined as the third derivative of the output actions with respect to time.
- Mathematical Formula: Let $\mathbf{a}(t)$ be the action (e.g., target joint rotation) at time $t$ . The action jitter $\mathbf{J}_a(t)$ is given by the third derivative: $ \mathbf{J}_a(t) = \frac{d^3 \mathbf{a}(t)}{dt^3} $ The metric reported in the paper is typically the mean magnitude or RMS of this quantity over an episode or trial.
- Symbol Explanation:
  - $\mathbf{a}(t)$ : The action vector at time $t$ .
  - $\frac{d^3}{dt^3}$ : The third derivative with respect to time.
  - $\mathrm{rad/s^3}$ : Units for angular jerk or action jitter for rotational joints.
DoF Position Jitter ( $\mathrm{rad/s^3}$ ):
- Conceptual Definition: Similar to action jitter, DoF position jitter measures the jerk (third derivative) of the robot's joint positions. It indicates how erratic the joint movements themselves are. High values mean the robot's limbs are moving in a very choppy or spasmodic manner.
- Mathematical Formula: Let $\mathbf{q}(t)$ be the joint position vector (Degrees of Freedom) at time $t$ . The DoF position jitter $\mathbf{J}_q(t)$ is given by the third derivative: $ \mathbf{J}_q(t) = \frac{d^3 \mathbf{q}(t)}{dt^3} $ The metric reported is typically the mean magnitude or RMS over an episode or trial.
- Symbol Explanation:
  - $\mathbf{q}(t)$ : The joint position vector (DoF) at time $t$ .
  - $\frac{d^3}{dt^3}$ : The third derivative with respect to time.
  - $\mathrm{rad/s^3}$ : Units for angular jerk for rotational joints.
DoF Velocity ( $\mathrm{rad/s}$ ):
- Conceptual Definition: DoF velocity refers to the angular velocity of the robot's joints. High joint velocities can indicate aggressive movements, potentially leading to higher energy consumption and wear and tear on actuators. Lower mean DoF velocities generally imply smoother, more controlled movements. The paper reports the mean over an episode.
- Mathematical Formula: Let $\mathbf{q}(t)$ be the joint position vector at time $t$ . The DoF velocity $\dot{\mathbf{q}}(t)$ is given by the first derivative: $ \dot{\mathbf{q}}(t) = \frac{d \mathbf{q}(t)}{dt} $ The metric reported in the paper is $\mathbb{E}[\|\dot{\mathbf{q}}(t)\|]$ over time $t$ .
- Symbol Explanation:
  - $\mathbf{q}(t)$ : The joint position vector (DoF) at time $t$ .
  - $\frac{d}{dt}$ : The first derivative with respect to time.
  - $\mathrm{rad/s}$ : Units for angular velocity.
Energy ( $\mathrm{N \cdot rad / s}$ ):
- Conceptual Definition: Energy consumption measures the power expended by the robot's motors. In robotics, it's often approximated by the sum of absolute motor torques multiplied by joint velocities. Lower energy consumption indicates more efficient and smoother locomotion. The paper specifies units of $N·rad/s$ , which is equivalent to Watts (W), representing power. The metric reported is the mean power consumed over an episode.
- Mathematical Formula: For a single joint $j$ , the instantaneous power $P_j(t)$ is given by: $ P_j(t) = |\tau_j(t) \cdot \dot{q}_j(t)| $ where $\tau_j(t)$ is the torque at joint $j$ and $\dot{q}_j(t)$ is the angular velocity of joint $j$ . The total energy (or mean power) for all joints is: $ \text{Energy} = \mathbb{E}\left[\sum_j P_j(t)\right] = \mathbb{E}\left[\sum_j |\tau_j(t) \cdot \dot{q}_j(t)|\right] $ The metric reported in the paper is the mean over an episode $t$ .
- Symbol Explanation:
  - $\tau_j(t)$ : Torque applied at joint $j$ at time $t$ .
  - $\dot{q}_j(t)$ : Angular velocity of joint $j$ at time $t$ .
  - $|\cdot|$ : Absolute value.
  - $\mathrm{N \cdot rad / s}$ : Units for power (Newton-meters per radian per second for rotational motion, which simplifies to Watts).
Base Acc ( $\mathrm{m/s^2}$ ):
- Conceptual Definition: Base acceleration measures the linear acceleration of the robot's base (torso or main body). High base acceleration indicates jerky or unstable overall body movements, which can be uncomfortable for observers and potentially less stable for the robot. Lower base acceleration suggests smoother and more stable locomotion.
- Mathematical Formula: Let $\mathbf{p}_{\mathrm{base}}(t)$ be the position vector of the robot's base in 3D space at time $t$ . The base acceleration $\ddot{\mathbf{p}}_{\mathrm{base}}(t)$ is given by the second derivative: $ \ddot{\mathbf{p}}{\mathrm{base}}(t) = \frac{d^2 \mathbf{p}{\mathrm{base}}(t)}{dt^2} $ The metric reported in the paper is typically the mean magnitude or RMS of this quantity over an episode or trial.
- Symbol Explanation:
  - $\mathbf{p}_{\mathrm{base}}(t)$ : Position vector of the robot's base at time $t$ .
  - $\frac{d^2}{dt^2}$ : The second derivative with respect to time.
  - $\mathrm{m/s^2}$ : Units for linear acceleration.
Task Return ( $\uparrow$ ):
- Conceptual Definition: Task return is the cumulative discounted reward obtained by the policy over an episode, as defined by the RL objective (Equation 3). In this paper, it is calculated using linear and angular velocity tracking rewards, meaning the robot gets higher rewards for accurately following the given command velocities ( $\mathbf{v}_{\mathrm{x}}^{\mathrm{cmd}}, \mathbf{v}_{\mathrm{y}}^{\mathrm{cmd}}, \mathbf{v}_{\mathrm{yaw}}^{\mathrm{cmd}}$ ). Higher task return indicates better performance in achieving the locomotion goal.
- Mathematical Formula: $ J(\pi) = \mathbb{E}{\tau \sim p(\cdot | \pi)}\left[ \sum{t=0}^{T-1} \gamma^t r_t \right] $ where the reward $r_t$ at each step is typically a sum of several terms, including tracking rewards (e.g., $r_{\mathrm{tracking}} = w_x \exp(-\alpha_x \|\mathbf{v}_x - \mathbf{v}_x^{\mathrm{cmd}}\|^2) + \dots$ ).
- Symbol Explanation:
  - $J(\pi)$ : The expected return (or task return) of policy $\pi$ .
  - $\mathbb{E}_{\tau \sim p(\cdot | \pi)}[\cdot]$ : Expectation over trajectories $\tau$ generated by policy $\pi$ .
  - $T$ : Time horizon of the episode.
  - $\gamma$ : Discount factor (typically between 0 and 1), which makes immediate rewards more valuable than future rewards.
  - $r_t$ : Reward received at timestep $t$ .

5.3. Baselines

The paper compares LCP against the following baselines to demonstrate its effectiveness:

No smoothing: Policies are trained without any explicit smoothing techniques. This baseline is crucial for highlighting the inherent jittery behaviors of RL policies in simulation and the necessity of smoothing for sim-to-real transfer.
Smoothness rewards: This baseline incorporates additional reward terms into the RL objective function to encourage smooth behaviors, such as penalizing joint velocities, accelerations, or energy consumption. This is the most commonly used smoothing method.
Low-pass Filters: This baseline applies a low-pass filter to the policy's output actions before they are sent to the environment. This post-processing step aims to remove high-frequency components from the action signals, thereby smoothing the robot's movements.

6. Results & Analysis

The experiments extensively evaluate LCP in both simulation and real-world humanoid robots, comparing its performance to commonly used smoothing techniques. The core goal is to show that LCP effectively produces smooth and robust locomotion controllers.

6.1. Core Results Analysis

6.1.1. Effectiveness of LCP for Producing Smooth Behaviors

The authors train policies with LCP using a gradient penalty coefficient of $\lambda_{\mathrm{gp}} = 0.002$ and compare its smoothness metrics against policies trained with and without smoothness rewards.

The following figure (Figure 4 from the original paper) shows the performance comparison of different smoothing techniques in terms of action rate, DoF acceleration, DoF velocity, and energy consumption.

该图像是一个图表，展示了不同平滑技术在动作率、关节加速度、关节速度和能量消耗上的性能对比。图中红色曲线代表我们提出的 Lipschitz-Constrained Policies (LCP)，表明相比于无平滑和使用平滑奖励的方法，LCP 更有效地控制了这些性能指标。

s m ervecurai pro bhav p policies that are trained with explicit smoothness rewards.

Analysis of Figure 4:
- The charts display the evolution of Action Rate, DoF Acceleration, DoF Velocity, and Energy during training. Lower values indicate smoother behavior.
- The No Smoothing curve consistently shows the highest values across all smoothness metrics, confirming that policies trained without explicit smoothing exhibit highly jittery behaviors.
- Smoothness Rewards effectively reduce these metrics, demonstrating their intended effect.
- Crucially, LCP (ours) achieves comparable, and in some cases even lower, values for these smoothness metrics relative to Smoothness Rewards. This is a significant finding because LCP does not directly penalize these quantities in the reward function; instead, it enforces Lipschitz continuity through a gradient penalty. This suggests that bounding the policy's gradient is an effective proxy for achieving overall smooth behavior across various aspects of robot motion.
  
  This demonstrates that LCP can be an effective substitute for traditional smoothness rewards in eliciting smooth behaviors from a learned policy.

6.1.2. How LCP Affects Task Performance

Beyond smoothness, it's vital that smoothing techniques do not significantly degrade the robot's ability to perform its primary task (e.g., walking and command following). The paper compares the task performance of LCP with other smoothing methods.

The following are the results from Table I(a) of the original paper:

Method	Action Jitter ↓	DoF Pos Jitter ↓	DoF Velocity ↓	Energy ↓	Base Acc ↓	Task Return ↑
(a) Ablation on Smooth Methods
LCP (ours)	3.21 ± 0.11	0.17 ± 0.01	10.65 ± 0.37	24.57 ± 1.17	0.06 ± 0.002	26.03 ± 1.51
Smoothness Reward	5.74 ± 0.08	0.19 ± 0.002	11.35 ± 0.51	25.92 ± 0.84	0.06 ± 0.002	26.56 ± 0.26
Low-pass Filter	7.86 ± 3.00	0.23 ± 0.04	11.72 ± 0.14	32.83 ± 5.50	0.06 ± 0.002	24.98 ± 1.29
No Smoothness	42.19 ± 4.72	0.41 ± 0.08	12.92 ± 0.99	42.68 ± 10.27	0.09 ± 0.01	28.87 ± 0.85

Analysis of Table I(a):
- No Smoothing: Achieves the highest Task Return (28.87), indicating it's very effective at task completion in simulation. However, it exhibits by far the worst smoothness metrics (Action Jitter: 42.19, DoF Pos Jitter: 0.41, Energy: 42.68, DoF Velocity: 12.92, Base Acc: 0.09), confirming its unsuitability for real-world deployment.
- Smoothness Reward: Significantly improves smoothness compared to No Smoothing (e.g., Action Jitter down to 5.74, Energy down to 25.92) while maintaining a high Task Return (26.56), which is very close to the No Smoothing baseline.
- LCP (ours): Demonstrates superior smoothness (e.g., Action Jitter: 3.21, DoF Pos Jitter: 0.17, Energy: 24.57, DoF Velocity: 10.65) compared to Smoothness Reward, with even lower jitter and energy consumption. Its Task Return (26.03) is slightly lower than Smoothness Reward but still competitive and robust. This suggests LCP provides a better trade-off between smoothness and task performance than Smoothness Rewards, especially in terms of raw smoothness metrics.
- Low-pass Filter: While providing some smoothing, it shows worse smoothness metrics than LCP and Smoothness Reward (e.g., Action Jitter: 7.86, Energy: 32.83) and also the lowest Task Return (24.98) among the smoothing methods. This supports the claim that low-pass filters can dampen exploration, leading to sub-optimal policies.
  
  The following figure (Figure 5 from the original paper) displays the task returns of different smoothing methods.
  
  该图像是一个图表，展示了不同平滑方法的任务收益。LCP（我们的）方法表现出明显的优越性，相比于其他方法（无平滑、低通滤波器、平滑奖励）在提高任务收益方面更为有效。

Fig. 5: Task returns of different smoothing methods. LCP provides an effective alternative to other techniques.

Analysis of Figure 5: The plot visually reinforces the findings from Table I(a). No Smoothing achieves the highest task return but is impractical. Smoothness Reward and LCP achieve comparable and high task returns, both significantly outperforming Low-pass Filter. This confirms that LCP offers an effective alternative that balances smoothness and task performance well.

6.1.3. Effect of the GP Coefficient ( $\lambda_{\mathrm{gp}}$ )

The gradient penalty coefficient $\lambda_{\mathrm{gp}}$ is a critical hyperparameter in LCP. The authors perform an ablation study to understand its impact.

The following are the results from Table I(b) of the original paper:

	Action Jitter ↓	DoF Pos Jitter ↓	DoF Velocity ↓	Energy ↓	Base Acc ↓	Task Return ↑
(b) Ablation on GP Weights (λgp)
LCP w. λ_gp = 0.0	42.19 ± 4.72	0.41 ± 0.08	12.92 ± 0.99	42.68 ± 10.27	0.09 ± 0.01	28.87 ± 0.85
LCP w. λ_gp = 0.001	3.69 ± 0.31	0.21 ± 0.05	11.44 ± 1.18	27.09 ± 4.44	0.06 ± 0.01	26.32 ± 1.20
LCP w. λ_gp = 0.002 (ours)	3.21 ± 0.11	0.17 ± 0.01	10.65 ± 0.37	24.57 ± 1.17	0.06 ± 0.002	26.03 ± 1.51
LCP w. λ_gp = 0.005	2.10 ± 0.05	0.15 ± 0.01	10.44 ± 0.70	26.24 ± 3.50	0.05 ± 0.002	23.92 ± 2.05
LCP w. λ_gp = 0.01	0.17 ± 0.01	0.07 ± 0.00	2.75 ± 0.12	5.89 ± 0.28	0.007 ± 0.00	16.11 ± 2.76

Analysis of Table I(b):
- $\lambda_{\mathrm{gp}} = 0.0$ : This is equivalent to No Smoothing, yielding high task return but extreme jitter.
- Increasing $\lambda_{\mathrm{gp}}$ from 0.0 to 0.001 to 0.002: As $\lambda_{\mathrm{gp}}$ increases, all smoothness metrics (Action Jitter, DoF Pos Jitter, Energy, DoF Velocity, Base Acc) significantly decrease, indicating progressively smoother behaviors. The Task Return remains high and relatively stable (from 28.87 down to 26.03). This range (0.001-0.002) represents an effective balance.
- $\lambda_{\mathrm{gp}} = 0.002$ (ours): This value strikes a good balance, providing very smooth behaviors with strong task performance.
- Larger $\lambda_{\mathrm{gp}}$ (0.005 and 0.01): Further increasing $\lambda_{\mathrm{gp}}$ leads to even smoother behaviors (e.g., Action Jitter drops to 0.17 for $\lambda_{\mathrm{gp}} = 0.01$ , Energy to 5.89). However, this comes at a substantial cost to Task Return (dropping to 16.11 for $\lambda_{\mathrm{gp}} = 0.01$ ). This confirms that excessively penalizing gradient norms results in overly smooth and sluggish policies that struggle with task completion.
  
  The following figure (Figure 6 from the original paper) shows the impact of different GP weights (i.e., $\lambda_{\mathrm{gp}}$ ) on task returns.
  
  $Fig. 6: Task returns of LCP with different $\\lambda _ { \\mathrm { g p } }$ Excessively large $\\lambda _ { \\mathrm { g p } }$ may hinder policy learning.$ 该图像是图表，展示了不同的 ext{GP} 权重（即 $ext{λ}_{ ext{gp}}$ ）对任务回报的影响。随着迭代次数的增加， $ext{λ}_{ ext{gp}}$ 的变化对学习策略的效果有显著影响，特别是当 $ext{λ}_{ ext{gp}}$ 过大时，可能阻碍策略学习。

Fig. 6: Task returns of LCP with different $\lambda _ { \mathrm { g p } }$ Excessively large $\lambda _ { \mathrm { g p } }$ may hinder policy learning.

Analysis of Figure 6: This graph illustrates the learning curves (Task Returns over Iterations) for different $\lambda_{\mathrm{gp}}$ $λ_{gp}$ values.
- $\lambda_{\mathrm{gp}}=0.0$ (No GP) reaches the highest task return quickly, but as established, is too jittery for real robots.
- $\lambda_{\mathrm{gp}}=0.001$ and $\lambda_{\mathrm{gp}}=0.002$ (selected value) show robust learning and achieve high task returns. The learning speed is slightly slower than No GP but stable.
- $\lambda_{\mathrm{gp}}=0.005$ and $\lambda_{\mathrm{gp}}=0.01$ significantly hinder policy learning. They either learn much slower or converge to substantially lower task returns. This confirms that excessively large gradient penalties can impede the policy's ability to learn the primary task.
  
  These experiments suggest that $\lambda_{\mathrm{gp}} = 0.002$ is an effective balance between policy smoothness and task performance. Like other smoothing techniques, some tuning of the GP coefficient is required.

6.1.4. Which Components of the Observation Should GP Be Applied To?

Since the policies are trained using Regularized Online Adaptation (ROA), the policy's input consists of the current observation and a history of past observations. The authors investigate whether the gradient penalty should be applied to the whole input observation or only to the current observation.

The following are the results from Table I(c) of the original paper:

	Action Jitter ↓	DoF Pos Jitter ↓	DoF Velocity ↓	Energy ↓	Base Acc ↓	Task Return ↑
(c) Ablation on GP Inputs
LCP w. GP on whole obs (ours)	3.21 ± 0.11	0.17 ± 0.01	10.65 ± 0.37	24.57 ± 1.17	0.06 ± 0.002	26.03 ± 1.51
LCP w. GP on current obs	7.16 ± 0.60	0.35 ± 0.03	13.70 ± 1.50	35.18 ± 4.84	0.09 ± 0.005	25.44 ± 3.73

Analysis of Table I(c):
- LCP w. GP on whole obs (ours): This configuration (using $\lambda_{\mathrm{gp}} = 0.002$ ) achieves excellent smoothness metrics (e.g., Action Jitter: 3.21, Energy: 24.57) and strong task return (26.03).
- LCP w. GP on current obs: When the gradient penalty is applied only to the current observation (excluding the history), the smoothness metrics significantly degrade (e.g., Action Jitter: 7.16, Energy: 35.18, DoF Velocity: 13.70), becoming much closer to the Low-pass Filter or even No Smoothing baselines for some metrics. The task return also slightly decreases.
  
  This finding suggests that regularizing the policy with respect to its entire input (including the observation history) is crucial. Changes in the historical observations can still lead to non-smooth policy outputs if only the current observation is regularized. By applying the gradient penalty to the whole observation, the policy is encouraged to be Lipschitz continuous with respect to all its inputs, leading to more robust smoothness.

6.1.5. Sim-to-Sim Transfer

Before real-world deployment, the models are tested in a different simulator, MuJoCo [25], to assess their robustness to domain shifts between simulators (from IsaacGym to MuJoCo).

The following are the results from Table II of the original paper:

	Action Jitter ↓	DoF Pos Jitter ↓	DoF Velocity ↓	Energy ↓	Base Acc ↓	Task Return ↑
Fourier GR1	1.47 ± 0.43	0.34 ± 0.07	9.54 ± 1.53	36.38 ± 2.97	0.08 ± 0.004	24.33 ± 1.25
Unitree H1	0.44 ± 0.03	0.10 ± 0.007	9.12 ± 0.38	76.22 ± 5.81	0.04 ± 0.005	21.74 ± 1.40
Berkeley Humanoid	1.77 ± 0.32	0.12 ± 0.01	7.92 ± 0.21	19.99 ± 0.36	0.06 ± 0.00	26.50 ± 0.57

Analysis of Table II:
- The table shows the performance of LCP models (trained in IsaacGym) when transferred to MuJoCo.
- For full-sized robots like Fourier GR1 and Unitree H1, there is a slight decrease in task return compared to IsaacGym performance (e.g., Fourier GR1's Task Return is 24.33 compared to $\approx$ 26-27 in IsaacGym for similar settings). This suggests that the domain gap between IsaacGym and MuJoCo is more significant for larger robots, possibly due to more complex dynamics or different physics engine implementations.
- However, the smoothness metrics remain low across all robots, indicating that the LCP policies maintain their smooth characteristics even in a different simulator.
- The Berkeley Humanoid shows a relatively high task return (26.50) with good smoothness, suggesting better sim-to-sim transfer for smaller robots or those with simpler dynamics.
- Overall, the results instill confidence for real-world deployments, showing that LCP produces policies that are robust across simulators.

6.2. Real World Deployment

The authors deploy LCP models (trained with $\lambda_{\mathrm{gp}} = 0.002$ ) zero-shot on four distinct real-world robots (Fourier GR1T1, GR1T2, Unitree H1, and Berkeley Humanoid).

The following figure (Figure 7 from the original paper) shows snapshots of the robots' behaviors over the course of one gait cycle.

Fig. 7: Real-world deployment. LCP is able to train effective locomotion policies on a wide range of robots, which can be directly transferred to the real world.
该图像是一个插图，展示了不同款式的人形机器人在不同环境中的真实部署。图中显示了多款机器人如Fourier GR171、T2、Unitree H1和Berkeley Humanoid的动作表现，表明Lipschitz约束策略在多种人形机器人上训练效果良好。

Fig. 7: Real-world deployment. LCP is able to train effective locomotion policies on a wide range of robots, which can be directly transferred to the real world.

Analysis of Figure 7: The image montage visually confirms that LCP can train effective locomotion policies that successfully transfer to a wide range of real-world humanoid robots. The robots are shown performing basic walking motions, demonstrating stable and coordinated movements on real terrain.

6.2.1. Performance on Different Terrains

To evaluate the robustness of the learned policies, they are applied to walk on three types of real-world terrains: smooth, soft, and rough planes. Jitter metrics are used to evaluate performance.

The following are the results from Table III of the original paper:

Robot	Action Jitter ↓	DoF Pos Jitter ↓	DoF Velocity ↓
(a) Smooth Plane
Fourier GR1	1.12 ± 0.16	0.28 ± 0.13	10.82 ± 1.58
Unitree H1	1.11 ± 0.07	0.14 ± 0.01	10.95 ± 0.53
Berkeley Humanoid	1.56 ± 0.10	0.10 ± 0.01	4.99 ± 0.60
b) Soft Plane
Fourier GR1	1.18 ± 0.17	0.24 ± 0.09	10.45 ± 1.42
Unitree H1	1.18 ± 0.09	0.15 ± 0.01	11.80 ± 0.57
Berkeley Humanoid	1.66 ± 0.03	0.12 ± 0.01	6.78 ± 1.57
(c) Rough Plane
Fourier GR1	1.18 ± 0.22	0.26 ± 0.11	11.61 ± 1.64
Unitree H1	1.20 ± 0.09	0.14 ± 0.01	11.68 ± 0.84
Berkeley Humanoid	1.63 ± 0.11	0.11 ± 0.01	5.02 ± 0.48

Analysis of Table III:
- The jitter metrics (Action Jitter, DoF Pos Jitter, DoF Velocity) remain low and comparable across all three types of terrains (smooth, soft, rough) for all robots.
- This demonstrates that LCP-trained policies are robust to variations in terrain properties in the real world. The policies maintain their smooth characteristics even when encountering less predictable surfaces.
- The Berkeley Humanoid consistently shows lower DoF Velocity compared to the larger robots, possibly due to its smaller size and lower inertia requiring less aggressive joint movements.
- The standard deviations are relatively small, indicating consistent performance across different trials and models.
  
  This real-world evaluation confirms that LCP effectively generalizes to unseen real-world conditions and produces robust and smooth locomotion controllers.

6.2.2. External Forces

The robustness of LCP policies is further tested by applying external forces to the robots in the real world. The paper refers to a supplementary video for recovery behaviors. The qualitative finding is that LCP models can robustly recover from unexpected external perturbations. This is a crucial aspect of real-world robot deployment, where unforeseen disturbances are common. The inherent smoothness likely contributes to better stability and more controlled recovery motions.

6.3. Data Presentation (Tables)

All tables from the paper (Table I(a), I(b), I(c), Table II, Table III, and Table IV from the appendix) have been transcribed completely and accurately into the relevant sections above, using HTML table formatting for tables with merged cells and Markdown for simple grid structures, as per the instructions.

6.4. Ablation Studies / Parameter Analysis

The paper includes significant ablation studies and parameter analyses:

Ablation on Smooth Methods (Table I(a) and Figure 5): Directly compares LCP with Smoothness Reward, Low-pass Filter, and No Smoothing. This validates LCP as a superior or competitive alternative to existing smoothing techniques regarding both smoothness and task performance.
Ablation on GP Weights ( $\lambda_{\mathrm{gp}}$ ) (Table I(b) and Figure 6): Investigates the impact of the Lipschitz constraint coefficient. This is a critical analysis showing the trade-off between smoothness and task performance and helping to identify an optimal operating point for $\lambda_{\mathrm{gp}}$ . It highlights that hyperparameter tuning is still necessary, even with LCP.
Ablation on GP Inputs (Table I(c)): Compares applying the gradient penalty to the whole observation vs. only the current observation. This demonstrates the importance of considering the entire policy input context, especially in ROA-based transfer frameworks that utilize observation histories.

These ablation studies are comprehensive and effectively verify the critical design choices and hyperparameters of LCP, enhancing confidence in the method's effectiveness and underlying principles.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Lipschitz-Constrained Policies (LCP), a novel, simple, and general method for training Reinforcement Learning (RL) controllers to produce smooth behaviors suitable for sim-to-real transfer. LCP works by approximating a Lipschitz constraint on the policy, which is implemented as a differentiable gradient penalty term added to the RL objective function during training. Through extensive simulation and real-world experiments on a diverse set of humanoid robots (Fourier GR1T1, GR1T2, Unitree H1, Berkeley Humanoid), the authors demonstrate that LCP effectively replaces traditional, non-differentiable smoothing techniques like smoothness rewards and low-pass filters. LCP-trained policies achieve superior smoothness metrics while maintaining high task performance, generalize well across different simulators and real-world terrains, and enable robust recovery from external perturbations. This work provides a more principled and easily integrable approach to developing smooth and robust locomotion controllers for legged robots.

7.2. Limitations & Future Work

The authors acknowledge the following limitations and suggest future research directions:

Limited to Basic Walking Behaviors: While LCP has proven effective for real-world locomotion experiments, the current evaluation is limited to basic walking behaviors.
Future Work - Dynamic Skills: Evaluating LCP on more dynamic and complex skills, such as running and jumping, would further validate the method's generality and robustness for a broader range of robotic capabilities. This would test whether the Lipschitz constraint can maintain smoothness without excessively limiting the policy's expressiveness for highly dynamic actions.
Tuning $\lambda_{\mathrm{gp}}$ : Although LCP reduces the hyperparameter tuning complexity compared to smoothness rewards, the gradient penalty coefficient $\lambda_{\mathrm{gp}}$ still requires careful tuning to find the optimal balance between smoothness and task performance, as demonstrated by the ablation studies. Automating this tuning process could be a future improvement.

7.3. Personal Insights & Critique

This paper presents an elegant and effective solution to a pervasive problem in robot reinforcement learning: jittery policies and the sim-to-real gap. The idea of grounding smoothness in Lipschitz continuity and enforcing it via a differentiable gradient penalty is a significant conceptual leap from heuristic rewards or post-hoc filtering.

Strengths and Inspirations:

Mathematical Elegance: The direct translation of Lipschitz continuity into a gradient penalty is mathematically sound and provides a principled approach, which is often preferred over empirical tuning.
Practicality and Integrability: The gradient penalty is simple to implement in existing RL frameworks due to automatic differentiation, making it highly accessible for researchers and practitioners.
Generalizability: Demonstrating effectiveness across multiple, distinct humanoid robot platforms is a strong indicator of the method's generality and potential for widespread adoption.
Improved Sim-to-Real: LCP directly addresses a critical barrier to real-world deployment, making RL-trained controllers more viable for practical applications.
Reduced Heuristic Tuning: While $\lambda_{\mathrm{gp}}$ still needs tuning, it's arguably less complex than balancing multiple smoothness reward terms.

Potential Issues and Areas for Improvement:

Tuning $\lambda_{\mathrm{gp}}$ (Continued): The ablation study clearly shows the sensitivity of task performance to $\lambda_{\mathrm{gp}}$ . This still requires manual tuning for each robot or task. Future work could explore adaptive Lagrange multiplier approaches (as in WGAN-GP itself) or meta-learning techniques to automate the selection of $\lambda_{\mathrm{gp}}$ .
Computational Cost: Calculating gradients with respect to inputs (especially whole observations including history) might introduce a slight additional computational cost during training, though modern GPU-accelerated frameworks likely mitigate this. The paper does not explicitly discuss the computational overhead.
Scope of "Smoothness": While Lipschitz continuity bounds the first derivative, true physical smoothness might involve higher-order derivatives (as jitter is defined as the third derivative). The paper implicitly shows that bounding the first derivative's magnitude effectively translates to smoother higher-order derivatives as well. However, for extremely dynamic tasks, perhaps higher-order gradient penalties could be explored, although they would be more complex to compute and might lead to excessive stiffness.
Interpretability: While LCP provides a smooth policy, the specific emergent gait or control strategy for achieving this smoothness is not deeply analyzed. Understanding how the policy achieves Lipschitz continuity (e.g., by activating certain joints synchronously, or maintaining certain joint limits) could provide further insights into humanoid locomotion.

Transferability to Other Domains: The core idea of Lipschitz-Constrained Policies is highly transferable beyond humanoid locomotion:

Other Robotic Systems: Quadrupeds, manipulators, drones, or any robot where smooth control actions are crucial for real-world deployment, energy efficiency, or safety.
General Continuous Control Tasks: Any RL task involving continuous action spaces where sudden action changes are undesirable (e.g., autonomous driving, fluid control, chemical process control).
Safe RL: LCP could be integrated into safe RL frameworks to ensure that policies adhere to safety constraints related to control signal smoothness and robot dynamics.
Imitation Learning: Combined with imitation learning, LCP could help policies imitate human motions that are inherently smooth, rather than reproducing noisy or erratic reference trajectories.

Overall, LCP is a robust and innovative contribution that offers a compelling, differentiable alternative to existing smoothing techniques, paving the way for more reliable and generalizable robot controllers.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.