Paper status: completed

Learning Sim-to-Real Humanoid Locomotion in 15 Minutes

Published:12/02/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a method using off-policy RL algorithms (FastSAC and FastTD3) that enables rapid humanoid locomotion training in just 15 minutes with a single RTX 4090 GPU, addressing high dimensionality and domain randomization challenges.

Abstract

Massively parallel simulation has reduced reinforcement learning (RL) training time for robots from days to minutes. However, achieving fast and reliable sim-to-real RL for humanoid control remains difficult due to the challenges introduced by factors such as high dimensionality and domain randomization. In this work, we introduce a simple and practical recipe based on off-policy RL algorithms, i.e., FastSAC and FastTD3, that enables rapid training of humanoid locomotion policies in just 15 minutes with a single RTX 4090 GPU. Our simple recipe stabilizes off-policy RL algorithms at massive scale with thousands of parallel environments through carefully tuned design choices and minimalist reward functions. We demonstrate rapid end-to-end learning of humanoid locomotion controllers on Unitree G1 and Booster T1 robots under strong domain randomization, e.g., randomized dynamics, rough terrain, and push perturbations, as well as fast training of whole-body human-motion tracking policies. We provide videos and open-source implementation at: https://younggyo.me/fastsac-humanoid.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Learning Sim-to-Real Humanoid Locomotion in 15 Minutes".

1.2. Authors

The authors are Younggyo Seo*, Carmelo Sferrazza*, Juyue Chen, Guanya Shi, Rocky Duan, and Pieter Abbeel, affiliated with Amazon FAR (Frontier AI & Robotics).

1.3. Journal/Conference

The paper was published on arXiv, a preprint server, with the identifier arXiv:2512.01996. ArXiv is a widely recognized platform for disseminating research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics, allowing researchers to share their work before, or in parallel with, peer review and formal publication.

1.4.Publication Year

The paper was published at (UTC): 2025-12-01T18:55:17.000Z.

1.5. Abstract

This paper addresses the challenge of achieving fast and reliable sim-to-real reinforcementlearning (RL) for humanoid control, which has been difficult despite advances in massively parallel simulation that have reduced training times for robots. The authors introduce a practical recipe based on off-policy RL algorithms, specifically FastSAC and FastTD3, enabling the rapid training of humanoid locomotion policies in just 15minutes using a single RTX 4090 GPU. Their recipe stabilizes off-policy RL algorithms at a massive scale with thousands of parallel environments through carefully tuned design choices and minimalist reward functions. They demonstrate the rapid end-to-end learning of humanoid locomotion controllers on Unitree G1 and Booster T1 robots under strong domain randomization (e.g., randomized dynamics, rough terrain, push perturbations) and fast training of whole-body human-motion tracking policies.

The official source link is https://arxiv.org/abs/2512.01996, and the PDF link is https://arxiv.org/pdf/2512.01996v1.pdf. This indicates the paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background &Motivation

The core problem the paper aims to solve is the difficulty of achieving fast and reliable sim-to-real reinforcement learning (RL) for humanoid control. While massively parallel simulation has significantly reduced RL training times for robots from days to minutes, applying this to humanoid control remains challenging due to factors like high dimensionality andthe need for extensive domain randomization to ensure robust real-world transfer. These challenges often push training for humanoids back into the multi-hour regime, making the iterative sim-to-real development cycle (training in simulation, deploying on hardware, correcting mismatches, and retraining) impractical and expensive. Thepaper's entry point is to simplify and accelerate this sim-to-real iteration process for humanoids to a matter of minutes.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  • A Simple and Practical Recipe: Introduction of a recipe based onFastSAC and FastTD3, efficient variants of off-policy RL algorithms, that enables rapid training of humanoid locomotion policies.

  • Rapid Training Time: Demonstration that robust humanoid locomotion policies can be trained in just 15 minutes using a single RTX 4090 GPU, evenunder strong domain randomization.

  • Stabilization of Off-Policy RL at Scale: The recipe stabilizes off-policy RL algorithms at a massive scale (thousands of parallel environments) through carefully tuned design choices and minimalist reward functions.

  • Demonstration on Multiple Humanoids: Successful application of the recipeto Unitree G1 and Booster T1 robots for end-to-end locomotion control and whole-body human-motion tracking.

  • Open-Source Implementation: Provision of open-source code to support further research and development.

    The key findings are that off-policy RL algorithms, when carefully tuned and combined with minimalist reward designs, can significantly outperform traditional on-policy methods like PPO in terms of training speed and robustness for complex humanoid control tasks, thereby making sim-to-real iteration much more feasible.

3. Prerequisite Knowledge & RelatedWork

3.1. Foundational Concepts

To understand this paper, several foundational concepts in reinforcement learning and robotics are crucial:

  • Reinforcement Learning (RL): A paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agentobserves states, takes actions, receives rewards, and transitions to new states.
  • Agent: The entity that performs actions in the environment. In this paper, the agent is the humanoid robot's controller.
  • Environment: The setting in which the agent operates. Here, it'sa simulated physical environment for humanoid robots.
  • State: A complete description of the environment at a given time. For a humanoid, this could include joint angles, velocities, orientation, etc.
  • Action: A decision made by the agent that affects the environment. For humanoids, these are often torquecommands or desired joint positions for PD controllers.
  • Reward: A scalar feedback signal indicating the desirability of an agent's actions. The goal of RL is to maximize the total reward.
  • Policy (π\pi): A mapping from states to actions, defining the agent's behavior. Thegoal of RL is to learn an optimal policy.
  • Sim-to-Real (Simulation-to-Real Transfer): The process of training an RL policy in a simulated environment and then deploying it successfully on a real physical robot. This is challenging because simulations often do not perfectly capture real-world physics, leading to sim-to-real gap.
  • Domain Randomization: A technique used to bridge the sim-to-real gap. Instead of meticulously modeling every physical parameter, domain randomization involves randomizing various aspects of the simulation (e.g., friction, mass, sensor noise) during training. This forces the policy to learn to be robust to variations, making it more likely to perform well on a real robot whose parameters might differ from any specific simulation setting.
  • Off-policy Reinforcement Learning: A category of RL algorithms where the policy being learned (target policy) is differentfrom the policy used to collect data (behavior policy). This allows algorithms to reuse past experience (stored in a replay buffer) more efficiently.
    • Soft Actor-Critic (SAC): An off-policy actor-critic algorithm that aims to maximize both the expected reward and theentropy of the policy. Maximizing entropy encourages exploration and leads to more robust policies.
    • Twin Delayed DDPG (TD3): An off-policy actor-critic algorithm designed to address overestimation bias in Q-learning. It uses two critic networks (clippeddouble Q-learning) and delays policy updates.
  • On-policy Reinforcement Learning: A category of RL algorithms where the policy being learned is the same as the policy used to collect data. This means that data collected with an older policy cannot be directly reused for updating a new policy, often leading to lowersample efficiency.
    • Proximal Policy Optimization (PPO): A popular on-policy algorithm known for its stability and good performance. It uses a clipped objective function to prevent excessively large policy updates.
  • Massively Parallel Simulation: The use of many simulation environments running concurrently, often on GPUs,to generate large amounts of training data quickly for RL agents. This significantly reduces wall-clock training time.
  • PD Controllers (Proportional-Derivative Controllers): A common feedback control loop mechanism widely used in robotics. It calculates an error value as the difference between a desired setpoint and ameasured process variable (e.g., current joint angle). The controller attempts to minimize the error by adjusting the control output (e.g., motor torque) based on proportional (P) and derivative (D) terms.
    • PP term: accountsfor the current error.
    • DD term: accounts for the rate of change of the error. PD-gain randomization refers to varying the proportional and derivative gains during domain randomization.
  • Entropy: In RL, entropy can be used to quantify the randomness orunpredictability of a policy. A policy with high entropy explores more, which can be beneficial for finding optimal solutions in complex environments.

3.2. Previous Works

The paper builds upon and distinguishes itself from several prior works:

  • Massively Parallel Simulation Frameworks: The current work explicitly leveragesadvancements in frameworks like Isaac Gym (Makoviychuk et al., 2021), which have enabled GPU-based parallel environments to scale environment throughput to thousands. These frameworks have driven successes in various robot control tasks (Rudin et al., 2022; Agarwal et al., 2023). The paper acknowledges this trend as a key enabler for accelerating sim-to-real iterations.
  • On-policy RL for Sim-to-Real: PPO (Schulman et al., 2017) has been the de-facto standard forsim-to-real RL due to its ease of scaling with parallel environments. However, the paper's core innovation lies in challenging this standard by showing that off-policy methods can be even faster and more effective.
  • Scaling Off-policy RL: Recent works have started to demonstrate that off-policy RL methods can also scale effectively in large-scale training regimes (Li et al., 2023; Raffin, 2025; Shukla, 2025). The paper specifically references Seoetal.(2025)Seo et al. (2025) which reported the first sim-to-real deployment of humanoid control policies trained with FastTD3.
  • Humanoid Control with FastTD3: While Seoetal.(2025)Seo et al. (2025) demonstrated sim-to-real deployment with FastTD3, their results were limited to humanoids with only a subset ofjoints. This paper extends that work by developing a recipe for full-body humanoid control using FastSAC and FastTD3.
  • Reward Design: The paper contrasts its simple reward design with traditional heavy reward shaping approaches used in humanoid locomotion (Mittal et al., 2023; Lab, 2025), citing recent works (Zakka et al., 2025; Liao et al., 2025) that rely on simpler reward functions. The whole-body tracking rewards follow the structure of BeyondMimic (Liao et al., 2025).## 3.3. Technological Evolution The field of robotics reinforcement learning has evolved significantly. Initially, training RL policies for complex robots was extremely sample inefficient and required vast amounts of real-world interaction, which is expensive and time-consuming. The advent of highly realistic and massively parallel simulation frameworks, particularly GPU-accelerated ones, marked a turning point. These allowed for training policies in simulated environments at an unprecedented speed, drastically reducing wall-clock training time from days to minutes. This parallelization initially favored on-policy algorithms like PPO dueto their straightforward scalability. However, the sim-to-real gap remained a persistent challenge, necessitating techniques like domain randomization.

This paper's work fits within this technological timeline by addressing the remaining challenge of applying rapid training to high-dimensional systems like humanoids. It demonstrates that off-policy algorithms, which areoften more sample efficient, can also be effectively scaled with parallel simulations. By doing so, it pushes the boundary of how quickly robust sim-to-real humanoid control policies can be developed, making the iterative development cycle much more feasible.

3.4. Differentiation Analysis

Compared to the main methods in relatedwork, the core differences and innovations of this paper's approach are:

  • Off-policy RL for Full-Body Humanoid Control: While PPO has been the standard for sim-to-real due to its scalability, and previous off-policy work (Seo et al., 2025) was limited to partial humanoid control, this paper successfully scales FastSAC and FastTD3 to full-body humanoid control, encompassing all joints for locomotion and whole-body tracking.
  • Speed and Efficiency: The paper demonstrates training of robust sim-to-real policies in an unprecedented 15 minutes on a single RTX 4090 GPU, which is a significant speedup compared to prior multi-hour training regimes for humanoids.
  • Stabilized FastSAC: The work stabilizes and improves FastSAC, which previously exhibited training instabilities forhumanoid control, through a carefully tuned set of hyperparameters and design choices.
  • Minimalist Reward Design: Unlike traditional approaches that rely on complex reward shaping with many terms, this paper uses a simple, minimalist reward design with fewer than 10 terms. This simplifies hyperparameter tuning and allowsfor rapid iteration, while still producing robust behaviors.
  • Robustness under Strong Domain Randomization: The recipe is shown to be effective under strong domain randomization, including randomized dynamics, rough terrain, and significant push perturbations, ensuring high sim-to-real transferability.

4. Methodology

##4.1. Principles The core idea behind the proposed method is to leverage the sample efficiency of off-policy Reinforcement Learning (RL) algorithms, specifically FastSAC and FastTD3, and combine them with massively parallel simulation and a minimalist reward design to achieve rapidsim-to-real training for humanoid control. The theoretical basis is that off-policy methods, by reusing data, can achieve faster training (wall-clock time) when scaled appropriately, especially in challenging, high-dimensional tasks where exploration is difficult and on-policy methods like PPO struggle. The minimalistreward function simplifies the hyperparameter tuning process, allowing for faster iteration and more robust policies.

4.2. Core Methodology In-depth (Layer by Layer)

The paper introduces a "simple and practical recipe" for training humanoid locomotion and whole-body tracking policies. This recipe primarilyinvolves tuning FastSAC and FastTD3 for large-scale training and adopting a minimalist reward design.

4.2.1. FastSAC and FastTD3: Off-Policy RL for Humanoid Control

The recipe is based on off-policy RL algorithms, FastTD3 (an efficient variant of TD3) and FastSAC (an efficient variant of Soft Actor-Critic), tuned for large-scale training with massively parallel simulation. The paper explicitly states that it improves upon the Seoetal.(2025)Seo et al. (2025) recipe, particularly for FastSAC,to handle full-body humanoid control.

The general off-policy RL loop is described in the "Related Work" section with the following pseudo-code (reproduced exactly from the paper):

1:Initialize actor πθ, two critics Qφ1, Qφ2, entropy temperature α, replaybuffer B 2: 3: for each environment step do 4: Sample a ~ πθ(o) given the current observation o, and take action a 5: Observe next state o' and reward r' 6: Store transition τ = (o, a, o′, r′) in replay buffer B ← B ∪ {τ } 7: for j = 1 to num_updates do Sample mini-batch B = {τk}k=1 8: |B| from B
  • Initialize actor πθ\pi_\theta,two critics Qϕ1Q_{\phi_1}, Qϕ2Q_{\phi_2}, entropy temperature α\alpha, replay buffer B\mathcal{B}: This step sets up the core components of the actor-critic framework.

    • πθ\pi_\theta: The actor network, parameterized by θ\theta,which outputs actions given an observation.
    • Qϕ1Q_{\phi_1}, Qϕ2Q_{\phi_2}: Two critic networks, parameterized by ϕ1\phi_1 and ϕ2\phi_2, which estimate the Q-value (expected cumulative reward) of a state-action pair. Usingtwo critics helps mitigate overestimation bias in Q-learning, a technique introduced in TD3.
    • α\alpha: The entropy temperature parameter, used in SAC, which balances exploration (maximizing entropy) and exploitation (maximizing reward).
    • B\mathcal{B}: Thereplay buffer, which stores past experiences (transitions) for off-policy learning.
  • for each environment step do: This loop represents the interaction with the simulation environment.

    • Sample aπθ(o)a \sim \pi_\theta(o) given the current observation oo, and take action aa: The actor policy is used to select an action aa based on the current observation oo.
    • Observe next state oo' and reward rr': The environment executes the action aa, resulting in a new observation oo' and animmediate reward rr'.
    • Store transition τ=(o,a,o,r)\tau = (o, a, o', r') in replay buffer BB{τ}\mathcal{B} \leftarrow \mathcal{B} \cup \{\tau \}: The observed experience (transition) is stored in the replay buffer for later use intraining.
  • for j=1j = 1 to num_updates do: This inner loop represents the training phase, where the actor and critic networks are updated multiple times for each environment step.

    • Sample mini-batch B={τk}k=1B\mathcal{B} = \{\tau_k\}_{k=1}^{|\mathcal{B}|} from B\mathcal{B}: A batch of past experiences is randomly sampled from the replay buffer to train the networks.

      The paper then details specific design choices and hyperparameters that stabilize off-policy RL for full-body humanoid control:

  • Scaling Up Off-Policy RL with Massively Parallel Simulation:

    • The authors use massively parallel simulation to train FastSAC and FastTD3.
    • Batch Size: They find that using a large batch size up to 8K consistently improves performance.
    • GradientSteps: Taking more gradient steps per each simulation step usually leads to faster training. This highlights the sample efficiency advantage of off-policy RL, as it reuses data rather than discarding it, which is crucial when simulation speed is a bottleneck.
  • Joint-Limit-Aware Action Bounds:* A common challenge for off-policy RL algorithms like SAC or TD3 is setting proper action bounds for their Tanh policy.

    • The proposed technique sets action bounds based on the robot's joint limits when using PD controllers. Specifically, the difference between each joint's limit and its default position is calculated and used as the action bound for that joint. This reduces the need for manual tuning of action bounds.
  • Observation and Layer Normalization:

    • Observation Normalization: Similar to Seoetal.(2025)Seo et al. (2025), observation normalization is found to be helpful for training. This typically involves scaling observations to have zero mean and unit variance.
    • Layer Normalization: Unlike Seoetal.(2025)Seo et al. (2025), layer normalization (Ba et al., 2016) is found to be helpful in stabilizing performance in high-dimensional tasks. Layer Normalization normalizes the inputs across the features within a layer, rather than across the batch, which can be more stable, especially with varying batch sizes or sequential data.
  • Critic Learning Hyperparameters:* Q-value Averaging: Using the average of Q-values from the two critics is found to improve FastSAC and FastTD3 performance, as opposed to using Clipped double Q-learning (CDQ) (Fujimoto et al., 2018), which uses the minimum. This aligns with observations that CDQ can be harmful when used with layer normalization (Nauman et al., 2024).

    • Discount Factor γ\gamma:
      • A low discount factor γ=0.97\gamma = 0.97 is helpful for simple velocity tracking tasks.
      • A higher discount factor γ=0.99\gamma = 0.99 is helpful for challenging whole-body tracking tasks. The discount factor determines the present value of future rewards; a higher γ\gamma means future rewards areconsidered more important.
    • Distributional Critic: Following prior work, a distributional critic, specifically C51 (Bellemare et al., 2017), is used. C51 predicts a distribution over possible Q-values rather than a single expectedvalue. The paper notes that quantile regression (Dabney et al., 2018) for distributional critics is too expensive with large batch training.
  • FastSAC: Exploration Hyperparameters:

    • Standard Deviation σ\sigma Bounds: A widely used implementation of SAC bounds the standard deviation σ\sigma of pre-tanh actions to e2e^2. The authors find this can cause instability with a large initial temperature α\alpha. Instead, they set the maximum σ\sigma to 1.0.
    • Initial Entropy Temperature α\alpha: Theyinitialize α\alpha with a low value of 0.001 to prevent excessive exploration, which can lead to instability.
    • Auto-tuning for Maximum Entropy Learning: Using auto-tuning for maximum entropy learning (Haarnoja et al., 2018b) consistently outperforms fixed α\alpha values. This means the entropy temperature α\alpha is dynamically adjusted during training.
    • Target Entropy: The target entropy for auto-tuning is set to 0.0 for locomotion tasks and Aˉ/2-|\bar{\mathcal{A}}|/2 for whole-body tracking tasks, where Aˉ|\bar{\mathcal{A}}| is the dimensionality of the action space.
  • FastTD3: Exploration Hyperparameters:

    • Mixed Noise Schedule: Following Lietal.(2023)Li et al. (2023) and Seoetal.(2025)Seo et al. (2025), a mixed noise schedule is used, where the Gaussian noise standard deviation is randomly sampled from a range [σmin,σmax][\sigma_{min}, \sigma_{max}].
    • Noise Range: Low values, specifically (σmin,σmax)=(0.01,0.05)(\sigma_{min}, \sigma_{max}) = (0.01, 0.05), are found to perform best.
  • Optimization Hyperparameters:

    • Optimizer: Adam optimizer (Kingma & Ba, 2015) is used.
    • Learning Rate: A learning rate of 0.0003 isused.
    • Weight Decay: A weight decay of 0.001 is used, as the 0.1 value from Seoetal.(2025)Seo et al. (2025) was found to be too strong a regularization for high-dimensional control tasks.Weight decay is a regularization technique that penalizes large weights, preventing overfitting.
    • Adam β2\beta_2: Similar to Zhai et al. (2023), using a lower β2=0.95\beta_2 = 0.95 slightly improves stabilitycompared to the common β2=0.99\beta_2 = 0.99. β2\beta_2 is a hyperparameter in Adam that controls the exponential decay rate of the L2 norm of past gradients.

4.2.2. Simple Reward Design

The paper emphasizes a minimalist reward philosophy, aiming for fewer than 10 terms, unlike traditional reward shaping with 20+ terms. This simplifies hyperparameter tuning and promotes robust, rich behaviors.

  • Locomotion (Velocity Tracking) Rewards:

    • Linear and angular velocity tracking rewards: Encourage the humanoid to followcommanded x-y speed and yaw rate. These are the primary drivers for emergent locomotion.
    • Foot-height tracking term: Guides swing motion (Zakka et al., 2025; Shao et al., 2022).
    • Default-posepenalty: Avoids extreme joint configurations.
    • Feet penalties: Encourages parallel relative orientation and prevents foot crossing.
    • Per-step alive reward: Encourages the robot to remain in valid, non-fallen states.
    • Penalties for torso orientation: Keeps the torsonear a stable upright orientation.
    • Penalty on the action rate: Smooths control outputs.
    • Episode Termination: The episode terminates upon ground contact by the torso or other non-foot body parts.
    • Symmetry Augmentation: Used to encourage symmetric walking patterns and faster convergence (Mittal et al., 2024).
    • Curriculum: All penalties are subject to a curriculum that ramps up their weights over training as episode length increases (Lab, 2025), simplifying exploration.
  • Whole-BodyTracking Rewards:

    • The reward structure from BeyondMimic (Liao et al., 2025) is adopted, which adheres to minimalist principles. This involves tracking goals with lightweight regularization and DeepMimic-style termination conditions (Peng et al., 2018a).
    • External Disturbances: External disturbances in the form of velocity pushes are introduced to further robustify sim-to-real performance.

5. Experimental Setup

5.1. Datasets

The paper does not usetraditional datasets in the supervised learning sense. Instead, it trains Reinforcement Learning (RL) policies in simulated environments.

  • Robots: The experiments are conducted on two humanoid robot models: Unitree G1 and Booster T1.
  • Tasks:
    • Locomotion (Velocity Tracking): The RL policies are trained to maximize the sum of rewards by making robots achieve target linear and angular velocities while minimizing various penalties. Target velocity commands are randomly sampled every 10 seconds. With 20% probability, target velocities are set to zero to encourage learning to stand. *Whole-Body Tracking: The RL policies are trained to maximize tracking rewards by making robots follow human motion. Motion segments are randomly sampled for each episode.
  • Training Environment:
    • All robots are trained on a mix of flat and rough terrains to stabilize walking in sim-to-real deployment.
    • Domain Randomization Techniques: Various techniques are applied to robustify sim-to-real deployment:
      • Push perturbations: External forces applied to the robot. For locomotion, Push-Strong applies perturbations every 1 to 3 seconds, while other locomotiontasks apply them every 5 to 10 seconds. For whole-body tracking, velocity pushes are used.
      • Action delay: Introduces latency in action execution.
      • PD-gain randomization: Varies the proportional and derivative gains of the PD controllers.* Mass randomization: Varies the mass properties of the robot.
      • Friction randomization: Varies the friction coefficients.
      • Center of mass randomization: Varies the center of mass (only for Unitree G1).
      • Joint positionbias randomization: Used for whole-body tracking.
      • Body mass randomization: Used for whole-body tracking.

5.2. Evaluation Metrics

The paper primarily evaluates performance using reward values accumulated during training and sim-to-real deployment success. No explicit mathematical formulas are provided for these metricsin the main text, so their conceptual definitions based on the context are as follows:

  • Linear Velocity Tracking Reward:

    • Conceptual Definition: This metric quantifies how well the robot maintains a desired linear velocity in its locomotion task. A higher value indicates closer adherence to the commanded velocity. It is a keycomponent of the overall reward function for locomotion.
    • Mathematical Formula: (Not explicitly provided in the paper. Conceptually, it would be a function that penalizes deviations from target linear velocities. For instance, it could be a negative squared error, or a positive reward inversely proportional to the error magnitude.)
    • Symbol Explanation: Not applicable as a formula is not given.
  • Sum of Tracking Rewards:

    • Conceptual Definition: This metric represents the total cumulative reward received by the agent in whole-body tracking tasks. It reflects how accurately the robot tracks human motion. Higher values indicate bettermotion tracking performance.
    • Mathematical Formula: (Not explicitly provided in the paper. Conceptually, it is the sum of rewards over an episode, where rewards are designed to penalize deviations from target joint positions, orientations, and other kinematic quantities of the human motion.)
    • **Symbol Explanation:**Not applicable as a formula is not given.
  • Wall-Clock Time:

    • Conceptual Definition: This refers to the actual time taken (in minutes, in this paper) for the RL policy to reach a certain performance level or complete training. This metric is crucial for evaluating the practicalefficiency and iteration speed of the proposed method.

5.3. Baselines

The paper compares its FastSAC and FastTD3 methods primarily against PPO (Proximal Policy Optimization).

  • PPO (Proximal Policy Optimization): PPO (Schulman et al., 2017) is chosen as a representative baseline because it has been the de-facto standard on-policy algorithm for sim-to-real RL. It is widely used and is often the only supported algorithm in many RL frameworks due to its stability and ease of scaling withmassively parallel environments. This makes it a strong and relevant benchmark to demonstrate the advantages of off-policy methods in terms of training speed and robustness.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that FastSAC and FastTD3 significantly accelerate the training of robust humanoid locomotion and whole-body tracking policies compared to PPO, especially under challenging domain randomization conditions.

The following figure (Figure 1 from the original paper) summarizes the results across both locomotion and whole-body tracking:

Figure 1: Summary of results. We introduce a simple recipe based on off-policy RL algorithms, i.e., FastSAC and FastTD3, that learns robust humanoid locomotion policies in 15 minutes on a single RTX 4090 GPU, with strong domain randomization including randomized dynamics, rough terrain, and push perturbations. We also show that our recipe based on off-policy RL algorithms is scalable and accelerates the training of whole-body tracking policies: trained with \(4 \\times \\mathrm { L 4 0 s }\) GPUs and 16384 parallel environments, FastSAC and FastTD3 learn to complete the full sequence of dancing motion much faster than PPO under the same condition. For sim-to-real deployment with Unitree G1 and Booster T1, we used the checkpoints saved at the points we marked as \(^ { \\star }\) .As shown in the "Summary of results" chart, FastSAC and FastTD3 achieve high performance much faster than PPO for both walking (locomotion) and dancing (whole-body tracking). Specifically, they learn robust policies in 15 minutes on a single RTX 4090 GPU for locomotion, and accelerate whole-body tracking with 4×L40s4 \times \mathrm{L40s} GPUs. 该图像是图表,展示了在 15 分钟内通过 FastSAC 和 FastTD3 算法在 G1 和 T1 机器人上的行走和全身跟踪性能。图中包含的曲线表示线性速度跟踪和运动跟踪的结果,显示了不同算法在训练过程中的表现,其中 FastSAC 和 FastTD3 显著优于 PPO。

For locomotion (velocity tracking), Figure 3 illustrates the performance on Unitree G1 and Booster T1 robots across different terrainsand with push perturbations.

Figure 3: Locomotion (velocity tracking) results. FastSAC and FastTD3 enable fast training of G1 and T1 humanoid locomotion policies with strong domain randomization such as rough terrain or Push-Strong that applies push perturbations to humanoid robots every 1 to 3 seconds (max episode length is 20 seconds). For non-Push-Strong tasks, we apply push perturbations every 5 to 10 seconds. We use a single RTX 4090 GPU for all locomotion experiments.
该图像是图表,展示了G1和T1人形机器人在不同环境下(平面、粗糙地面及施加推力时)使用FastSAC、FastTD3和PPO算法进行速度跟踪的结果。每条曲线代表在每分钟内的线性速度,横坐标为时间(分钟),纵坐标为线性速度跟踪值。

The "Locomotion (velocity tracking) results" show that FastSAC and FastTD3 quickly enable G1 and T1 humanoids to track velocity commands within 15 minutes, significantly outperforming PPO in terms of wall-time clock. This performance is maintained even under strong domain randomization, such as rough terrain and Push-Strong perturbations. PPO struggles particularly with these strong perturbations, while FastSAC slightly outperforms FastTD3 in several locomotion setups, hypothesized to be due to FastSAC's efficient exploration via maximum entropy learning.

For whole-body tracking, Figure 5 presents the comparative results:

Figure 5: Whole-body tracking results. We show that FastSAC and FastTD3 are competitive or superior to PPO in whole-body motion tracking tasks. See Figure 6 for the sim-to-real deployment of FastSAC policies to real hardware. We use \(4 \\times \\mathrm { L 4 0 s }\) GPUs for all whole-body tracking experiments.
该图像是图表,展示了在不同任务(舞蹈、举箱、推)中,FastSAC、FastTD3 和 PPO 的动作跟踪时间表现。数据表明,FastSAC 和 FastTD3 在大部分情况下优于 PPO,尤其在舞蹈任务中表现最为显著。

The "Whole-body tracking results" indicate that FastSAC and FastTD3 are competitive with or superior to PPO in tasks like Dance, Box Lifting, and Push. FastSAC notably outperforms FastTD3 in the Dance task, which is a longer motion, further supporting the hypothesis that maximum entropy RL aids in learning more challenging tasks due to better exploration.

The paper also demonstrates the sim-to-realdeployment of FastSAC policies on a real Unitree G1 humanoid hardware, as shown in Figure 6.

Figure 6: Whole-body tracking examples. We demonstrate the sim-to-real deployment of wholebody tracking controllers for Unitree G1 trained with FastSAC (Top: Dance, Middle: Box Lifting, Bottom: Push). Videos are available at https: //younggyo .me/fastsac-humanoid.
该图像是一个插图,展示了Unitree G1机器人在全身跟踪控制下的各种动作示例,包括舞蹈、箱子举起和推——展示了快速仿真到真实部署的能力。

These examples (Dance, Box Lifting, Push) confirm that the FastSAC policies not only train quickly in simulation but also translateeffectively to real-world hardware, capable of executing long and complex motions.

6.2. Ablation Studies / Parameter Analysis

The paper includes ablation studies and parameter analysis, primarily for FastSAC, to justify the chosen design choices and hyperparameters.

Figure 2 from the original paper, titled "FastSAC:Analyses", shows the effects of various components:

Figure 2: FastSAC: Analyses. We investigate the effect of (a) Clipped double Q-learning, (b) number of update steps, (c) normalization techniques, and (d) discount factor \(\\gamma\) on a Unitree G1 locomotion task with rough terrain. We further investigate the effect of (e) discount factor \(\\gamma\) and (f) number of environments on a G1 whole-body tracking (WBT) task with a dancing motion. We use a single RTX 4090 GPU for locomotion experiments (a-d) and \(4 \\times \\mathrm { L 4 0 s }\) GPUs for whole-body tracking (e-f).
该图像是图表,展示了不同因素对 Unitree G1 机器人运动控制的影响,包括(a)夹闭双 Q 学习的效果,(b)更新步数的影响,(c)归一化技术的效果,及(d)折扣因子 eta 对的影响。图中使用了单块 RTX 4090 GPU 进行实验,数据表明这些因素在快速训练中的重要性。

The analyses in Figure 2 were conducted on a Unitree G1 locomotion task with rough terrain (a-d) or a G1 whole-body tracking task with a dancing motion (e-f).* aa Clipped double Q-learning (CDQ): The results suggest that using the average of Q-values (labeled as Mean-Q in the plot) performs better than CDQ (Clipped-Q) for FastSAC. This supports the designchoice to move away from CDQ, aligning with observations that CDQ can be detrimental when combined with layer normalization.

  • bb Number of update steps: This plot shows the impact of taking different numbers of gradient steps per simulation step. Increasing the number of update steps (e.g., 4_update vs. 1_update) generally leads to faster training, confirming that off-policy RL benefits from more frequent updates using reused data. slow_sim indicates scenarios where simulation speed is a bottleneck, making off-policy methods more attractive.

  • cc Normalization techniques: The analysis indicates that layer normalization (LN) is helpful for stabilizing performance in high-dimensional tasks, outperforming a scenario without layer normalization (NoLN). This justifies its inclusion in the recipe.

  • dd Discount factor γ\gamma on locomotion: Forsimple locomotion tasks, a lower discount factor of γ=0.97\gamma = 0.97 (gamma0.97gamma_0.97) yields better performance than γ=0.99\gamma = 0.99 (gamma0.99gamma_0.99). This suggests that for simpler tasks, immediate rewards are more critical, and distantfuture rewards should be discounted more heavily.

  • ee Discount factor γ\gamma on WBT (Whole-Body Tracking): Conversely, for challenging whole-body tracking tasks, a higher discount factor of γ=0.99\gamma = 0.99 (gamma0.99gamma_0.99) is more effective than γ=0.97\gamma = 0.97 (gamma0.97gamma_0.97). This implies that for complex, long-horizon tasks, considering future rewards more heavily is beneficial.

  • ff Number of environments on WBT: The performance significantly improves with a higher number of parallel environments (4k_envs vs. 1k_envs). This underscores the benefit of massively parallel simulation for scaling off-policy RL, especially in demanding whole-body tracking tasks.

    Figure 4 from the original paper, titled "Improvement from our FastSAC recipe", explicitlyshows the impact of the new recipe for FastSAC compared to its previous configuration (Seo et al., 2025).

    Figure 4: Improvement from our FastSAC recipe. While a version of FastSAC was previously considered as a baseline to FastTD3 (Seo et al., 2025) in the context of humanoid control, a straightforward implementation of FastSAC exhibited training instabilities. In this work, we have stabilized and improved FastSAC with a carefully tuned set of hyperparameters and design choices. 该图像是图表,展示了在G1和T1机器人上使用FastSAC和其先前配置在15分钟内的奖励和阶段长度改善情况。左上角和右上角分别显示G1的奖励和阶段长度,左下角和右下角显示T1的奖励和阶段长度。通过优化超参数,FastSAC显著提升了机器人的控制性能。

This figure clearly demonstrates the "Improvement from our FastSAC recipe" by comparing the new recipe's performance against the previous FastSAC configuration. The plots for G1 and T1 robots, showing both reward and episode length, indicate that the carefully tuned hyperparameters and design choices (such as the use of layer normalization, disabling CDQ, and precise exploration and optimization hyperparameters) successfully stabilize FastSAC and significantly enhance its performance inthe context of humanoid control. The new recipe achieves higher rewards and longer episode lengths much faster, resolving the training instabilities observed in prior FastSAC implementations for humanoids.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents a significant advancement in sim-to-real humanoid control by introducing a "simple and practical recipe" that enables rapid policy training. By combining efficient off-policy RL algorithms, FastSAC and FastTD3, with massively parallel simulation and a minimalist reward design, the authors successfully reduce wall-clocktraining time for robust humanoid locomotion policies to just 15 minutes using a single RTX 4090 GPU. The work stabilizes FastSAC and demonstrates its effectiveness alongside FastTD3 on Unitree G1 and Booster T1 robots, even under strong domain randomization and for complex whole-body tracking tasks. This closes a critical gap, making the iterative sim-to-real development cycle much more feasible for humanoid robotics.

7.2. Limitations & Future Work

The authors intentionally maintained a simple, minimalist design for their recipe, which itself suggestsavenues for future work.

  • Incorporating Recent Advances: They expect that incorporating recent advances in off-policy RL (D'Oro et al., 2023; Schwarzer et al., 2023; Nauman et al., 2024; Lee et al., 2024; Sukhija et al., 2025; Lee et al., 2025; Obando-Ceron et al., 2025) and humanoid learning could further improve the performance and stability of FastSAC and FastTD3. *Further Investigation of Algorithm Differences: The paper notes that FastSAC outperforms FastTD3 in certain challenging tasks (like the Dance motion) due to better exploration, suggesting that further investigation into the performance differences between FastSAC and FastTD3 across a more diverse range of tasks is an interesting future direction.## 7.3. Personal Insights & Critique The paper's contribution is highly impactful, particularly for the robotics community. The reduction of sim-to-real iteration time for complex humanoids to just 15 minutes is a game-changer, making rapid prototyping and deployment much more accessible. Theemphasis on a minimalist reward design is particularly insightful; it challenges the traditional belief that complex behaviors require equally complex reward functions, demonstrating that simpler objectives can lead to robust and rich behaviors while simplifying the entire development process. This principle could be transferred to other complex RL tasks where reward shaping has become a bottleneck.

Onepotential area for further exploration, beyond what the authors mention, could be the transferability of this recipe to even more diverse hardware platforms or highly unstructured environments. While domain randomization is effective, understanding its limits and how to systematically generate randomized environments that are optimally calibrated to minimize the sim-to-realgap could be a valuable extension. Additionally, a deeper theoretical analysis of why the combination of layer normalization and mean Q-value averaging works better than CDQ could provide fundamental insights for off-policy RL algorithm design. The work provides an excellent blueprint, and its open-source naturewill undoubtedly foster significant follow-up research.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.