Learning Sim-to-Real Humanoid Locomotion in 15 Minutes
TL;DR Summary
This paper introduces a method using off-policy RL algorithms (FastSAC and FastTD3) that enables rapid humanoid locomotion training in just 15 minutes with a single RTX 4090 GPU, addressing high dimensionality and domain randomization challenges.
Abstract
Massively parallel simulation has reduced reinforcement learning (RL) training time for robots from days to minutes. However, achieving fast and reliable sim-to-real RL for humanoid control remains difficult due to the challenges introduced by factors such as high dimensionality and domain randomization. In this work, we introduce a simple and practical recipe based on off-policy RL algorithms, i.e., FastSAC and FastTD3, that enables rapid training of humanoid locomotion policies in just 15 minutes with a single RTX 4090 GPU. Our simple recipe stabilizes off-policy RL algorithms at massive scale with thousands of parallel environments through carefully tuned design choices and minimalist reward functions. We demonstrate rapid end-to-end learning of humanoid locomotion controllers on Unitree G1 and Booster T1 robots under strong domain randomization, e.g., randomized dynamics, rough terrain, and push perturbations, as well as fast training of whole-body human-motion tracking policies. We provide videos and open-source implementation at: https://younggyo.me/fastsac-humanoid.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Learning Sim-to-Real Humanoid Locomotion in 15 Minutes".
1.2. Authors
The authors are Younggyo Seo*, Carmelo Sferrazza*, Juyue Chen, Guanya Shi, Rocky Duan, and Pieter Abbeel, affiliated with Amazon FAR (Frontier AI & Robotics).
1.3. Journal/Conference
The paper was published on arXiv, a preprint server, with the identifier arXiv:2512.01996. ArXiv is a widely recognized platform for disseminating research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics, allowing researchers to share their work before, or in parallel with, peer review and formal publication.
1.4.Publication Year
The paper was published at (UTC): 2025-12-01T18:55:17.000Z.
1.5. Abstract
This paper addresses the challenge of achieving fast and reliable sim-to-real reinforcementlearning (RL) for humanoid control, which has been difficult despite advances in massively parallel simulation that have reduced training times for robots. The authors introduce a practical recipe based on off-policy RL algorithms, specifically FastSAC and FastTD3, enabling the rapid training of humanoid locomotion policies in just 15minutes using a single RTX 4090 GPU. Their recipe stabilizes off-policy RL algorithms at a massive scale with thousands of parallel environments through carefully tuned design choices and minimalist reward functions. They demonstrate the rapid end-to-end learning of humanoid locomotion controllers on Unitree G1 and Booster T1 robots under strong domain randomization (e.g., randomized dynamics, rough terrain, push perturbations) and fast training of whole-body human-motion tracking policies.
1.6. Original Source Link
The official source link is https://arxiv.org/abs/2512.01996, and the PDF link is https://arxiv.org/pdf/2512.01996v1.pdf. This indicates the paper is available as a preprint on arXiv.
2. Executive Summary
2.1. Background &Motivation
The core problem the paper aims to solve is the difficulty of achieving fast and reliable sim-to-real reinforcement learning (RL) for humanoid control. While massively parallel simulation has significantly reduced RL training times for robots from days to minutes, applying this to humanoid control remains challenging due to factors like high dimensionality andthe need for extensive domain randomization to ensure robust real-world transfer. These challenges often push training for humanoids back into the multi-hour regime, making the iterative sim-to-real development cycle (training in simulation, deploying on hardware, correcting mismatches, and retraining) impractical and expensive. Thepaper's entry point is to simplify and accelerate this sim-to-real iteration process for humanoids to a matter of minutes.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
A Simple and Practical Recipe: Introduction of a recipe based on
FastSACandFastTD3, efficient variants of off-policy RL algorithms, that enables rapid training of humanoid locomotion policies. -
Rapid Training Time: Demonstration that robust humanoid locomotion policies can be trained in just 15 minutes using a single
RTX 4090 GPU, evenunder strongdomain randomization. -
Stabilization of Off-Policy RL at Scale: The recipe stabilizes off-policy RL algorithms at a massive scale (thousands of parallel environments) through carefully tuned design choices and minimalist reward functions.
-
Demonstration on Multiple Humanoids: Successful application of the recipeto
Unitree G1andBooster T1robots forend-to-endlocomotion control andwhole-body human-motion tracking. -
Open-Source Implementation: Provision of open-source code to support further research and development.
The key findings are that off-policy RL algorithms, when carefully tuned and combined with minimalist reward designs, can significantly outperform traditional on-policy methods like
PPOin terms of training speed and robustness for complex humanoid control tasks, thereby makingsim-to-realiteration much more feasible.
3. Prerequisite Knowledge & RelatedWork
3.1. Foundational Concepts
To understand this paper, several foundational concepts in reinforcement learning and robotics are crucial:
- Reinforcement Learning (RL): A paradigm where an
agentlearns to make decisions by performingactionsin anenvironmentto maximize a cumulativereward. The agentobservesstates, takes actions, receives rewards, and transitions to new states. - Agent: The entity that performs actions in the environment. In this paper, the agent is the humanoid robot's controller.
- Environment: The setting in which the agent operates. Here, it'sa simulated physical environment for humanoid robots.
- State: A complete description of the environment at a given time. For a humanoid, this could include joint angles, velocities, orientation, etc.
- Action: A decision made by the agent that affects the environment. For humanoids, these are often torquecommands or desired joint positions for
PD controllers. - Reward: A scalar feedback signal indicating the desirability of an agent's actions. The goal of RL is to maximize the total reward.
- Policy (): A mapping from states to actions, defining the agent's behavior. Thegoal of RL is to learn an optimal policy.
- Sim-to-Real (Simulation-to-Real Transfer): The process of training an RL policy in a simulated environment and then deploying it successfully on a real physical robot. This is challenging because simulations often do not perfectly capture real-world physics, leading to
sim-to-real gap. - Domain Randomization: A technique used to bridge the
sim-to-real gap. Instead of meticulously modeling every physical parameter,domain randomizationinvolves randomizing various aspects of the simulation (e.g., friction, mass, sensor noise) during training. This forces the policy to learn to be robust to variations, making it more likely to perform well on a real robot whose parameters might differ from any specific simulation setting. - Off-policy Reinforcement Learning: A category of RL algorithms where the policy being learned (
target policy) is differentfrom the policy used to collect data (behavior policy). This allows algorithms to reuse past experience (stored in areplay buffer) more efficiently.- Soft Actor-Critic (SAC): An
off-policyactor-criticalgorithm that aims to maximize both the expected reward and theentropyof the policy. Maximizing entropy encourages exploration and leads to more robust policies. - Twin Delayed DDPG (TD3): An
off-policyactor-criticalgorithm designed to address overestimation bias inQ-learning. It uses twocriticnetworks (clippeddouble Q-learning) and delays policy updates.
- Soft Actor-Critic (SAC): An
- On-policy Reinforcement Learning: A category of RL algorithms where the policy being learned is the same as the policy used to collect data. This means that data collected with an older policy cannot be directly reused for updating a new policy, often leading to lowersample efficiency.
- Proximal Policy Optimization (PPO): A popular
on-policyalgorithm known for its stability and good performance. It uses a clipped objective function to prevent excessively large policy updates.
- Proximal Policy Optimization (PPO): A popular
- Massively Parallel Simulation: The use of many simulation environments running concurrently, often on GPUs,to generate large amounts of training data quickly for RL agents. This significantly reduces
wall-clock training time. - PD Controllers (Proportional-Derivative Controllers): A common feedback control loop mechanism widely used in robotics. It calculates an error value as the difference between a desired
setpointand ameasuredprocess variable(e.g., current joint angle). The controller attempts to minimize the error by adjusting thecontrol output(e.g., motor torque) based onproportional(P) andderivative(D) terms.- term: accountsfor the current error.
- term: accounts for the rate of change of the error.
PD-gain randomizationrefers to varying the proportional and derivative gains duringdomain randomization.
- Entropy: In RL,
entropycan be used to quantify the randomness orunpredictability of a policy. A policy with high entropy explores more, which can be beneficial for finding optimal solutions in complex environments.
3.2. Previous Works
The paper builds upon and distinguishes itself from several prior works:
- Massively Parallel Simulation Frameworks: The current work explicitly leveragesadvancements in frameworks like Isaac Gym (Makoviychuk et al., 2021), which have enabled
GPU-based parallel environmentsto scale environment throughput to thousands. These frameworks have driven successes in various robot control tasks (Rudin et al., 2022; Agarwal et al., 2023). The paper acknowledges this trend as a key enabler for acceleratingsim-to-realiterations. - On-policy RL for Sim-to-Real:
PPO(Schulman et al., 2017) has been the de-facto standard forsim-to-realRL due to its ease of scaling with parallel environments. However, the paper's core innovation lies in challenging this standard by showing thatoff-policymethods can be even faster and more effective. - Scaling Off-policy RL: Recent works have started to demonstrate that
off-policyRL methods can also scale effectively in large-scale training regimes (Li et al., 2023; Raffin, 2025; Shukla, 2025). The paper specifically references which reported the firstsim-to-realdeployment of humanoid control policies trained withFastTD3. - Humanoid Control with
FastTD3: While demonstratedsim-to-realdeployment withFastTD3, their results were limited to humanoids with only a subset ofjoints. This paper extends that work by developing a recipe for full-body humanoid control usingFastSACandFastTD3. - Reward Design: The paper contrasts its
simple reward designwith traditionalheavy reward shapingapproaches used in humanoid locomotion (Mittal et al., 2023; Lab, 2025), citing recent works (Zakka et al., 2025; Liao et al., 2025) that rely on simpler reward functions. Thewhole-body trackingrewards follow the structure ofBeyondMimic(Liao et al., 2025).## 3.3. Technological Evolution The field of roboticsreinforcement learninghas evolved significantly. Initially, training RL policies for complex robots was extremelysample inefficientand required vast amounts ofreal-world interaction, which is expensive and time-consuming. The advent of highly realistic andmassively parallel simulationframeworks, particularlyGPU-acceleratedones, marked a turning point. These allowed for training policies in simulated environments at an unprecedented speed, drastically reducingwall-clock training timefrom days to minutes. This parallelization initially favoredon-policyalgorithms likePPOdueto their straightforward scalability. However, thesim-to-real gapremained a persistent challenge, necessitating techniques likedomain randomization.
This paper's work fits within this technological timeline by addressing the remaining challenge of applying rapid training to high-dimensional systems like humanoids. It demonstrates that off-policy algorithms, which areoften more sample efficient, can also be effectively scaled with parallel simulations. By doing so, it pushes the boundary of how quickly robust sim-to-real humanoid control policies can be developed, making the iterative development cycle much more feasible.
3.4. Differentiation Analysis
Compared to the main methods in relatedwork, the core differences and innovations of this paper's approach are:
- Off-policy RL for Full-Body Humanoid Control: While
PPOhas been the standard forsim-to-realdue to its scalability, and previousoff-policywork (Seo et al., 2025) was limited to partial humanoid control, this paper successfully scalesFastSACandFastTD3tofull-body humanoid control, encompassing all joints for locomotion andwhole-body tracking. - Speed and Efficiency: The paper demonstrates training of robust
sim-to-realpolicies in an unprecedented 15 minutes on a singleRTX 4090 GPU, which is a significant speedup compared to prior multi-hour training regimes for humanoids. - Stabilized
FastSAC: The work stabilizes and improvesFastSAC, which previously exhibited training instabilities forhumanoid control, through a carefully tuned set of hyperparameters and design choices. - Minimalist Reward Design: Unlike traditional approaches that rely on complex
reward shapingwith many terms, this paper uses a simple, minimalist reward design with fewer than 10 terms. This simplifieshyperparameter tuningand allowsfor rapid iteration, while still producing robust behaviors. - Robustness under Strong Domain Randomization: The recipe is shown to be effective under strong
domain randomization, including randomized dynamics, rough terrain, and significant push perturbations, ensuring highsim-to-realtransferability.
4. Methodology
##4.1. Principles
The core idea behind the proposed method is to leverage the sample efficiency of off-policy Reinforcement Learning (RL) algorithms, specifically FastSAC and FastTD3, and combine them with massively parallel simulation and a minimalist reward design to achieve rapidsim-to-real training for humanoid control. The theoretical basis is that off-policy methods, by reusing data, can achieve faster training (wall-clock time) when scaled appropriately, especially in challenging, high-dimensional tasks where exploration is difficult and on-policy methods like PPO struggle. The minimalistreward function simplifies the hyperparameter tuning process, allowing for faster iteration and more robust policies.
4.2. Core Methodology In-depth (Layer by Layer)
The paper introduces a "simple and practical recipe" for training humanoid locomotion and whole-body tracking policies. This recipe primarilyinvolves tuning FastSAC and FastTD3 for large-scale training and adopting a minimalist reward design.
4.2.1. FastSAC and FastTD3: Off-Policy RL for Humanoid Control
The recipe is based on off-policy RL algorithms, FastTD3 (an efficient variant of TD3) and FastSAC (an efficient variant of Soft Actor-Critic), tuned for large-scale training with massively parallel simulation. The paper explicitly states that it improves upon the recipe, particularly for FastSAC,to handle full-body humanoid control.
The general off-policy RL loop is described in the "Related Work" section with the following pseudo-code (reproduced exactly from the paper):
| 1:Initialize actor πθ, two critics Qφ1, Qφ2, entropy temperature α, replaybuffer B 2: 3: for each environment step do 4: Sample a ~ πθ(o) given the current observation o, and take action a 5: Observe next state o' and reward r' 6: Store transition τ = (o, a, o′, r′) in replay buffer B ← B ∪ {τ } 7: for j = 1 to num_updates do Sample mini-batch B = {τk}k=1 8: |B| from B |
-
Initialize actor ,two critics , , entropy temperature , replay buffer : This step sets up the core components of the
actor-criticframework.- : The
actornetwork, parameterized by ,which outputs actions given an observation. - , : Two
criticnetworks, parameterized by and , which estimate theQ-value(expected cumulative reward) of a state-action pair. Usingtwo critics helps mitigateoverestimation biasinQ-learning, a technique introduced inTD3. - : The
entropy temperatureparameter, used inSAC, which balances exploration (maximizing entropy) and exploitation (maximizing reward). - : The
replay buffer, which stores past experiences (transitions) foroff-policylearning.
- : The
-
for each environment step do: This loop represents the interaction with the simulation environment.
- Sample given the current observation , and take action : The
actorpolicy is used to select an action based on the currentobservation. - Observe next state and reward : The environment executes the action , resulting in a new
observationand animmediatereward. - Store transition in replay buffer : The observed experience (transition) is stored in the
replay bufferfor later use intraining.
- Sample given the current observation , and take action : The
-
for to num_updates do: This inner loop represents the training phase, where the
actorandcriticnetworks are updated multiple times for each environment step.-
Sample mini-batch from : A batch of past experiences is randomly sampled from the
replay bufferto train the networks.The paper then details specific design choices and hyperparameters that stabilize
off-policyRL forfull-body humanoid control:
-
-
Scaling Up Off-Policy RL with Massively Parallel Simulation:
- The authors use
massively parallel simulationto trainFastSACandFastTD3. - Batch Size: They find that using a large
batch sizeup to 8K consistently improves performance. - GradientSteps: Taking more
gradient stepsper each simulation step usually leads to faster training. This highlights thesample efficiencyadvantage ofoff-policyRL, as it reuses data rather than discarding it, which is crucial whensimulation speedis a bottleneck.
- The authors use
-
Joint-Limit-Aware Action Bounds:* A common challenge for
off-policyRL algorithms likeSACorTD3is setting properaction boundsfor theirTanh policy.- The proposed technique sets
action boundsbased on the robot'sjoint limitswhen usingPD controllers. Specifically, the difference between each joint's limit and its default position is calculated and used as theaction boundfor that joint. This reduces the need for manual tuning of action bounds.
- The proposed technique sets
-
Observation and Layer Normalization:
- Observation Normalization: Similar to ,
observation normalizationis found to be helpful for training. This typically involves scaling observations to have zero mean and unit variance. - Layer Normalization: Unlike ,
layer normalization(Ba et al., 2016) is found to be helpful in stabilizing performance in high-dimensional tasks.Layer Normalizationnormalizes the inputs across the features within a layer, rather than across the batch, which can be more stable, especially with varying batch sizes or sequential data.
- Observation Normalization: Similar to ,
-
Critic Learning Hyperparameters:* Q-value Averaging: Using the average of
Q-valuesfrom the two critics is found to improveFastSACandFastTD3performance, as opposed to usingClipped double Q-learning (CDQ)(Fujimoto et al., 2018), which uses the minimum. This aligns with observations thatCDQcan be harmful when used withlayer normalization(Nauman et al., 2024).- Discount Factor :
- A low
discount factoris helpful for simple velocity tracking tasks. - A higher
discount factoris helpful for challengingwhole-body trackingtasks. Thediscount factordetermines the present value of future rewards; a higher means future rewards areconsidered more important.
- A low
- Distributional Critic: Following prior work, a
distributional critic, specificallyC51(Bellemare et al., 2017), is used.C51predicts a distribution over possibleQ-valuesrather than a single expectedvalue. The paper notes thatquantile regression(Dabney et al., 2018) fordistributional criticsis too expensive with largebatch training.
- Discount Factor :
-
FastSAC: Exploration Hyperparameters:
- Standard Deviation Bounds: A widely used implementation of
SACbounds thestandard deviationofpre-tanh actionsto . The authors find this can cause instability with a large initialtemperature. Instead, they set the maximum to 1.0. - Initial Entropy Temperature : Theyinitialize with a low value of 0.001 to prevent excessive exploration, which can lead to instability.
- Auto-tuning for Maximum Entropy Learning: Using
auto-tuningformaximum entropy learning(Haarnoja et al., 2018b) consistently outperforms fixed values. This means theentropy temperatureis dynamically adjusted during training. - Target Entropy: The
target entropyforauto-tuningis set to 0.0 for locomotion tasks and forwhole-body trackingtasks, where is the dimensionality of the action space.
- Standard Deviation Bounds: A widely used implementation of
-
FastTD3: Exploration Hyperparameters:
- Mixed Noise Schedule: Following and , a
mixed noise scheduleis used, where theGaussian noise standard deviationis randomly sampled from a range . - Noise Range: Low values, specifically , are found to perform best.
- Mixed Noise Schedule: Following and , a
-
Optimization Hyperparameters:
- Optimizer:
Adam optimizer(Kingma & Ba, 2015) is used. - Learning Rate: A learning rate of 0.0003 isused.
- Weight Decay: A
weight decayof 0.001 is used, as the 0.1 value from was found to be too strong aregularizationfor high-dimensional control tasks.Weight decayis aregularizationtechnique that penalizes large weights, preventingoverfitting. - Adam : Similar to
Zhai et al. (2023), using a lower slightly improves stabilitycompared to the common . is a hyperparameter inAdamthat controls the exponential decay rate of theL2 normof past gradients.
- Optimizer:
4.2.2. Simple Reward Design
The paper emphasizes a minimalist reward philosophy, aiming for fewer than 10 terms, unlike traditional reward shaping with 20+ terms. This simplifies hyperparameter tuning and promotes robust, rich behaviors.
-
Locomotion (Velocity Tracking) Rewards:
Linear and angular velocity tracking rewards: Encourage the humanoid to followcommanded x-y speed andyaw rate. These are the primary drivers foremergent locomotion.Foot-height tracking term: Guidesswing motion(Zakka et al., 2025; Shao et al., 2022).Default-posepenalty: Avoidsextreme joint configurations.Feet penalties: Encouragesparallel relative orientationand prevents foot crossing.Per-step alive reward: Encourages the robot to remain in valid, non-fallen states.Penalties for torso orientation: Keeps the torsonear a stable upright orientation.Penalty on the action rate: Smooths control outputs.- Episode Termination: The episode terminates upon ground contact by the torso or other non-foot body parts.
- Symmetry Augmentation: Used to encourage
symmetric walking patternsand faster convergence (Mittal et al., 2024). - Curriculum: All penalties are subject to a
curriculumthat ramps up their weights over training asepisode lengthincreases (Lab, 2025), simplifying exploration.
-
Whole-BodyTracking Rewards:
- The
reward structurefromBeyondMimic(Liao et al., 2025) is adopted, which adheres to minimalist principles. This involvestracking goalswithlightweight regularizationandDeepMimic-style termination conditions(Peng et al., 2018a). - External Disturbances:
External disturbancesin the form ofvelocity pushesare introduced to furtherrobustify sim-to-real performance.
- The
5. Experimental Setup
5.1. Datasets
The paper does not usetraditional datasets in the supervised learning sense. Instead, it trains Reinforcement Learning (RL) policies in simulated environments.
- Robots: The experiments are conducted on two humanoid robot models:
Unitree G1andBooster T1. - Tasks:
- Locomotion (Velocity Tracking): The
RL policiesare trained to maximize the sum of rewards by making robots achieve targetlinearandangular velocitieswhile minimizing various penalties. Target velocity commands are randomly sampled every 10 seconds. With 20% probability, target velocities are set to zero to encourage learning to stand. *Whole-Body Tracking: TheRL policiesare trained to maximizetracking rewardsby making robots follow human motion. Motion segments are randomly sampled for each episode.
- Locomotion (Velocity Tracking): The
- Training Environment:
- All robots are trained on a mix of flat and rough terrains to stabilize walking in
sim-to-real deployment. - Domain Randomization Techniques: Various techniques are applied to
robustify sim-to-real deployment:Push perturbations: External forces applied to the robot. For locomotion,Push-Strongapplies perturbations every 1 to 3 seconds, while other locomotiontasks apply them every 5 to 10 seconds. Forwhole-body tracking,velocity pushesare used.Action delay: Introduces latency in action execution.PD-gain randomization: Varies theproportionalandderivative gainsof thePD controllers.*Mass randomization: Varies the mass properties of the robot.Friction randomization: Varies the friction coefficients.Center of mass randomization: Varies the center of mass (only forUnitree G1).Joint positionbias randomization: Used forwhole-body tracking.Body mass randomization: Used forwhole-body tracking.
- All robots are trained on a mix of flat and rough terrains to stabilize walking in
5.2. Evaluation Metrics
The paper primarily evaluates performance using reward values accumulated during training and sim-to-real deployment success. No explicit mathematical formulas are provided for these metricsin the main text, so their conceptual definitions based on the context are as follows:
-
Linear Velocity Tracking Reward:
- Conceptual Definition: This metric quantifies how well the robot maintains a desired
linear velocityin its locomotion task. A higher value indicates closer adherence to the commanded velocity. It is a keycomponent of the overall reward function for locomotion. - Mathematical Formula: (Not explicitly provided in the paper. Conceptually, it would be a function that penalizes deviations from target linear velocities. For instance, it could be a negative squared error, or a positive reward inversely proportional to the error magnitude.)
- Symbol Explanation: Not applicable as a formula is not given.
- Conceptual Definition: This metric quantifies how well the robot maintains a desired
-
Sum of Tracking Rewards:
- Conceptual Definition: This metric represents the total cumulative reward received by the agent in
whole-body trackingtasks. It reflects how accurately the robot tracks human motion. Higher values indicate bettermotion trackingperformance. - Mathematical Formula: (Not explicitly provided in the paper. Conceptually, it is the sum of rewards over an episode, where rewards are designed to penalize deviations from target joint positions, orientations, and other
kinematicquantities of the human motion.) - **Symbol Explanation:**Not applicable as a formula is not given.
- Conceptual Definition: This metric represents the total cumulative reward received by the agent in
-
Wall-Clock Time:
- Conceptual Definition: This refers to the actual time taken (in minutes, in this paper) for the
RL policyto reach a certain performance level or complete training. This metric is crucial for evaluating the practicalefficiency and iteration speed of the proposed method.
- Conceptual Definition: This refers to the actual time taken (in minutes, in this paper) for the
5.3. Baselines
The paper compares its FastSAC and FastTD3 methods primarily against PPO (Proximal Policy Optimization).
- PPO (Proximal Policy Optimization):
PPO(Schulman et al., 2017) is chosen as a representative baseline because it has been the de-facto standardon-policyalgorithm forsim-to-real RL. It is widely used and is often the only supported algorithm in manyRL frameworksdue to its stability and ease of scaling withmassively parallel environments. This makes it a strong and relevant benchmark to demonstrate the advantages ofoff-policymethods in terms of training speed and robustness.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that FastSAC and FastTD3 significantly accelerate the training of robust humanoid locomotion and whole-body tracking policies compared to PPO, especially under challenging domain randomization conditions.
The following figure (Figure 1 from the original paper) summarizes the results across both locomotion and whole-body tracking:
As shown in the "Summary of results" chart, FastSAC and FastTD3 achieve high performance much faster than PPO for both walking (locomotion) and dancing (whole-body tracking). Specifically, they learn robust policies in 15 minutes on a single RTX 4090 GPU for locomotion, and accelerate whole-body tracking with GPUs.
该图像是图表,展示了在 15 分钟内通过 FastSAC 和 FastTD3 算法在 G1 和 T1 机器人上的行走和全身跟踪性能。图中包含的曲线表示线性速度跟踪和运动跟踪的结果,显示了不同算法在训练过程中的表现,其中 FastSAC 和 FastTD3 显著优于 PPO。
For locomotion (velocity tracking), Figure 3 illustrates the performance on Unitree G1 and Booster T1 robots across different terrainsand with push perturbations.

该图像是图表,展示了G1和T1人形机器人在不同环境下(平面、粗糙地面及施加推力时)使用FastSAC、FastTD3和PPO算法进行速度跟踪的结果。每条曲线代表在每分钟内的线性速度,横坐标为时间(分钟),纵坐标为线性速度跟踪值。
The "Locomotion (velocity tracking) results" show that FastSAC and FastTD3 quickly enable G1 and T1 humanoids to track velocity commands within 15 minutes, significantly outperforming PPO in terms of wall-time clock. This performance is maintained even under strong domain randomization, such as rough terrain and Push-Strong perturbations. PPO struggles particularly with these strong perturbations, while FastSAC slightly outperforms FastTD3 in several locomotion setups, hypothesized to be due to FastSAC's efficient exploration via maximum entropy learning.
For whole-body tracking, Figure 5 presents the comparative results:

该图像是图表,展示了在不同任务(舞蹈、举箱、推)中,FastSAC、FastTD3 和 PPO 的动作跟踪时间表现。数据表明,FastSAC 和 FastTD3 在大部分情况下优于 PPO,尤其在舞蹈任务中表现最为显著。
The "Whole-body tracking results" indicate that FastSAC and FastTD3 are competitive with or superior to PPO in tasks like Dance, Box Lifting, and Push. FastSAC notably outperforms FastTD3 in the Dance task, which is a longer motion, further supporting the hypothesis that maximum entropy RL aids in learning more challenging tasks due to better exploration.
The paper also demonstrates the sim-to-realdeployment of FastSAC policies on a real Unitree G1 humanoid hardware, as shown in Figure 6.

该图像是一个插图,展示了Unitree G1机器人在全身跟踪控制下的各种动作示例,包括舞蹈、箱子举起和推——展示了快速仿真到真实部署的能力。
These examples (Dance, Box Lifting, Push) confirm that the FastSAC policies not only train quickly in simulation but also translateeffectively to real-world hardware, capable of executing long and complex motions.
6.2. Ablation Studies / Parameter Analysis
The paper includes ablation studies and parameter analysis, primarily for FastSAC, to justify the chosen design choices and hyperparameters.
Figure 2 from the original paper, titled "FastSAC:Analyses", shows the effects of various components:

该图像是图表,展示了不同因素对 Unitree G1 机器人运动控制的影响,包括(a)夹闭双 Q 学习的效果,(b)更新步数的影响,(c)归一化技术的效果,及(d)折扣因子 eta 对的影响。图中使用了单块 RTX 4090 GPU 进行实验,数据表明这些因素在快速训练中的重要性。
The analyses in Figure 2 were conducted on a Unitree G1 locomotion task with rough terrain (a-d) or a G1 whole-body tracking task with a dancing motion (e-f).* Clipped double Q-learning (CDQ): The results suggest that using the average of Q-values (labeled as Mean-Q in the plot) performs better than CDQ (Clipped-Q) for FastSAC. This supports the designchoice to move away from CDQ, aligning with observations that CDQ can be detrimental when combined with layer normalization.
-
Number of update steps: This plot shows the impact of taking different numbers of
gradient stepspersimulation step. Increasing the number ofupdate steps(e.g.,4_updatevs.1_update) generally leads to faster training, confirming thatoff-policyRL benefits from more frequent updates using reused data.slow_simindicates scenarios wheresimulation speedis a bottleneck, makingoff-policymethods more attractive. -
Normalization techniques: The analysis indicates that
layer normalization(LN) is helpful for stabilizing performance in high-dimensional tasks, outperforming a scenario withoutlayer normalization(NoLN). This justifies its inclusion in the recipe. -
Discount factor on locomotion: Forsimple locomotion tasks, a lower
discount factorof () yields better performance than (). This suggests that for simpler tasks, immediate rewards are more critical, and distantfuture rewards should be discounted more heavily. -
Discount factor on WBT (Whole-Body Tracking): Conversely, for challenging
whole-body trackingtasks, a higherdiscount factorof () is more effective than (). This implies that for complex, long-horizon tasks, considering future rewards more heavily is beneficial. -
Number of environments on WBT: The performance significantly improves with a higher number of
parallel environments(4k_envsvs.1k_envs). This underscores the benefit ofmassively parallel simulationfor scalingoff-policyRL, especially in demandingwhole-body trackingtasks.Figure 4 from the original paper, titled "Improvement from our FastSAC recipe", explicitlyshows the impact of the new recipe for
FastSACcompared to its previous configuration (Seo et al., 2025).
该图像是图表,展示了在G1和T1机器人上使用FastSAC和其先前配置在15分钟内的奖励和阶段长度改善情况。左上角和右上角分别显示G1的奖励和阶段长度,左下角和右下角显示T1的奖励和阶段长度。通过优化超参数,FastSAC显著提升了机器人的控制性能。
This figure clearly demonstrates the "Improvement from our FastSAC recipe" by comparing the new recipe's performance against the previous FastSAC configuration. The plots for G1 and T1 robots, showing both reward and episode length, indicate that the carefully tuned hyperparameters and design choices (such as the use of layer normalization, disabling CDQ, and precise exploration and optimization hyperparameters) successfully stabilize FastSAC and significantly enhance its performance inthe context of humanoid control. The new recipe achieves higher rewards and longer episode lengths much faster, resolving the training instabilities observed in prior FastSAC implementations for humanoids.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper presents a significant advancement in sim-to-real humanoid control by introducing a "simple and practical recipe" that enables rapid policy training. By combining efficient off-policy RL algorithms, FastSAC and FastTD3, with massively parallel simulation and a minimalist reward design, the authors successfully reduce wall-clocktraining time for robust humanoid locomotion policies to just 15 minutes using a single RTX 4090 GPU. The work stabilizes FastSAC and demonstrates its effectiveness alongside FastTD3 on Unitree G1 and Booster T1 robots, even under strong domain randomization and for complex whole-body tracking tasks. This closes a critical gap, making the iterative sim-to-real development cycle much more feasible for humanoid robotics.
7.2. Limitations & Future Work
The authors intentionally maintained a simple, minimalist design for their recipe, which itself suggestsavenues for future work.
- Incorporating Recent Advances: They expect that incorporating recent advances in
off-policy RL(D'Oro et al., 2023; Schwarzer et al., 2023; Nauman et al., 2024; Lee et al., 2024; Sukhija et al., 2025; Lee et al., 2025; Obando-Ceron et al., 2025) and humanoid learning could further improve the performance and stability ofFastSACandFastTD3. *Further Investigation of Algorithm Differences: The paper notes thatFastSACoutperformsFastTD3in certain challenging tasks (like the Dance motion) due to better exploration, suggesting that further investigation into the performance differences betweenFastSACandFastTD3across a more diverse range of tasks is an interesting future direction.## 7.3. Personal Insights & Critique The paper's contribution is highly impactful, particularly for the robotics community. The reduction ofsim-to-realiteration time for complex humanoids to just 15 minutes is a game-changer, making rapid prototyping and deployment much more accessible. Theemphasis on aminimalist reward designis particularly insightful; it challenges the traditional belief that complex behaviors require equally complex reward functions, demonstrating that simpler objectives can lead to robust and rich behaviors while simplifying the entire development process. This principle could be transferred to other complexRLtasks wherereward shapinghas become a bottleneck.
Onepotential area for further exploration, beyond what the authors mention, could be the transferability of this recipe to even more diverse hardware platforms or highly unstructured environments. While domain randomization is effective, understanding its limits and how to systematically generate randomized environments that are optimally calibrated to minimize the sim-to-realgap could be a valuable extension. Additionally, a deeper theoretical analysis of why the combination of layer normalization and mean Q-value averaging works better than CDQ could provide fundamental insights for off-policy RL algorithm design. The work provides an excellent blueprint, and its open-source naturewill undoubtedly foster significant follow-up research.
Similar papers
Recommended via semantic vector search.