DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion
TL;DR Summary
DreamControl introduces a novel method for learning whole-body skills in humanoid robots by combining diffusion models and reinforcement learning, enabling complex tasks and facilitating sim-to-real transfer.
Abstract
We introduce DreamControl, a novel methodology for learning autonomous whole-body humanoid skills. DreamControl leverages the strengths of diffusion models and Reinforcement Learning (RL): our core innovation is the use of a diffusion prior trained on human motion data, which subsequently guides an RL policy in simulation to complete specific tasks of interest (e.g., opening a drawer or picking up an object). We demonstrate that this human motion-informed prior allows RL to discover solutions unattainable by direct RL, and that diffusion models inherently promote natural looking motions, aiding in sim-to-real transfer. We validate DreamControl's effectiveness on a Unitree G1 robot across a diverse set of challenging tasks involving simultaneous lower and upper body control and object interaction. Project website at https://genrobo.github.io/DreamControl/
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion. It focuses on developing a novel method for learning autonomous whole-body skills for humanoid robots to interact with their environment.
1.2. Authors
The authors of the paper are Ovij Kalaria, Sudarshan Harithas, Pushkal Katara, Sangkyung Kwak, Sarthak Bhagat, S. Shankar Sastry, Srinath Sridhar, Sai Vemprala, Ashish Kapoor, and Jonathan Huang. Their affiliations are:
- 1: Microsoft
- 2: Georgia Institute of Technology
- 3: University of California, Berkeley
1.3. Journal/Conference
The paper is listed as an arXiv preprint (arXiv:2509.14353v3). This indicates it has not yet undergone formal peer review and publication in a journal or conference proceedings at the time of its release, though it is often a precursor to such publication in reputable venues. The Published at (UTC) timestamp is 2025-09-17T18:35:43.000Z, suggesting a future publication date or an update to the preprint.
1.4. Publication Year
The publication year, based on the Published at (UTC) timestamp, is 2025.
1.5. Abstract
DreamControl is presented as a novel methodology for learning autonomous whole-body humanoid skills. The core innovation lies in combining diffusion models with Reinforcement Learning (RL). Specifically, a diffusion prior is trained on human motion data, which then guides an RL policy in simulation to accomplish specific tasks like opening a drawer or picking up an object. The authors demonstrate that this human motion-informed prior enables RL to discover solutions that are otherwise unattainable by direct RL approaches. Furthermore, diffusion models are shown to inherently promote natural-looking motions, which is beneficial for sim-to-real transfer. The effectiveness of DreamControl is validated on a Unitree G1 robot across a diverse set of challenging tasks requiring simultaneous lower and upper body control and object interaction.
1.6. Original Source Link
- Official Source Link: https://arxiv.org/abs/2509.14353v3 (arXiv preprint)
- PDF Link: https://arxiv.org/pdf/2509.14353v3.pdf (arXiv preprint) The paper is currently a preprint on arXiv, meaning it has been publicly shared but has not yet undergone formal peer review and official publication by a journal or conference.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is enabling humanoid robots to perform autonomous whole-body skills for scene interaction. This means moving beyond impressive but pre-programmed demonstrations (like dancing or kung-fu) to become truly universal assistants capable of dynamically interacting with their environment—e.g., picking up objects, opening drawers, or pressing buttons, utilizing their full humanoid form factor's mobility and range of motion.
This problem is crucial for the advancement of robotics because it unlocks a wider range of applications for humanoids in unstructured environments, transforming them from exhibition pieces to practical tools. However, several significant challenges exist in prior research:
-
Whole-body Loco-Manipulation Complexity: These tasks require
multiple timescalesof control. Short-horizon control is needed for maintainingstabilityandbalance(due to highdegrees of freedom (DoF),underactuation, andhigh center of mass). Concurrently,long-horizon motion planning(spanning tens of seconds) is required for intricate tasks like grasping distant objects or coordinating both arms forbimanual manipulation. -
RL Exploration Problem: Directly applying
Reinforcement Learning (RL)tolong-horizon,high-dimensionalbimanual manipulationtasks often fails or leads tounnatural behaviorsthat do not generalize well to the real world. This is due to the vastaction spaceand sparse rewards. -
Data Scarcity: Modern
deep learningapproaches, especially those usingimitation learning (IL)ordiffusion policies, rely heavily on large datasets. However,teleoperation dataforwhole-body humanoid controlis extremelylabor-intensiveandcostlyto collect, leading to a "100,000-year data gap" in robotics. Existing solutions often simplify the problem by fixing the lower body, separating upper and lower body control, or focusing solely oncomputer graphicsapplications.The paper's innovative idea, or entry point, is to leverage the abundance of
human motion data(which is far more accessible than robot teleoperation data) by training adiffusion prioron it. This prior thenguidesanRL policyinsimulation, implicitly providinghuman-inspired motion plansto overcome theRL exploration problemand promotenatural-looking motions, thereby aidingsim-to-real transfer.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Novel Two-Stage Methodology (
DreamControl): Introducing a new approach that synergistically combinesdiffusion modelsandReinforcement Learningfor learningautonomous whole-body humanoid skills. -
Human Motion-Informed Diffusion Prior: Utilizing a
diffusion prior(specificallyOmniControl[12]) trained onhuman motion datato generatehuman-like motion plansthat can be flexibly conditioned by text and spatiotemporal guidance. This significantly reduces reliance on expensiverobot teleoperation data. -
Enhanced RL Exploration and Solution Discovery: Demonstrating that this
human motion-informed priorprovidesRLwith dense guidance, enabling it to discoverrobustandnatural-looking solutionsfor challengingloco-manipulation tasksthat are otherwise unattainable by directRLorsparse-reward-onlymethods. -
Promotion of Natural Motions and Sim-to-Real Transfer: Showing that the
diffusion priorinherently promotesnaturalandsmootherrobot movements, which aids in bridging thesim-to-real gapand results in less "robotic" behaviors. -
Validation on Real Humanoid Hardware: Successfully validating
DreamControlon aUnitree G1 robotacross a diverse set of11 challenging tasks, including object interaction, bimanual manipulation, and combined lower and upper body control, with successfulsim-to-real deployment. -
Comprehensive Evaluation: Providing extensive
simulation resultswith various baselines,human-ness comparisonsusing metrics likeFréchet Inception Distance (FID)andjerk, anduser studiesto quantitatively and qualitatively demonstrate the superiority ofDreamControl.The key conclusions and findings are that by pre-generating
human-like motion plansand using them to guideRL,DreamControlcan effectively tackle thelong-horizon,high-dimensionalchallenges ofwhole-body humanoid control. This approach mitigates thedata scarcityproblem by leveraging abundant human data, leads to morenaturalandgeneralizable motions, and ultimately enablesrobust skill transferto real robots. The findings solve the problem of enabling humanoids to perform complex, interactive tasks autonomously and naturally, which was a significant gap in priorRL-basedorimitation learning-basedmethods struggling withexplorationordata requirements.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand DreamControl, a reader should be familiar with the following fundamental concepts:
- Humanoid Robotics:
- Degrees of Freedom (DoF): The number of independent parameters that define the configuration or state of a robot. A
Unitree G1robot has27-DoFfor its body joints and6-DoFfor each of itsdexterous hands. More DoF means greater dexterity but also increased control complexity. - Whole-Body Loco-Manipulation: Refers to tasks where a humanoid robot needs to coordinate both its locomotion (movement of the base and legs) and manipulation (movement of arms and hands) simultaneously to interact with the environment. Examples include bending down to pick an object while maintaining balance, or bracing against a surface to open a heavy door.
- Sim-to-Real Transfer (Sim2Real): The process of training a robot control policy in a simulated environment and then deploying it successfully on a real physical robot. This is challenging due to discrepancies between simulation (e.g., simplified physics, perfect sensors) and reality (e.g., real-world friction, sensor noise, latency). Techniques like
domain randomizationandasymmetric actor-criticare often used to bridge this gap.
- Degrees of Freedom (DoF): The number of independent parameters that define the configuration or state of a robot. A
- Reinforcement Learning (RL):
- Reinforcement Learning (RL): A paradigm where an
agentlearns to make decisions by performing actions in anenvironmentto maximize a cumulativerewardsignal. The agent learns through trial and error, observing thestateof the environment, taking anaction, and receiving arewardand a new state. - Policy: The learned strategy or function that maps observed states of the environment to actions that the agent should take. In deep RL, this is often represented by a neural network.
- Reward: A scalar signal given by the environment to the agent, indicating the desirability of its actions.
Dense rewardsprovide continuous feedback, whilesparse rewardsare given only upon task completion or specific events, making exploration harder. - Exploration-Exploitation Dilemma: A fundamental challenge in RL where the agent must balance exploring new, potentially better actions with exploiting known good actions to maximize immediate reward.
Long-horizonandhigh-dimensionalproblems exacerbate theexplorationchallenge. - Proximal Policy Optimization (PPO) [70]: A popular
on-policyRL algorithm that aims to stabilize and improve the training process of policies. It updates the policy by taking multiple small gradient steps to avoid large changes that could destabilize training, typically using aclipped objective function.
- Reinforcement Learning (RL): A paradigm where an
- Generative Models:
- Generative Models: A class of machine learning models that learn to generate new data samples that resemble the training data. Examples include
Generative Adversarial Networks (GANs),Variational Autoencoders (VAEs), andDiffusion Models. - Diffusion Models: A type of generative model that learns to reverse a gradual
diffusion process(adding noise to data) to generate data from noise. They have shown great success in generating high-quality images, videos, and, more recently, motion. - Diffusion Prior: In
DreamControl, this refers to adiffusion model(specificallyOmniControl) trained onhuman motion data. It learns the distribution of natural human movements and can generate diverse motion sequences that serve as "priors" or reference trajectories for theRL policy. - Text-to-Motion Generation: The ability of a
diffusion modelto generate motion sequences based on textual prompts (e.g., "pick up the bottle"). - Spatiotemporal Guidance: The ability to condition motion generation not just on text, but also on specific spatial positions of joints at particular times (e.g., "wrist at point X at time T"). This allows for fine-grained control over the generated motion to interact with specific objects or scene elements.
- Generative Models: A class of machine learning models that learn to generate new data samples that resemble the training data. Examples include
- Imitation Learning (IL): A machine learning paradigm where an agent learns a policy by observing demonstrations of a task (e.g., human teleoperation data). The goal is to imitate the demonstrated behavior. While effective, it heavily relies on the availability and quality of demonstration data, which is a bottleneck for
whole-body humanoid control. - SMPL Parameterization [66]: A widely used statistical 3D model of the human body that can represent various body shapes and poses with a compact set of parameters. Human motion datasets are often represented using
SMPLparameters. - Kinematic Optimization / Inverse Kinematics (IK):
- Kinematics: The study of motion without considering the forces that cause it.
Forward kinematicscalculates the position and orientation of a robot's end-effectors (e.g., hands, feet) given the joint angles. - Inverse Kinematics (IK): The inverse problem: calculating the joint angles required for a robot's end-effectors to reach a desired position and orientation.
IKis often solved via optimization. - Retargeting: The process of transferring a motion from one character or body (e.g., a human represented by
SMPL) to another (e.g., a humanoid robot with a differentkinematic structure). This typically involves solving anIKor optimization problem to map key points or joint angles.
- Kinematics: The study of motion without considering the forces that cause it.
3.2. Previous Works
The paper organizes related work into three main strands: robot manipulation, RL controllers for legged robots, and character animation and motion models.
3.2.1. Recent Advances in Manipulation
- Imitation Learning (IL) with Diffusion Policies: Many modern
deep learningapproaches torobot manipulationare based onimitation learning[13]-[15]. Recent advancements leveragediffusion models[16]-[18] orflow matching[18] forpolicy parameterization[9], [10], [19]-[23]. These approaches aim to scale likeLarge Language Models (LLMs)but face the challenge ofdata scarcityfor robot trajectories, which are costly to collect viateleoperationrigs [11].- Diffusion Policy [9]: A key influence,
Diffusion Policylearnsvisuomotor policiesby modeling the distribution of actions as adiffusion process. This allows it to generatelong, consistent temporal dataand handlemultimodal action distributions. The core idea is to treat policy learning as a denoising process, where noisy actions are iteratively refined towards a desired action distribution.
- Diffusion Policy [9]: A key influence,
- On-Policy RL Approaches: Some methods use
on-policy RLin simulation for scalability [4], [24], thoughsim-to-real transferremains challenging.- Lin et al. [4]: Demonstrated robust
bimanual manipulation skillson a humanoid robot usingon-policy RL, but did not addresswhole-body skillsand primarily focused onbehavior cloningfromteleoperated trajectories.DreamControldiffers by using adiffusion prioroverhuman motionto informRL, reducing the need for extensivereward engineering.
- Lin et al. [4]: Demonstrated robust
3.2.2. RL Controllers for Legged Robots
- Legged Locomotion: Deep RL has significantly advanced
legged robot control, starting with robustlocomotion policiesforquadrupeds[25]-[27] and extending tobipedalandhumanoid form factors[28]-[34]. - Whole-Body Motion Tracking & Teleoperation: More recent works focus on tracking human teleoperator motions [1], [2], [35]-[42], including agile and extreme motions like
KungFuBot[3] andASAP[43]. These approaches enable a robot to mimic a human's movements. - Autonomous Skill Execution: Beyond tracking, enabling fully
autonomous executionof specific tasks (e.g., kicking, sitting) is a major challenge [5], [43], [45]-[50].- HumanPlus [48] and AMO [5]: Demonstrated
whole-body autonomous task executionbut requiredteleoperated trajectoriesforimitation learning. - R2S2 [50]: Focused on training a limited set of "primitive" skills and ensembling them, whereas
DreamControlprovides a recipe for training such primitive skills. - BeyondMimic [49]: Also leverages
guided diffusionandRL, but its diffusion guidance is "coarse" and doesn't account forobject interactionorlong-range planning, making itorthogonaltoDreamControl's fine-grained guidance.
- HumanPlus [48] and AMO [5]: Demonstrated
3.2.3. Character Animation and Motion Models
- Physically Realistic Character Animation: A related literature exists on modeling
humanoid movementinphysically realistic character animation[8], [51]-[58]. Solving problems in this simplified synthetic setting, with access toprivileged simulation statesand nosim-to-real distribution shifts, has served as a stepping stone. - Statistical Priors over Human Motion: This field has a rich history [59]-[61] and now leverages
generative AI, includingdiffusion modelsandautoregressive transformers[6], [7], [12], [62], [63].- OmniGrasp [8]: Leveraged a
human motion prior(PULSE[55]) in the form of abottleneck VAEto directly predict actions, but it was noted as somewhat awkward to interpret as a prior on trajectories. - CloSd [7]: Generated
motion plansviadiffusionand used anRL-trained policyfor execution insimulation.DreamControlextends this by employing richer,fine-grained guidanceto handle a wider variety of tasks and addressessim2real aspects(e.g., removing explicit dependence on reference trajectories duringpolicy rollouts). - OmniControl [12]: This is the specific
diffusion modelDreamControlbuilds upon.OmniControlis adiffusion transformerthat can be flexibly conditioned on bothtext(e.g., "open the drawer") andspatiotemporal guidance(e.g., enforcing a wrist position at a specific time). This allows for precise control over the generated human motions.
- OmniGrasp [8]: Leveraged a
3.3. Technological Evolution
The field of humanoid robot control has evolved from focusing on basic locomotion and motion tracking (mimicking pre-defined or teleoperated movements) to tackling more complex autonomous interaction with diverse environments. Early efforts established robust legged locomotion for quadrupeds and bipeds using RL. Subsequently, teleoperation and motion tracking systems enabled humanoids to mimic human movements, but still relied on external input. The next frontier, whole-body loco-manipulation, demands autonomous planning and execution of tasks involving both movement and interaction. This paper fits into this evolution by addressing the data scarcity and exploration challenges of autonomous whole-body control. Generative AI (especially diffusion models), initially popular in computer graphics and character animation, is now being integrated into robotics to provide human-inspired motion priors, effectively bridging the gap between abundant human motion data and the specific needs of robot control. DreamControl marks a significant step towards general-purpose humanoid robots by enabling them to learn complex, natural, and autonomous skills without extensive robot-specific demonstrations.
3.4. Differentiation Analysis
Compared to the main methods in related work, DreamControl offers several core differences and innovations:
- Data Source & Efficiency:
- Prior Work (IL-based): Many
whole-body autonomous task executionmethods (e.g.,HumanPlus[48],AMO[5]) rely onteleoperated trajectoriesforimitation learning. This isdata-intensive,costly, and hard to scale. - DreamControl: Eliminates the need for
robot teleoperation datafor initial motion generation. Instead, it leverages the moreabundant human motion datato train adiffusion prior. This makes the approach far moredata-efficientfor robots.
- Prior Work (IL-based): Many
- Motion Generation & Guidance:
- Prior Work (Direct RL/Coarse Guidance): Direct
RLoften struggles withexplorationand can produceunnaturalorsuboptimal motions[8]. Somediffusion-basedapproaches (e.g.,BeyondMimic[49]) usecoarse guidancefor diffusion policies, which may not be sufficient for complexobject interactionorlong-range planning. - DreamControl: Utilizes
OmniControl[12], adiffusion priorwithfine-grained spatiotemporal guidanceandtext conditioning. This allows for precise control overgenerated trajectories, enabling them to seamlessly connect with specificenvironment objects(e.g., guiding a wrist to an object's location at a specific time). This rich guidance is crucial for solving interactive tasks.
- Prior Work (Direct RL/Coarse Guidance): Direct
- Role of Motion Prior:
- Prior Work (Tracking/IL): Many methods aim to directly track
reference trajectoriesat runtime, or use them forimitation learning. - DreamControl: Uses the
diffusion priorto generatereference trajectoriesduring training which then implicitly guide theRL policythrough thereward signal. Thepolicydoes not explicitly rely on these reference trajectories attest time, enablingfully autonomous task execution. This "Dream" aspect (the policy learns from, but doesn't strictly track, a dream motion) is a key conceptual difference.
- Prior Work (Tracking/IL): Many methods aim to directly track
- Naturalness and Sim-to-Real:
- Prior Work (Direct RL): Can lead to
unnaturalorjerky motionsthatgeneralize poorlyto the real world. - DreamControl: By drawing from
human motion data, thediffusion priorinherently promotesnatural-looking,less roboticmovements that typically avoidextreme joint configurations. This contributes significantly tosim-to-real transferabilityandhuman-robot interaction.
- Prior Work (Direct RL): Can lead to
- Scope of Skills:
-
Prior Work: Often simplifies tasks (e.g., fixed lower body, separate body part training) or focuses on specific aspects (e.g., locomotion, upper-body manipulation).
-
DreamControl: Tackles
challenging whole-body loco-manipulation tasksrequiring simultaneouslower and upper body controlandobject interaction, validating its versatility across a broad range of skills.In essence,
DreamControlinnovates by creating a scalable, data-efficient pipeline that leverages the strengths ofgenerative modelsfor natural motion planning andreinforcement learningfor robust, task-specific control, without the heavy burden ofrobot demonstration datathat plagues manyimitation learningapproaches.
-
4. Methodology
4.1. Principles
The core idea of DreamControl is a two-stage methodology that combines the strengths of diffusion models for human-inspired motion planning and Reinforcement Learning (RL) for robust task execution. The theoretical basis is that diffusion models, trained on abundant human motion data, can generate natural-looking, long-horizon kinematic trajectories that are difficult for RL to discover through exploration alone. These generated trajectories then serve as implicit guidance (through the reward signal) for an RL policy trained in simulation to learn to complete specific tasks while maintaining human-like motion quality. This approach addresses the challenges of RL exploration in high-dimensional spaces and the data scarcity for robot teleoperation.
The overall pipeline is summarized in Fig. 2 (Image 2 from the original paper), which depicts:
-
Stage 1: Trajectory Generation from Human Motion Prior: A
human motion prior(adiffusion modellikeOmniControl) takestext commandsandspatiotemporal guidanceto generatehuman-like motion trajectories. These are thenretargetedto the robot's form factor andfiltered/refined. -
Stage 2: RL with Reference Trajectory: An
RL policyis trained insimulationusing the generatedreference trajectories(andsynthesized scenes) to provide densetracking rewards, alongsidesparse task-specific rewardsfor task completion. This policy can then be adapted forreal-world deployment.
该图像是一个示意图,展示了DreamControl框架中的运动扩散模型与强化学习的结合。左侧展示了生成参考轨迹的过程,中间部分展示了基于参考轨迹的强化学习策略,而右侧则描述了感知模型与实际部署的关系,涉及RGB图像和深度图数据。
The image is a schematic diagram illustrating the integration of the motion diffusion model and reinforcement learning within the DreamControl framework. The left side displays the process of generating reference trajectories, the middle part showcases the reinforcement learning policy based on reference trajectories, and the right side describes the relationship between the perception model and real-world deployment, involving RGB images and depth data.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Stage 1: Generating Reference Trajectory from a Human Motion Prior
This stage focuses on creating human-like motion plans without requiring expensive humanoid teleoperation data.
4.2.1.1. Leveraging Human Motion Data
- Rationale: Human motion data is widely available (e.g.,
motion capture datasets,video datasets), allowing for learning high-quality priors for a multitude of tasks. Generatingrealistic human-like motionsalso facilitatessim-to-real transferand enables morenatural human-robot interaction. - Diffusion Transformer: A
diffusion transformer(OmniControl[12]) is chosen due to its success in modelinghuman motionandrobot manipulation trajectories, its favorable scaling properties with large datasets, and its robustness in low data regimes [65].
4.2.1.2. OmniControl for Trajectory Generation
OmniControl[12] is amotion diffusion modelthat generates trajectories conditioned on:- Text commands (): Natural language descriptions of the task (e.g., "Pick up the bottle"). Refer to
Table IXin the Appendix for specific prompts used for each task in simulation andTable Xfor real robot deployment. - Spatiotemporal guidance (): Specifies that a
jointorsubset of jointsshould reach aprespecified spatial locationat aprespecified time. This is crucial for linking the generated trajectory toobject interactionpoints in the simulation. Thespatial control signalis a tensor , where is the number of time-steps, and is the number ofSMPL joints. A spatial point isfunctionalonly if it's not .- Example (Pick task): The guidance might stipulate a wrist position at a certain time to approach an object. The Appendix provides detailed spatial guidance definitions for each task (e.g.,
Pick,Precise Punch,Press Button,Jump,Sit,Bimanual Pick,Pick from Ground,Pick and Place). For instance, for thePicktask, a target point is sampled. The spatial control signal for the wrist, , is set to for a duration around the interaction time . Additional elbow targets are used to encourage a side grasp.
- Example (Pick task): The guidance might stipulate a wrist position at a certain time to approach an object. The Appendix provides detailed spatial guidance definitions for each task (e.g.,
- Text commands (): Natural language descriptions of the task (e.g., "Pick up the bottle"). Refer to
4.2.1.3. Post-Retargeting and Trajectory Filtering
- Retargeting:
OmniControlgenerateshuman trajectoriesusing theSMPL parameterization[66]. These areretargetedto theUnitree G1robot'sform factorusingPyRoki[67]. This involves solving anoptimization problemthat minimizes:- Relative
keypoint positions. - Relative
angles. - A
scale factorto adjust forlink length differences. - Additional
residualsforphysical plausibility:feet contact costs,self-collision costs, andfoot orientation costs.
- Relative
- Trajectory Filtering: Some generated trajectories may not be
dynamically feasibleor might collide with the scene.Task-specific filtering mechanismsare applied based onheuristics(seeTable XI):- Reject trajectories if
torso anglewith the x-axis (measured by ) is larger than . This prevents excessive bending or turning. - Reject if
pelvis heightis below . This prevents unnecessary squatting. - Reject if any body part
collides with the scene. - The specific thresholds for and vary by task (e.g.,
Pick: ;Precise Kick: ).
- Reject trajectories if
4.2.1.4. Trajectory Refinements
- Initialization Alignment: Generated trajectories may not start from a consistent pose. To address this for
RL training, arefinementstep prepends frames: the first 10 frames are static at afixed default joint poseandroot pose, and the next 10 frames interpolate to the start pose () of the generated trajectory. This results in (216 frames total). - Non-functional Arm Disablement: To avoid unnecessary movement,
non-functional armsare set todefault joint angles(as perTable VI) for the entire trajectory duration.Table VIIspecifies which arm groups ( or ) are refined for each task (e.g.,PickrefinesG_left arm). - Special Case for Pick Task: Due to the
G1 robotbeing shorter than theSMPL model,Picktrajectories often involve the robot's hand going through the platform. An additionaloptimization problemis solved viagradient descentto minimally modify thereference trajectoryto avoidright hand collisionwith the platform.- The objective function aims to preserve the
smoothnessof the motion while ensuringcollision avoidance: Where:- represents the joint angles to be optimized.
- is the time step.
- is the trajectory length.
- is the position of the right hand in the original reference trajectory.
- is the position of the right hand in the modified trajectory.
d(.)is a function that maps 3D points to their closest distance from the free space (i.e., implies collision).- The objective minimizes the difference in step-wise hand movement between the original and modified trajectories, preserving motion smoothness.
- The objective function aims to preserve the
4.2.1.5. Trajectory Representation
Each reference trajectory, , is a sequence of target frames:
Where:
\Delta t = 0.05sis the time step.- (or after refinement) is the trajectory length (spanning 9.8s). Each frame contains: Where:
- is the position of the root (pelvis).
- is the orientation of the root (represented as a quaternion).
- are the target joint angles for the robot's body.
- are the
leftandright hand states(0 for open, 1 for closed), manually labeled for each task based on .Table VIIdetails open/closed hand states for each task. A critical parameter ist _ { g }, the time at which thetask-specific goal interactionoccurs (e.g., object pickup time). This is used forscene synthesis.
4.2.1.6. Out-of-Distribution Tasks (IK-based Optimization)
For novel tasks not well-represented in the OmniControl training data (e.g., pulling drawers), DreamControl employs a workaround:
- Instead of
OmniControl, it generates a base trajectory (e.g., standing idle or squatting) and then usesIK-based optimizationto align thewristto atarget trajectorydefined for the specific task. - For
Open DrawerandOpen Door,target trajectoriesfor thewrist() are defined with complex piecewise functions involving linear andquadinterpolation, and for circular motion (for door opening). - An
optimization problemis solved viagradient descentto align the wrist to by adjusting joint angles : Where:- is the loss function.
- is the forward kinematics of the wrist.
- is the wrist position in the original re-targeted trajectory.
- is the forward kinematics function for the wrist.
- are the joint angles to be optimized.
- is the learning rate for gradient descent.
4.2.2. Stage 2: RL with Reference Trajectory
Once the reference trajectories are generated, an RL policy is trained in simulation to follow these trajectories while completing the task.
4.2.2.1. Scene Synthesis
- For each generated kinematic trajectory, a corresponding
sceneissynthesizedinsimulation. - Given the
task-specific interaction timefrom Stage 1, theobject of interest(e.g., pick object, button) is placed at a specific location relative to the robot's interacting body part (e.g., wrist). Where:- is the
pose(position and orientation) of the object in the world frame. - is the
poseof the robotbody part link(e.g., right wrist link for pick task) in the world frame at time . - is the constant
offsetof the object relative to the robotbody part linkwhere the object should be placed.
- is the
- Randomization: To promote robustness, the
timestamp,target positions, objectmass, andfrictionare randomized within defined ranges (seeTable XIIin the Appendix).
4.2.2.2. Action Space
- The simulated robot is a
27-DoF Unitree G1equipped with two7-DoF DEX 3-1 hands(sim) orInspire hands(real). - Hand control is restricted to
discrete open/closed configurationsper-task (e.g., extending the right index finger forbutton-press).Table VIIdetails the specific joint angles foropenandclosedhand states for each task. - The
action spaceconsists of:- : Target joint positions for the body.
- : Scalar controls for left and right hands (negative for open, positive for closed).
- PD Control: The target joint angles are converted to joint torques using a
Proportional-Derivative (PD)controller: Where:- is the torque at time .
- and are the
proportionalandderivative gains, respectively. - are the target joint angles from the policy.
- are the current joint angles of the robot.
- are the current joint velocities of the robot.
Table VIin the Appendix lists thedefault angles, , and gains for each joint.
4.2.2.3. Observations
The observation space for the privileged policy includes:
- Proprioception:
- Joint angles ()
- Joint velocities ()
- Root linear velocity ()
- Root angular velocity ()
- Projected gravity in root frame ()
- Previous action ()
- Target Trajectory Reference: A window of future reference frames:
- is the number of future time steps.
\Delta t^{\mathrm{obs}} = 0.1sis the observation time step.- Each frame consists of:
- Target joint angles ()
- Target joint velocities ()
- Relative root position () with respect to the robot's root.
- Relative positions of
41 keypointson the robot () with respect to the robot's root.Table Vlists thekeypoints. - Target reference binary hand states ().
- Note: contains information similar to but transformed into the robot's frame and augmented for easier policy learning. This design (using relative keypoints with respect to the robot's root and inputting relative root positions) aims for precise trajectory following, unlike methods that focus on velocity commands and may drift.
- Privileged Task-Specific Observations: Relative pose of the object, mass, friction of the object (where relevant).
4.2.2.4. Rewards
The total reward is a sum of weighted terms: Where:
- are weights for 10
reward terms(listed inTable I). - are individual
reward terms. - is the weight for the
task-specific sparse reward. - is the
task-specific sparse reward.
Table I: Reward terms for reference tracking and smooth policy enforcement.
| Reward Term | Interpretation |
|---|---|
| Penalizes deviation from reference joint angles | |
| Penalizes deviation from reference keypoints (3D positions in world frame) | |
| Penalizes deviation of robot root from reference root position | |
| Penalizes deviation in orientation between robot and reference | |
Penalizes deviation of hand states from reference ( if , 0 if ) |
|
| Penalizes high torques and accelerations | |
| Penalizes high action rate changes | |
| Penalizes foot sliding while in ground contact | |
| Penalizes excessive foot-ground contacts (to discourage baby steps) | |
| Encourages feet to remain parallel to the ground (discourages heel sliding) |
Additionally, task-specific sparse rewards are crucial for task completion. These are detailed in Table XIII in the Appendix and are typically binary signals indicating success (e.g., object above a certain height for Pick, object within a target distance for Precise Punch). The body part link and interaction time are relevant here.
The reward weights ( and ) are task-specific and listed in Table XIV in the Appendix.
4.2.2.5. Training
- Environment: Training is conducted in
IsaacLab[69], which usesIsaacSim simulation. - Algorithm: All policies are trained using
Proximal Policy Optimization (PPO)[70]. - Hardware: An
NVIDIA RTX A6000with48 GB vRAMis used. - Parameters: For each task, training runs for
2000 iterationswith8192 parallel environments. - Model Architecture: A
simple fully-connected MLPis used for both theactor(policy) andcritic. The network architecture for both hashidden layers: (512, 256, 256). The same observations are used for both policy and critic, unlikeasymmetric actor-criticsetups.
4.2.3. Sim2Real Deployment
To deploy on real hardware, the observation space is modified to remove simulator-privileged information:
-
The
trajectory reference observation() is removed from the policy input (though references remain available via rewards). -
The
linear velocity of the rootis removed. -
Privileged scene-physics information(e.g., object mass, friction) is removed. -
Time encoding(where is episode length) is added. Thereward functionremains the same, but thereference trajectoryfor root position and yaw is transformed to avoid privileged inputs for the critic. The resulting policy depends only on therelative position of the object/goal. -
Hardware Setup:
Unitree G1 humanoid(27-DoF, waist lock mode allowing only yaw),Inspire dexterous hands(6-DoF, binary open/close). -
Sensors:
Onboard IMU(root orientation, gravity, angular velocity),RealSense D435i depth camera(neck-mounted, estimates 3D object/goal position relative to pelvis). -
Object Position Estimation:
OWLv2[72] for2D localizationanddepth datafor3D liftingwithobject-specific offsets. Due toOWLv2 latency, object estimates are fixed after the first frame. -
Mitigation of Static Estimate Errors: The
lower bodyis frozen during interactive tasks (except bimanual pick), and apenalty on root velocitiesis added to ensure the base remains static. Thisperception bottleneckis noted as a limitation, withvision-based policiesviastudent-teacher distillation[4] suggested for future work. -
Sim2Real Trajectory Generation/Refinement (Appendix D):
- Specific prompts (Table X) are used, and generated trajectories are
slowed downby a factor (e.g., 2.5 for Pick) to ensure safety and minimizesim-real gap. - Refinements for
PickandPrecise Punchadd a cost to bring trajectories close to goal points in theoptimization problemto ensure smooth policies. - For
Pick, thelower bodyis set tofixed default joint anglesandroot heightadjusted to ensure a standing-still motion. - For
Bimanual PickandSquat,IKis used to dismissfeet slippingby adjusting leg/foot joints, androot roll/yaware set to 0 for symmetry.
- Specific prompts (Table X) are used, and generated trajectories are
5. Experimental Setup
5.1. Datasets
The primary dataset mentioned for training the human motion prior (OmniControl) is HumanML3d [68].
-
Source:
HumanML3dis a large-scale, high-quality dataset of3D human motionscoupled withnatural language descriptions. It is a collection of various human activities. -
Characteristics: It provides diverse human motions, allowing the
diffusion modelto learn a rich distribution of natural human movements. The pairing of motion with text descriptions is crucial fortext-conditioned motion generation. -
Domain: Human motion data.
-
Effectiveness: It is effective for validating the method's performance in generating
human-like motionsand leveraging text-based instructions. The paper usesHumanML3dto evaluate the "human-ness" of its generated motions viaFIDscores. -
Data Sample Example: While the paper doesn't show a direct data sample from HumanML3d, the
text promptsused forOmniControl(e.g., "a person walks to cup, grabs the cup from side and lifts up" forPicktask, as seen inTable IX) directly correspond to the kind of language conditioningHumanML3denables.For the
RLtraining phase, no external dataset is used. Instead, theRL policiesare trained entirely within asimulated environment(IsaacSimviaIsaacLab), where scenes aresynthesizedbased on thereference trajectoriesgenerated in Stage 1. This highlightsDreamControl's data-efficient nature in theRLphase, as it doesn't requirerobot-specific demonstrations.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
5.2.1. Success Rate (%)
- Conceptual Definition: This metric quantifies the percentage of trials (episodes) in which the robot successfully completes the designated task according to predefined criteria. It directly measures the task-solving capability of the learned policy.
- Mathematical Formula: $ \mathrm{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
- Symbol Explanation:
- : The count of experimental runs where the robot achieved the task's success conditions.
- : The total count of experimental runs performed.
- : A scaling factor to express the result as a percentage.
- Task-Specific Success Criteria: These are defined in the Appendix (e.g., for
Pick, the object must be above a certain height; forPress Button, the robot's end-effector must be within a threshold distance of the button for a specific duration).
5.2.2. Fréchet Inception Distance (FID)
- Conceptual Definition:
FIDis a metric used to assess the quality of images or generated data. In the context of motion generation, it measures the "distance" between the distribution of generated motion trajectories and the distribution of real (ground-truth human) motion trajectories. A lowerFIDscore indicates that the generated motions are more similar to real human motions, implying higher quality and naturalness. It captures both the visual fidelity and the diversity of the generated samples. - Mathematical Formula: $ \mathrm{FID} = | \mu_x - \mu_g |_2^2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}) $
- Symbol Explanation:
- : The mean of the feature vectors for the real (ground-truth human) motion trajectories.
- : The mean of the feature vectors for the generated motion trajectories.
- : The covariance matrix of the feature vectors for the real motion trajectories.
- : The covariance matrix of the feature vectors for the generated motion trajectories.
- : The squared Euclidean distance.
- : The trace of a matrix (sum of diagonal elements).
- : Matrix square root.
- Feature Vectors: In practice, motion trajectories are often passed through a pre-trained feature extractor (e.g., an
Inception v3network for images, or a specialized motion encoder for motions) to obtain compact representations.FIDis calculated on these feature distributions.
5.2.3. Jerk
- Conceptual Definition:
Jerkis the third derivative of position with respect to time, or the rate of change of acceleration. In the context of robot motion,jerkmeasures the smoothness of movement. Highjerkindicates abrupt, sudden changes in acceleration, which often results in jerky, unnatural, or uncomfortable motions. Conversely, lowerjerkindicates smoother, more fluid movements. - Mathematical Formula: $ \mathrm{Jerk} = \frac { \sum _ { i } \sum _ { t } \sum _ { k } | \dddot { p } _ { i , t , k } ^ { \mathrm { key , global } } | } { N T K } $
- Symbol Explanation:
- : The third derivative of the global position of the -th keypoint in the -th trajectory at time . This represents the instantaneous
jerkfor a specific keypoint at a specific time. - : The total number of trajectories evaluated (e.g., ).
- : The total number of time-steps in each trajectory (e.g., ).
- : The total number of keypoints on the robot (e.g., ).
- : Sum of the absolute
jerkvalues across all trajectories, all time steps, and all keypoints. NTK: Normalization factor, dividing by the total number of keypoint-time-trajectory samples.
- : The third derivative of the global position of the -th keypoint in the -th trajectory at time . This represents the instantaneous
5.2.4. User Study
- Conceptual Definition: A qualitative evaluation method where human participants are asked to assess the
naturalnessorhuman-likenessof generated robot motions. This provides direct human feedback, complementing quantitative metrics. - Methodology: 40 participants were shown side-by-side videos of trajectories from different methods (order randomized) and asked to select which looked more human-like.
- Result: Reported as the average human preference percentage.
5.3. Baselines
The paper compares DreamControl against three baseline methods to evaluate its effectiveness:
-
1. (a)
TaskOnly:- Description: This baseline uses only
task-specific (sparse) rewards. It provides no dense guidance to theRL policyduring training, meaning the robot only receives a reward signal when it successfully completes the task or reaches specific task milestones. - Representativeness: This represents a naive
RL approachwhere the agent must discover the entire complexwhole-body locomotionandmanipulationstrategy through pureexplorationbased on a very infrequent success signal. It highlights the difficulty oflong-horizon,high-dimensional RL problemswithsparse rewards.
- Description: This baseline uses only
-
2. (b) :
- Description: This baseline improves upon
TaskOnlyby incorporatingtask-specific rewardsthat include bothsparse rewards(for task completion) andengineered dense rewards. Thesedense rewardsare inspired by prior work likeOmniGrasp[8] and are designed to encouragepre-grasporpre-approach posesfor the object or goal. For instance, a dense reward might guide the hand towards the object before actual grasping. The dense reward term is defined as , where is the robot's body part position, and is the target body part position at interaction time . - Representativeness: This baseline reflects a common strategy in
RLwherereward shapingis used to facilitate learning. It demonstrates whether carefulreward engineeringalone is sufficient to solvewhole-body loco-manipulation tasksand generatenatural motions.
- Description: This baseline improves upon
-
3. (c)
TrackingOnly:-
Description: This baseline focuses exclusively on
tracking rewards. TheRL policyis rewarded primarily for accurately following thereference trajectoriesgenerated byStage 1(thediffusion prior). It aims to mimic thehuman-like motionswithout explicittask-specific rewards. -
Representativeness: This baseline isolates the contribution of the
diffusion priorin generatingnatural,dynamically feasible motions. It assesses how well the robot can track human-inspired movements when not explicitly optimized for task completion. It's akin tomotion trackingapproaches but applied to generated rather than externally provided human motions.DreamControl(Ours) combines the strengths ofTrackingOnly(usingtracking rewardsfrom thediffusion prior) withtask-specific sparse rewards(similar toTaskOnly's sparse component) to achieve bothnatural motionandtask completion.
-
6. Results & Analysis
6.1. Core Results Analysis
The main experimental results strongly validate the effectiveness of DreamControl across a diverse set of 11 challenging tasks. The method consistently outperforms all baselines in terms of task success rates and generates significantly more human-like and smoother motions.
6.1.1. Simulation Success Rates
The following are the results from Table II of the original paper:
| Task / Method | (a) TaskOnly | (b) TaskOnly+ | (c) TrackingOnly | Ours | |
|---|---|---|---|---|---|
| (a) | |||||
| Pick | 0 | 15.1 | 87.5 | 95.4 | |
| Bimanual Pick | 0 | 31.0 | 100 | 100 | |
| Pick from Ground (Side Grasp) | 0 | 0 | 99.4 | 100 | |
| Pick from Ground (Top Grasp) | 0 | 0 | 100 | 100 | |
| Press Button | 0 | 99.8 | 99.1 | 99.3 | |
| Open Drawer | 0 | 24.5 | 100 | 100 | |
| Open Door | 0 | 15.4 | 100 | 100 | |
| Precise Punch | 0 | 100 | 99.4 | 99.7 | |
| Precise Kick | 0 | 97.6 | 96.1 | 98.6 | |
| Jump | 0 | 0 | 100 | 100 | |
| Sit | 0 | 100 | 100 | 100 | |
Analysis:
- (a)
TaskOnly(0% success across all tasks): This baseline, relying solely onsparse task-specific rewards, consistently fails. This highlights the severeRL exploration probleminhigh-dimensional whole-body controltasks without any dense guidance. The robot cannot discover meaningful motions to complete tasks by chance. - (b) (Variable success, often low): Adding
engineered dense rewardsimproves performance on simpler tasks likePress Button(99.8%) andPrecise Punch(100%). However, it still fails on tasks requiring complexcoordinated whole-body motion, such asPick from Ground(0%),Jump(0%), andBimanual Pick(31%). For instance, inJump, the robot might stretch its knees but cannot discover the full sequence of crouching and springing due to insufficient guidance from simple pelvis-target rewards. This demonstrates that whilereward engineeringhelps, it's often insufficient for intricateloco-manipulationand can be very challenging to design effectively for complex, natural motions. - (c)
TrackingOnly(High success, but struggles with fine-grained interaction): This baseline performs significantly better overall, achieving 100% success on many tasks likeBimanual Pick,Pick from Ground,Open Drawer,Open Door,Jump, andSit. This indicates that thehuman motion prioreffectively generatesdynamically feasibleandgoal-oriented motions. However,TrackingOnlystruggles withfine-grained interactive taskslikePick(87.5%), where precise object interaction is critical andsparse rewards(absent here) could make a difference. DreamControl(Ours) (Highest overall performance):DreamControlachieves the best results on 9 out of 11 tasks, matching the best on the other 2. It boasts near-perfect or perfect success rates on most tasks (e.g., 95.4% forPick, 100% forBimanual Pick,Pick from Ground,Open Drawer,Open Door,Jump,Sit). By combining thedense guidancefromtracking rewardswithtask-specific sparse rewards,DreamControlis able to learnrobust policiesthat not only follownatural motion plansbut also reliably complete the required interactions. This synergistic approach overcomes the limitations of baselines, demonstrating the power of ahuman motion-informed priorguidingRL.
6.1.2. Human-ness Comparison
The evaluation of human-ness (naturalness and smoothness of motion) provides further insight into the quality of DreamControl's trajectories.
The following are the results from Table III of the original paper:
| Task | Method | FID ↓ | Jerk ↓ | User Study ↑ |
|---|---|---|---|---|
| Pick | TaskOnly+ | 0.240 | 211.2 | 15.0% |
| Ours | 0.320 | 147.5 | 85.0% | |
| Press Button | TaskOnly+ | 1.220 | 235.7 | 17.25% |
| Ours | 0.375 | 161.9 | 82.75% | |
| Precise Punch | TaskOnly+ | 0.417 | 229.9 | 7.5% |
| Ours | 0.084 | 199.8 | 92.5% | |
| Precise Kick | TaskOnly+ | 0.522 | 360.9 | 17.5% |
| Ours | 0.161 | 252.5 | 82.5% | |
| Jump | TaskOnly+ | 1.216 | 236.4 | 5.0% |
| Ours | 0.208 | 148.5 | 95.0% |
Analysis:
- Fréchet Inception Distance (FID):
DreamControlconsistently achieves significantly lowerFIDvalues than across almost all tasks. A lowerFIDindicates a closer alignment with the distribution ofhuman motions, confirming thatDreamControlgenerates morenatural-looking trajectories.PickTask Exception: ThePicktask is an exception, where shows a lowerFID(0.240 vs. 0.320). The authors conjecture this is due to adomain gap: human demonstrations inHumanML3dtypically involvewaist-level picking, while the shorterG1 robotoften performsshoulder-level picks. This structural difference makes theG1's motion inherently less comparable to the human dataset for this specific task, even if the motion is natural for the robot.
- Jerk:
DreamControlconsistently produces lower average absolutejerkvalues compared to across all tasks. This quantitatively confirms thatDreamControl's motions aresmootherand morefluid, lacking the abrupt accelerations that characterize unnatural or "robotic" movements. - User Study: The
user studyresults provide strong qualitative validation. Participants overwhelmingly preferredDreamControl's trajectories, selecting them as morehuman-likein all tasks (e.g., 85% forPick, 95% forJump). This subjective evaluation aligns with thejerkmetric and generally withFID, reinforcing the naturalness benefit.
6.1.3. Visual Comparison (Jump Task)
该图像是图表,展示了 Jump 任务中两个不同方法的轨迹对比。顶部显示的是 基线的结果,而底部则展示了 DreamControl 的轨迹。黄色球体表示用于引导轨迹的空间控制点。
The image is a chart comparing the trajectories of two different methods in the Jump task. The top row shows results from the baseline, while the bottom row illustrates trajectories from DreamControl. The yellow sphere represents the spatial control point used to guide the trajectories.
Analysis:
The visual comparison for the Jump task (Fig. 3) strikingly illustrates the difference in human-ness.
-
(Top Row): The robot attempts to
jumpby lifting off the ground butwithout bendingits knees or preparing for the jump. This results in aless human-likemotion that also fails to accomplish a proper jump. The motion appears stiff and inefficient. -
DreamControl(Bottom Row): The robot exhibits asmooth jumping motionwhere it firstbends(crouches) and thenlifts off the ground. This behavior is characteristic of how humans jump, demonstrating the naturalness imparted by thehuman motion prior. The motion is fluid, effective, and visually appealing. This visual evidence further confirms the quantitative findings regardingjerkanduser preference.In summary,
DreamControlnot only enables robust task completion in challengingloco-manipulation scenariosbut also ensures that these tasks are performed withnatural,human-like, andsmooth motions, which is a critical factor for bothsim-to-real transferandhuman-robot interaction.
6.2. Data Presentation (Tables)
The following are the results from Table IV, Table V, Table VI, Table VII, Table IX, Table X, Table XI, Table XII, Table XIII, Table XIV of the original paper:
6.2.1. Robot Details (Appendix Tables)
The following are the results from Table IV of the original paper:
| Legs | ||
|---|---|---|
| left_hip_pitch_joint right_hip_pitch_joint left_hip_roll_joint right_hip_roll_joint left_hip_yaw_joint right_hip_yaw_joint left_knee_joint right_knee_joint left_ankle_pitch_joint right_ankle_pitch_joint left_ankle_roll_joint right_ankle_roll_joint |
||
| Waist | ||
| waist_yaw_joint | ||
| (Left |Right) Arms | ||
| left_shoulder_pitch_joint right_shoulder_pitch_joint left_shoulder_roll_joint right_shoulder_roll_joint left_shoulder_yaw_joint right_shoulder_yaw_joint left_elbow_joint right_elbow_joint left_wrist_roll_joint right_wrist_roll_joint left_wrist_pitch_joint right_wrist_pitch_joint left_wrist_yaw_joint right_wrist_yaw_joint |
||
| (Left | Right) Hands | ||
| left_hand_index_0_joint right_hand_index_0_joint left_hand_index_1_joint right_hand_index_1_joint left_hand_middle_0_joint right_hand_middle_0_joint left_hand_middle_1_joint right_hand_middle_1_joint left_hand_thumb_0_joint right_hand_thumb_0_joint left_hand_thumb_1_joint right_hand_thumb_1_joint left_hand_thumb_2_joint right_hand_thumb_2_joint |
The following are the results from Table V of the original paper:
| Legs | |
|---|---|
| left_hip_pitch_link left_hip_roll_link | right_hip_pitch_link right_hip_roll_link |
| left_hip_yaw_link | right_hip_yaw_link |
| left_knee_link | right_knee_link |
| left_ankle_pitch_link | |
| left_ankle_roll_link | right_ankle_pitch_link right_ankle_roll_link |
| Waist & Torso | |
| pelvis | pelvis_contour_link |
| waist_yaw_link | waist_roll_link |
| torso_link logo_link | waist_support_link |
| Head & Sensors | |
| head_link | imu_link |
| d435_link | mid360_link |
| Arms | |
| left_shoulder_pitch_link | right_shoulder_pitch_link |
| left_shoulder_roll_link | right_shoulder_roll_link |
| left_shoulder_yaw_link | right_shoulder_yaw_link |
| left_elbow_link | right_elbow_link |
| left_wrist_roll_link | right_wrist_roll_link |
| left_wrist_pitch_link | right_wrist_pitch_link |
| left_wrist_yaw_link | right_wrist_yaw_link |
| left_rubber_hand | right_rubber_hand |
The following are the results from Table VI of the original paper:
| Joint name | Default angle | Kp | Kd |
|---|---|---|---|
| left_hip_pitch_joint | -0.2 | 200 | 5 |
| left.hip_roll_joint | 0 | 150 | 5 |
| left._hip_yaw_joint | 0 | 150 | 5 |
| left_knee_joint | 0.42 | 200 | |
| left_ankle_pitch_joint | -0.23 | 20 | 20 |
| left_ankle_roll_joint | 0 | 20 | |
| right_hip-pitch_joint | -0.2 | 200 | |
| right_hip_roll_joint | 0 | 150 | |
| right_hip-yaw_joint | 0 | 150 | 5 |
| right_knee_joint | 0.42 | 200 | 2 |
| right_ankle_pitch_joint | -0.23 | 20 | |
| right_ankle roll_joint | 0 | 20 | 2 |
| waist_yaw_joint | 0 | 200 | 5 |
| left_shoulder_pitch_joint | 0.35 | 40 | 10 |
| left_shoulder_roll_joint | 0.16 | 40 | 10 |
| left_shoulder_yaw_joint | 0 | 40 | 10 |
| left_elbow_joint | 0.87 | 40 | 10 |
| left_wrist_roll_joint | 0 | 40 | 10 |
| left_wrist_pitch_joint | 0 | 40 | 10 |
| left_wrist_yaw_joint | 0 | 40 | 10 |
| left_hand_index_0_joint | 0 | 5 | 1.25 |
| left_hand_index_1_joint | 0 | 5 | 1.25 |
| left_hand_middle_0_joint | 0 | 5 | 1.25 |
| left_hand_middle_1_joint | 0 | 5 | 1.25 |
| left_hand_thumb_0_joint | 0 | 5 | 1.25 |
| left_hand_thumb_1_joint | 0 | 5 | 1.25 |
| left_hand_thumb_2_joint | 0 | 5 | 1.25 |
| right_shoulder_pitch_joint | 0.35 | 40 | 10 |
| right_shoulder_roll_joint | -0.16 | 40 | 10 |
| right_shoulder_yaw_joint | 0 | 40 | 10 |
| right_elbow_joint | 0.87 | 40 | 10 |
| right_wrist_roll_joint | 0 | 40 | 10 |
| right_wrist_pitch_joint | 0 | 40 | 10 |
| right_wrist_yaw_joint | 0 | 40 | 10 |
| right_hand_index_0_joint | 0 | 5 | 1.25 |
| right_hand_index_1_joint | 0 | 5 | 1.25 |
| right_hand_middle_0_joint | 0 | 5 | 1.25 |
| right_hand_middle_1 joint | 0 | 5 | 1.25 |
| right_hand_thumb_0_joint | 0 | 5 | 1.25 |
| right_hand_thumb_1_joint | 0 | 5 | 1.25 |
| right_hand_thumb_2_joint | 0 | 5 | 1.25 |
6.2.2. Hand States and Trajectory Refinements
The following are the results from Table VII of the original paper:
| Task | Left Hand Config (Open — Close) | Right Hand Config (Open — Close) | Group set to default q |
|---|---|---|---|
| Pick | ACL — ACL | AOR — ACR | Gleft arm |
| Precise Punch | ACL — ACL | ACR — ACR | - |
| Precise Kick | ACL — ACL | ACR — ACR | |
| Press Button | ACL — ACL | PBR — PBR | |
| Jump | ACL — ACL | ACR — ACR | |
| Sit | ACL — ACL | ACR — ACR | |
| Bimanual Pick | ACL - ACL | ACR — ACR | |
| Pick from Ground (Side Grasp) | AOL — ACL | ACR — ACR | Gright arm |
| (top grasp) Pick from Ground | ACL — ACL | ACR — ACR | Gleft arm |
| Pick and Place | ACL — ACL | AOR — ACR | |
| Open Drawer | ACL — ACL | DOR — ACR | |
| Open Door | ACL — ACL | DOR — ACR |
6.2.3. Task-Specific Prompts for Simulation
The following are the results from Table IX of the original paper:
| Task | Prompts |
|---|---|
| Pick | "a person walks to cup, grabs the cup from side and lifts up" |
| Precise Punch | "a person performs a single boxing punch with his right hand" |
| Precise Kick | "a person stands and kicks with his right leg" |
| Press Button | "a person walks towards elevator, presses elevator button" |
| Jump | "a person jumps forward" |
| Sit | "a person walks towards a chair, sits down" |
| Bimanual Pick | "a person raises the toolbox with both hands" |
| Pick from Ground (Side Grasp) | "a person raises the toolbox with the use of one hand" |
| Pick from. Ground (Top Grasp) | "a person walks forward, bends down to pick something up off the ground" |
| Pick and Place | "a person picks the cup and puts it on another table" |
6.2.4. Task-Specific Prompts and Slow-Down Factors for Real Robot
The following are the results from Table X of the original paper:
| Task | Prompts | Slow down factor |
|---|---|---|
| Pick | "a person stands in place, grabs the cup from side and lifts up" | 2.5 |
| Precise Punch | "a person performs a single boxing punch with his right hand" | 1.5 |
| Press Button | "a person stands in place, presses elevator button" | 1.5 |
| Bimanual Pick | a person raises the toolbox with both hands" | 1 |
| Squat | "a person squats in place and stands up" | 1 |
6.2.5. Trajectory Filtering Constants
The following are the results from Table XI of the original paper:
| Task | #Trajs before | #Trajs after | βtorso | βpelvis |
|---|---|---|---|---|
| Pick | 100 | 67 | π/4 | 0.6 |
| Precise Punch | 100 | 100 | π/4 | 0.6 |
| Precise Kick | 100 | 66 | π/2 | 0.5 |
| Press Button | 100 | 96 | π/3 | 0.5 |
6.2.6. Environment Randomization Parameters
The following are the results from Table XII of the original paper:
| Task | Friction | Mass of object |
|---|---|---|
| Pick | U(0.7, 1) | U(0.1, 1) |
| Precise Punch | U(0.7, 1) | |
| Precise Kick | U(0.7, 1) | |
| Press Button | U(0.7,1) | |
| Jump | U(0.7, 1) | |
| Sit | U(0.7, 1) | |
| Bimanual Pick | U(0.7, 1) | U(0.1,5) |
| Pick from Ground (Side Grasp) | U(0.7, 1) | U(0.1, 1) |
| Pick from Ground (Top Grasp) | U(0.7, 1) | U(0.1, 0.5) |
| Pick and Place | U(0.7, 1) | U(0.1, 0.5) |
| Open Drawer | U(0.7, 1) | |
| Open Door | U(0.7, 1) |
6.2.7. Task-Specific Sparse Rewards
The following are the results from Table XIII of the original paper:
| Task | Body part link, b | Sparse reward, r.sparse | Description |
|---|---|---|---|
| Pick | right_wrist_yaw_link | ||
| Precise Punch | right_wrist_yaw_link | ||
| Precise Kick | right_ankle_roll_link | ||
| Press Button | right_wrist_yaw_link | ||
| Jump | pelvis | ||
| Sit | pelvis | ||
| Bimanual Pick | right_wrist_yaw_link, left_wrist_yaw_link | ||
| Pick from Ground (Side Grasp) | left_wrist_yaw_link | ||
| Pick from Ground (Top Grasp) | right_wrist_yaw.link | ||
| Pick and Place | right_wrist_yaw_link | , is position of the goal | |
| Open Drawer | right_wrist_yaw_link | is drawer open amount, | |
| Open Door | right_wrist_yaw_link | is door open amount, |
6.2.8. Reward Weights for All Tasks
The following are the results from Table XIV of the original paper:
| Task | Tracking | Smoothness | Wr_task,sparse | Wr_task,dense | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wr1 | wr2 | wr3 | wr4 | wr5 | wr6 | wr7 | wr8 | wr9 | wr10 | |||
| Pick | -0.2 | -0.05 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | -0.5 | -1 | 0.1 | 100 |
| Precise Punch | -0.2 | -0.05 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | -0.5 | -1 | 1 | 100 |
| Precise Kick | -0.2 | -0.1 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | -0.15 | -1 for left, -0.3 for right | 1 | 100 |
| Press Button | -0.2 | -0.05 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | -0.5 | -1 | 1 | 100 |
| Jump | -0.2 | -0.1 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | 0 | -1 | 1 | 100 |
| Sit | -0.2 | -0.05 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | -0.5 | -1 | 1 | 100 |
| Bimanual Pick | -0.2 | -0.05 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | 0 | -1 | 0.1 | 100 |
| Pick from Ground (Side Grasp) | -0.2 | -0.05 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | -0.5 | -1 | 0.1 | 100 |
| Pick from Ground (Top Grasp) | -0.2 | -0.05 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | -0.15 | -1 | 0.1 | 100 |
| Pick and Place | -0.2 | -0.05 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | -0.5 | -1 | 0.1 | 100 |
| Open Drawer | -0.2 | -0.05 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | 0 | -1 | 0.1 | 100 |
| Open Door | -0.2 | -0.05 | -0.2 | -0.2 | 0.3 | -1.5e-7 | -5e-3 | -0.1 | 0 | -1 | 0.1 | 100 |
6.3. Sim2Real Deployment Results
While no specific success rates for sim-to-real deployment are provided in a table, the paper states that policies for Pick (standing), Bimanual Pick (varying weights), Press Button (standing), Open Drawer (different positions), Precise Punch (standing), and Squat (varying depths) were successfully deployed on the Unitree G1 robot. This qualitative validation is demonstrated through visualizations in Fig. 1 (Image 1 from the original paper) and additional videos on the project website.
The successful sim-to-real transfer for a selection of tasks highlights DreamControl's ability to generate robust policies that are not overly sensitive to the sim-to-real gap. The adaptations made to the observation space (removing privileged simulator information and adding time encoding) and the use of slow-down factors for certain tasks (Table X) were crucial for this real-world performance. The freezing of the lower body for some interactive tasks, necessitated by perception bottlenecks (e.g., OWLv2 latency), indicates that the core control methodology is sound, but real-world perception remains a challenge.
该图像是插图,展示了Unitree G1类人机器人正在执行多种技能,训练过程基于DreamControl。图中分别展示了机器人打开抽屉、双手同时取物、单手取物以及按电梯按钮的动作。
The image is an illustration showing the Unitree G1 humanoid robot performing various skills trained via DreamControl. It depicts the robot opening a drawer, bimanually picking an object, picking up an object with one hand, and pressing an elevator button.
6.4. Ablation Studies / Parameter Analysis
The paper's comparison against TaskOnly, , and TrackingOnly can be viewed as a form of ablation study, demonstrating the necessity and effectiveness of combining tracking rewards from the diffusion prior with task-specific sparse rewards.
-
The
TaskOnlybaseline shows thatsparse rewardsalone are insufficient. -
The baseline shows that even
hand-engineered dense rewardsoften fall short for complexwhole-body tasksand may not produce natural motions. -
The
TrackingOnlybaseline shows the power of themotion priorin generatingdynamically feasibleandnatural motions, but also its limitation in precisely executingfine-grained interaction taskswithouttask-specific rewards.DreamControl's superior performance, combiningtrackingandsparse task rewards, validates that both components are essential and synergistically contribute to the final robust and natural skill acquisition. Thesim-to-real deploymentsection also details adaptations to theobservation spaceandtrajectory refinements(e.g.,slow-down factors,lower body freezing, additionaloptimization costs) which act as a practicalablationorparameter tuningfor real-world scenarios.
7. Conclusion & Reflections
7.1. Conclusion Summary
DreamControl introduces a novel and highly effective recipe for training autonomous whole-body humanoid skills. By ingeniously leveraging guided diffusion models for long-horizon human-inspired motion planning and Reinforcement Learning for robust task-specific control, the method successfully addresses critical challenges in humanoid robotics. Key contributions include the use of human motion data (instead of expensive robot demonstrations) to inform a diffusion prior, which not only enables RL to discover solutions unattainable by direct RL but also inherently promotes natural-looking, smooth motions. The extensive simulation experiments and real-world deployment on a Unitree G1 robot across 11 challenging tasks (including object interaction and loco-manipulation) firmly validate DreamControl's effectiveness, robustness, and potential for sim-to-real transfer.
7.2. Limitations & Future Work
The authors openly acknowledge several limitations and propose future research directions:
- Skill Composition: The current implementation does not yet support the
composition of skills(e.g., combining "pick up" with "walk to table"). This would be crucial for more complex, multi-stage tasks. - Dexterous Manipulation: The method currently uses
discrete open/closed hand configurations. Extending it todexterous manipulationwithfine-grained finger controlfor more intricate object handling is a next step. - Complex Object Geometries: The system currently handles relatively simple object geometries. Supporting more
complex object geometriesand interactions would enhance its generality. - Broader Repertoire of Tasks: While demonstrating a diverse set of tasks, scaling
DreamControlto an evenbroader repertoire of tasksis an ongoing goal. - Diverse Robot Morphologies: The validation is on a
Unitree G1. ApplyingDreamControltomore diverse robot morphologieswould test its generalization capabilities. - Perception Bottlenecks: For
sim-to-real deployment, the reliance onstatic object estimatesfromOWLv2due to latency is a limitation. The authors suggest addressing this withvision-based policiestrained viastudent-teacher distillation(e.g., [4]) as future work.
7.3. Personal Insights & Critique
DreamControl represents a highly insightful and promising direction for humanoid robotics. The core idea of leveraging human motion priors to guide RL is elegant and addresses several fundamental problems simultaneously.
Strengths and Inspirations:
- Data Efficiency for Robots: The most significant inspiration is how
DreamControlcircumvents therobot data scarcityproblem by tapping into the abundance ofhuman motion data. Thishuman-in-the-loop (indirectly)approach, where human motion informs the prior, is a scalable and powerful paradigm shift from relying on expensive robotteleoperation. This concept could be applied to other domains where robot-specific data is scarce, by finding analogous human or animal motion data. - Naturalness by Design: The inherent ability of
diffusion modelsto generatenaturalanddiversemotions is a huge advantage. Thisnaturalnessdirectly translates tosmoother robot movements, which is vital forsim-to-real transfer(as jerky motions are often physically infeasible or damaging in reality) and for more acceptablehuman-robot interaction. - Enhanced RL Exploration: The
diffusion prioreffectively providesdense, meaningful guidancetoRL, transformingsparse-reward explorationinto a more tractable problem. This hybrid approach demonstrates howgenerative modelscan intelligentlypre-conditionorshapethe learning landscape forRL agents. - Modular and Adaptable: The two-stage design makes the system quite modular. The
motion priorcan be improved independently, and theRL policycan be adapted for specific robotmorphologiesortask requirements. Thesim-to-real adaptationsalso show a practical approach to bridging the reality gap, even with current perception limitations.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Fidelity of Retargeting for Complex Tasks: While
PyRokiis used, theretargetingfromhuman SMPLtoUnitree G1might not perfectly preserve alldynamic characteristicsorinteraction nuances. ThePicktask'sFID discrepancyand the need forIK-based optimizationforOpen Drawer/Doorsuggest that thediffusion prior(trained on human data) might sometimes struggle with producing directly optimal or perfectly transferable motions for tasks where robot kinematics or environment constraints significantly differ from typical human scenarios. -
Heuristic Filtering and Refinements: The reliance on
heuristic-based filteringandtrajectory refinements(e.g., forcollision avoidance,disabling non-functional arms,feet slippinginsim2real) indicates that the raw output of thediffusion prioris not alwaysdynamically feasibleor perfectly suitable for directRL tracking. While practical, this suggests room for improvement in themotion prioritself or its integration withphysics-based constraints. The authors acknowledge thatmore datamight eliminate the need for such filtering. -
Generalization of
OmniControlfor Out-of-Distribution Tasks: For tasks likeOpen Drawer/Door,DreamControlfalls back toIK-based optimizationrather than purediffusion generation. This highlights a limitation: whileOmniControlis "zero-shot" for many tasks, it might not fully generalize to actions poorly represented in its training data or tasks requiring very specific interaction dynamics. Improving thediffusion model's understanding of novel object interactions is crucial. -
Perception Dependency and Latency: The
sim-to-real deploymentis currently constrained byperception bottlenecks(e.g.,OWLv2 latency, static object estimates), requiring the robot to remain still for interactive tasks. This is a practical limitation for fullydynamic loco-manipulationin the real world. Integratingend-to-end vision-based policies(as suggested for future work) is necessary to unlock the full potential ofDreamControlin truly unstructured environments. -
Hand Control Granularity: Restricting hand control to
discrete open/closed statessimplifies the problem. Fordexterous manipulation, more granular control of individual finger joints would be required, which would significantly increase theaction spaceand complexity.Overall,
DreamControlmakes a substantial contribution by offering a robust and natural way to imbue humanoids with complex skills. Its modularity and strong performance in both simulation and real-world scenarios position it as a foundational work for future advancements in general-purpose humanoid robotics. The identified limitations also provide clear, actionable directions for continued research, particularly at the intersection ofgenerative models,reinforcement learning, andreal-world robot perception.
Similar papers
Recommended via semantic vector search.