Abstract

We introduce DreamControl, a novel methodology for learning autonomous whole-body humanoid skills. DreamControl leverages the strengths of diffusion models and Reinforcement Learning (RL): our core innovation is the use of a diffusion prior trained on human motion data, which subsequently guides an RL policy in simulation to complete specific tasks of interest (e.g., opening a drawer or picking up an object). We demonstrate that this human motion-informed prior allows RL to discover solutions unattainable by direct RL, and that diffusion models inherently promote natural looking motions, aiding in sim-to-real transfer. We validate DreamControl's effectiveness on a Unitree G1 robot across a diverse set of challenging tasks involving simultaneous lower and upper body control and object interaction. Project website at https://genrobo.github.io/DreamControl/

1. Bibliographic Information

1.1. Title

The central topic of this paper is DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion. It focuses on developing a novel method for learning autonomous whole-body skills for humanoid robots to interact with their environment.

1.2. Authors

The authors of the paper are Ovij Kalaria, Sudarshan Harithas, Pushkal Katara, Sangkyung Kwak, Sarthak Bhagat, S. Shankar Sastry, Srinath Sridhar, Sai Vemprala, Ashish Kapoor, and Jonathan Huang. Their affiliations are:

1: Microsoft
2: Georgia Institute of Technology
3: University of California, Berkeley

1.3. Journal/Conference

The paper is listed as an arXiv preprint (arXiv:2509.14353v3). This indicates it has not yet undergone formal peer review and publication in a journal or conference proceedings at the time of its release, though it is often a precursor to such publication in reputable venues. The Published at (UTC) timestamp is 2025-09-17T18:35:43.000Z, suggesting a future publication date or an update to the preprint.

1.4. Publication Year

The publication year, based on the Published at (UTC) timestamp, is 2025.

1.5. Abstract

DreamControl is presented as a novel methodology for learning autonomous whole-body humanoid skills. The core innovation lies in combining diffusion models with Reinforcement Learning (RL). Specifically, a diffusion prior is trained on human motion data, which then guides an RL policy in simulation to accomplish specific tasks like opening a drawer or picking up an object. The authors demonstrate that this human motion-informed prior enables RL to discover solutions that are otherwise unattainable by direct RL approaches. Furthermore, diffusion models are shown to inherently promote natural-looking motions, which is beneficial for sim-to-real transfer. The effectiveness of DreamControl is validated on a Unitree G1 robot across a diverse set of challenging tasks requiring simultaneous lower and upper body control and object interaction.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2509.14353v3 (arXiv preprint)
PDF Link: https://arxiv.org/pdf/2509.14353v3.pdf (arXiv preprint) The paper is currently a preprint on arXiv, meaning it has been publicly shared but has not yet undergone formal peer review and official publication by a journal or conference.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is enabling humanoid robots to perform autonomous whole-body skills for scene interaction. This means moving beyond impressive but pre-programmed demonstrations (like dancing or kung-fu) to become truly universal assistants capable of dynamically interacting with their environment—e.g., picking up objects, opening drawers, or pressing buttons, utilizing their full humanoid form factor's mobility and range of motion.

This problem is crucial for the advancement of robotics because it unlocks a wider range of applications for humanoids in unstructured environments, transforming them from exhibition pieces to practical tools. However, several significant challenges exist in prior research:

Whole-body Loco-Manipulation Complexity: These tasks require multiple timescales of control. Short-horizon control is needed for maintaining stability and balance (due to high degrees of freedom (DoF), underactuation, and high center of mass). Concurrently, long-horizon motion planning (spanning tens of seconds) is required for intricate tasks like grasping distant objects or coordinating both arms for bimanual manipulation.
RL Exploration Problem: Directly applying Reinforcement Learning (RL) to long-horizon, high-dimensional bimanual manipulation tasks often fails or leads to unnatural behaviors that do not generalize well to the real world. This is due to the vast action space and sparse rewards.
Data Scarcity: Modern deep learning approaches, especially those using imitation learning (IL) or diffusion policies, rely heavily on large datasets. However, teleoperation data for whole-body humanoid control is extremely labor-intensive and costly to collect, leading to a "100,000-year data gap" in robotics. Existing solutions often simplify the problem by fixing the lower body, separating upper and lower body control, or focusing solely on computer graphics applications.

The paper's innovative idea, or entry point, is to leverage the abundance of human motion data (which is far more accessible than robot teleoperation data) by training a diffusion prior on it. This prior then guides an RL policy in simulation, implicitly providing human-inspired motion plans to overcome the RL exploration problem and promote natural-looking motions, thereby aiding sim-to-real transfer.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Novel Two-Stage Methodology (DreamControl): Introducing a new approach that synergistically combines diffusion models and Reinforcement Learning for learning autonomous whole-body humanoid skills.
Human Motion-Informed Diffusion Prior: Utilizing a diffusion prior (specifically OmniControl [12]) trained on human motion data to generate human-like motion plans that can be flexibly conditioned by text and spatiotemporal guidance. This significantly reduces reliance on expensive robot teleoperation data.
Enhanced RL Exploration and Solution Discovery: Demonstrating that this human motion-informed prior provides RL with dense guidance, enabling it to discover robust and natural-looking solutions for challenging loco-manipulation tasks that are otherwise unattainable by direct RL or sparse-reward-only methods.
Promotion of Natural Motions and Sim-to-Real Transfer: Showing that the diffusion prior inherently promotes natural and smoother robot movements, which aids in bridging the sim-to-real gap and results in less "robotic" behaviors.
Validation on Real Humanoid Hardware: Successfully validating DreamControl on a Unitree G1 robot across a diverse set of 11 challenging tasks, including object interaction, bimanual manipulation, and combined lower and upper body control, with successful sim-to-real deployment.
Comprehensive Evaluation: Providing extensive simulation results with various baselines, human-ness comparisons using metrics like Fréchet Inception Distance (FID) and jerk, and user studies to quantitatively and qualitatively demonstrate the superiority of DreamControl.

The key conclusions and findings are that by pre-generating human-like motion plans and using them to guide RL, DreamControl can effectively tackle the long-horizon, high-dimensional challenges of whole-body humanoid control. This approach mitigates the data scarcity problem by leveraging abundant human data, leads to more natural and generalizable motions, and ultimately enables robust skill transfer to real robots. The findings solve the problem of enabling humanoids to perform complex, interactive tasks autonomously and naturally, which was a significant gap in prior RL-based or imitation learning-based methods struggling with exploration or data requirements.

3.1. Foundational Concepts

To understand DreamControl, a reader should be familiar with the following fundamental concepts:

Humanoid Robotics:
- Degrees of Freedom (DoF): The number of independent parameters that define the configuration or state of a robot. A Unitree G1 robot has 27-DoF for its body joints and 6-DoF for each of its dexterous hands. More DoF means greater dexterity but also increased control complexity.
- Whole-Body Loco-Manipulation: Refers to tasks where a humanoid robot needs to coordinate both its locomotion (movement of the base and legs) and manipulation (movement of arms and hands) simultaneously to interact with the environment. Examples include bending down to pick an object while maintaining balance, or bracing against a surface to open a heavy door.
- Sim-to-Real Transfer (Sim2Real): The process of training a robot control policy in a simulated environment and then deploying it successfully on a real physical robot. This is challenging due to discrepancies between simulation (e.g., simplified physics, perfect sensors) and reality (e.g., real-world friction, sensor noise, latency). Techniques like domain randomization and asymmetric actor-critic are often used to bridge this gap.
Reinforcement Learning (RL):
- Reinforcement Learning (RL): A paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. The agent learns through trial and error, observing the state of the environment, taking an action, and receiving a reward and a new state.
- Policy: The learned strategy or function that maps observed states of the environment to actions that the agent should take. In deep RL, this is often represented by a neural network.
- Reward: A scalar signal given by the environment to the agent, indicating the desirability of its actions. Dense rewards provide continuous feedback, while sparse rewards are given only upon task completion or specific events, making exploration harder.
- Exploration-Exploitation Dilemma: A fundamental challenge in RL where the agent must balance exploring new, potentially better actions with exploiting known good actions to maximize immediate reward. Long-horizon and high-dimensional problems exacerbate the exploration challenge.
- Proximal Policy Optimization (PPO) [70]: A popular on-policy RL algorithm that aims to stabilize and improve the training process of policies. It updates the policy by taking multiple small gradient steps to avoid large changes that could destabilize training, typically using a clipped objective function.
Generative Models:
- Generative Models: A class of machine learning models that learn to generate new data samples that resemble the training data. Examples include Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models.
- Diffusion Models: A type of generative model that learns to reverse a gradual diffusion process (adding noise to data) to generate data from noise. They have shown great success in generating high-quality images, videos, and, more recently, motion.
- Diffusion Prior: In DreamControl, this refers to a diffusion model (specifically OmniControl) trained on human motion data. It learns the distribution of natural human movements and can generate diverse motion sequences that serve as "priors" or reference trajectories for the RL policy.
- Text-to-Motion Generation: The ability of a diffusion model to generate motion sequences based on textual prompts (e.g., "pick up the bottle").
- Spatiotemporal Guidance: The ability to condition motion generation not just on text, but also on specific spatial positions of joints at particular times (e.g., "wrist at point X at time T"). This allows for fine-grained control over the generated motion to interact with specific objects or scene elements.
Imitation Learning (IL): A machine learning paradigm where an agent learns a policy by observing demonstrations of a task (e.g., human teleoperation data). The goal is to imitate the demonstrated behavior. While effective, it heavily relies on the availability and quality of demonstration data, which is a bottleneck for whole-body humanoid control.
SMPL Parameterization [66]: A widely used statistical 3D model of the human body that can represent various body shapes and poses with a compact set of parameters. Human motion datasets are often represented using SMPL parameters.
Kinematic Optimization / Inverse Kinematics (IK):
- Kinematics: The study of motion without considering the forces that cause it. Forward kinematics calculates the position and orientation of a robot's end-effectors (e.g., hands, feet) given the joint angles.
- Inverse Kinematics (IK): The inverse problem: calculating the joint angles required for a robot's end-effectors to reach a desired position and orientation. IK is often solved via optimization.
- Retargeting: The process of transferring a motion from one character or body (e.g., a human represented by SMPL) to another (e.g., a humanoid robot with a different kinematic structure). This typically involves solving an IK or optimization problem to map key points or joint angles.

3.2. Previous Works

The paper organizes related work into three main strands: robot manipulation, RL controllers for legged robots, and character animation and motion models.

3.2.1. Recent Advances in Manipulation

Imitation Learning (IL) with Diffusion Policies: Many modern deep learning approaches to robot manipulation are based on imitation learning [13]-[15]. Recent advancements leverage diffusion models [16]-[18] or flow matching [18] for policy parameterization [9], [10], [19]-[23]. These approaches aim to scale like Large Language Models (LLMs) but face the challenge of data scarcity for robot trajectories, which are costly to collect via teleoperation rigs [11].
- Diffusion Policy [9]: A key influence, Diffusion Policy learns visuomotor policies by modeling the distribution of actions as a diffusion process. This allows it to generate long, consistent temporal data and handle multimodal action distributions. The core idea is to treat policy learning as a denoising process, where noisy actions are iteratively refined towards a desired action distribution.
On-Policy RL Approaches: Some methods use on-policy RL in simulation for scalability [4], [24], though sim-to-real transfer remains challenging.
- Lin et al. [4]: Demonstrated robust bimanual manipulation skills on a humanoid robot using on-policy RL, but did not address whole-body skills and primarily focused on behavior cloning from teleoperated trajectories. DreamControl differs by using a diffusion prior over human motion to inform RL, reducing the need for extensive reward engineering.

3.2.2. RL Controllers for Legged Robots

Legged Locomotion: Deep RL has significantly advanced legged robot control, starting with robust locomotion policies for quadrupeds [25]-[27] and extending to bipedal and humanoid form factors [28]-[34].
Whole-Body Motion Tracking & Teleoperation: More recent works focus on tracking human teleoperator motions [1], [2], [35]-[42], including agile and extreme motions like KungFuBot [3] and ASAP [43]. These approaches enable a robot to mimic a human's movements.
Autonomous Skill Execution: Beyond tracking, enabling fully autonomous execution of specific tasks (e.g., kicking, sitting) is a major challenge [5], [43], [45]-[50].
- HumanPlus [48] and AMO [5]: Demonstrated whole-body autonomous task execution but required teleoperated trajectories for imitation learning.
- R2S2 [50]: Focused on training a limited set of "primitive" skills and ensembling them, whereas DreamControl provides a recipe for training such primitive skills.
- BeyondMimic [49]: Also leverages guided diffusion and RL, but its diffusion guidance is "coarse" and doesn't account for object interaction or long-range planning, making it orthogonal to DreamControl's fine-grained guidance.

3.2.3. Character Animation and Motion Models

Physically Realistic Character Animation: A related literature exists on modeling humanoid movement in physically realistic character animation [8], [51]-[58]. Solving problems in this simplified synthetic setting, with access to privileged simulation states and no sim-to-real distribution shifts, has served as a stepping stone.
Statistical Priors over Human Motion: This field has a rich history [59]-[61] and now leverages generative AI, including diffusion models and autoregressive transformers [6], [7], [12], [62], [63].
- OmniGrasp [8]: Leveraged a human motion prior (PULSE [55]) in the form of a bottleneck VAE to directly predict actions, but it was noted as somewhat awkward to interpret as a prior on trajectories.
- CloSd [7]: Generated motion plans via diffusion and used an RL-trained policy for execution in simulation. DreamControl extends this by employing richer, fine-grained guidance to handle a wider variety of tasks and addresses sim2real aspects (e.g., removing explicit dependence on reference trajectories during policy rollouts).
- OmniControl [12]: This is the specific diffusion model DreamControl builds upon. OmniControl is a diffusion transformer that can be flexibly conditioned on both text (e.g., "open the drawer") and spatiotemporal guidance (e.g., enforcing a wrist position at a specific time). This allows for precise control over the generated human motions.

3.3. Technological Evolution

The field of humanoid robot control has evolved from focusing on basic locomotion and motion tracking (mimicking pre-defined or teleoperated movements) to tackling more complex autonomous interaction with diverse environments. Early efforts established robust legged locomotion for quadrupeds and bipeds using RL. Subsequently, teleoperation and motion tracking systems enabled humanoids to mimic human movements, but still relied on external input. The next frontier, whole-body loco-manipulation, demands autonomous planning and execution of tasks involving both movement and interaction. This paper fits into this evolution by addressing the data scarcity and exploration challenges of autonomous whole-body control. Generative AI (especially diffusion models), initially popular in computer graphics and character animation, is now being integrated into robotics to provide human-inspired motion priors, effectively bridging the gap between abundant human motion data and the specific needs of robot control. DreamControl marks a significant step towards general-purpose humanoid robots by enabling them to learn complex, natural, and autonomous skills without extensive robot-specific demonstrations.

3.4. Differentiation Analysis

Compared to the main methods in related work, DreamControl offers several core differences and innovations:

Data Source & Efficiency:
- Prior Work (IL-based): Many whole-body autonomous task execution methods (e.g., HumanPlus [48], AMO [5]) rely on teleoperated trajectories for imitation learning. This is data-intensive, costly, and hard to scale.
- DreamControl: Eliminates the need for robot teleoperation data for initial motion generation. Instead, it leverages the more abundant human motion data to train a diffusion prior. This makes the approach far more data-efficient for robots.
Motion Generation & Guidance:
- Prior Work (Direct RL/Coarse Guidance): Direct RL often struggles with exploration and can produce unnatural or suboptimal motions [8]. Some diffusion-based approaches (e.g., BeyondMimic [49]) use coarse guidance for diffusion policies, which may not be sufficient for complex object interaction or long-range planning.
- DreamControl: Utilizes OmniControl [12], a diffusion prior with fine-grained spatiotemporal guidance and text conditioning. This allows for precise control over generated trajectories, enabling them to seamlessly connect with specific environment objects (e.g., guiding a wrist to an object's location at a specific time). This rich guidance is crucial for solving interactive tasks.
Role of Motion Prior:
- Prior Work (Tracking/IL): Many methods aim to directly track reference trajectories at runtime, or use them for imitation learning.
- DreamControl: Uses the diffusion prior to generate reference trajectories during training which then implicitly guide the RL policy through the reward signal. The policy does not explicitly rely on these reference trajectories at test time, enabling fully autonomous task execution. This "Dream" aspect (the policy learns from, but doesn't strictly track, a dream motion) is a key conceptual difference.
Naturalness and Sim-to-Real:
- Prior Work (Direct RL): Can lead to unnatural or jerky motions that generalize poorly to the real world.
- DreamControl: By drawing from human motion data, the diffusion prior inherently promotes natural-looking, less robotic movements that typically avoid extreme joint configurations. This contributes significantly to sim-to-real transferability and human-robot interaction.
Scope of Skills:
- Prior Work: Often simplifies tasks (e.g., fixed lower body, separate body part training) or focuses on specific aspects (e.g., locomotion, upper-body manipulation).
- DreamControl: Tackles challenging whole-body loco-manipulation tasks requiring simultaneous lower and upper body control and object interaction, validating its versatility across a broad range of skills.
  
  In essence, DreamControl innovates by creating a scalable, data-efficient pipeline that leverages the strengths of generative models for natural motion planning and reinforcement learning for robust, task-specific control, without the heavy burden of robot demonstration data that plagues many imitation learning approaches.

4. Methodology

4.1. Principles

The core idea of DreamControl is a two-stage methodology that combines the strengths of diffusion models for human-inspired motion planning and Reinforcement Learning (RL) for robust task execution. The theoretical basis is that diffusion models, trained on abundant human motion data, can generate natural-looking, long-horizon kinematic trajectories that are difficult for RL to discover through exploration alone. These generated trajectories then serve as implicit guidance (through the reward signal) for an RL policy trained in simulation to learn to complete specific tasks while maintaining human-like motion quality. This approach addresses the challenges of RL exploration in high-dimensional spaces and the data scarcity for robot teleoperation.

The overall pipeline is summarized in Fig. 2 (Image 2 from the original paper), which depicts:

Stage 1: Trajectory Generation from Human Motion Prior: A human motion prior (a diffusion model like OmniControl) takes text commands and spatiotemporal guidance to generate human-like motion trajectories. These are then retargeted to the robot's form factor and filtered/refined.
Stage 2: RL with Reference Trajectory: An RL policy is trained in simulation using the generated reference trajectories (and synthesized scenes) to provide dense tracking rewards, alongside sparse task-specific rewards for task completion. This policy can then be adapted for real-world deployment.

该图像是一个示意图，展示了DreamControl框架中的运动扩散模型与强化学习的结合。左侧展示了生成参考轨迹的过程，中间部分展示了基于参考轨迹的强化学习策略，而右侧则描述了感知模型与实际部署的关系，涉及RGB图像和深度图数据。

The image is a schematic diagram illustrating the integration of the motion diffusion model and reinforcement learning within the DreamControl framework. The left side displays the process of generating reference trajectories, the middle part showcases the reinforcement learning policy based on reference trajectories, and the right side describes the relationship between the perception model and real-world deployment, involving RGB images and depth data.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Stage 1: Generating Reference Trajectory from a Human Motion Prior

This stage focuses on creating human-like motion plans without requiring expensive humanoid teleoperation data.

4.2.1.1. Leveraging Human Motion Data

Rationale: Human motion data is widely available (e.g., motion capture datasets, video datasets), allowing for learning high-quality priors for a multitude of tasks. Generating realistic human-like motions also facilitates sim-to-real transfer and enables more natural human-robot interaction.
Diffusion Transformer: A diffusion transformer (OmniControl [12]) is chosen due to its success in modeling human motion and robot manipulation trajectories, its favorable scaling properties with large datasets, and its robustness in low data regimes [65].

4.2.1.2. `OmniControl` for Trajectory Generation

OmniControl [12] is a motion diffusion model that generates trajectories conditioned on:
- Text commands ( $\lambda^{\mathrm{text}}$ ): Natural language descriptions of the task (e.g., "Pick up the bottle"). Refer to Table IX in the Appendix for specific prompts used for each task in simulation and Table X for real robot deployment.
- Spatiotemporal guidance ( $\lambda^{\mathrm{spatial}}$ ): Specifies that a joint or subset of joints should reach a prespecified spatial location at a prespecified time. This is crucial for linking the generated trajectory to object interaction points in the simulation. The spatial control signal is a tensor $\in \mathcal{R}^{L \times S \times 3}$ $\in R^{L \times S \times 3}$ , where $L=196$ $L = 196$ is the number of time-steps, and $S=22$ $S = 22$ is the number of SMPL joints. A spatial point is functional only if it's not $(0,0,0)$ $(0, 0, 0)$ .
  - Example (Pick task): The guidance might stipulate a wrist position at a certain time to approach an object. The Appendix provides detailed spatial guidance definitions for each task (e.g., Pick, Precise Punch, Press Button, Jump, Sit, Bimanual Pick, Pick from Ground, Pick and Place). For instance, for the Pick task, a target point $p^{\mathrm{target}} = \{p^x \in \mathcal{U}(1.0, 1.2), p^y \in \mathcal{U}(-0.4, 0.4), p^z = 1.1\}$ is sampled. The spatial control signal for the wrist, $\lambda^{\mathrm{right-wrist}}$ , is set to $p^{\mathrm{target}}$ for a duration around the interaction time $t_g'$ . Additional elbow targets are used to encourage a side grasp.

4.2.1.3. Post-Retargeting and Trajectory Filtering

Retargeting: OmniControl generates human trajectories using the SMPL parameterization [66]. These are retargeted to the Unitree G1 robot's form factor using PyRoki [67]. This involves solving an optimization problem that minimizes:
- Relative keypoint positions.
- Relative angles.
- A scale factor to adjust for link length differences.
- Additional residuals for physical plausibility: feet contact costs, self-collision costs, and foot orientation costs.
Trajectory Filtering: Some generated trajectories may not be dynamically feasible or might collide with the scene. Task-specific filtering mechanisms are applied based on heuristics (see Table XI):
- Reject trajectories if torso angle with the x-axis (measured by $\operatorname{arccos}(axis_x^{\mathrm{torso},x})$ ) is larger than $\beta^{\mathrm{torso}}$ . This prevents excessive bending or turning.
- Reject if pelvis height is below $\beta^{\mathrm{pelvis}}$ . This prevents unnecessary squatting.
- Reject if any body part collides with the scene.
- The specific thresholds for $\beta^{\mathrm{torso}}$ and $\beta^{\mathrm{pelvis}}$ vary by task (e.g., Pick: $\pi/4, 0.6$ ; Precise Kick: $\pi/2, 0.5$ ).

Initialization Alignment: Generated trajectories may not start from a consistent pose. To address this for RL training, a refinement step prepends $N^{\mathrm{init}} = 20$ frames: the first 10 frames are static at a fixed default joint pose and root pose, and the next 10 frames interpolate to the start pose ( $\alpha_0$ ) of the generated trajectory. This results in $\alpha^{\mathrm{refined}}$ (216 frames total).
Non-functional Arm Disablement: To avoid unnecessary movement, non-functional arms are set to default joint angles (as per Table VI) for the entire trajectory duration. Table VII specifies which arm groups ( $G^{\mathrm{left arm}}$ or $G^{\mathrm{right arm}}$ ) are refined for each task (e.g., Pick refines G_left arm).
Special Case for Pick Task: Due to the G1 robot being shorter than the SMPL model, Pick trajectories often involve the robot's hand going through the platform. An additional optimization problem is solved via gradient descent to minimally modify the reference trajectory to avoid right hand collision with the platform.
- The objective function aims to preserve the smoothness of the motion while ensuring collision avoidance: $q^{\mathrm{ref,*}} = \mathrm{argmin} \sum_{t=\{\Delta t, \dots, (L-1)\Delta t\}} \left| \left( \| p_t^{\mathrm{righthand,*}} - p_{t-1}^{\mathrm{righthand,*}} \|_2 - \| p_t^{\mathrm{righthand}} - p_{t-1}^{\mathrm{righthand}} \|_2 \right) \right| \\ \mathrm{s.t.} \ d(p_t^{\mathrm{righthand}}) = 0$ Where:
  - $q^{\mathrm{ref,*}}$ represents the joint angles to be optimized.
  - $t$ is the time step.
  - $L$ is the trajectory length.
  - $p_t^{\mathrm{righthand,*}}$ is the position of the right hand in the original reference trajectory.
  - $p_t^{\mathrm{righthand}}$ is the position of the right hand in the modified trajectory.
  - d(.) is a function that maps 3D points to their closest distance from the free space (i.e., $d(p) = 0$ implies collision).
  - The objective minimizes the difference in step-wise hand movement between the original and modified trajectories, preserving motion smoothness.

4.2.1.5. Trajectory Representation

Each reference trajectory, $\alpha _ { i }$ , is a sequence of target frames: $\alpha _ { i } = [ \alpha _ { i , 0 } , \alpha _ { i , \Delta t } , . . . , \alpha _ { i , ( L - 1 ) \Delta t } ]$ Where:

\Delta t = 0.05s is the time step.
$L = 196$ (or $L=216$ after refinement) is the trajectory length (spanning 9.8s). Each frame $\alpha _ { t }$ contains: $\alpha _ { t } = \{ p _ { t } ^ { \mathrm { ref , root } } , ~ \bar { \theta } _ { t } ^ { \mathrm { ref , root } } , ~ q _ { t } ^ { \mathrm { ref } } , ~ s _ { t } ^ { \mathrm { ref , l e f t } } , ~ s _ { t } ^ { \mathrm { ref , r i g h t } } \}$ Where:
$p _ { t } ^ { \mathrm { ref , root } } \in \mathbb{R}^3$ is the position of the root (pelvis).
$\bar{\theta}_t^{\mathrm{ref,root}} \in \mathbb{R}^4$ is the orientation of the root (represented as a quaternion).
$q _ { t } ^ { \mathrm { ref } } \in \mathbb{R}^{27}$ are the target joint angles for the robot's body.
$s _ { t } ^ { \mathrm { ref , l e f t } }, s _ { t } ^ { \mathrm { ref , r i g h t } } \in \{0, 1\}$ are the left and right hand states (0 for open, 1 for closed), manually labeled for each task based on $t_g$ . Table VII details open/closed hand states for each task. A critical parameter is t _ { g }, the time at which the task-specific goal interaction occurs (e.g., object pickup time). This $t_g$ is used for scene synthesis.

4.2.1.6. Out-of-Distribution Tasks (IK-based Optimization)

For novel tasks not well-represented in the OmniControl training data (e.g., pulling drawers), DreamControl employs a workaround:

Instead of OmniControl, it generates a base trajectory (e.g., standing idle or squatting) and then uses IK-based optimization to align the wrist to a target trajectory defined for the specific task.
For Open Drawer and Open Door, target trajectories for the wrist ( $T_i^{\mathrm{wrist}}$ ) are defined with complex piecewise functions involving linear and quad interpolation, and for circular motion (for door opening).
An optimization problem is solved via gradient descent to align the wrist to $T_i^{\mathrm{wrist}}$ $T_{i}^{wrist}$ by adjusting joint angles $q_i^{\mathrm{ref}}$ $q_{i}^{ref}$ : $L = \sum_{i=0}^{i=L-1} (p_i^{\mathrm{wrist}} - w_i)^2 \\ \mathrm{where} \ p_i^{\mathrm{wrist}} = fk^{\mathrm{wrist}}(p_i^{\mathrm{ref,root}}, \theta_i^{\mathrm{ref,root}}, q_i^{\mathrm{ref,*}}) \\ q_i^{\mathrm{ref,j+1}} = q_i^{\mathrm{ref,j}} - \lambda^{\mathrm{lr}} \frac{\partial}{\partial q_i^{\mathrm{ref,*}}} L$ Where:
- $L$ is the loss function.
- $p_i^{\mathrm{wrist}}$ is the forward kinematics of the wrist.
- $w_i$ is the wrist position in the original re-targeted trajectory.
- $fk^{\mathrm{wrist}}$ is the forward kinematics function for the wrist.
- $q_i^{\mathrm{ref,*}}$ are the joint angles to be optimized.
- $\lambda^{\mathrm{lr}}$ is the learning rate for gradient descent.

4.2.2. Stage 2: RL with Reference Trajectory

Once the reference trajectories are generated, an RL policy is trained in simulation to follow these trajectories while completing the task.

4.2.2.1. Scene Synthesis

For each generated kinematic trajectory, a corresponding scene is synthesized in simulation.
Given the task-specific interaction time $t_g$ $t_{g}$ from Stage 1, the object of interest (e.g., pick object, button) is placed at a specific location relative to the robot's interacting body part (e.g., wrist). $\begin{array} { r l } & { \mathbf { t } ^ { \mathrm { o , w o r l d } } = \mathbf { t } _ { t _ { g } } ^ { \mathrm { b , w o r l d } } + R _ { t _ { g } } ^ { \mathrm { b , w o r l d } } \mathbf { t } ^ { \mathrm { o , b } } , } \\ & { R ^ { \mathrm { o , w o r l d } } = R _ { t _ { g } } ^ { \mathrm { b , w o r l d } } R ^ { \mathrm { o , b } } , } \end{array}$ Where:
- $(\mathbf{t}^{\mathrm{o,world}}, R^{\mathrm{o,world}})$ is the pose (position and orientation) of the object in the world frame.
- $(\mathbf{t}_{t_g}^{\mathrm{b,world}}, R_{t_g}^{\mathrm{b,world}})$ is the pose of the robot body part link (e.g., right wrist link for pick task) in the world frame at time $t_g$ .
- $(\mathbf{t}^{\mathrm{o,b}}, R^{\mathrm{o,b}})$ is the constant offset of the object relative to the robot body part link where the object should be placed.
Randomization: To promote robustness, the timestamp $t_g$ , target positions, object mass, and friction are randomized within defined ranges (see Table XII in the Appendix).

4.2.2.2. Action Space

The simulated robot is a 27-DoF Unitree G1 equipped with two 7-DoF DEX 3-1 hands (sim) or Inspire hands (real).
Hand control is restricted to discrete open/closed configurations per-task (e.g., extending the right index finger for button-press). Table VII details the specific joint angles for open and closed hand states for each task.
The action space $a_t \in \mathbb{R}^{29}$ $a_{t} \in R^{29}$ consists of:
- $a_t^{\mathrm{body}} \in \mathbb{R}^{27}$ : Target joint positions for the body.
- $a_t^{\mathrm{left}}, a_t^{\mathrm{right}} \in \mathbb{R}$ : Scalar controls for left and right hands (negative for open, positive for closed).
PD Control: The target joint angles are converted to joint torques $\tau_t$ $τ_{t}$ using a Proportional-Derivative (PD) controller: $\tau _ { t } = k _ { p } ( q _ { t } ^ { \mathrm { c o m m a n d s } } - q _ { t } ^ { \mathrm { r o b o t } } ) - k _ { d } \dot { q } _ { t } ^ { \mathrm { r o b o t } }$ Where:
- $\tau_t$ is the torque at time $t$ .
- $k_p$ and $k_d$ are the proportional and derivative gains, respectively.
- $q_t^{\mathrm{commands}}$ are the target joint angles from the policy.
- $q_t^{\mathrm{robot}}$ are the current joint angles of the robot.
- $\dot{q}_t^{\mathrm{robot}}$ are the current joint velocities of the robot.
- Table VI in the Appendix lists the default angles, $k_p$ , and $k_d$ gains for each joint.

4.2.2.3. Observations

The observation space for the privileged policy includes:

Proprioception:
- Joint angles ( $q_t^{\mathrm{robot}} \in \mathbb{R}^{27}$ )
- Joint velocities ( $\dot{q}_t^{\mathrm{robot}} \in \mathbb{R}^{27}$ )
- Root linear velocity ( $v_t^{\mathrm{root}} \in \mathbb{R}^3$ )
- Root angular velocity ( $\omega_t^{\mathrm{root}} \in \mathbb{R}^3$ )
- Projected gravity in root frame ( $g_t \in \mathbb{R}^3$ )
- Previous action ( $a_{t-1} \in \mathbb{R}^{29}$ )
Target Trajectory Reference: A window of future reference frames: $\left[ \gamma _ { t } , \gamma _ { t + \Delta t ^ { \mathrm { o b s } } } , . . . , \gamma _ { t + ( K - 1 ) \Delta t ^ { \mathrm { o b s } } } \right]$ $[γ_{t}, γ_{t + Δ t^{obs}}, ..., γ_{t + (K - 1) Δ t^{obs}}]$
- $K$ is the number of future time steps.
- \Delta t^{\mathrm{obs}} = 0.1s is the observation time step.
- Each $\gamma_t$ $γ_{t}$ frame consists of:
  - Target joint angles ( $q_t^{\mathrm{ref}} \in \mathbb{R}^{27}$ )
  - Target joint velocities ( $\dot{q}_t^{\mathrm{ref}} \in \mathbb{R}^{27}$ )
  - Relative root position ( $p_t^{\mathrm{rel,root}}$ ) with respect to the robot's root.
  - Relative positions of 41 keypoints on the robot ( $p_t^{\mathrm{rel,key}} \in \mathbb{R}^{3 \times 41}$ ) with respect to the robot's root. Table V lists the keypoints.
  - Target reference binary hand states ( $s_t^{\mathrm{ref,left}}, s_t^{\mathrm{ref,right}}$ ).
- Note: $\gamma_t$ contains information similar to $\alpha_t$ but transformed into the robot's frame and augmented for easier policy learning. This design (using relative keypoints with respect to the robot's root and inputting relative root positions) aims for precise trajectory following, unlike methods that focus on velocity commands and may drift.
Privileged Task-Specific Observations: Relative pose of the object, mass, friction of the object (where relevant).

4.2.2.4. Rewards

The total reward $r_t$ is a sum of weighted terms: $r _ { t } = \Sigma _ { i = 1 } ^ { 1 0 } w _ { r _ { i } } r _ { t , i } + w _ { r ^ { \mathrm { t a s k , s p a r s e } } } r ^ { \mathrm { t a s k , s p a r s e } }$ Where:

$w_{r_i}$ are weights for 10 reward terms (listed in Table I).
$r_{t,i}$ are individual reward terms.
$w_{r^{\mathrm{task,sparse}}}$ is the weight for the task-specific sparse reward.
$r^{\mathrm{task,sparse}}$ is the task-specific sparse reward.

Table I: Reward terms for reference tracking and smooth policy enforcement.

Reward Term	Interpretation
$\Vert q^{\mathrm{robot}} - q^{\mathrm{ref}} \Vert_2$	Penalizes deviation from reference joint angles
$\Vert p^{\mathrm{robot,key}} - p^{\mathrm{ref,key}} \Vert_2$	Penalizes deviation from reference keypoints (3D positions in world frame)
$\Vert p^{\mathrm{robot,root}} - p^{\mathrm{ref,root}} \Vert_2$	Penalizes deviation of robot root from reference root position
$\Vert \theta^{\mathrm{robot}} - \theta^{\mathrm{ref}} \Vert_2$	Penalizes deviation in orientation between robot and reference
$\Vert (\sigma(a^{\mathrm{left}}), \sigma(a^{\mathrm{right}})) - (s^{\mathrm{ref,left}}, s^{\mathrm{ref,right}}) \Vert_2$	Penalizes deviation of hand states from reference ( $\sigma(x)=1$ if $x>0$ , `0` if $x<0$ )
$\Vert a_t \Vert_2^2 + \Vert q^{\mathrm{robot}}_t \Vert_2^2$	Penalizes high torques and accelerations
$\Vert a_t - a_{t-1} \Vert_2$	Penalizes high action rate changes
$p^{\mathrm{foot,slide}}$	Penalizes foot sliding while in ground contact
$n^{\mathrm{feet}}$	Penalizes excessive foot-ground contacts (to discourage baby steps)
$o^{\mathrm{foot,orientation}}$	Encourages feet to remain parallel to the ground (discourages heel sliding)

Additionally, task-specific sparse rewards are crucial for task completion. These are detailed in Table XIII in the Appendix and are typically binary signals indicating success (e.g., object above a certain height for Pick, object within a target distance for Precise Punch). The body part link $b$ and interaction time $t_g$ are relevant here.

The reward weights ( $w_{r_i}$ and $w_{r^{\mathrm{task,sparse}}}$ ) are task-specific and listed in Table XIV in the Appendix.

4.2.2.5. Training

Environment: Training is conducted in IsaacLab [69], which uses IsaacSim simulation.
Algorithm: All policies are trained using Proximal Policy Optimization (PPO) [70].
Hardware: An NVIDIA RTX A6000 with 48 GB vRAM is used.
Parameters: For each task, training runs for 2000 iterations with 8192 parallel environments.
Model Architecture: A simple fully-connected MLP is used for both the actor (policy) and critic. The network architecture for both has hidden layers: (512, 256, 256). The same observations are used for both policy and critic, unlike asymmetric actor-critic setups.

4.2.3. Sim2Real Deployment

To deploy on real hardware, the observation space is modified to remove simulator-privileged information:

The trajectory reference observation ( $\gamma_t$ ) is removed from the policy input (though references remain available via rewards).
The linear velocity of the root is removed.
Privileged scene-physics information (e.g., object mass, friction) is removed.
Time encoding $\langle t, \sin(2\pi t/T)\rangle$ (where $T$ is episode length) is added. The reward function remains the same, but the reference trajectory for root position and yaw is transformed to avoid privileged inputs for the critic. The resulting policy depends only on the relative position of the object/goal.
Hardware Setup: Unitree G1 humanoid (27-DoF, waist lock mode allowing only yaw), Inspire dexterous hands (6-DoF, binary open/close).
Sensors: Onboard IMU (root orientation, gravity, angular velocity), RealSense D435i depth camera (neck-mounted, estimates 3D object/goal position relative to pelvis).
Object Position Estimation: OWLv2 [72] for 2D localization and depth data for 3D lifting with object-specific offsets. Due to OWLv2 latency, object estimates are fixed after the first frame.
Mitigation of Static Estimate Errors: The lower body is frozen during interactive tasks (except bimanual pick), and a penalty on root velocities is added to ensure the base remains static. This perception bottleneck is noted as a limitation, with vision-based policies via student-teacher distillation [4] suggested for future work.
Sim2Real Trajectory Generation/Refinement (Appendix D):
- Specific prompts (Table X) are used, and generated trajectories are slowed down by a factor (e.g., 2.5 for Pick) to ensure safety and minimize sim-real gap.
- Refinements for Pick and Precise Punch add a cost to bring trajectories close to goal points in the optimization problem to ensure smooth policies.
- For Pick, the lower body is set to fixed default joint angles and root height adjusted to ensure a standing-still motion.
- For Bimanual Pick and Squat, IK is used to dismiss feet slipping by adjusting leg/foot joints, and root roll/yaw are set to 0 for symmetry.

5. Experimental Setup

5.1. Datasets

The primary dataset mentioned for training the human motion prior (OmniControl) is HumanML3d [68].

Source: HumanML3d is a large-scale, high-quality dataset of 3D human motions coupled with natural language descriptions. It is a collection of various human activities.
Characteristics: It provides diverse human motions, allowing the diffusion model to learn a rich distribution of natural human movements. The pairing of motion with text descriptions is crucial for text-conditioned motion generation.
Domain: Human motion data.
Effectiveness: It is effective for validating the method's performance in generating human-like motions and leveraging text-based instructions. The paper uses HumanML3d to evaluate the "human-ness" of its generated motions via FID scores.
Data Sample Example: While the paper doesn't show a direct data sample from HumanML3d, the text prompts used for OmniControl (e.g., "a person walks to cup, grabs the cup from side and lifts up" for Pick task, as seen in Table IX) directly correspond to the kind of language conditioning HumanML3d enables.

For the RL training phase, no external dataset is used. Instead, the RL policies are trained entirely within a simulated environment (IsaacSim via IsaacLab), where scenes are synthesized based on the reference trajectories generated in Stage 1. This highlights DreamControl's data-efficient nature in the RL phase, as it doesn't require robot-specific demonstrations.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a complete explanation:

5.2.1. Success Rate (%)

Conceptual Definition: This metric quantifies the percentage of trials (episodes) in which the robot successfully completes the designated task according to predefined criteria. It directly measures the task-solving capability of the learned policy.
Mathematical Formula: $ \mathrm{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
Symbol Explanation:
- $\text{Number of Successful Trials}$ : The count of experimental runs where the robot achieved the task's success conditions.
- $\text{Total Number of Trials}$ : The total count of experimental runs performed.
- $100\%$ : A scaling factor to express the result as a percentage.
Task-Specific Success Criteria: These are defined in the Appendix (e.g., for Pick, the object must be above a certain height; for Press Button, the robot's end-effector must be within a threshold distance of the button for a specific duration).

5.2.2. Fréchet Inception Distance (FID)

Conceptual Definition: FID is a metric used to assess the quality of images or generated data. In the context of motion generation, it measures the "distance" between the distribution of generated motion trajectories and the distribution of real (ground-truth human) motion trajectories. A lower FID score indicates that the generated motions are more similar to real human motions, implying higher quality and naturalness. It captures both the visual fidelity and the diversity of the generated samples.
Mathematical Formula: $ \mathrm{FID} = | \mu_x - \mu_g |_2^2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}) $
Symbol Explanation:
- $\mu_x$ : The mean of the feature vectors for the real (ground-truth human) motion trajectories.
- $\mu_g$ : The mean of the feature vectors for the generated motion trajectories.
- $\Sigma_x$ : The covariance matrix of the feature vectors for the real motion trajectories.
- $\Sigma_g$ : The covariance matrix of the feature vectors for the generated motion trajectories.
- $\|\cdot\|_2^2$ : The squared Euclidean distance.
- $\mathrm{Tr}(\cdot)$ : The trace of a matrix (sum of diagonal elements).
- $(.)^{1/2}$ : Matrix square root.
- Feature Vectors: In practice, motion trajectories are often passed through a pre-trained feature extractor (e.g., an Inception v3 network for images, or a specialized motion encoder for motions) to obtain compact representations. FID is calculated on these feature distributions.

5.2.3. Jerk

Conceptual Definition: Jerk is the third derivative of position with respect to time, or the rate of change of acceleration. In the context of robot motion, jerk measures the smoothness of movement. High jerk indicates abrupt, sudden changes in acceleration, which often results in jerky, unnatural, or uncomfortable motions. Conversely, lower jerk indicates smoother, more fluid movements.
Mathematical Formula: $ \mathrm{Jerk} = \frac { \sum _ { i } \sum _ { t } \sum _ { k } | \dddot { p } _ { i , t , k } ^ { \mathrm { key , global } } | } { N T K } $
Symbol Explanation:
- $\dddot { p } _ { i , t , k } ^ { \mathrm { key , global } }$ : The third derivative of the global position of the $k$ -th keypoint in the $i$ -th trajectory at time $t$ . This represents the instantaneous jerk for a specific keypoint at a specific time.
- $N$ : The total number of trajectories evaluated (e.g., $N=1000$ ).
- $T$ : The total number of time-steps in each trajectory (e.g., $T=500$ ).
- $K$ : The total number of keypoints on the robot (e.g., $K=41$ ).
- $\sum_i \sum_t \sum_k |\cdot|$ : Sum of the absolute jerk values across all trajectories, all time steps, and all keypoints.
- NTK: Normalization factor, dividing by the total number of keypoint-time-trajectory samples.

5.2.4. User Study

Conceptual Definition: A qualitative evaluation method where human participants are asked to assess the naturalness or human-likeness of generated robot motions. This provides direct human feedback, complementing quantitative metrics.
Methodology: 40 participants were shown side-by-side videos of trajectories from different methods (order randomized) and asked to select which looked more human-like.
Result: Reported as the average human preference percentage.

5.3. Baselines

The paper compares DreamControl against three baseline methods to evaluate its effectiveness:

1. (a) TaskOnly:
- Description: This baseline uses only task-specific (sparse) rewards. It provides no dense guidance to the RL policy during training, meaning the robot only receives a reward signal when it successfully completes the task or reaches specific task milestones.
- Representativeness: This represents a naive RL approach where the agent must discover the entire complex whole-body locomotion and manipulation strategy through pure exploration based on a very infrequent success signal. It highlights the difficulty of long-horizon, high-dimensional RL problems with sparse rewards.
2. (b) $TaskOnly+$ :
- Description: This baseline improves upon TaskOnly by incorporating task-specific rewards that include both sparse rewards (for task completion) and engineered dense rewards. These dense rewards are inspired by prior work like OmniGrasp [8] and are designed to encourage pre-grasp or pre-approach poses for the object or goal. For instance, a dense reward might guide the hand towards the object before actual grasping. The dense reward term is defined as $r_t^{\mathrm{dense}} = \|p_{t-\Delta t}^{\mathrm{b}} - p_{t_g}^{\mathrm{ref,b}}\|_2 - \|p_t^{\mathrm{b}} - p_{t_g}^{\mathrm{ref,b}}\|_2$ , where $p_t^{\mathrm{b}}$ is the robot's body part position, and $p_{t_g}^{\mathrm{ref,b}}$ is the target body part position at interaction time $t_g$ .
- Representativeness: This baseline reflects a common strategy in RL where reward shaping is used to facilitate learning. It demonstrates whether careful reward engineering alone is sufficient to solve whole-body loco-manipulation tasks and generate natural motions.
3. (c) TrackingOnly:
- Description: This baseline focuses exclusively on tracking rewards. The RL policy is rewarded primarily for accurately following the reference trajectories generated by Stage 1 (the diffusion prior). It aims to mimic the human-like motions without explicit task-specific rewards.
- Representativeness: This baseline isolates the contribution of the diffusion prior in generating natural, dynamically feasible motions. It assesses how well the robot can track human-inspired movements when not explicitly optimized for task completion. It's akin to motion tracking approaches but applied to generated rather than externally provided human motions.
  
  DreamControl (Ours) combines the strengths of TrackingOnly (using tracking rewards from the diffusion prior) with task-specific sparse rewards (similar to TaskOnly's sparse component) to achieve both natural motion and task completion.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results strongly validate the effectiveness of DreamControl across a diverse set of 11 challenging tasks. The method consistently outperforms all baselines in terms of task success rates and generates significantly more human-like and smoother motions.

6.1.1. Simulation Success Rates

The following are the results from Table II of the original paper:

Task / Method	(a) TaskOnly		(b) TaskOnly+	(c) TrackingOnly	Ours
Task / Method	(a)		(b) TaskOnly+	(c) TrackingOnly	Ours
Pick		0	15.1	87.5	95.4
Bimanual Pick		0	31.0	100	100
Pick from Ground (Side Grasp)		0	0	99.4	100
Pick from Ground (Top Grasp)		0	0	100	100
Press Button		0	99.8	99.1	99.3
Open Drawer		0	24.5	100	100
Open Door		0	15.4	100	100
Precise Punch		0	100	99.4	99.7
Precise Kick		0	97.6	96.1	98.6
Jump		0	0	100	100
Sit		0	100	100	100

Analysis:

(a) TaskOnly (0% success across all tasks): This baseline, relying solely on sparse task-specific rewards, consistently fails. This highlights the severe RL exploration problem in high-dimensional whole-body control tasks without any dense guidance. The robot cannot discover meaningful motions to complete tasks by chance.
(b) $TaskOnly+$ (Variable success, often low): Adding engineered dense rewards improves performance on simpler tasks like Press Button (99.8%) and Precise Punch (100%). However, it still fails on tasks requiring complex coordinated whole-body motion, such as Pick from Ground (0%), Jump (0%), and Bimanual Pick (31%). For instance, in Jump, the robot might stretch its knees but cannot discover the full sequence of crouching and springing due to insufficient guidance from simple pelvis-target rewards. This demonstrates that while reward engineering helps, it's often insufficient for intricate loco-manipulation and can be very challenging to design effectively for complex, natural motions.
(c) TrackingOnly (High success, but struggles with fine-grained interaction): This baseline performs significantly better overall, achieving 100% success on many tasks like Bimanual Pick, Pick from Ground, Open Drawer, Open Door, Jump, and Sit. This indicates that the human motion prior effectively generates dynamically feasible and goal-oriented motions. However, TrackingOnly struggles with fine-grained interactive tasks like Pick (87.5%), where precise object interaction is critical and sparse rewards (absent here) could make a difference.
DreamControl (Ours) (Highest overall performance): DreamControl achieves the best results on 9 out of 11 tasks, matching the best on the other 2. It boasts near-perfect or perfect success rates on most tasks (e.g., 95.4% for Pick, 100% for Bimanual Pick, Pick from Ground, Open Drawer, Open Door, Jump, Sit). By combining the dense guidance from tracking rewards with task-specific sparse rewards, DreamControl is able to learn robust policies that not only follow natural motion plans but also reliably complete the required interactions. This synergistic approach overcomes the limitations of baselines, demonstrating the power of a human motion-informed prior guiding RL.

6.1.2. Human-ness Comparison

The evaluation of human-ness (naturalness and smoothness of motion) provides further insight into the quality of DreamControl's trajectories.

The following are the results from Table III of the original paper:

Task	Method	FID ↓	Jerk ↓	User Study ↑
Pick	TaskOnly+	0.240	211.2	15.0%
	Ours	0.320	147.5	85.0%
Press Button	TaskOnly+	1.220	235.7	17.25%
	Ours	0.375	161.9	82.75%
Precise Punch	TaskOnly+	0.417	229.9	7.5%
	Ours	0.084	199.8	92.5%
Precise Kick	TaskOnly+	0.522	360.9	17.5%
	Ours	0.161	252.5	82.5%
Jump	TaskOnly+	1.216	236.4	5.0%
	Ours	0.208	148.5	95.0%

Analysis:

Fréchet Inception Distance (FID): DreamControl consistently achieves significantly lower FID values than $TaskOnly+$ $T a s k O n l y +$ across almost all tasks. A lower FID indicates a closer alignment with the distribution of human motions, confirming that DreamControl generates more natural-looking trajectories.
- Pick Task Exception: The Pick task is an exception, where $TaskOnly+$ shows a lower FID (0.240 vs. 0.320). The authors conjecture this is due to a domain gap: human demonstrations in HumanML3d typically involve waist-level picking, while the shorter G1 robot often performs shoulder-level picks. This structural difference makes the G1's motion inherently less comparable to the human dataset for this specific task, even if the motion is natural for the robot.
Jerk: DreamControl consistently produces lower average absolute jerk values compared to $TaskOnly+$ across all tasks. This quantitatively confirms that DreamControl's motions are smoother and more fluid, lacking the abrupt accelerations that characterize unnatural or "robotic" movements.
User Study: The user study results provide strong qualitative validation. Participants overwhelmingly preferred DreamControl's trajectories, selecting them as more human-like in all tasks (e.g., 85% for Pick, 95% for Jump). This subjective evaluation aligns with the jerk metric and generally with FID, reinforcing the naturalness benefit.

6.1.3. Visual Comparison (`Jump` Task)

$Fig. 3: Comparison of trajectories for the task of Jump. The top row shows results from the $T a s k O n l y +$ baseline, while the bottom row illustrates trajectories from the DreamControl. The yell…$ 该图像是图表，展示了 Jump 任务中两个不同方法的轨迹对比。顶部显示的是 $TaskOnly+$ 基线的结果，而底部则展示了 DreamControl 的轨迹。黄色球体表示用于引导轨迹的空间控制点。

The image is a chart comparing the trajectories of two different methods in the Jump task. The top row shows results from the $TaskOnly+$ baseline, while the bottom row illustrates trajectories from DreamControl. The yellow sphere represents the spatial control point used to guide the trajectories.

Analysis: The visual comparison for the Jump task (Fig. 3) strikingly illustrates the difference in human-ness.

$TaskOnly+$ (Top Row): The robot attempts to jump by lifting off the ground but without bending its knees or preparing for the jump. This results in a less human-like motion that also fails to accomplish a proper jump. The motion appears stiff and inefficient.
DreamControl (Bottom Row): The robot exhibits a smooth jumping motion where it first bends (crouches) and then lifts off the ground. This behavior is characteristic of how humans jump, demonstrating the naturalness imparted by the human motion prior. The motion is fluid, effective, and visually appealing. This visual evidence further confirms the quantitative findings regarding jerk and user preference.

In summary, DreamControl not only enables robust task completion in challenging loco-manipulation scenarios but also ensures that these tasks are performed with natural, human-like, and smooth motions, which is a critical factor for both sim-to-real transfer and human-robot interaction.

6.2. Data Presentation (Tables)

The following are the results from Table IV, Table V, Table VI, Table VII, Table IX, Table X, Table XI, Table XII, Table XIII, Table XIV of the original paper:

6.2.1. Robot Details (Appendix Tables)

The following are the results from Table IV of the original paper:

Legs
left_hip_pitch_joint right_hip_pitch_joint left_hip_roll_joint right_hip_roll_joint left_hip_yaw_joint right_hip_yaw_joint left_knee_joint right_knee_joint left_ankle_pitch_joint right_ankle_pitch_joint left_ankle_roll_joint right_ankle_roll_joint

Waist
waist_yaw_joint
(Left \|Right) Arms
left_shoulder_pitch_joint right_shoulder_pitch_joint left_shoulder_roll_joint right_shoulder_roll_joint left_shoulder_yaw_joint right_shoulder_yaw_joint left_elbow_joint right_elbow_joint left_wrist_roll_joint right_wrist_roll_joint left_wrist_pitch_joint right_wrist_pitch_joint left_wrist_yaw_joint right_wrist_yaw_joint


(Left \| Right) Hands
left_hand_index_0_joint right_hand_index_0_joint left_hand_index_1_joint right_hand_index_1_joint left_hand_middle_0_joint right_hand_middle_0_joint left_hand_middle_1_joint right_hand_middle_1_joint left_hand_thumb_0_joint right_hand_thumb_0_joint left_hand_thumb_1_joint right_hand_thumb_1_joint left_hand_thumb_2_joint right_hand_thumb_2_joint

The following are the results from Table V of the original paper:

Legs
left_hip_pitch_link left_hip_roll_link	right_hip_pitch_link right_hip_roll_link
left_hip_yaw_link	right_hip_yaw_link
left_knee_link	right_knee_link
left_ankle_pitch_link
left_ankle_roll_link	right_ankle_pitch_link right_ankle_roll_link

Waist & Torso
pelvis	pelvis_contour_link
waist_yaw_link	waist_roll_link
torso_link logo_link	waist_support_link
Head & Sensors
head_link	imu_link
d435_link	mid360_link
Arms
left_shoulder_pitch_link	right_shoulder_pitch_link
left_shoulder_roll_link	right_shoulder_roll_link
left_shoulder_yaw_link	right_shoulder_yaw_link
left_elbow_link	right_elbow_link
left_wrist_roll_link	right_wrist_roll_link
left_wrist_pitch_link	right_wrist_pitch_link
left_wrist_yaw_link	right_wrist_yaw_link
left_rubber_hand	right_rubber_hand

The following are the results from Table VI of the original paper:

Joint name	Default angle	Kp	Kd
left_hip_pitch_joint	-0.2	200	5
left.hip_roll_joint	0	150	5
left._hip_yaw_joint	0	150	5
left_knee_joint	0.42	200
left_ankle_pitch_joint	-0.23	20	20
left_ankle_roll_joint	0	20
right_hip-pitch_joint	-0.2	200
right_hip_roll_joint	0	150
right_hip-yaw_joint	0	150	5
right_knee_joint	0.42	200	2
right_ankle_pitch_joint	-0.23	20
right_ankle roll_joint	0	20	2
waist_yaw_joint	0	200	5
left_shoulder_pitch_joint	0.35	40	10
left_shoulder_roll_joint	0.16	40	10
left_shoulder_yaw_joint	0	40	10
left_elbow_joint	0.87	40	10
left_wrist_roll_joint	0	40	10
left_wrist_pitch_joint	0	40	10
left_wrist_yaw_joint	0	40	10
left_hand_index_0_joint	0	5	1.25
left_hand_index_1_joint	0	5	1.25
left_hand_middle_0_joint	0	5	1.25
left_hand_middle_1_joint	0	5	1.25
left_hand_thumb_0_joint	0	5	1.25
left_hand_thumb_1_joint	0	5	1.25
left_hand_thumb_2_joint	0	5	1.25
right_shoulder_pitch_joint	0.35	40	10
right_shoulder_roll_joint	-0.16	40	10
right_shoulder_yaw_joint	0	40	10
right_elbow_joint	0.87	40	10
right_wrist_roll_joint	0	40	10
right_wrist_pitch_joint	0	40	10
right_wrist_yaw_joint	0	40	10
right_hand_index_0_joint	0	5	1.25
right_hand_index_1_joint	0	5	1.25
right_hand_middle_0_joint	0	5	1.25
right_hand_middle_1 joint	0	5	1.25
right_hand_thumb_0_joint	0	5	1.25
right_hand_thumb_1_joint	0	5	1.25
right_hand_thumb_2_joint	0	5	1.25

The following are the results from Table VII of the original paper:

Task	Left Hand Config (Open — Close)	Right Hand Config (Open — Close)	Group set to default q
Pick	ACL — ACL	AOR — ACR	Gleft arm
Precise Punch	ACL — ACL	ACR — ACR	-
Precise Kick	ACL — ACL	ACR — ACR
Press Button	ACL — ACL	PBR — PBR
Jump	ACL — ACL	ACR — ACR
Sit	ACL — ACL	ACR — ACR
Bimanual Pick	ACL - ACL	ACR — ACR
Pick from Ground (Side Grasp)	AOL — ACL	ACR — ACR	Gright arm
(top grasp) Pick from Ground	ACL — ACL	ACR — ACR	Gleft arm
Pick and Place	ACL — ACL	AOR — ACR
Open Drawer	ACL — ACL	DOR — ACR
Open Door	ACL — ACL	DOR — ACR

6.2.3. Task-Specific Prompts for Simulation

The following are the results from Table IX of the original paper:

Task	Prompts
Pick	"a person walks to cup, grabs the cup from side and lifts up"
Precise Punch	"a person performs a single boxing punch with his right hand"
Precise Kick	"a person stands and kicks with his right leg"
Press Button	"a person walks towards elevator, presses elevator button"
Jump	"a person jumps forward"
Sit	"a person walks towards a chair, sits down"
Bimanual Pick	"a person raises the toolbox with both hands"
Pick from Ground (Side Grasp)	"a person raises the toolbox with the use of one hand"
Pick from. Ground (Top Grasp)	"a person walks forward, bends down to pick something up off the ground"
Pick and Place	"a person picks the cup and puts it on another table"

6.2.4. Task-Specific Prompts and Slow-Down Factors for Real Robot

The following are the results from Table X of the original paper:

Task	Prompts	Slow down factor
Pick	"a person stands in place, grabs the cup from side and lifts up"	2.5
Precise Punch	"a person performs a single boxing punch with his right hand"	1.5
Press Button	"a person stands in place, presses elevator button"	1.5
Bimanual Pick	a person raises the toolbox with both hands"	1
Squat	"a person squats in place and stands up"	1

6.2.5. Trajectory Filtering Constants

The following are the results from Table XI of the original paper:

Task	#Trajs before	#Trajs after	βtorso	βpelvis
Pick	100	67	π/4	0.6
Precise Punch	100	100	π/4	0.6
Precise Kick	100	66	π/2	0.5
Press Button	100	96	π/3	0.5

6.2.6. Environment Randomization Parameters

The following are the results from Table XII of the original paper:

Task	Friction	Mass of object
Pick	U(0.7, 1)	U(0.1, 1)
Precise Punch	U(0.7, 1)
Precise Kick	U(0.7, 1)
Press Button	U(0.7,1)
Jump	U(0.7, 1)
Sit	U(0.7, 1)
Bimanual Pick	U(0.7, 1)	U(0.1,5)
Pick from Ground (Side Grasp)	U(0.7, 1)	U(0.1, 1)
Pick from Ground (Top Grasp)	U(0.7, 1)	U(0.1, 0.5)
Pick and Place	U(0.7, 1)	U(0.1, 0.5)
Open Drawer	U(0.7, 1)
Open Door	U(0.7, 1)

6.2.7. Task-Specific Sparse Rewards

The following are the results from Table XIII of the original paper:

Task	Body part link, b	Sparse reward, r.sparse	Description
Pick	right_wrist_yaw_link	$1_{\text{object above } h_{\text{thres}} \text{ at } t \geq t_g + t_b}$
Precise Punch	right_wrist_yaw_link	$1_{\Vert p^b - p_{\text{goal}} \Vert_2 < d_{\text{thres}}} 1_{t \geq t_g - 0.1} 1_{t \leq t_g + 0.1}$	$d_{\text{thres}} = 0.05$
Precise Kick	right_ankle_roll_link	$1_{\Vert p^b - p_{\text{goal}} \Vert_2 < d_{\text{thres}}} 1_{t \geq t_g - 0.1} 1_{t \leq t_g + 0.1}$	$d_{\text{thres}} = 0.1$
Press Button	right_wrist_yaw_link	$1_{\Vert p^b - p_{\text{goal}} \Vert_2 < d_{\text{thres}}} 1_{t \geq t_g - 0.1} 1_{t \leq t_g + 0.1}$	$d_{\text{thres}} = 0.05$
Jump	pelvis	$1_{\Vert p^b - p_{\text{goal}} \Vert_2 < d_{\text{thres}}} 1_{t \geq t_g - 0.1} 1_{t \leq t_g + 0.1}$	$d_{\text{thres}} = 0.1$
Sit	pelvis	$1_{\Vert p^b - p_{\text{goal}} \Vert_2 < d_{\text{thres}}} 1_{t \geq t_g}$	$d_{\text{thres}} = 0.05$
Bimanual Pick	right_wrist_yaw_link, left_wrist_yaw_link	$1_{\text{object above } h_{\text{thres}} \text{ at } t \geq t_g + t_b}$
Pick from Ground (Side Grasp)	left_wrist_yaw_link	$1_{h_{\text{object}} > h_{\text{thres}} \text{ at } t \geq t_g + t_b}$
Pick from Ground (Top Grasp)	right_wrist_yaw.link	$1_{h_{\text{object}} > h_{\text{thres}} \text{ at } t \geq t_g + t_b}$	$h_{\text{thres}} = 0.3$
Pick and Place	right_wrist_yaw_link	$1_{\Vert \text{object at } p_{\text{goal}} \Vert_2 < d_{\text{thres}} \text{ at } t \geq t_b}$	$d_{\text{thres}} = 0.1$ , $p_{\text{goal}}$ is position of the goal
Open Drawer	right_wrist_yaw_link	$1_{a_{\text{drawer}} > a_{\text{thres}} \text{ at } t \geq t_b}$	$a_{\text{drawer}}$ is drawer open amount, $a_{\text{thres}} = 0.05$
Open Door	right_wrist_yaw_link	$1_{a_{\text{door}} > a_{\text{thres}} \text{ at } t \geq t_b}$	$a_{\text{door}}$ is door open amount, $a_{\text{thres}} = 0.05$

6.2.8. Reward Weights for All Tasks

The following are the results from Table XIV of the original paper:

Task	Tracking					Smoothness					W_{r_task,sparse}	W_{r_task,dense}
Task	w_r1	w_r2	w_r3	w_r4	w_r5	w_r6	w_r7	w_r8	w_r9	w_r10	W_{r_task,sparse}	W_{r_task,dense}
Pick	-0.2	-0.05	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	-0.5	-1	0.1	100
Precise Punch	-0.2	-0.05	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	-0.5	-1	1	100
Precise Kick	-0.2	-0.1	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	-0.15	-1 for left, -0.3 for right	1	100
Press Button	-0.2	-0.05	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	-0.5	-1	1	100
Jump	-0.2	-0.1	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	0	-1	1	100
Sit	-0.2	-0.05	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	-0.5	-1	1	100
Bimanual Pick	-0.2	-0.05	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	0	-1	0.1	100
Pick from Ground (Side Grasp)	-0.2	-0.05	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	-0.5	-1	0.1	100
Pick from Ground (Top Grasp)	-0.2	-0.05	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	-0.15	-1	0.1	100
Pick and Place	-0.2	-0.05	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	-0.5	-1	0.1	100
Open Drawer	-0.2	-0.05	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	0	-1	0.1	100
Open Door	-0.2	-0.05	-0.2	-0.2	0.3	-1.5e-7	-5e-3	-0.1	0	-1	0.1	100

6.3. Sim2Real Deployment Results

While no specific success rates for sim-to-real deployment are provided in a table, the paper states that policies for Pick (standing), Bimanual Pick (varying weights), Press Button (standing), Open Drawer (different positions), Precise Punch (standing), and Squat (varying depths) were successfully deployed on the Unitree G1 robot. This qualitative validation is demonstrated through visualizations in Fig. 1 (Image 1 from the original paper) and additional videos on the project website.

The successful sim-to-real transfer for a selection of tasks highlights DreamControl's ability to generate robust policies that are not overly sensitive to the sim-to-real gap. The adaptations made to the observation space (removing privileged simulator information and adding time encoding) and the use of slow-down factors for certain tasks (Table X) were crucial for this real-world performance. The freezing of the lower body for some interactive tasks, necessitated by perception bottlenecks (e.g., OWLv2 latency), indicates that the core control methodology is sound, but real-world perception remains a challenge.

Fig. 1: Unitree G1 humanoid performing various skills trained via DreamControl, including (1) opening a drawer, (2) bimanual pick (of a box), (3) ordinary pick and (4) pressing an elevator button. 该图像是插图，展示了Unitree G1类人机器人正在执行多种技能，训练过程基于DreamControl。图中分别展示了机器人打开抽屉、双手同时取物、单手取物以及按电梯按钮的动作。

The image is an illustration showing the Unitree G1 humanoid robot performing various skills trained via DreamControl. It depicts the robot opening a drawer, bimanually picking an object, picking up an object with one hand, and pressing an elevator button.

6.4. Ablation Studies / Parameter Analysis

The paper's comparison against TaskOnly, $TaskOnly+$ , and TrackingOnly can be viewed as a form of ablation study, demonstrating the necessity and effectiveness of combining tracking rewards from the diffusion prior with task-specific sparse rewards.

The TaskOnly baseline shows that sparse rewards alone are insufficient.
The $TaskOnly+$ baseline shows that even hand-engineered dense rewards often fall short for complex whole-body tasks and may not produce natural motions.
The TrackingOnly baseline shows the power of the motion prior in generating dynamically feasible and natural motions, but also its limitation in precisely executing fine-grained interaction tasks without task-specific rewards.

DreamControl's superior performance, combining tracking and sparse task rewards, validates that both components are essential and synergistically contribute to the final robust and natural skill acquisition. The sim-to-real deployment section also details adaptations to the observation space and trajectory refinements (e.g., slow-down factors, lower body freezing, additional optimization costs) which act as a practical ablation or parameter tuning for real-world scenarios.

7. Conclusion & Reflections

7.1. Conclusion Summary

DreamControl introduces a novel and highly effective recipe for training autonomous whole-body humanoid skills. By ingeniously leveraging guided diffusion models for long-horizon human-inspired motion planning and Reinforcement Learning for robust task-specific control, the method successfully addresses critical challenges in humanoid robotics. Key contributions include the use of human motion data (instead of expensive robot demonstrations) to inform a diffusion prior, which not only enables RL to discover solutions unattainable by direct RL but also inherently promotes natural-looking, smooth motions. The extensive simulation experiments and real-world deployment on a Unitree G1 robot across 11 challenging tasks (including object interaction and loco-manipulation) firmly validate DreamControl's effectiveness, robustness, and potential for sim-to-real transfer.

7.2. Limitations & Future Work

The authors openly acknowledge several limitations and propose future research directions:

Skill Composition: The current implementation does not yet support the composition of skills (e.g., combining "pick up" with "walk to table"). This would be crucial for more complex, multi-stage tasks.
Dexterous Manipulation: The method currently uses discrete open/closed hand configurations. Extending it to dexterous manipulation with fine-grained finger control for more intricate object handling is a next step.
Complex Object Geometries: The system currently handles relatively simple object geometries. Supporting more complex object geometries and interactions would enhance its generality.
Broader Repertoire of Tasks: While demonstrating a diverse set of tasks, scaling DreamControl to an even broader repertoire of tasks is an ongoing goal.
Diverse Robot Morphologies: The validation is on a Unitree G1. Applying DreamControl to more diverse robot morphologies would test its generalization capabilities.
Perception Bottlenecks: For sim-to-real deployment, the reliance on static object estimates from OWLv2 due to latency is a limitation. The authors suggest addressing this with vision-based policies trained via student-teacher distillation (e.g., [4]) as future work.

7.3. Personal Insights & Critique

DreamControl represents a highly insightful and promising direction for humanoid robotics. The core idea of leveraging human motion priors to guide RL is elegant and addresses several fundamental problems simultaneously.

Strengths and Inspirations:

Data Efficiency for Robots: The most significant inspiration is how DreamControl circumvents the robot data scarcity problem by tapping into the abundance of human motion data. This human-in-the-loop (indirectly) approach, where human motion informs the prior, is a scalable and powerful paradigm shift from relying on expensive robot teleoperation. This concept could be applied to other domains where robot-specific data is scarce, by finding analogous human or animal motion data.
Naturalness by Design: The inherent ability of diffusion models to generate natural and diverse motions is a huge advantage. This naturalness directly translates to smoother robot movements, which is vital for sim-to-real transfer (as jerky motions are often physically infeasible or damaging in reality) and for more acceptable human-robot interaction.
Enhanced RL Exploration: The diffusion prior effectively provides dense, meaningful guidance to RL, transforming sparse-reward exploration into a more tractable problem. This hybrid approach demonstrates how generative models can intelligently pre-condition or shape the learning landscape for RL agents.
Modular and Adaptable: The two-stage design makes the system quite modular. The motion prior can be improved independently, and the RL policy can be adapted for specific robot morphologies or task requirements. The sim-to-real adaptations also show a practical approach to bridging the reality gap, even with current perception limitations.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Fidelity of Retargeting for Complex Tasks: While PyRoki is used, the retargeting from human SMPL to Unitree G1 might not perfectly preserve all dynamic characteristics or interaction nuances. The Pick task's FID discrepancy and the need for IK-based optimization for Open Drawer/Door suggest that the diffusion prior (trained on human data) might sometimes struggle with producing directly optimal or perfectly transferable motions for tasks where robot kinematics or environment constraints significantly differ from typical human scenarios.
Heuristic Filtering and Refinements: The reliance on heuristic-based filtering and trajectory refinements (e.g., for collision avoidance, disabling non-functional arms, feet slipping in sim2real) indicates that the raw output of the diffusion prior is not always dynamically feasible or perfectly suitable for direct RL tracking. While practical, this suggests room for improvement in the motion prior itself or its integration with physics-based constraints. The authors acknowledge that more data might eliminate the need for such filtering.
Generalization of OmniControl for Out-of-Distribution Tasks: For tasks like Open Drawer/Door, DreamControl falls back to IK-based optimization rather than pure diffusion generation. This highlights a limitation: while OmniControl is "zero-shot" for many tasks, it might not fully generalize to actions poorly represented in its training data or tasks requiring very specific interaction dynamics. Improving the diffusion model's understanding of novel object interactions is crucial.
Perception Dependency and Latency: The sim-to-real deployment is currently constrained by perception bottlenecks (e.g., OWLv2 latency, static object estimates), requiring the robot to remain still for interactive tasks. This is a practical limitation for fully dynamic loco-manipulation in the real world. Integrating end-to-end vision-based policies (as suggested for future work) is necessary to unlock the full potential of DreamControl in truly unstructured environments.
Hand Control Granularity: Restricting hand control to discrete open/closed states simplifies the problem. For dexterous manipulation, more granular control of individual finger joints would be required, which would significantly increase the action space and complexity.

Overall, DreamControl makes a substantial contribution by offering a robust and natural way to imbue humanoids with complex skills. Its modularity and strong performance in both simulation and real-world scenarios position it as a foundational work for future advancements in general-purpose humanoid robotics. The identified limitations also provide clear, actionable directions for continued research, particularly at the intersection of generative models, reinforcement learning, and real-world robot perception.

DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion

TL;DR Summary