Paper status: completed

Embracing Bulky Objects with Humanoid Robots: Whole-Body Manipulation with Reinforcement Learning

Published:09/17/2025

Reinforcement Learning for Whole-Body Manipulation (1)Humanoid Robot Multi-Contact Interaction (1)Neural Signed Distance Field Representation (1)Large-Scale Human Motion Data Distillation (1)Long-Horizon Task Control (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a reinforcement learning framework for humanoid robots to embrace bulky objects via whole-body manipulation. It combines human motion priors with neural signed distance fields to generate robust and natural motions, enhancing multi-contact stability and adap

Abstract

Whole-body manipulation (WBM) for humanoid robots presents a promising approach for executing embracing tasks involving bulky objects, where traditional grasping relying on end-effectors only remains limited in such scenarios due to inherent stability and payload constraints. This paper introduces a reinforcement learning framework that integrates a pre-trained human motion prior with a neural signed distance field (NSDF) representation to achieve robust whole-body embracing. Our method leverages a teacher-student architecture to distill large-scale human motion data, generating kinematically natural and physically feasible whole-body motion patterns. This facilitates coordinated control across the arms and torso, enabling stable multi-contact interactions that enhance the robustness in manipulation and also the load capacity. The embedded NSDF further provides accurate and continuous geometric perception, improving contact awareness throughout long-horizon tasks. We thoroughly evaluate the approach through comprehensive simulations and real-world experiments. The results demonstrate improved adaptability to diverse shapes and sizes of objects and also successful sim-to-real transfer. These indicate that the proposed framework offers an effective and practical solution for multi-contact and long-horizon WBM tasks of humanoid robots.

Mind Map

In-depth Reading

English Analysis~33 min read · 41,632 chars

1. Bibliographic Information

1.1. Title

Embracing Bulky Objects with Humanoid Robots: Whole-Body Manipulation with Reinforcement Learning

1.2. Authors

Chunxin Zheng, Kai Chen, Zhihai Bi, Yulin Li, Liang Pan, Jinni Zhou, Haoang Li, and Jun Ma, Senior Member, IEEE

1.3. Journal/Conference

The paper is published at the given Original Source Link as a preprint on arXiv, dated 2025-09-16T21:01:24.000Z. As an arXiv preprint, it signifies a submission to a scientific repository, typically before or alongside peer review for a journal or conference. The affiliation of "Senior Member, IEEE" for Jun Ma suggests that the authors are likely targeting a high-impact IEEE journal or conference in robotics and automation.

1.4. Publication Year

2025

1.5. Abstract

This paper presents a novel reinforcement learning (RL) framework for humanoid robots to perform whole-body manipulation (WBM) tasks, specifically focusing on embracing bulky objects. Traditional end-effector grasping is often limited in such scenarios due to stability and payload constraints. The proposed method integrates a pre-trained human motion prior with a neural signed distance field (NSDF) representation. It utilizes a teacher-student architecture to distill human motion data, generating natural and physically feasible whole-body motion patterns that enable coordinated control of arms and torso for stable multi-contact interactions, thereby enhancing robustness and load capacity. The NSDF provides accurate and continuous geometric perception, improving contact awareness for long-horizon tasks. The framework is evaluated through comprehensive simulations and real-world experiments, demonstrating improved adaptability to diverse object shapes and sizes, and successful sim-to-real transfer. The authors conclude that this framework offers an effective and practical solution for multi-contact and long-horizon WBM tasks for humanoid robots.

1.6. Original Source Link

https://arxiv.org/abs/2509.13534 The paper is available as a preprint on arXiv.

1.7. PDF Link

https://arxiv.org/pdf/2509.13534v1.pdf The paper is available as a PDF on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the challenge of enabling humanoid robots to manipulate bulky objects effectively. While humanoid robots are increasingly deployed in various domains, their ability to handle large and heavy items remains a significant hurdle.

This problem is important because traditional grasping methods that rely solely on end-effectors (like grippers) are inherently limited in stability and payload capacity when dealing with bulky objects. Humans, in contrast, naturally use their whole body—arms, torso, and even legs—to embrace and move such objects, distributing forces and maintaining balance. Replicating this whole-body manipulation (WBM) capability in humanoids would greatly expand their utility in industrial, logistics, and home service applications.

Specific challenges or gaps in prior research include:

Perception: Accurately modeling the precise geometry and spatial relationships between the robot's complex body and objects, especially during contact-rich manipulation.
Control Complexity: Managing the high degrees of freedom (DoF) of humanoid robots for coordinated, multi-contact interactions.
Anthropomorphic Behavior: Generating natural, human-like, and physically feasible whole-body motion patterns.
Stability: Maintaining balance during dynamic multi-contact interactions with heavy objects.
Long-horizon Tasks: Coordinating a sequence of actions (approaching, embracing, transporting) that require sustained planning and execution.
Sim-to-Real Transfer: Bridging the gap between behaviors learned in simulation and their successful deployment in the real world.

The paper's innovative idea is to leverage Reinforcement Learning (RL) for WBM, specifically integrating a pre-trained human motion prior to guide natural movements and a neural signed distance field (NSDF) for precise contact awareness and geometric perception. This combination aims to overcome the limitations of traditional methods and existing RL approaches that struggle with contact-intensive, dynamically unstable, and long-horizon WBM tasks.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Novel RL Framework for Whole-Body Embracing: The authors propose the first RL framework (to their knowledge) that enables humanoid robots to actively engage with bulky objects using their entire body, including arms and torso, for manipulation. This addresses the limitations of end-effector-only grasping by enabling multi-contact interactions.
Integration of Human Motion Prior: The framework incorporates a human motion prior into the embracing policy training pipeline. This motion prior, distilled from large-scale human motion data, accelerates policy training convergence for multi-contact and long-horizon tasks and allows robots to acquire anthropomorphic skills.
NSDF for Enhanced Perception and Contact Awareness: A neural signed distance field (NSDF) representation of the humanoid robot is constructed. This NSDF precisely perceives robot-object interactions and is integrated into both the observation space and the reward function. This contact awareness guides the robot's upper body to maintain sustained contact, significantly enhancing manipulation robustness.
Comprehensive Evaluation and Sim-to-Real Transfer: The proposed approach is thoroughly validated through extensive simulations and real-world experiments on a humanoid robot (Unitree H1-2). The results demonstrate robust performance in complex whole-body embracing tasks, including adaptability to diverse object shapes, sizes, and masses, as well as successful sim-to-real transfer.

The key findings are that the integrated framework:
Enables stable multi-contact interactions, distributing contact forces across the body, which improves manipulation robustness and load capacity.
Generates kinematically natural and physically feasible whole-body motion patterns.
Provides accurate and continuous geometric perception, crucial for contact awareness in long-horizon tasks.
Achieves high success rates when handling objects of varying sizes, masses, and shapes in simulation.
Successfully transfers learned policies from simulation to a physical humanoid robot.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following concepts:

Humanoid Robots: Robots designed to resemble the human body, typically with a torso, head, two arms, and two legs. They have many degrees of freedom (DoF), allowing for complex movements but also posing significant control challenges.
Whole-Body Manipulation (WBM): A robotics paradigm where a robot uses multiple parts of its body (e.g., arms, torso, legs) and potentially environmental contacts to interact with and manipulate objects. This contrasts with end-effector-only manipulation, which solely relies on grippers. WBM is crucial for handling large, heavy, or irregularly shaped objects that exceed the capabilities of single-point grasping.
Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. The agent learns an optimal policy (a mapping from states to actions) through trial and error.
Proximal Policy Optimization (PPO): A popular on-policy RL algorithm known for its stability and performance. PPO aims to update the policy in a way that avoids overly large policy changes in one step, ensuring more stable training. It often uses a clipping mechanism or adaptive KL penalty to constrain policy updates.
Variational Autoencoder (VAE): A type of generative model that learns a latent space representation of data. A VAE consists of an encoder that maps input data to a distribution (mean and variance) in the latent space, and a decoder that reconstructs the input data from a sample drawn from this latent distribution. It encourages the latent space to follow a specific prior distribution (e.g., a standard normal distribution) using a Kullback-Leibler (KL) divergence loss. In this paper, it's used for motion prior distillation.
Signed Distance Field (SDF): A mathematical representation of a 3D shape where each point in space is associated with the shortest distance to the surface of the shape. The sign of the distance indicates whether the point is inside (negative) or outside (positive) the shape. SDFs are useful for collision detection, path planning, and robot-environment interaction modeling.
Neural Signed Distance Field (NSDF): An SDF where the distance function is approximated by a neural network. This allows for a continuous and differentiable representation of complex geometries, often more compact and memory-efficient than traditional voxel-based SDFs. In this paper, NSDF is used for geometric perception and contact awareness.
Motion Capture (Mocap) Data: Data recorded from human movements, typically using optical markers or inertial sensors. This data provides high-fidelity kinematic information about human poses and trajectories, which can be used to train robots to perform human-like motions.
SMPL Model: A skinned multi-person linear model that represents the human body as a mesh with shape and pose parameters. It's widely used in computer graphics and vision to represent and animate human motion, often serving as the basis for processing mocap data.
Teacher-Student Learning (Knowledge Distillation): A technique where a smaller, simpler "student" model learns to mimic the behavior of a larger, more complex "teacher" model. The teacher often has superior performance but might be too computationally expensive for deployment. The student learns from the teacher's outputs, distilling its knowledge into a more efficient form. In this paper, a teacher policy trained on mocap data is distilled into a VAE-based student policy to create a compact motion prior.

3.2. Previous Works

The paper references several categories of previous works in Whole-Body Manipulation and Reinforcement Learning for humanoid robot control.

Whole-Body Manipulation (`WBM`)

Traditional WBM (Model-based): Early WBM approaches, such as [13] for whole-arm grasping on single-arm systems and [5] incorporating contact models for fixed-base humanoid platforms (dual arms, objects, torso).
- Limitation: These model-based methods often rely on simplified robot models (e.g., links as line segments) and struggle with the inherent complexity of contact-rich scenarios, where precise geometry and contact sequence determination are computationally expensive and difficult to specify. For example, Murooka et al. [7] worked on whole-body pushing manipulation but also faced high computational costs.
Learning-based WBM:
- RL methods for WBM on bimanual platforms [9].
- Enhancing contact perception with soft pneumatic sensors [8] to provide richer tactile feedback.
- More recent state-of-the-art approaches combine RL with whole-body tactile sensing for compliant and human-like motion generation, though often limited to upper-body platforms [11].
- Zhang et al. [10] explore plan-guided reinforcement learning for WBM.
- Gu et al. [12] discuss humanoid locomotion and manipulation control problems.
- Chen et al. [4] discuss geometry as distance fields for WBM.

Reinforcement Learning for Humanoid Robot Control

Task-driven Controllers: Early successes in RL for specific humanoid behaviors like agile running [14] and locomotion across challenging terrains [15].
- Limitation: These methods often lack generalization to complex, long-horizon scenarios and may not guarantee bionicity (naturalness) without explicit gait rewards.
Behavior Cloning (BC) from Physics-based Animation: A significant trend where control policies are trained on large-scale human motion datasets (e.g., AMASS). This allows robots to imitate natural, human-like movements while maintaining physical feasibility [16], [17].
- Examples: DeepMimic [16] pioneered example-guided RL for physics-based character skills. Perpetual Humanoid Control (PHC) [17] is a framework used for precise trajectory tracking. MaskedMimic [23] processes and filters mocap data for kinematically feasible trajectories.
- Limitation: BC policies often require additional action generators [3], [18] or policy distillation [19] to adapt to specific tasks (locomotion, manipulation). They are also limited in interactive tasks due to the absence of environmental information (contact forces, terrain geometry) in human motion datasets.
Universal Motion Controllers: Recent advancements focus on universal motion controllers that generalize across tasks, often using pre-trained BC policies to construct comprehensive action spaces.
- Luo et al. [20] proposed using human motion priors, distilled from BC policies, to create universal controllers that serve as a robust foundation for task-specific fine-tuning.
- He et al. [19] and Luo et al. [20] discuss policy distillation and universal humanoid motion representations. PULSE [20] uses a VAE-based distillation approach to encode diverse motion behaviors into a latent space.
- Other methods combine text descriptions with pre-trained BC policies [21] or distill various robotics skill policies to build action spaces for universal humanoid robot controllers [22].

3.3. Technological Evolution

The field has evolved from purely model-based control, which required detailed physical models and explicit contact planning, to learning-based approaches, especially Reinforcement Learning. Early RL focused on task-specific skills, often struggling with generalization and human-like motion. The advent of behavior cloning from human motion data marked a significant step towards anthropomorphic behaviors and physical feasibility. However, adapting these mimicry policies to interactive, contact-rich manipulation tasks remained challenging due to the lack of environmental context in mocap data.

This paper fits into the latest wave by combining the strengths of RL (for task adaptation and interaction) with human motion priors (for natural and stable kinematics) and enhancing perception through NSDFs (for precise contact awareness). This represents a step towards creating truly versatile and robust humanoid WBM capabilities, moving beyond simplified models and single-task policies towards more generalizable and human-like interaction.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

Integrated RL Framework for Whole-Body Embracing: While WBM has been explored, this paper is, to the authors' knowledge, the first RL framework specifically designed for whole-body embracing of bulky objects by humanoid robots. Many previous RL WBM works focused on upper-body or bimanual manipulation, not the full whole-body multi-contact embracing.
Synergistic Combination of Human Motion Prior and NSDF:
- Human Motion Prior: Unlike generic RL that might struggle to find physically feasible and natural movements in high-dimensional DoF spaces, the pre-trained human motion prior provides a strong inductive bias. It distills complex human motion into a compact latent action space, making RL training more efficient and yielding anthropomorphic skills. This is an evolution from simple behavior cloning by actively integrating the prior into the RL policy's action generation, rather than just using it for trajectory tracking or as a separate action generator.
- NSDF for Contact Awareness: Model-based WBM often simplifies robot geometry or requires explicit contact point specification, leading to inaccuracies. Generic RL might learn contacts implicitly but less reliably. The NSDF explicitly provides accurate and continuous geometric perception of the robot's own body in relation to the object. This is integrated directly into both the observation space and the reward function, actively guiding the robot to establish and maintain stable multi-contacts, a level of contact awareness often missing in prior RL approaches.
Addressing Long-Horizon Tasks with Multi-stage Random Initialization: The paper explicitly tackles the difficulty of long-horizon tasks by introducing a multi-stage random initialization strategy. This significantly improves convergence speed and stability compared to fixed initialization, ensuring the RL agent learns all phases of the embracing task (approaching, embracing, transporting) effectively.
Robustness and Generalization: The combined approach demonstrates superior adaptability to diverse object shapes, sizes, and masses in simulation and successful sim-to-real transfer, indicating a more practical and robust solution than many prior RL methods that might struggle with variations outside the training distribution.

In essence, the paper differentiates itself by cohesively integrating kinematic plausibility (from motion prior) with geometric accuracy and contact stability (from NSDF) within a Reinforcement Learning framework specifically designed for the complex, multi-contact, and long-horizon problem of bulky object embracing by humanoid robots.

4. Methodology

4.1. Principles

The core idea of the method is to enable humanoid robots to perform whole-body manipulation (WBM) tasks, specifically embracing bulky objects, by leveraging the strengths of Reinforcement Learning (RL), human motion priors, and advanced geometric perception. The theoretical basis and intuition behind this approach are:

Anthropomorphic Motion: Humans naturally use their entire body for such tasks. By distilling human motion data into a motion prior, the robot can learn kinematically natural and physically feasible movements, which are often more stable and efficient than motions discovered purely by trial-and-error RL in a high-dimensional state-action space. This prior acts as a strong inductive bias, simplifying the RL search.
Robust Contact Management: Whole-body embracing relies heavily on stable multi-contact interactions. Traditional RL might struggle to reliably establish and maintain these contacts. By introducing a Neural Signed Distance Field (NSDF), the robot gains continuous and accurate contact awareness, allowing it to actively perceive and optimize its body's proximity and contact with the object. This NSDF information is fed directly into the RL agent's observation space and reward function, explicitly guiding it toward stable multi-contact states.
Hierarchical Control and Efficiency: The framework implicitly creates a hierarchical control structure. The motion prior handles the low-level, kinematically natural movements, while the task-specific RL policy focuses on adapting these movements for the specific embracing goal, responding to environmental cues (object position, target location, NSDF features). This distillation process makes RL training more efficient by starting from a behaviorally plausible base.
Addressing Long-Horizon Tasks: RL often struggles with long-horizon tasks due to sparse rewards and exploration challenges. The multi-stage random initialization strategy explicitly addresses this by exposing the agent to various intermediate states, making the learning process more robust and efficient across different phases of the task (approaching, embracing, transporting).

4.2. Core Methodology In-depth (Layer by Layer)

The complete system architecture is illustrated in Fig. 2. The methodology is broken down into three main phases: Data Processing, Motion Prior Distillation, and Embracing Policy Training.

该图像是整个人体拥抱策略的训练框架示意图，分为三个部分：(a) 数据处理、(b) 动作先验蒸馏和(c) 拥抱策略训练。通过清理运动捕捉数据并利用教师-学生架构，生成自然且可行的全身动作模式，强化多接触互动的稳定性与负载能力。

The image is a schematic representation of the training framework for whole-body embracing policy, divided into three parts: (a) data processing, (b) motion prior distillation, and (c) embracing policy training. By cleaning motion capture data and utilizing a teacher-student architecture, it generates natural and feasible whole-body motion patterns, enhancing the stability and load capacity of multi-contact interactions.

4.2.1. Data Processing

To enable humanoid robots to acquire fundamental movement skills, the initial step involves processing human motion capture (mocap) data.

Raw Mocap Data to SMPL: Raw mocap data, typically represented using the SMPL (Skinned Multi-Person Linear) human model, is first evaluated.
Filtering and Cleaning: This data undergoes processing and filtering to identify kinematically feasible trajectories. An example of such processing is MaskedMimic [23], which produces a cleaned human motion dataset, denoted as $\mathcal{D}_h$ . This dataset contains only physically plausible human movement patterns.
Retargeting to Robot Platform: The retargeting framework from H20 [24] is then used to transfer these human motions onto the target humanoid robot platform. This process explicitly accounts for differences in body proportions and joint ranges of motion between the human and the robot.
Robot Motion Dataset: The output of this retargeting process is the robot motion dataset, denoted as $\mathcal{D}_r$ . This dataset provides validated trajectories that serve as training targets for the subsequent mimic teacher policy.

4.2.2. Motion Prior Distillation

The goal of this module is to distill motion priors from the robot motion dataset ( $\mathcal{D}_r$ ). These priors should retain human-like dexterity while ensuring physical feasibility for humanoid robots. This is achieved using a teacher-student learning scheme similar to PULSE [20].

4.2.2.1. Teacher Policy

A motion-tracking teacher policy is trained using the PHC (Perpetual Humanoid Control) framework [17] based on the retargeted robot motion dataset ( $\mathcal{D}_r$ ). This policy learns to precisely track motion trajectories.

The teacher policy $\pi_{\mathrm{teacher}}$ maps the current robot state and a reference motion (goal state) to an action: $\mathbf { a } _ { t } ^ { * } = \pi _ { \mathrm { t e a c h e r } } ( \mathbf { s } _ { t } ^ { p } , \mathbf { s } _ { t } ^ { g } )$ Where:

$\mathbf{a}_t^*$ is the action generated by the teacher policy at time $t$ .
$\pi_{\mathrm{teacher}}$ is the teacher policy, implemented as a multi-layer perceptron (MLP).
$\mathbf{s}_t^p$ $s_{t}^{p}$ is the proprioceptive state of the robot at time $t$ $t$ , defined as: $ \mathbf { s } _ { t } ^ { p } = \left{ \mathbf { v } _ { t } , \omega _ { t } , \mathbf { g } _ { t } , \mathbf { q } _ { t } , \dot { \mathbf { q } } _ { t } , \mathbf { a } _ { t - 1 } ^ { * } \right} \in \mathbb { R } ^ { 90 } $
- $\mathbf{v}_t \in \mathbb{R}^3$ : Root linear velocity of the robot.
- $\omega_t \in \mathbb{R}^3$ : Root angular velocity of the robot.
- $\mathbf{g}_t \in \mathbb{R}^3$ : Projected gravity vector, indicating the robot's orientation relative to gravity.
- $\mathbf{q}_t \in \mathbb{R}^{27}$ : Joint positions (angles) of the robot's 27 joints.
- $\dot{\mathbf{q}}_t \in \mathbb{R}^{27}$ : Joint velocities of the robot's 27 joints.
- $\mathbf{a}_{t-1}^* \in \mathbb{R}^{27}$ : The previous action generated by the teacher policy.
$\mathbf{s}_t^g$ $s_{t}^{g}$ is the goal state, defined as: $ \mathbf { s } _ { t } ^ { g } = { \widehat { \mathbf { p } } _ { t + 1 } - \mathbf { p } _ { t } , \widehat { \mathbf { p } } _ { t + 1 } } \in \mathbb { R } ^ { 2 \times 3 \times 27 } $
- $\widehat{\mathbf{p}}_{t+1} - \mathbf{p}_t$ : Positional offset, representing the difference between the target position of each rigid body at $t+1$ and the current position at $t$ .
- $\widehat{\mathbf{p}}_{t+1}$ : Target configurations (positions) for each of the robot's rigid bodies at $t+1$ . This incorporates reference motion trajectories from the dataset $\mathcal{D}_r$ . The policy is optimized using Proximal Policy Optimization (PPO) [25].

4.2.2.2. Humanoid Motion Prior Distillation

After training the teacher policy, its complex action space is distilled into a compact latent representation using a VAE-based approach, specifically following PULSE [20]. This latent space is designed to naturally induce human-like behavior and be efficiently leveraged for downstream tasks.

A variational encoder-decoder framework is constructed to represent the humanoid motion space:

Encoder ( $\mathcal{E}$ ): Infers a distribution over the latent motion variable $\mathbf{z}_t$ given the current proprioceptive state $\mathbf{s}_t^p$ and the goal state $\mathbf{s}_t^g$ .
Decoder ( $\mathcal{D}$ ): Maps the latent code $\mathbf{z}_t$ back to the action space.
Prior Network ( $\mathcal{R}$ ): A learnable prior network that approximates the latent code distribution produced by the encoder.

To facilitate sim-to-real transfer, the decoder input $\mathbf{s}_t^d$ is designed to exclude the root linear velocity component $\mathbf{v}_t$ from $\mathbf{s}_t^p$ . This makes the representation more feasible for real-world scenarios where precise linear velocity might be harder to estimate reliably. The specific distributions are defined as: $\begin{array} { r } { \mathcal { E } ( \mathbf { z } _ { t } \mid \mathbf { s } _ { t } ^ { p } , \mathbf { s } _ { t } ^ { g } ) = \mathcal { N } ( \mathbf { z } _ { t } \mid \boldsymbol { \mu } _ { t } ^ { e } , \boldsymbol { \sigma } _ { t } ^ { e } ) , } \\ { \mathcal { R } ( \mathbf { z } _ { t } \mid \mathbf { s } _ { t } ^ { d } ) = \mathcal { N } ( \mathbf { z } _ { t } \mid \boldsymbol { \mu } _ { t } ^ { p } , \boldsymbol { \sigma } _ { t } ^ { p } ) , } \\ { \mathcal { D } ( \mathbf { a } _ { t } \mid \mathbf { s } _ { t } ^ { d } , \mathbf { z } _ { t } ) = \mathcal { N } ( \mathbf { a } _ { t } \mid \boldsymbol { \mu } _ { t } ^ { d } , \hat { \boldsymbol { \sigma } } _ { t } ^ { d } ) , } \end{array}$ Where:
$\mathcal{E}(\mathbf{z}_t \mid \mathbf{s}_t^p, \mathbf{s}_t^g)$ : The encoder, outputting a Gaussian distribution for the latent code $\mathbf{z}_t$ .
- $\mathbf{z}_t$ : The latent motion variable at time $t$ .
- $\mathbf{s}_t^p$ : The proprioceptive state (as defined for the teacher policy).
- $\mathbf{s}_t^g$ : The goal state (as defined for the teacher policy).
- $\mathcal{N}(\cdot \mid \boldsymbol{\mu}_t^e, \boldsymbol{\sigma}_t^e)$ : A normal distribution with mean $\boldsymbol{\mu}_t^e$ and standard deviation $\boldsymbol{\sigma}_t^e$ (derived from the encoder's output).
$\mathcal{R}(\mathbf{z}_t \mid \mathbf{s}_t^d)$ : The learnable prior network, approximating the latent code distribution.
- $\mathbf{s}_t^d = \{ \omega _ { t } , \mathbf { g } _ { t } , \mathbf { q } _ { t } , \dot { \mathbf { q } } _ { t } , \mathbf { a } _ { t - 1 } \} \in \mathbb { R } ^ { 87 }$ $s_{t}^{d} = {ω_{t}, g_{t}, q_{t}, \dot{q}_{t}, a_{t - 1}} \in R^{87}$ : The decoder input state for sim-to-real transfer. It excludes the root linear velocity $\mathbf{v}_t$ $v_{t}$ from $\mathbf{s}_t^p$ $s_{t}^{p}$ .
  - $\omega_t$ : Root angular velocity.
  - $\mathbf{g}_t$ : Projected gravity vector.
  - $\mathbf{q}_t$ : Joint positions.
  - $\dot{\mathbf{q}}_t$ : Joint velocities.
  - $\mathbf{a}_{t-1}$ : The previous action reconstructed by the student policy.
- $\mathcal{N}(\cdot \mid \boldsymbol{\mu}_t^p, \boldsymbol{\sigma}_t^p)$ : A normal distribution with mean $\boldsymbol{\mu}_t^p$ and standard deviation $\boldsymbol{\sigma}_t^p$ (derived from the prior network's output).
$\mathcal{D}(\mathbf{a}_t \mid \mathbf{s}_t^d, \mathbf{z}_t)$ : The decoder, outputting a Gaussian distribution for the action $\mathbf{a}_t$ .
- $\mathbf{a}_t$ : The action reconstructed by the student policy.
- $\mathcal{N}(\cdot \mid \boldsymbol{\mu}_t^d, \hat{\boldsymbol{\sigma}}_t^d)$ : A normal distribution with mean $\boldsymbol{\mu}_t^d$ and a fixed diagonal covariance $\hat{\boldsymbol{\sigma}}_t^d$ .
  
  The overall training objective $\mathcal{L}_{\mathrm{all}}$ for the encoder, decoder, and prior is: $\mathcal { L } _ { \mathrm { a l l } } = \mathcal { L } _ { \mathrm { a c t i o n } } + \alpha \mathcal { L } _ { \mathrm { r e g u } } + \beta \mathcal { L } _ { \mathrm { K L } }$ Where:
$\mathcal{L}_{\mathrm{action}} = \| \mathbf{a}_t^* - \mathbf{a}_t \|_2^2$ : The action reconstruction loss, which measures the squared Euclidean distance between the action generated by the teacher policy $\mathbf{a}_t^*$ and the action reconstructed by the student policy $\mathbf{a}_t$ . This term encourages the student to mimic the teacher's output.
$\mathcal{L}_{\mathrm{regu}} = \| \boldsymbol{\mu}_t^e - \boldsymbol{\mu}_{t-1}^e \|_2^2$ : The regularization term, which penalizes abrupt changes in the latent trajectory. It enforces temporal consistency by minimizing the squared Euclidean distance between the mean of the encoder's output distribution at time $t$ ( $\boldsymbol{\mu}_t^e$ ) and at time t-1 ( $\boldsymbol{\mu}_{t-1}^e$ ).
$\mathcal{L}_{\mathrm{KL}}$ : The Kullback-Leibler (KL) divergence term, which aligns the encoder's latent distribution $\mathcal{E}(\mathbf{z}_t \mid \mathbf{s}_t^p, \mathbf{s}_t^g)$ with the learned prior $\mathcal{R}(\mathbf{z}_t \mid \mathbf{s}_t^d)$ . It encourages the latent space to follow the distribution learned by the prior network.
$\alpha$ and $\beta$ : Coefficients that coordinate the relative importance of the smoothness regularization and KL regularization terms, respectively.

Once this distillation is complete, the parameters of the decoder ( $\mathcal{D}$ ) and the prior ( $\mathcal{R}$ ) are frozen. This defines a new, compact, low-dimensional action space that inherently generates human-like movements, which is then used for downstream task learning.

4.2.3. Embracing Policy Training

This section formulates the task policy $\pi_{\mathrm{task}}$ for the WBM task, specifically for bulky objects. The task is divided into three stages: approaching the object, embracing the object, and transporting the object to the target location. A tailored PPO algorithm is used, incorporating stage randomized initialization, specific rewards, and the humanoid motion prior distribution.

4.2.3.1. Observations and Actions

The task observation at time step $t$ is a concatenation of the decoder input $\mathbf{s}_t^d$ (from the motion prior distillation) and a set of task-specific features $\mathbf{s}_t^s$ : $\mathbf { s } _ { t } ^ { \mathrm { t a s k } } = \{ \mathbf { s } _ { t } ^ { d } , \mathbf { s } _ { t } ^ { s } \}$ Where:

$\mathbf{s}_t^d$ : The decoder input state (as defined in Motion Prior Distillation), which contains proprioceptive information like angular velocity, gravity vector, joint positions and velocities, and previous action.
$\mathbf{s}_t^s$ $s_{t}^{s}$ : The task-specific features, defined as: $\mathbf { s } _ { t } ^ { s } = \{ \hat { p } _ { t } ^ { \mathrm { b o x } } , \hat { \theta } _ { t } ^ { \mathrm { b o x } } , \hat { p } _ { t } ^ { \mathrm { t a r g e t } } , \hat { \theta } _ { t } ^ { \mathrm { t a r g e t } } , \mathbf { d } _ { t } \} \in \mathbb { R } ^ { 19 }$
- $\hat{p}_t^{\mathrm{box}} \in \mathbb{R}^3$ : Distance vector between the robot torso and the center of the object (box).
- $\hat{\theta}_t^{\mathrm{box}} \in \mathbb{R}$ : Direction angle difference between the robot torso and the center of the object (box).
- $\hat{p}_t^{\mathrm{target}} \in \mathbb{R}^3$ : Distance vector between the robot torso and the final target position.
- $\hat{\theta}_t^{\mathrm{target}} \in \mathbb{R}$ : Direction angle difference between the robot torso and the final target position.
- $\mathbf{d}_t \in \mathbb{R}^{15}$ : NSDF features. These are computed by a pre-trained network $f_{\theta}$ that evaluates the shortest distances between a selected target point (the center of the object) and a set of mesh segments (the robot's upper links).
  
  The NSDF provides continuous geometric perception to improve contact awareness. As illustrated in Fig. 3, dots along the robot's links represent the closest points to the object's center, and dashed lines show the NSDF direction.
  
  该图像是示意图，展示了目标物体与机器人上半身之间的最近距离。机器人关节的每个链接上分布的点代表其与物体中心质的最短距离，虚线表示这种距离的方向，帮助理解在全身操控任务中的几何感知和接触意识。

Fig. 3. Illustration of NSDF between the target object and the robot's upper body. The dots distributed along each link of the robot represent the closest point from the corresponding joint to the object's center of mass. The dashed lines indicate the direction of the NSDF between the object's certer of mass and the robot's links.

The NSDF features $\mathbf{d}_t$ are formally defined as: $\mathbf { d } _ { t } = f _ { \theta } ( \mathbf { p } _ { t } ^ { \mathrm { target } } ) .$ Where:

$\mathbf{d}_t$ : The NSDF features vector.
$f_{\theta}$ : A pre-trained neural network that represents the NSDF.
$\mathbf{p}_t^{\mathrm{target}}$ : The target point for NSDF evaluation, which is chosen as the center of the object (box).

The task policy $\pi_{\mathrm{task}}$ operates in this latent space, producing a latent action $\mathbf{z}_t^s$ given the current task state $\mathbf{s}_t^{\mathrm{task}}$ : $\begin{array} { r l } & { \mathbf { z } _ { t } ^ { s } = \pi _ { \operatorname { t a s k } } ( \mathbf { s } _ { t } ^ { \mathrm { t a s k } } ) , } \\ & { \mathbf { a } _ { t } ^ { s } = \mathcal { D } ( \mathbf { z } _ { t } ^ { s } + \boldsymbol { \mu } _ { t } ^ { p } ) . } \end{array}$ Where:
$\mathbf{z}_t^s$ : The latent action output by the task policy $\pi_{\mathrm{task}}$ .
$\pi_{\mathrm{task}}$ : The task policy for WBM.
$\mathbf{s}_t^{\mathrm{task}}$ : The full task observation.
$\mathbf{a}_t^s$ : The final robot actions.
$\mathcal{D}$ : The frozen decoder from the motion prior distillation.
$\boldsymbol{\mu}_t^p$ : The motion prior generated by the learnable prior network $\mathcal{R}$ (mean of its output distribution). The latent action $\mathbf{z}_t^s$ is added to the motion prior $\boldsymbol{\mu}_t^p$ to form a new latent token, which is then decoded into the final robot actions $\mathbf{a}_t^s$ . This approach allows the task policy to modulate the human-like motions provided by the prior to achieve the specific task goal.

4.2.3.2. Reward Design

Multiple reward functions are designed to train the humanoid robot for the WBM task. These rewards encourage smoothness, enforce physical limitations, and guide task-specific behaviors. The weights for these rewards are detailed in Table II.

The following are the results from Table II of the original paper:

Reward Terms	Definition	Weight
Torque	∥τ∥	-1e-7
Joint Acceleration	\|l2	-2.5e-8
Action Rate	∥a1−1 − at k2	-0.5
Feet Slippage	∑ \|vfoot\| · (\|fcontact\| > 1)	-0.05
Feet Contact Force	∑ [max (0, \|fcontact \| − )]	-1e-5
Joint Angle Limitation	∑ − min (0, q − qlowerlimit)	-1e-3

	+ max (0, q − qupperlimit)

Smoothness Reward: This reward category encourages the policy to generate smooth and physically plausible actions.
- $r_{\mathrm{smooth}} = r_{\mathrm{torque}} + r_{\mathrm{acc}} + r_{\mathrm{action}}$ $r_{smooth} = r_{torque} + r_{acc} + r_{action}$
  - $r_{\mathrm{torque}}$ : Penalizes high joint torques (represented as $\| \tau \|$ in Table II), preventing jerky or energy-inefficient movements.
  - $r_{\mathrm{acc}}$ : Penalizes large actuator accelerations (represented as $|l2$ in Table II), contributing to smoother motion.
  - $r_{\mathrm{action}}$ : Penalizes rapid changes in action values (represented as $\| \mathbf{a}_{t-1} - \mathbf{a}_t \|_2^2$ in Table II), further promoting smoothness.
Physical Limitation Reward: These rewards are introduced to protect the robot during deployment and ensure stable operation.
- $r_{\mathrm{limit}} = r_{\mathrm{dof}} + r_{\mathrm{slippage}} + r_{\mathrm{feet}}$ $r_{limit} = r_{dof} + r_{slippage} + r_{feet}$
  - $r_{\mathrm{dof}}$ : Encourages joint angles to stay within their physical limits and discourages configurations near joint boundaries (defined by sums of penalties for exceeding lower or upper limits).
  - $r_{\mathrm{slippage}}$ : Enhances the stability of the robot by penalizing feet slippage (measured by $\sum |\mathbf{v}_{\mathrm{foot}}| \cdot (|\mathbf{f}_{\mathrm{contact}}| > 1)$ , where $\mathbf{v}_{\mathrm{foot}}$ is foot velocity and $\mathbf{f}_{\mathrm{contact}}$ is contact force).
  - $r_{\mathrm{feet}}$ : Prevents the robot from stepping too heavily on the ground by limiting the contact force on the robot's foot (defined by $\sum [\max(0, |\mathbf{f}_{\mathrm{contact}}| - \tau_{\mathrm{threshold}})]$ , where $\tau_{\mathrm{threshold}}$ is a contact force threshold), leading to smoother movements.
Task Reward: The WBM task is decomposed into three distinct stages, and the task reward $r_{\mathrm{task}}$ dynamically changes based on the current stage. A circular pick-up zone of radius $d_p = 0.35 \mathrm{~m}$ centered around the object acts as a transition boundary between approaching and manipulation. An annular region with inner radius $d_i = 3.5 \mathrm{~m}$ and outer radius $d_o = 4.0 \mathrm{~m}$ serves as the initial spawning area for the robot and the target delivery region for the object.
- $r_{\mathrm{task}} = r_{\mathrm{walk}} + r_{\mathrm{carry}} + r_{\mathrm{arm}} + r_{\mathrm{NSDF}}$
- First Stage: Approaching the object. (Robot is outside the pick-up zone.)
  - $r_{\mathrm{walk}}$ encourages the robot to move closer to the object.
  - $r_{\mathrm{carry}}$ and $r_{\mathrm{arm}}$ are set to zero in this stage.
  - The reward function for walking is formulated as: $r _ { \mathrm { w a l k } } = \left\{ \begin{array} { l l } { 1 , \qquad \mathrm { I n ~ p i c k -u p ~ z o n e , } } \\ { \qquad \mathrm { e x p } ( - \| \sigma \hat { p } _ { t } ^ { \mathrm { b o x } } \| _ { 2 } ) + \exp ( - \| \sigma \hat { \theta } _ { t } ^ { \mathrm { b o x } } \| _ { 2 } ) + } \\ { \qquad \exp ( - \| \sigma \hat { \nu } _ { t } ^ { \mathrm { b o d y } } \| _ { 2 } ) , \mathrm { O u t ~ o f ~ p i c k -u p ~ z o n e } } \end{array} \right.$ Where:
    - 1: A constant reward received when the robot successfully enters the pick-up zone.
    - $\exp ( - \| \sigma \hat { p } _ { t } ^ { \mathrm { b o x } } \| _ { 2 } )$ : An exponential reward component for reducing the distance ( $\hat{p}_t^{\mathrm{box}}$ ) to the object.
    - $\exp ( - \| \sigma \hat { \theta } _ { t } ^ { \mathrm { b o x } } \| _ { 2 } )$ : An exponential reward component for aligning the direction ( $\hat{\theta}_t^{\mathrm{box}}$ ) towards the object.
    - $\exp ( - \| \sigma \hat { \nu } _ { t } ^ { \mathrm { b o d y } } \| _ { 2 } )$ : An exponential reward component related to the robot's body velocity ( $\hat{\nu}_t^{\mathrm{body}}$ ), likely encouraging forward movement.
    - $\sigma$ : A scaling factor.
- Second and Third Stages: Embracing and Transporting the object. (Robot is inside the pick-up zone.)
  - $r_{\mathrm{carry}}$ encourages the object to move closer to the final target.
  - The reward function for carrying is defined as: $r _ { \mathrm { c a r r y } } = \left\{ \begin{array} { l l } { 0 , \qquad \mathrm { O u t ~ o f ~ p i c k -up ~ z o n e } , } \\ { \mathrm { e x p } ( - \| \sigma \hat { p } _ { t } ^ { \mathrm { t a r g e t } } \| _ { 2 } ) + \mathrm { e x p } ( - \| \sigma \hat { \theta } _ { t } ^ { \mathrm { t a r g e t } } \| _ { 2 } ) + } \\ { \qquad \mathrm { e x p } ( - \| \sigma \hat { \nu } _ { t } ^ { \mathrm { b o x } } \| _ { 2 } ) , \quad \mathrm { I n ~ p i c k -up ~ z o n e } . } \end{array} \right.$ Where:
    - 0: No reward when outside the pick-up zone.
    - $\exp ( - \| \sigma \hat { p } _ { t } ^ { \mathrm { t a r g e t } } \| _ { 2 } )$ : An exponential reward component for reducing the distance ( $\hat{p}_t^{\mathrm{target}}$ ) of the object to the final target.
    - $\exp ( - \| \sigma \hat { \theta } _ { t } ^ { \mathrm { t a r g e t } } \| _ { 2 } )$ : An exponential reward component for aligning the direction ( $\hat{\theta}_t^{\mathrm{target}}$ ) of the object towards the final target.
    - $\exp ( - \| \sigma \hat { \nu } _ { t } ^ { \mathrm { b o x } } \| _ { 2 } )$ : An exponential reward component related to the object's velocity ( $\hat{\nu}_t^{\mathrm{box}}$ ), encouraging it to move.
    - $\sigma$ : A scaling factor.
  - Additionally, $r_{\mathrm{arm}}$ is used to adjust the arm position: $\begin{array} { r l } & { r _ { \mathrm { a r m } } = \exp ( - \sigma ( \| \hat { p } _ { t } ^ { \mathrm { l h } } \| _ { 2 } + \| \hat { p } _ { t } ^ { \mathrm { r h } } \| _ { 2 } ) ) + } \\ & { \quad \quad \quad \exp ( - \sigma ( \| \hat { h } _ { t } ^ { \mathrm { l h } } \| _ { 2 } + \| \hat { h } _ { t } ^ { \mathrm { r h } } \| _ { 2 } ) ) , } \end{array}$ Where:
    - $\exp ( - \sigma ( \| \hat { p } _ { t } ^ { \mathrm { l h } } \| _ { 2 } + \| \hat { p } _ { t } ^ { \mathrm { r h } } \| _ { 2 } ) )$ : Encourages the left hand (lh) and right hand (rh) end-effectors to be close to the object ( $\hat{p}_t^{\mathrm{lh}}$ and $\hat{p}_t^{\mathrm{rh}}$ are distances from end-effectors to the object).
    - $\exp ( - \sigma ( \| \hat { h } _ { t } ^ { \mathrm { l h } } \| _ { 2 } + \| \hat { h } _ { t } ^ { \mathrm { r h } } \| _ { 2 } ) )$ : Encourages the hands to maintain a certain height or posture relative to the object ( $\hat{h}_t^{\mathrm{lh}}$ and $\hat{h}_t^{\mathrm{rh}}$ ).
    - $\sigma$ : A scaling factor.
  - The $r_{\mathrm{NSDF}}$ term (not explicitly formulated but mentioned as part of $r_{\mathrm{task}}$ ) is implicitly derived from the NSDF features $\mathbf{d}_t$ in the observation space. It is crucial for guiding multi-contact interactions, by rewarding sustained contact and proper embrace posture based on the NSDF values.

4.2.3.3. Random Initialization

To effectively address long-horizon tasks, the training environment uses randomized initialization of scenarios, categorized into three types:

Approaching Phase Training: The target object is placed on a table, and the robot's position is randomly initialized within the initial spawning area (the annular region). This scenario primarily trains the robot to approach the object.
Object Pickup Phase Training: The object remains on the table, but the robot's initial state is configured such that its upper joints are in a pre-hugging posture, and the robot is initialized directly in the pick-up zone. This setup specifically optimizes training for the embracing and lifting phase.
Transport Phase Training: The robot's joints are directly initialized in a hugging posture, with the object already generated in its arms. This configuration facilitates training for the transporting phase, where the robot carries the object to the target location.

During training, one of these three scenarios is randomly selected upon resetting the environment. This strategy ensures comprehensive learning across all task stages and helps overcome challenges associated with sparse rewards in long-horizon tasks.

5. Experimental Setup

5.1. Datasets

The training environment for the proposed RL framework is constructed using the Isaac Sim simulator. This simulator provides a high-fidelity physics environment suitable for robotics simulation and RL training. For efficient data collection, 4096 parallel agents are deployed within this environment, allowing for rapid exploration and learning.

The target object used for training is a cylinder with specific dimensions: $\Phi 42 \mathrm{~cm}$ (diameter) $\times 40 \mathrm{~cm}$ (height). The task setup is as follows:

Initial Object Placement: The object is initially placed 4 meters away from the robot.
Task Goal: The robot is required to embrace the object and transport it to a target location situated 4 meters away from the object's original position. This defines an integrated whole-body manipulation task.

For sim-to-sim transfer experiments in MuJoCo, the policy was tested on objects with diverse sizes, masses, and geometries, even though it was trained solely on a fixed-size cylindrical object. This demonstrates the generalization capability of the policy. An illustration of this is shown in Fig. 6, where various object primitives (cylinder, cuboid, sphere) with different colors indicating training vs. unseen dimensions are placed on pedestals.

该图像是示意图，展示了在Mujoco中进行的模拟实验。图中包括了三种不同形状的物体：圆柱体、球体和立方体，各自以不同的重量进行操控测试。红色物体为训练中使用的尺寸，蓝色物体则为未见过的形状，绿色柱子作为物体放置的基座。

Fig. 6. Illustration of sim-to-sim experiments in Mujoco. The illustration compares the manipulation performance across three common object primitives: a cylinder, a cuboid, and a sphere. Objects are varied in both shape and weight to evaluate generalization. The red objects match the dimensions used during training, while those in blue represent unseen shapes for generalization testing. The green columns serve as pedestals for initial object placement.

5.2. Evaluation Metrics

The primary evaluation metric used in the experiments is the Success Rate.

Conceptual Definition: Success rate quantifies the percentage of trials in which the robot successfully completes the entire whole-body manipulation (WBM) task. For a trial to be considered successful, two conditions must be met:
1. The object's center of mass must remain within a specified tolerance ( $0.1 \mathrm{~m}$ ) of the target location.
2. The object must not fall during the task execution. This metric provides a direct measure of the policy's robustness, reliability, and ability to achieve the ultimate task objective.
Mathematical Formula: The paper does not provide a specific mathematical formula for success rate, but conceptually it is calculated as: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $ Where:
- Number of Successful Trials: The count of experimental runs where the object reached the target location within tolerance and did not fall.
- Total Number of Trials: The total number of experimental runs conducted for a given condition.
  
  For statistical reliability, success rates are computed over 30 independent trials for each test condition.

5.3. Baselines

Due to the absence of suitable open-source baselines for the specific domain of whole-body embracing of bulky objects by humanoid robots using RL, the authors primarily conduct comparative experiments through ablation studies to validate the effectiveness of their proposed modules. The baselines for comparison are essentially versions of their own framework with certain key components removed:

Ablation Study 1: Without the NSDF Module:
- Description: This baseline evaluates the framework's performance when the Neural Signed Distance Field (NSDF)-related observations are removed from the input space and the NSDF reward term is disabled.
- Representativeness: This baseline is representative for understanding the critical role of geometric perception and contact awareness provided by NSDF in achieving stable multi-contact manipulation.
Ablation Study 2: Without Randomized Initialization:
- Description: This baseline evaluates the framework's performance when the multi-stage random initialization strategy is replaced with a deterministic initialization from a fixed initial state at the beginning of each episode.
- Representativeness: This baseline highlights the importance of effective state space exploration and curriculum learning for training RL policies to handle long-horizon tasks that involve sequential stages (approaching, embracing, transporting).
  
  These ablation baselines effectively demonstrate the individual contributions and necessity of the NSDF module and the randomized initialization strategy to the overall performance and convergence efficiency of the proposed framework.

6. Results & Analysis

The evaluation of the proposed framework is conducted through both simulation and real-world experiments. The training environment is Isaac Sim, utilizing 4096 parallel agents. A cylinder ( $\Phi 42 \mathrm{~cm} \times 40 \mathrm{~cm}$ ) is the target object, requiring the robot to transport it 4 meters. Experiments are run on an NVIDIA RTX 4080 GPU.

6.1. Core Results Analysis

6.1.1. Effectiveness of the NSDF Module

The ablation study on the NSDF module clearly demonstrates its critical role in enabling effective whole-body manipulation.

With NSDF (Fig. 4a): After training for the same number of epochs, the policy with the NSDF module successfully establishes multi-point contact using multiple upper-body joints during object transportation. This leads to an increased number of contact points with the object, significantly enhancing manipulation stability, which is particularly crucial for bulky items.
Without NSDF (Fig. 4b): In contrast, the policy trained without the NSDF module shows inferior performance. During task execution, only the torso tends to move close to the object, while the arms fail to properly embrace it. This lack of coordinated multi-contact ultimately results in an inability to complete the transportation task successfully. The visual comparison highlights that the NSDF provides the necessary geometric perception and contact awareness for the robot to actively engage its entire upper body in a stable embrace.

该图像是示意图，展示了有无NSDF模块的全身操作效果对比。左侧(a)为使用NSDF模块的情况，右侧(b)为不使用NSDF模块的情况。红色圆圈标示接触区域。使用NSDF模块的模型在操作过程中生成了更多的接触点，显著增强了抓取稳定性和成功率。

Fig. 4. Effectiveness of the NSDF module. (a) Results with the NSDF module. (b) Results without the NSDF module. The red circles indicate contact regions. As shown, the model with NSDF generates more contact points during manipulation, significantly enhancing grasping stability and success rate.

6.1.2. Evaluation of Randomized Initialization

The randomized initialization strategy significantly improves the training process for long-horizon tasks.

Proposed Method (Red Curve in Fig. 5): The policy trained with the multi-stage random initialization demonstrates significantly improved convergence speed and stability throughout training. The reward curve shows a smoother and more consistent increase, reaching higher values. This indicates that the strategy effectively promotes balanced learning across all task phases (approaching, embracing, transporting).
Baseline (Blue Curve in Fig. 5): The baseline, which uses deterministic initialization from a fixed initial state, exhibits slower reward growth during the initial phases of training and greater instability in its reward progression. This suggests that without diversified initial states, the RL agent struggles more with exploration and integrating learning across the different stages of the long-horizon task.

该图像是图表，展示了在多阶段随机初始化的消融研究中平均奖励的比较。红色曲线表示采用多阶段随机初始化的策略，而蓝色曲线代表未采用该机制的基线策略。红色曲线显示了训练过程中的平稳收敛和更高的稳定性，反观蓝色曲线在初期阶段奖励增长较慢且不稳定。

Fig. 5. Comparison of mean rewards in the ablation study on multi-stage random initialization. The red curve represents the policy with the proposed multi-stage random initialization, while the blue curve denotes the baseline without this mechanism. The policy with initialization (red) demonstrates smoother convergence and higher stability throughout training. In contrast, the baseline (blue) exhibits slower reward growth during the initial phase and greater instability.

6.1.3. Adaptability to Different Object Properties

The policy's robustness and generalization capability were evaluated through sim-to-sim transfer experiments in MuJoCo on objects with diverse sizes, masses, and geometries. Success rate is measured over 30 independent trials for each condition.

The following are the results from Table III of the original paper:

Object	Size [cm³]	Mass [kg]	Success rate [%]	NSDF
Cylinder	Φ42× 40	3	0	w/o
Cuboid	42 × 42× 42	3	0	w/o
Sphere	Φ42	3	0	w/o
Cylinder	Φ42× 40	1	100	W
Cylinder	Φ42× 40	3	100	W
Cylinder	Φ42× 40	7	93	W
Cylinder	Φ50× 40	3	100	W
Cuboid	42 × 42× 42	1	100	W
Cuboid	42 × 42× 42	3	100	W
Cuboid	42 × 42× 42	7	80	W
Cuboid	30× 30× 30	3	100	W
Sphere	Φ42	1	100	W
Sphere	Φ42	3	100	W
Sphere	Φ42	7	87	W
Sphere	Φ30	3	100	W

Key observations from Table III:

Necessity of NSDF: When the NSDF module is absent (w/o NSDF), the policy completely fails, achieving a 0% success rate across all object types (cylinder, cuboid, sphere) with standard mass (3 kg). This strongly reiterates the essential role of NSDF for stable manipulation, confirming its importance as seen in the ablation study.
Strong Sim-to-Sim Transfer and Generalization (With NSDF):
- Shape Adaptability: With NSDF ( $W$ ), the policy achieves 100% success rate for objects of standard mass (3 kg) and size, across all three tested primitives: cylinders ( $\Phi 42 \times 40 \mathrm{~cm}$ ), cuboids ( $42 \times 42 \times 42 \mathrm{~cm}$ ), and spheres ( $\Phi 42 \mathrm{~cm}$ ). This demonstrates excellent generalization to unseen shapes.
- Mass Robustness: The policy shows notable robustness to mass variations, maintaining high success rates even at 7 kg. For cylinders, it's 93%; for spheres, 87%; and for cuboids, 80%. The slight decrease for heavier objects is expected, but the performance remains strong. The authors attribute the slightly lower performance for heavier cuboidal objects to their flat surface morphology, which might offer less stable multi-contact points compared to curved surfaces under high load.
- Size Adaptability: Variations in object size do not degrade performance. The policy achieves 100% success for a larger cylinder ( $\Phi 50 \times 40 \mathrm{~cm}$ ) and a smaller sphere ( $\Phi 30 \mathrm{~cm}$ ) while handling standard mass (3 kg). This confirms the policy's adaptability beyond the training distribution for object dimensions.
  
  Overall, the simulation results demonstrate that the proposed framework, especially with the NSDF module, is highly robust and generalizable to significant variations in object properties (shape, size, mass), which is a crucial aspect for practical deployment.

6.1.4. Real-World Experiment

The efficacy of the trained policy is further validated through a real-world experiment on the Unitree H1-2 humanoid robot platform.

Setup: The H1-2 robot is equipped with an onboard Intel i7 computer for policy inference, running at 50 Hz to ensure real-time control performance. Global poses of the robot and the target object are tracked using a high-precision motion capture system. The target object is a cylinder with dimensions $\Phi 40 \mathrm{~cm} \times 60 \mathrm{~cm}$ , similar to the training object but slightly different, representing a realistic test case.
Execution Sequence (Fig. 7):
- 0 s - 2 s (Approaching): The H1 robot initiates motion and approaches the target object.
- 4 s (Contact Initiation): The robot positions itself to initiate contact, entering the pick-up zone and preparing for lifting.
- 6 s (Embracing & Transport): The robot successfully lifts and embraces the object with its whole body (arms and torso) and begins locomotion to transport it to the target location.
  
  该图像是一个示意图，展示了H1-2人形机器人在进行全身操作的仿真到现实转移过程。图中显示了从0秒到6秒的操作步骤，包括机器人靠近目标物体、准备接触、举起物体并开始运动的动态过程。

Fig. 7. Sim-to-real transfer of whole-body manipulation in H1-2 humanoid robot. The sequence illustrates the sim-to-real transfer of a whole-body manipulation task. From 0 to 2 s, the H1 robot approaches the target object. At 4 s, it positions itself to initiate contact and prepare for lifting. By 6 s, the robot successfully lifts the object and begins locomotion to complete the task.

The real-world experiment visually demonstrates successful sim-to-real transfer, showcasing the robot's ability to perform whole-body embracing of a bulky object. This validates the practical applicability of the proposed RL framework for multi-contact and long-horizon WBM tasks.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully developed and evaluated an RL framework that enables humanoid robots to perform whole-body embracing tasks for bulky objects. The framework innovatively integrates a pre-trained human motion prior with a Neural Signed Distance Field (NSDF) representation. The human motion prior, derived via a teacher-student architecture from large-scale human motion data, provides kinematically natural and physically feasible movement patterns. The NSDF offers precise geometric perception and contact awareness. This combination facilitates coordinated control of robot arms and torso, distributing contact forces across multiple points, significantly enhancing operational stability and load capacity. Experimental validation, including ablation studies, sim-to-sim transfer across diverse object properties (shapes, sizes, masses), and successful sim-to-real transfer on a Unitree H1-2 robot, demonstrates the approach's effectiveness, robustness, and practicality for complex multi-contact and long-horizon WBM tasks.

7.2. Limitations & Future Work

The paper does not explicitly dedicate a section to "Limitations & Future Work." However, some implicit limitations and potential areas for future research can be inferred from the discussion:

Performance with Heavily Loaded Cuboidal Objects: The paper notes "The slightly lower performance observed for heavier cuboidal objects can be attributed to their flat surface morphology." This suggests a limitation where the current policy might struggle more with non-curved surfaces under high loads, indicating a potential area for improvement in contact strategy or perception for such geometries.
Scalability to More Complex Environments/Objects: While generalization to diverse object properties is shown, the experiments are conducted with relatively simple, single bulky objects. Future work could involve more complex manipulation scenarios, such as interacting with multiple objects, navigating cluttered environments, or handling objects with irregular/deformable surfaces.
Generalization to Unseen Tasks: The framework is tailored for "embracing bulky objects." While it leverages a universal motion prior, adapting it to entirely new WBM tasks (e.g., carrying objects with different postures, pushing, opening doors while carrying) might still require significant RL fine-tuning.
Real-time NSDF Generation/Adaptation: The NSDF is pre-trained. For dynamic or highly deformable objects, real-time NSDF generation or adaptive NSDF perception might be necessary, which is a complex challenge.
Robustness to Perception Noise/Errors: The real-world experiment relies on a high-precision motion capture system. In environments without such systems, perception noise from onboard sensors (e.g., cameras, depth sensors) could affect performance. Future work could explore robustness to sensor noise or integration with less precise perception systems.
Human-Robot Collaboration: The paper focuses on autonomous manipulation. Extending WBM to human-robot collaborative carrying or interaction would introduce new challenges related to intent prediction and physical interaction safety.

7.3. Personal Insights & Critique

This paper presents a highly practical and well-engineered solution to a critical problem in humanoid robotics. The integration of human motion priors with NSDF-enhanced Reinforcement Learning is a powerful combination that addresses several long-standing challenges.

Inspirations:

Leveraging Human Intelligence: The teacher-student framework for motion prior distillation is particularly inspiring. It demonstrates a sophisticated way to infuse human-like dexterity and physical plausibility into RL policies, bypassing the often-unnatural or inefficient motions discovered by RL from scratch in high-dimensional spaces. This approach could be transferred to other complex motor skills beyond manipulation, such as complex locomotion or tool use.
Explicit Contact Awareness: The explicit use of NSDF for contact awareness is a significant step forward. Rather than hoping RL implicitly learns to manage contacts, providing continuous geometric perception directly in the observation space and reward function makes the learning process more directed and robust. This paradigm could be adapted to other contact-rich robotics tasks, such as assembly, pushing, or even robot-robot interaction.
Curriculum for Long-Horizon Tasks: The multi-stage random initialization is a simple yet effective curriculum learning strategy that is often overlooked. Its demonstrated impact on convergence speed and stability underscores the importance of carefully structuring the RL environment for complex sequential tasks.

Potential Issues or Areas for Improvement:

Computational Cost of NSDF: While neural SDFs are efficient, the real-time inference cost of the NSDF network $f_{\theta}$ over multiple robot links, especially for a complex humanoid, could still be a factor, particularly on less powerful onboard compute. Further optimization or efficient query strategies might be beneficial.
Generalization of Motion Prior: The human motion prior is distilled from a fixed dataset. While it provides kinematically natural movements, its generalization to highly novel or constrained manipulation postures not present in the mocap data could be a subtle limitation. Could the prior itself be adapted online or fine-tuned for specific contexts?
Reward Function Tuning: RL reward functions are notoriously difficult to tune. The paper presents a comprehensive set of rewards, but the specific weights (-1e-7, -2.5e-8, etc.) are critical and often require extensive trial and error. A more systematic approach to reward design or auto-tuning could enhance robustness.
Dynamic Object Properties: The NSDF helps with geometric perception for static object shapes. If the object were to deform significantly during manipulation (e.g., a bag of groceries), or if its properties changed unexpectedly, the current NSDF approach might need adaptation. Integrating deformable object modeling with NSDF could be a future extension.
Energy Efficiency: While the smoothness rewards contribute to energy efficiency, a more explicit energy consumption metric could be incorporated into the reward function for highly demanding WBM tasks, especially for battery-operated humanoid robots.

Overall, this paper provides a robust and well-validated framework, pushing the boundaries of what humanoid robots can achieve in whole-body manipulation. Its strength lies in the intelligent combination of established techniques to solve a complex, practical problem, making it a valuable contribution to the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Embracing Bulky Objects with Humanoid Robots: Whole-Body Manipulation with Reinforcement Learning

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 41,632 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

Whole-Body Manipulation (WBM)

Reinforcement Learning for Humanoid Robot Control

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Data Processing

4.2.2. Motion Prior Distillation

4.2.2.1. Teacher Policy

4.2.2.2. Humanoid Motion Prior Distillation

4.2.3. Embracing Policy Training

4.2.3.1. Observations and Actions

4.2.3.2. Reward Design

4.2.3.3. Random Initialization

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Effectiveness of the NSDF Module

6.1.2. Evaluation of Randomized Initialization

6.1.3. Adaptability to Different Object Properties

6.1.4. Real-World Experiment

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

Whole-Body Manipulation (`WBM`)