Embracing Bulky Objects with Humanoid Robots: Whole-Body Manipulation with Reinforcement Learning
TL;DR Summary
This paper introduces a reinforcement learning framework for humanoid robots to embrace bulky objects via whole-body manipulation. It combines human motion priors with neural signed distance fields to generate robust and natural motions, enhancing multi-contact stability and adap
Abstract
Whole-body manipulation (WBM) for humanoid robots presents a promising approach for executing embracing tasks involving bulky objects, where traditional grasping relying on end-effectors only remains limited in such scenarios due to inherent stability and payload constraints. This paper introduces a reinforcement learning framework that integrates a pre-trained human motion prior with a neural signed distance field (NSDF) representation to achieve robust whole-body embracing. Our method leverages a teacher-student architecture to distill large-scale human motion data, generating kinematically natural and physically feasible whole-body motion patterns. This facilitates coordinated control across the arms and torso, enabling stable multi-contact interactions that enhance the robustness in manipulation and also the load capacity. The embedded NSDF further provides accurate and continuous geometric perception, improving contact awareness throughout long-horizon tasks. We thoroughly evaluate the approach through comprehensive simulations and real-world experiments. The results demonstrate improved adaptability to diverse shapes and sizes of objects and also successful sim-to-real transfer. These indicate that the proposed framework offers an effective and practical solution for multi-contact and long-horizon WBM tasks of humanoid robots.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Embracing Bulky Objects with Humanoid Robots: Whole-Body Manipulation with Reinforcement Learning
1.2. Authors
Chunxin Zheng, Kai Chen, Zhihai Bi, Yulin Li, Liang Pan, Jinni Zhou, Haoang Li, and Jun Ma, Senior Member, IEEE
1.3. Journal/Conference
The paper is published at the given Original Source Link as a preprint on arXiv, dated 2025-09-16T21:01:24.000Z. As an arXiv preprint, it signifies a submission to a scientific repository, typically before or alongside peer review for a journal or conference. The affiliation of "Senior Member, IEEE" for Jun Ma suggests that the authors are likely targeting a high-impact IEEE journal or conference in robotics and automation.
1.4. Publication Year
2025
1.5. Abstract
This paper presents a novel reinforcement learning (RL) framework for humanoid robots to perform whole-body manipulation (WBM) tasks, specifically focusing on embracing bulky objects. Traditional end-effector grasping is often limited in such scenarios due to stability and payload constraints. The proposed method integrates a pre-trained human motion prior with a neural signed distance field (NSDF) representation. It utilizes a teacher-student architecture to distill human motion data, generating natural and physically feasible whole-body motion patterns that enable coordinated control of arms and torso for stable multi-contact interactions, thereby enhancing robustness and load capacity. The NSDF provides accurate and continuous geometric perception, improving contact awareness for long-horizon tasks. The framework is evaluated through comprehensive simulations and real-world experiments, demonstrating improved adaptability to diverse object shapes and sizes, and successful sim-to-real transfer. The authors conclude that this framework offers an effective and practical solution for multi-contact and long-horizon WBM tasks for humanoid robots.
1.6. Original Source Link
https://arxiv.org/abs/2509.13534 The paper is available as a preprint on arXiv.
1.7. PDF Link
https://arxiv.org/pdf/2509.13534v1.pdf The paper is available as a PDF on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the challenge of enabling humanoid robots to manipulate bulky objects effectively. While humanoid robots are increasingly deployed in various domains, their ability to handle large and heavy items remains a significant hurdle.
This problem is important because traditional grasping methods that rely solely on end-effectors (like grippers) are inherently limited in stability and payload capacity when dealing with bulky objects. Humans, in contrast, naturally use their whole body—arms, torso, and even legs—to embrace and move such objects, distributing forces and maintaining balance. Replicating this whole-body manipulation (WBM) capability in humanoids would greatly expand their utility in industrial, logistics, and home service applications.
Specific challenges or gaps in prior research include:
-
Perception: Accurately modeling the precise geometry and spatial relationships between the robot's complex body and objects, especially during
contact-rich manipulation. -
Control Complexity: Managing the
high degrees of freedom (DoF)of humanoid robots for coordinated,multi-contact interactions. -
Anthropomorphic Behavior: Generating
natural,human-like, andphysically feasiblewhole-body motion patterns. -
Stability: Maintaining
balanceduring dynamicmulti-contact interactionswith heavy objects. -
Long-horizon Tasks: Coordinating a sequence of actions (approaching, embracing, transporting) that require sustained planning and execution.
-
Sim-to-Real Transfer: Bridging the gap between behaviors learned in simulation and their successful deployment in the real world.
The paper's innovative idea is to leverage
Reinforcement Learning (RL)forWBM, specifically integrating apre-trained human motion priorto guide natural movements and aneural signed distance field (NSDF)for precisecontact awarenessandgeometric perception. This combination aims to overcome the limitations of traditional methods and existingRLapproaches that struggle withcontact-intensive,dynamically unstable, andlong-horizonWBMtasks.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Novel
RLFramework for Whole-Body Embracing: The authors propose the firstRLframework (to their knowledge) that enables humanoid robots to actively engage with bulky objects using their entire body, including arms and torso, for manipulation. This addresses the limitations of end-effector-only grasping by enablingmulti-contact interactions. -
Integration of Human Motion Prior: The framework incorporates a
human motion priorinto theembracing policy training pipeline. Thismotion prior, distilled from large-scale human motion data, acceleratespolicy training convergenceformulti-contactandlong-horizon tasksand allows robots to acquireanthropomorphic skills. -
NSDFfor Enhanced Perception and Contact Awareness: Aneural signed distance field (NSDF)representation of the humanoid robot is constructed. ThisNSDFprecisely perceivesrobot-object interactionsand is integrated into both theobservation spaceand thereward function. Thiscontact awarenessguides the robot's upper body to maintain sustained contact, significantly enhancingmanipulation robustness. -
Comprehensive Evaluation and
Sim-to-Real Transfer: The proposed approach is thoroughly validated through extensivesimulationsandreal-world experimentson a humanoid robot (Unitree H1-2). The results demonstrate robust performance in complexwhole-body embracing tasks, including adaptability to diverse object shapes, sizes, and masses, as well as successfulsim-to-real transfer.The key findings are that the integrated framework:
-
Enables stable
multi-contact interactions, distributing contact forces across the body, which improvesmanipulation robustnessandload capacity. -
Generates
kinematically naturalandphysically feasiblewhole-body motion patterns. -
Provides
accurate and continuous geometric perception, crucial forcontact awarenessinlong-horizon tasks. -
Achieves high
success rateswhen handling objects of varying sizes, masses, and shapes in simulation. -
Successfully transfers learned policies from simulation to a physical humanoid robot.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following concepts:
- Humanoid Robots: Robots designed to resemble the human body, typically with a torso, head, two arms, and two legs. They have many
degrees of freedom (DoF), allowing for complex movements but also posing significant control challenges. - Whole-Body Manipulation (WBM): A robotics paradigm where a robot uses multiple parts of its body (e.g., arms, torso, legs) and potentially environmental contacts to interact with and manipulate objects. This contrasts with
end-effector-only manipulation, which solely relies on grippers.WBMis crucial for handling large, heavy, or irregularly shaped objects that exceed the capabilities of single-point grasping. - Reinforcement Learning (RL): A machine learning paradigm where an
agentlearns to make decisions by performing actions in anenvironmentto maximize a cumulativerewardsignal. The agent learns an optimalpolicy(a mapping from states to actions) through trial and error. - Proximal Policy Optimization (PPO): A popular
on-policyRLalgorithm known for its stability and performance.PPOaims to update thepolicyin a way that avoids overly large policy changes in one step, ensuring more stable training. It often uses aclipping mechanismoradaptive KL penaltyto constrain policy updates. - Variational Autoencoder (VAE): A type of
generative modelthat learns alatent spacerepresentation of data. AVAEconsists of anencoderthat maps input data to adistribution(mean and variance) in the latent space, and adecoderthat reconstructs the input data from a sample drawn from thislatent distribution. It encourages the latent space to follow a specific prior distribution (e.g., a standard normal distribution) using aKullback-Leibler (KL) divergenceloss. In this paper, it's used formotion prior distillation. - Signed Distance Field (SDF): A mathematical representation of a 3D shape where each point in space is associated with the shortest distance to the surface of the shape. The sign of the distance indicates whether the point is inside (negative) or outside (positive) the shape.
SDFsare useful for collision detection, path planning, and robot-environment interaction modeling. - Neural Signed Distance Field (NSDF): An
SDFwhere the distance function is approximated by aneural network. This allows for a continuous and differentiable representation of complex geometries, often more compact and memory-efficient than traditional voxel-basedSDFs. In this paper,NSDFis used forgeometric perceptionandcontact awareness. - Motion Capture (Mocap) Data: Data recorded from human movements, typically using optical markers or inertial sensors. This data provides high-fidelity kinematic information about human poses and trajectories, which can be used to train robots to perform human-like motions.
SMPLModel: Askinned multi-person linear modelthat represents the human body as a mesh with shape and pose parameters. It's widely used in computer graphics and vision to represent and animate human motion, often serving as the basis for processingmocapdata.- Teacher-Student Learning (Knowledge Distillation): A technique where a smaller, simpler "student" model learns to mimic the behavior of a larger, more complex "teacher" model. The teacher often has superior performance but might be too computationally expensive for deployment. The student learns from the teacher's outputs, distilling its knowledge into a more efficient form. In this paper, a teacher policy trained on
mocapdata is distilled into aVAE-based student policyto create a compactmotion prior.
3.2. Previous Works
The paper references several categories of previous works in Whole-Body Manipulation and Reinforcement Learning for humanoid robot control.
Whole-Body Manipulation (WBM)
- Traditional
WBM(Model-based): EarlyWBMapproaches, such as [13] forwhole-arm graspingon single-arm systems and [5] incorporating contact models forfixed-base humanoid platforms(dual arms, objects, torso).- Limitation: These
model-based methodsoften rely on simplified robot models (e.g., links as line segments) and struggle with theinherent complexity of contact-rich scenarios, where precise geometry and contact sequence determination are computationally expensive and difficult to specify. For example,Murooka et al. [7]worked on whole-body pushing manipulation but also faced high computational costs.
- Limitation: These
- Learning-based
WBM:RLmethods forWBMon bimanual platforms [9].- Enhancing
contact perceptionwith soft pneumatic sensors [8] to provide richer tactile feedback. - More recent state-of-the-art approaches combine
RLwithwhole-body tactile sensingfor compliant and human-like motion generation, though often limited to upper-body platforms [11]. Zhang et al. [10]exploreplan-guided reinforcement learningforWBM.Gu et al. [12]discuss humanoid locomotion and manipulation control problems.Chen et al. [4]discuss geometry asdistance fieldsforWBM.
Reinforcement Learning for Humanoid Robot Control
- Task-driven Controllers: Early successes in
RLfor specific humanoid behaviors like agile running [14] and locomotion across challenging terrains [15].- Limitation: These methods often lack
generalizationtocomplex,long-horizon scenariosand may not guaranteebionicity(naturalness) without explicitgait rewards.
- Limitation: These methods often lack
- Behavior Cloning (BC) from Physics-based Animation: A significant trend where
control policiesare trained on large-scalehuman motion datasets(e.g.,AMASS). This allows robots to imitatenatural,human-like movementswhile maintainingphysical feasibility[16], [17].- Examples:
DeepMimic [16]pioneered example-guidedRLfor physics-based character skills.Perpetual Humanoid Control (PHC) [17]is a framework used for precise trajectory tracking.MaskedMimic [23]processes and filtersmocapdata for kinematically feasible trajectories. - Limitation:
BCpolicies often require additionalaction generators[3], [18] orpolicy distillation[19] to adapt to specific tasks (locomotion, manipulation). They are also limited in interactive tasks due to the absence of environmental information (contact forces, terrain geometry) in human motion datasets.
- Examples:
- Universal Motion Controllers: Recent advancements focus on
universal motion controllersthat generalize across tasks, often usingpre-trained BC policiesto construct comprehensiveaction spaces.Luo et al. [20]proposed usinghuman motion priors, distilled fromBC policies, to createuniversal controllersthat serve as a robust foundation for task-specificfine-tuning.He et al. [19]andLuo et al. [20]discusspolicy distillationanduniversal humanoid motion representations.PULSE [20]uses aVAE-based distillation approachto encode diverse motion behaviors into a latent space.- Other methods combine
text descriptionswithpre-trained BC policies[21] or distill variousrobotics skill policiesto buildaction spacesforuniversal humanoid robot controllers[22].
3.3. Technological Evolution
The field has evolved from purely model-based control, which required detailed physical models and explicit contact planning, to learning-based approaches, especially Reinforcement Learning. Early RL focused on task-specific skills, often struggling with generalization and human-like motion. The advent of behavior cloning from human motion data marked a significant step towards anthropomorphic behaviors and physical feasibility. However, adapting these mimicry policies to interactive, contact-rich manipulation tasks remained challenging due to the lack of environmental context in mocap data.
This paper fits into the latest wave by combining the strengths of RL (for task adaptation and interaction) with human motion priors (for natural and stable kinematics) and enhancing perception through NSDFs (for precise contact awareness). This represents a step towards creating truly versatile and robust humanoid WBM capabilities, moving beyond simplified models and single-task policies towards more generalizable and human-like interaction.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's approach are:
-
Integrated
RLFramework for Whole-Body Embracing: WhileWBMhas been explored, this paper is, to the authors' knowledge, the firstRLframework specifically designed for whole-body embracing of bulky objects by humanoid robots. Many previousRLWBMworks focused on upper-body or bimanual manipulation, not the full whole-bodymulti-contactembracing. -
Synergistic Combination of
Human Motion PriorandNSDF:- Human Motion Prior: Unlike generic
RLthat might struggle to findphysically feasibleandnaturalmovements in high-dimensionalDoFspaces, the pre-trainedhuman motion priorprovides a strong inductive bias. It distills complex human motion into a compactlatent action space, makingRL trainingmore efficient and yieldinganthropomorphic skills. This is an evolution from simplebehavior cloningby actively integrating the prior into theRL policy's action generation, rather than just using it fortrajectory trackingor as a separateaction generator. NSDFforContact Awareness: Model-basedWBMoften simplifies robot geometry or requires explicit contact point specification, leading to inaccuracies. GenericRLmight learn contacts implicitly but less reliably. TheNSDFexplicitly providesaccurate and continuous geometric perceptionof the robot's own body in relation to the object. This is integrated directly into both theobservation spaceand thereward function, actively guiding the robot to establish and maintain stablemulti-contacts, a level ofcontact awarenessoften missing in priorRLapproaches.
- Human Motion Prior: Unlike generic
-
Addressing
Long-Horizon TaskswithMulti-stage Random Initialization: The paper explicitly tackles the difficulty oflong-horizon tasksby introducing amulti-stage random initialization strategy. This significantly improvesconvergence speedandstabilitycompared to fixed initialization, ensuring theRL agentlearns all phases of the embracing task (approaching, embracing, transporting) effectively. -
Robustness and Generalization: The combined approach demonstrates superior
adaptabilityto diverse objectshapes,sizes, andmassesin simulation and successfulsim-to-real transfer, indicating a more practical and robust solution than many priorRLmethods that might struggle with variations outside the training distribution.In essence, the paper differentiates itself by cohesively integrating
kinematic plausibility(frommotion prior) withgeometric accuracyandcontact stability(fromNSDF) within aReinforcement Learningframework specifically designed for the complex,multi-contact, andlong-horizonproblem of bulky objectembracingby humanoid robots.
4. Methodology
4.1. Principles
The core idea of the method is to enable humanoid robots to perform whole-body manipulation (WBM) tasks, specifically embracing bulky objects, by leveraging the strengths of Reinforcement Learning (RL), human motion priors, and advanced geometric perception. The theoretical basis and intuition behind this approach are:
- Anthropomorphic Motion: Humans naturally use their entire body for such tasks. By distilling
human motion datainto amotion prior, the robot can learnkinematically naturalandphysically feasiblemovements, which are often more stable and efficient than motions discovered purely by trial-and-errorRLin a high-dimensional state-action space. Thisprioracts as a strong inductive bias, simplifying theRLsearch. - Robust Contact Management:
Whole-body embracingrelies heavily on stablemulti-contact interactions. TraditionalRLmight struggle to reliably establish and maintain these contacts. By introducing aNeural Signed Distance Field (NSDF), the robot gains continuous and accuratecontact awareness, allowing it to actively perceive and optimize its body's proximity and contact with the object. ThisNSDFinformation is fed directly into theRLagent'sobservation spaceandreward function, explicitly guiding it toward stablemulti-contact states. - Hierarchical Control and Efficiency: The framework implicitly creates a hierarchical control structure. The
motion priorhandles the low-level,kinematically naturalmovements, while thetask-specific RL policyfocuses on adapting these movements for the specificembracinggoal, responding to environmental cues (object position, target location,NSDFfeatures). Thisdistillationprocess makesRL trainingmore efficient by starting from abehaviorally plausiblebase. - Addressing Long-Horizon Tasks:
RLoften struggles withlong-horizon tasksdue to sparse rewards and exploration challenges. Themulti-stage random initializationstrategy explicitly addresses this by exposing the agent to various intermediate states, making the learning process more robust and efficient across different phases of the task (approaching, embracing, transporting).
4.2. Core Methodology In-depth (Layer by Layer)
The complete system architecture is illustrated in Fig. 2. The methodology is broken down into three main phases: Data Processing, Motion Prior Distillation, and Embracing Policy Training.
该图像是整个人体拥抱策略的训练框架示意图,分为三个部分:(a) 数据处理、(b) 动作先验蒸馏和(c) 拥抱策略训练。通过清理运动捕捉数据并利用教师-学生架构,生成自然且可行的全身动作模式,强化多接触互动的稳定性与负载能力。
The image is a schematic representation of the training framework for whole-body embracing policy, divided into three parts: (a) data processing, (b) motion prior distillation, and (c) embracing policy training. By cleaning motion capture data and utilizing a teacher-student architecture, it generates natural and feasible whole-body motion patterns, enhancing the stability and load capacity of multi-contact interactions.
4.2.1. Data Processing
To enable humanoid robots to acquire fundamental movement skills, the initial step involves processing human motion capture (mocap) data.
- Raw
MocapData toSMPL: Rawmocapdata, typically represented using theSMPL (Skinned Multi-Person Linear)human model, is first evaluated. - Filtering and Cleaning: This data undergoes processing and filtering to identify
kinematically feasible trajectories. An example of such processing isMaskedMimic [23], which produces acleaned human motion dataset, denoted as . This dataset contains only physically plausible human movement patterns. - Retargeting to Robot Platform: The
retargeting frameworkfromH20 [24]is then used to transfer these human motions onto the targethumanoid robot platform. This process explicitly accounts for differences inbody proportionsandjoint ranges of motionbetween the human and the robot. - Robot Motion Dataset: The output of this retargeting process is the
robot motion dataset, denoted as . This dataset provides validated trajectories that serve as training targets for the subsequentmimic teacher policy.
4.2.2. Motion Prior Distillation
The goal of this module is to distill motion priors from the robot motion dataset (). These priors should retain human-like dexterity while ensuring physical feasibility for humanoid robots. This is achieved using a teacher-student learning scheme similar to PULSE [20].
4.2.2.1. Teacher Policy
A motion-tracking teacher policy is trained using the PHC (Perpetual Humanoid Control) framework [17] based on the retargeted robot motion dataset (). This policy learns to precisely track motion trajectories.
The teacher policy maps the current robot state and a reference motion (goal state) to an action: Where:
- is the action generated by the teacher policy at time .
- is the teacher policy, implemented as a
multi-layer perceptron (MLP). - is the
proprioceptive stateof the robot at time , defined as: $ \mathbf { s } _ { t } ^ { p } = \left{ \mathbf { v } _ { t } , \omega _ { t } , \mathbf { g } _ { t } , \mathbf { q } _ { t } , \dot { \mathbf { q } } _ { t } , \mathbf { a } _ { t - 1 } ^ { * } \right} \in \mathbb { R } ^ { 90 } $- : Root linear velocity of the robot.
- : Root angular velocity of the robot.
- : Projected gravity vector, indicating the robot's orientation relative to gravity.
- : Joint positions (angles) of the robot's 27 joints.
- : Joint velocities of the robot's 27 joints.
- : The previous action generated by the teacher policy.
- is the
goal state, defined as: $ \mathbf { s } _ { t } ^ { g } = { \widehat { \mathbf { p } } _ { t + 1 } - \mathbf { p } _ { t } , \widehat { \mathbf { p } } _ { t + 1 } } \in \mathbb { R } ^ { 2 \times 3 \times 27 } $- : Positional offset, representing the difference between the target position of each rigid body at and the current position at .
- : Target configurations (positions) for each of the robot's rigid bodies at . This incorporates reference motion trajectories from the dataset .
The policy is optimized using
Proximal Policy Optimization (PPO) [25].
4.2.2.2. Humanoid Motion Prior Distillation
After training the teacher policy, its complex action space is distilled into a compact latent representation using a VAE-based approach, specifically following PULSE [20]. This latent space is designed to naturally induce human-like behavior and be efficiently leveraged for downstream tasks.
A variational encoder-decoder framework is constructed to represent the humanoid motion space:
-
Encoder (): Infers a
distributionover thelatent motion variablegiven the currentproprioceptive stateand thegoal state. -
Decoder (): Maps the
latent codeback to theaction space. -
Prior Network (): A
learnable prior networkthat approximates thelatent code distributionproduced by theencoder.To facilitate
sim-to-real transfer, thedecoder inputis designed to exclude theroot linear velocity componentfrom . This makes the representation more feasible for real-world scenarios where precise linear velocity might be harder to estimate reliably. The specific distributions are defined as: Where: -
: The encoder, outputting a
Gaussian distributionfor thelatent code.- : The
latent motion variableat time . - : The
proprioceptive state(as defined for the teacher policy). - : The
goal state(as defined for the teacher policy). - : A
normal distributionwith mean and standard deviation (derived from the encoder's output).
- : The
-
: The learnable prior network, approximating the
latent code distribution.- : The
decoder input stateforsim-to-real transfer. It excludes theroot linear velocityfrom .- : Root angular velocity.
- : Projected gravity vector.
- : Joint positions.
- : Joint velocities.
- : The previous action reconstructed by the student policy.
- : A
normal distributionwith mean and standard deviation (derived from the prior network's output).
- : The
-
: The decoder, outputting a
Gaussian distributionfor the action .-
: The action reconstructed by the student policy.
-
: A
normal distributionwith mean and a fixed diagonal covariance .The overall training objective for the encoder, decoder, and prior is: Where:
-
-
: The
action reconstruction loss, which measures the squared Euclidean distance between the action generated by the teacher policy and the action reconstructed by the student policy . This term encourages the student to mimic the teacher's output. -
: The
regularization term, which penalizes abrupt changes in thelatent trajectory. It enforcestemporal consistencyby minimizing the squared Euclidean distance between the mean of the encoder's output distribution at time () and at timet-1(). -
: The
Kullback-Leibler (KL) divergenceterm, which aligns the encoder'slatent distributionwith thelearned prior. It encourages thelatent spaceto follow the distribution learned by the prior network. -
and : Coefficients that coordinate the relative importance of the
smoothness regularizationandKL regularizationterms, respectively.Once this distillation is complete, the parameters of the
decoder() and theprior() arefrozen. This defines a new, compact,low-dimensional action spacethat inherently generateshuman-likemovements, which is then used fordownstream task learning.
4.2.3. Embracing Policy Training
This section formulates the task policy for the WBM task, specifically for bulky objects. The task is divided into three stages: approaching the object, embracing the object, and transporting the object to the target location. A tailored PPO algorithm is used, incorporating stage randomized initialization, specific rewards, and the humanoid motion prior distribution.
4.2.3.1. Observations and Actions
The task observation at time step is a concatenation of the decoder input (from the motion prior distillation) and a set of task-specific features :
Where:
- : The
decoder input state(as defined inMotion Prior Distillation), which contains proprioceptive information like angular velocity, gravity vector, joint positions and velocities, and previous action. - : The
task-specific features, defined as:-
: Distance vector between the robot torso and the center of the object (box).
-
: Direction angle difference between the robot torso and the center of the object (box).
-
: Distance vector between the robot torso and the final target position.
-
: Direction angle difference between the robot torso and the final target position.
-
:
NSDF features. These are computed by a pre-trained network that evaluates the shortest distances between aselected target point(the center of the object) and aset of mesh segments(the robot's upper links).The
NSDFprovides continuousgeometric perceptionto improvecontact awareness. As illustrated in Fig. 3, dots along the robot's links represent the closest points to the object's center, and dashed lines show theNSDFdirection.
该图像是示意图,展示了目标物体与机器人上半身之间的最近距离。机器人关节的每个链接上分布的点代表其与物体中心质的最短距离,虚线表示这种距离的方向,帮助理解在全身操控任务中的几何感知和接触意识。
-
Fig. 3. Illustration of NSDF between the target object and the robot's upper body. The dots distributed along each link of the robot represent the closest point from the corresponding joint to the object's center of mass. The dashed lines indicate the direction of the NSDF between the object's certer of mass and the robot's links.
The NSDF features are formally defined as:
Where:
-
: The
NSDF featuresvector. -
: A pre-trained
neural networkthat represents theNSDF. -
: The
target pointforNSDFevaluation, which is chosen as the center of the object (box).The
task policyoperates in thislatent space, producing alatent actiongiven the currenttask state: Where: -
: The
latent actionoutput by thetask policy. -
: The
task policyforWBM. -
: The full task observation.
-
: The final robot actions.
-
: The
frozen decoderfrom themotion prior distillation. -
: The
motion priorgenerated by thelearnable prior network(mean of its output distribution). Thelatent actionis added to themotion priorto form a newlatent token, which is then decoded into the final robot actions . This approach allows thetask policyto modulate thehuman-likemotions provided by thepriorto achieve the specific task goal.
4.2.3.2. Reward Design
Multiple reward functions are designed to train the humanoid robot for the WBM task. These rewards encourage smoothness, enforce physical limitations, and guide task-specific behaviors. The weights for these rewards are detailed in Table II.
The following are the results from Table II of the original paper:
| Reward Terms | Definition | Weight |
| Torque | ∥τ∥ | -1e-7 |
| Joint Acceleration | |l2 | -2.5e-8 |
| Action Rate | ∥a1−1 − at k2 | -0.5 |
| Feet Slippage | ∑ |vfoot| · (|fcontact| > 1) | -0.05 |
| Feet Contact Force | ∑ [max (0, |fcontact | − )] | -1e-5 |
| Joint Angle Limitation | ∑ − min (0, q − qlowerlimit) | -1e-3 |
| + max (0, q − qupperlimit) |
-
Smoothness Reward: This reward category encourages the policy to generate smooth and
physically plausible actions.-
- : Penalizes high
joint torques(represented as in Table II), preventing jerky or energy-inefficient movements. - : Penalizes large
actuator accelerations(represented as in Table II), contributing to smoother motion. - : Penalizes rapid changes in
action values(represented as in Table II), further promoting smoothness.
- : Penalizes high
-
-
Physical Limitation Reward: These rewards are introduced to protect the robot during deployment and ensure stable operation.
-
- : Encourages
joint anglesto stay within theirphysical limitsand discourages configurations nearjoint boundaries(defined by sums of penalties for exceeding lower or upper limits). - : Enhances the stability of the robot by penalizing
feet slippage(measured by , where is foot velocity and is contact force). - : Prevents the robot from stepping too heavily on the ground by limiting the
contact forceon the robot's foot (defined by , where is a contact force threshold), leading to smoother movements.
- : Encourages
-
-
Task Reward: The
WBMtask is decomposed into three distinct stages, and thetask rewarddynamically changes based on the current stage. Acircular pick-up zoneof radius centered around the object acts as a transition boundary betweenapproachingandmanipulation. Anannular regionwith inner radius and outer radius serves as the initialspawning areafor the robot and thetarget delivery regionfor the object.-
-
First Stage: Approaching the object. (Robot is outside the
pick-up zone.)- encourages the robot to move closer to the object.
- and are set to zero in this stage.
- The reward function for
walkingis formulated as: Where:1: A constant reward received when the robot successfully enters thepick-up zone.- : An exponential reward component for reducing the
distance() to the object. - : An exponential reward component for aligning the
direction() towards the object. - : An exponential reward component related to the robot's
body velocity(), likely encouraging forward movement. - : A scaling factor.
-
Second and Third Stages: Embracing and Transporting the object. (Robot is inside the
pick-up zone.)-
encourages the object to move closer to the final target.
-
The reward function for
carryingis defined as: Where:0: No reward when outside thepick-up zone.- : An exponential reward component for reducing the
distance() of the object to the final target. - : An exponential reward component for aligning the
direction() of the object towards the final target. - : An exponential reward component related to the
object's velocity(), encouraging it to move. - : A scaling factor.
-
Additionally, is used to adjust the arm position: Where:
- : Encourages the
left hand (lh)andright hand (rh) end-effectorsto be close to the object ( and are distances from end-effectors to the object). - : Encourages the
handsto maintain a certainheightorposturerelative to the object ( and ). - : A scaling factor.
- : Encourages the
-
The term (not explicitly formulated but mentioned as part of ) is implicitly derived from the
NSDF featuresin theobservation space. It is crucial for guidingmulti-contact interactions, by rewarding sustained contact and proper embrace posture based on theNSDFvalues.
-
-
4.2.3.3. Random Initialization
To effectively address long-horizon tasks, the training environment uses randomized initialization of scenarios, categorized into three types:
-
Approaching Phase Training: The target object is placed on a table, and the robot's position is randomly initialized within the
initial spawning area(the annular region). This scenario primarily trains the robot toapproachthe object. -
Object Pickup Phase Training: The object remains on the table, but the robot's
initial stateis configured such that itsupper jointsare in apre-hugging posture, and the robot is initialized directly in thepick-up zone. This setup specifically optimizes training for theembracingandliftingphase. -
Transport Phase Training: The robot's joints are directly initialized in a
hugging posture, with the object alreadygenerated in its arms. This configuration facilitates training for thetransportingphase, where the robot carries the object to the target location.During training, one of these three scenarios is randomly selected upon resetting the environment. This strategy ensures comprehensive learning across all task stages and helps overcome challenges associated with
sparse rewardsinlong-horizon tasks.
5. Experimental Setup
5.1. Datasets
The training environment for the proposed RL framework is constructed using the Isaac Sim simulator. This simulator provides a high-fidelity physics environment suitable for robotics simulation and RL training. For efficient data collection, 4096 parallel agents are deployed within this environment, allowing for rapid exploration and learning.
The target object used for training is a cylinder with specific dimensions: (diameter) (height).
The task setup is as follows:
-
Initial Object Placement: The object is initially placed
4 metersaway from the robot. -
Task Goal: The robot is required to
embracethe object andtransportit to a target location situated4 metersaway from the object's original position. This defines anintegrated whole-body manipulation task.For
sim-to-sim transferexperiments inMuJoCo, the policy was tested on objects with diverse sizes, masses, and geometries, even though it was trained solely on a fixed-size cylindrical object. This demonstrates thegeneralization capabilityof the policy. An illustration of this is shown in Fig. 6, where various object primitives (cylinder, cuboid, sphere) with different colors indicating training vs. unseen dimensions are placed on pedestals.
该图像是示意图,展示了在Mujoco中进行的模拟实验。图中包括了三种不同形状的物体:圆柱体、球体和立方体,各自以不同的重量进行操控测试。红色物体为训练中使用的尺寸,蓝色物体则为未见过的形状,绿色柱子作为物体放置的基座。
Fig. 6. Illustration of sim-to-sim experiments in Mujoco. The illustration compares the manipulation performance across three common object primitives: a cylinder, a cuboid, and a sphere. Objects are varied in both shape and weight to evaluate generalization. The red objects match the dimensions used during training, while those in blue represent unseen shapes for generalization testing. The green columns serve as pedestals for initial object placement.
5.2. Evaluation Metrics
The primary evaluation metric used in the experiments is the Success Rate.
-
Conceptual Definition:
Success ratequantifies the percentage of trials in which the robot successfully completes the entirewhole-body manipulation (WBM)task. For a trial to be considered successful, two conditions must be met:- The object's
center of massmust remain within a specified tolerance () of thetarget location. - The object must not
fallduring the task execution. This metric provides a direct measure of the policy's robustness, reliability, and ability to achieve the ultimate task objective.
- The object's
-
Mathematical Formula: The paper does not provide a specific mathematical formula for
success rate, but conceptually it is calculated as: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $ Where:-
Number of Successful Trials: The count of experimental runs where the object reached the target location within tolerance and did not fall. -
Total Number of Trials: The total number of experimental runs conducted for a given condition.For statistical reliability,
success ratesare computed over30 independent trialsfor each test condition.
-
5.3. Baselines
Due to the absence of suitable open-source baselines for the specific domain of whole-body embracing of bulky objects by humanoid robots using RL, the authors primarily conduct comparative experiments through ablation studies to validate the effectiveness of their proposed modules. The baselines for comparison are essentially versions of their own framework with certain key components removed:
-
Ablation Study 1: Without the
NSDFModule:- Description: This baseline evaluates the framework's performance when the
Neural Signed Distance Field (NSDF)-related observations are removed from theinput spaceand theNSDF reward termis disabled. - Representativeness: This baseline is representative for understanding the critical role of
geometric perceptionandcontact awarenessprovided byNSDFin achieving stablemulti-contact manipulation.
- Description: This baseline evaluates the framework's performance when the
-
Ablation Study 2: Without Randomized Initialization:
-
Description: This baseline evaluates the framework's performance when the
multi-stage random initialization strategyis replaced with adeterministic initializationfrom a fixed initial state at the beginning of each episode. -
Representativeness: This baseline highlights the importance of effective
state space explorationandcurriculum learningfor trainingRL policiesto handlelong-horizon tasksthat involve sequential stages (approaching, embracing, transporting).These
ablation baselineseffectively demonstrate the individual contributions and necessity of theNSDF moduleand therandomized initialization strategyto the overall performance andconvergence efficiencyof the proposed framework.
-
6. Results & Analysis
The evaluation of the proposed framework is conducted through both simulation and real-world experiments. The training environment is Isaac Sim, utilizing 4096 parallel agents. A cylinder () is the target object, requiring the robot to transport it 4 meters. Experiments are run on an NVIDIA RTX 4080 GPU.
6.1. Core Results Analysis
6.1.1. Effectiveness of the NSDF Module
The ablation study on the NSDF module clearly demonstrates its critical role in enabling effective whole-body manipulation.
-
With
NSDF(Fig. 4a): After training for the same number of epochs, the policy with theNSDF modulesuccessfully establishesmulti-point contactusing multipleupper-body jointsduring object transportation. This leads to an increased number ofcontact pointswith the object, significantly enhancingmanipulation stability, which is particularly crucial for bulky items. -
Without
NSDF(Fig. 4b): In contrast, the policy trained without theNSDF moduleshows inferior performance. During task execution, only the torso tends to move close to the object, while the arms fail to properly embrace it. This lack ofcoordinated multi-contactultimately results in an inability to complete the transportation task successfully. The visual comparison highlights that theNSDFprovides the necessarygeometric perceptionandcontact awarenessfor the robot to actively engage its entire upper body in a stable embrace.
该图像是示意图,展示了有无NSDF模块的全身操作效果对比。左侧(a)为使用NSDF模块的情况,右侧(b)为不使用NSDF模块的情况。红色圆圈标示接触区域。使用NSDF模块的模型在操作过程中生成了更多的接触点,显著增强了抓取稳定性和成功率。
Fig. 4. Effectiveness of the NSDF module. (a) Results with the NSDF module. (b) Results without the NSDF module. The red circles indicate contact regions. As shown, the model with NSDF generates more contact points during manipulation, significantly enhancing grasping stability and success rate.
6.1.2. Evaluation of Randomized Initialization
The randomized initialization strategy significantly improves the training process for long-horizon tasks.
-
Proposed Method (Red Curve in Fig. 5): The policy trained with the
multi-stage random initializationdemonstrates significantly improvedconvergence speedandstabilitythroughout training. The reward curve shows a smoother and more consistent increase, reaching higher values. This indicates that the strategy effectively promotesbalanced learningacross alltask phases(approaching, embracing, transporting). -
Baseline (Blue Curve in Fig. 5): The baseline, which uses
deterministic initializationfrom a fixed initial state, exhibits slowerreward growthduring the initial phases of training and greaterinstabilityin itsreward progression. This suggests that without diversified initial states, theRL agentstruggles more withexplorationand integrating learning across the different stages of thelong-horizon task.
该图像是图表,展示了在多阶段随机初始化的消融研究中平均奖励的比较。红色曲线表示采用多阶段随机初始化的策略,而蓝色曲线代表未采用该机制的基线策略。红色曲线显示了训练过程中的平稳收敛和更高的稳定性,反观蓝色曲线在初期阶段奖励增长较慢且不稳定。
Fig. 5. Comparison of mean rewards in the ablation study on multi-stage random initialization. The red curve represents the policy with the proposed multi-stage random initialization, while the blue curve denotes the baseline without this mechanism. The policy with initialization (red) demonstrates smoother convergence and higher stability throughout training. In contrast, the baseline (blue) exhibits slower reward growth during the initial phase and greater instability.
6.1.3. Adaptability to Different Object Properties
The policy's robustness and generalization capability were evaluated through sim-to-sim transfer experiments in MuJoCo on objects with diverse sizes, masses, and geometries. Success rate is measured over 30 independent trials for each condition.
The following are the results from Table III of the original paper:
| Object | Size [cm³] | Mass [kg] | Success rate [%] | NSDF |
| Cylinder | Φ42× 40 | 3 | 0 | w/o |
| Cuboid | 42 × 42× 42 | 3 | 0 | w/o |
| Sphere | Φ42 | 3 | 0 | w/o |
| Cylinder | Φ42× 40 | 1 | 100 | W |
| Cylinder | Φ42× 40 | 3 | 100 | W |
| Cylinder | Φ42× 40 | 7 | 93 | W |
| Cylinder | Φ50× 40 | 3 | 100 | W |
| Cuboid | 42 × 42× 42 | 1 | 100 | W |
| Cuboid | 42 × 42× 42 | 3 | 100 | W |
| Cuboid | 42 × 42× 42 | 7 | 80 | W |
| Cuboid | 30× 30× 30 | 3 | 100 | W |
| Sphere | Φ42 | 1 | 100 | W |
| Sphere | Φ42 | 3 | 100 | W |
| Sphere | Φ42 | 7 | 87 | W |
| Sphere | Φ30 | 3 | 100 | W |
Key observations from Table III:
- Necessity of
NSDF: When theNSDF moduleis absent (w/oNSDF), the policy completely fails, achieving a0% success rateacross all object types (cylinder, cuboid, sphere) with standard mass (3 kg). This strongly reiterates the essential role ofNSDFfor stable manipulation, confirming its importance as seen in theablation study. - Strong
Sim-to-Sim Transferand Generalization (WithNSDF):-
Shape Adaptability: With
NSDF(), the policy achieves100% success ratefor objects of standard mass (3 kg) and size, across all three tested primitives:cylinders(),cuboids(), andspheres(). This demonstrates excellentgeneralizationto unseen shapes. -
Mass Robustness: The policy shows notable
robustness to mass variations, maintaining highsuccess rateseven at7 kg. For cylinders, it's93%; for spheres,87%; and for cuboids,80%. The slight decrease for heavier objects is expected, but the performance remains strong. The authors attribute the slightly lower performance for heavier cuboidal objects to theirflat surface morphology, which might offer less stablemulti-contact pointscompared to curved surfaces under high load. -
Size Adaptability: Variations in object size do not degrade performance. The policy achieves
100% successfor a largercylinder() and a smallersphere() while handling standard mass (3 kg). This confirms the policy's adaptability beyond the training distribution for object dimensions.Overall, the simulation results demonstrate that the proposed framework, especially with the
NSDF module, is highlyrobustandgeneralizableto significant variations in objectproperties(shape, size, mass), which is a crucial aspect for practical deployment.
-
6.1.4. Real-World Experiment
The efficacy of the trained policy is further validated through a real-world experiment on the Unitree H1-2 humanoid robot platform.
- Setup: The
H1-2robot is equipped with anonboard Intel i7 computerforpolicy inference, running at50 Hzto ensurereal-time control performance.Global posesof the robot and the target object are tracked using ahigh-precision motion capture system. The target object is acylinderwith dimensions , similar to the training object but slightly different, representing a realistic test case. - Execution Sequence (Fig. 7):
-
0 s - 2 s (Approaching): The
H1 robotinitiates motion and approaches the target object. -
4 s (Contact Initiation): The robot positions itself to initiate contact, entering the
pick-up zoneand preparing for lifting. -
6 s (Embracing & Transport): The robot successfully lifts and embraces the object with its
whole body(arms and torso) and beginslocomotionto transport it to the target location.
该图像是一个示意图,展示了H1-2人形机器人在进行全身操作的仿真到现实转移过程。图中显示了从0秒到6秒的操作步骤,包括机器人靠近目标物体、准备接触、举起物体并开始运动的动态过程。
-
Fig. 7. Sim-to-real transfer of whole-body manipulation in H1-2 humanoid robot. The sequence illustrates the sim-to-real transfer of a whole-body manipulation task. From 0 to 2 s, the H1 robot approaches the target object. At 4 s, it positions itself to initiate contact and prepare for lifting. By 6 s, the robot successfully lifts the object and begins locomotion to complete the task.
The real-world experiment visually demonstrates successful sim-to-real transfer, showcasing the robot's ability to perform whole-body embracing of a bulky object. This validates the practical applicability of the proposed RL framework for multi-contact and long-horizon WBM tasks.
7. Conclusion & Reflections
7.1. Conclusion Summary
This study successfully developed and evaluated an RL framework that enables humanoid robots to perform whole-body embracing tasks for bulky objects. The framework innovatively integrates a pre-trained human motion prior with a Neural Signed Distance Field (NSDF) representation. The human motion prior, derived via a teacher-student architecture from large-scale human motion data, provides kinematically natural and physically feasible movement patterns. The NSDF offers precise geometric perception and contact awareness. This combination facilitates coordinated control of robot arms and torso, distributing contact forces across multiple points, significantly enhancing operational stability and load capacity. Experimental validation, including ablation studies, sim-to-sim transfer across diverse object properties (shapes, sizes, masses), and successful sim-to-real transfer on a Unitree H1-2 robot, demonstrates the approach's effectiveness, robustness, and practicality for complex multi-contact and long-horizon WBM tasks.
7.2. Limitations & Future Work
The paper does not explicitly dedicate a section to "Limitations & Future Work." However, some implicit limitations and potential areas for future research can be inferred from the discussion:
- Performance with Heavily Loaded
Cuboidal Objects: The paper notes "The slightly lower performance observed for heavier cuboidal objects can be attributed to their flat surface morphology." This suggests a limitation where the current policy might struggle more withnon-curved surfacesunder high loads, indicating a potential area for improvement incontact strategyorperceptionfor such geometries. - Scalability to More Complex Environments/Objects: While generalization to diverse object properties is shown, the experiments are conducted with relatively simple, single bulky objects. Future work could involve more complex
manipulation scenarios, such as interacting with multiple objects, navigating cluttered environments, or handling objects with irregular/deformable surfaces. - Generalization to Unseen Tasks: The framework is tailored for "embracing bulky objects." While it leverages a
universal motion prior, adapting it to entirely newWBM tasks(e.g., carrying objects with different postures, pushing, opening doors while carrying) might still require significantRL fine-tuning. - Real-time
NSDFGeneration/Adaptation: TheNSDFis pre-trained. For dynamic or highly deformable objects,real-time NSDF generationoradaptive NSDF perceptionmight be necessary, which is a complex challenge. - Robustness to Perception Noise/Errors: The
real-world experimentrelies on ahigh-precision motion capture system. In environments without such systems,perception noisefromonboard sensors(e.g., cameras, depth sensors) could affect performance. Future work could explore robustness to sensor noise or integration with less preciseperception systems. - Human-Robot Collaboration: The paper focuses on autonomous manipulation. Extending
WBMtohuman-robot collaborative carryingorinteractionwould introduce new challenges related tointent predictionandphysical interaction safety.
7.3. Personal Insights & Critique
This paper presents a highly practical and well-engineered solution to a critical problem in humanoid robotics. The integration of human motion priors with NSDF-enhanced Reinforcement Learning is a powerful combination that addresses several long-standing challenges.
Inspirations:
- Leveraging Human Intelligence: The
teacher-student frameworkformotion prior distillationis particularly inspiring. It demonstrates a sophisticated way to infusehuman-like dexterityandphysical plausibilityintoRL policies, bypassing the often-unnatural or inefficient motions discovered byRLfrom scratch in high-dimensional spaces. This approach could be transferred to other complex motor skills beyond manipulation, such as complex locomotion or tool use. - Explicit Contact Awareness: The explicit use of
NSDFforcontact awarenessis a significant step forward. Rather than hopingRLimplicitly learns to manage contacts, providing continuousgeometric perceptiondirectly in theobservation spaceandreward functionmakes the learning process more directed and robust. This paradigm could be adapted to othercontact-rich robotics tasks, such as assembly, pushing, or even robot-robot interaction. - Curriculum for Long-Horizon Tasks: The
multi-stage random initializationis a simple yet effectivecurriculum learningstrategy that is often overlooked. Its demonstrated impact onconvergence speedandstabilityunderscores the importance of carefully structuring theRL environmentfor complex sequential tasks.
Potential Issues or Areas for Improvement:
-
Computational Cost of
NSDF: Whileneural SDFsare efficient, the real-time inference cost of theNSDF networkover multiple robot links, especially for a complexhumanoid, could still be a factor, particularly on less powerfulonboard compute. Further optimization or efficientquery strategiesmight be beneficial. -
Generalization of
Motion Prior: Thehuman motion prioris distilled from afixed dataset. While it provideskinematically naturalmovements, itsgeneralizationto highly novel or constrained manipulation postures not present in themocapdata could be a subtle limitation. Could the prior itself be adapted online or fine-tuned for specific contexts? -
Reward FunctionTuning:RLreward functionsare notoriously difficult to tune. The paper presents a comprehensive set of rewards, but the specific weights (-1e-7,-2.5e-8, etc.) are critical and often require extensive trial and error. A more systematic approach toreward designorauto-tuningcould enhance robustness. -
Dynamic Object Properties: The
NSDFhelps withgeometric perceptionforstatic object shapes. If the object were to deform significantly during manipulation (e.g., a bag of groceries), or if its properties changed unexpectedly, the currentNSDFapproach might need adaptation. Integratingdeformable object modelingwithNSDFcould be a future extension. -
Energy Efficiency: While the smoothness rewards contribute to energy efficiency, a more explicit energy consumption metric could be incorporated into the
reward functionfor highly demandingWBM tasks, especially for battery-operated humanoid robots.Overall, this paper provides a robust and well-validated framework, pushing the boundaries of what humanoid robots can achieve in
whole-body manipulation. Its strength lies in the intelligent combination of established techniques to solve a complex, practical problem, making it a valuable contribution to the field.
Similar papers
Recommended via semantic vector search.