Paper status: completed

GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction

Published:11/07/2025

Upper-body Compliance Learning for Humanoid Robots (1)Spring-based Impedance Control (1)Contact-rich Human-Robot Interaction (1)Safe Object Manipulation (1)Whole-body Motion Tracking Policy (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

GentleHumanoid integrates impedance control into a whole-body motion tracking policy for humanoid robots, achieving upper-body compliance. It employs a spring-based model to adapt to diverse human-robot interactions, reducing contact forces while ensuring successful task executio

Abstract

Humanoid robots are expected to operate in human-centered environments where safe and natural physical interaction is essential. However, most recent reinforcement learning (RL) policies emphasize rigid tracking and suppress external forces. Existing impedance-augmented approaches are typically restricted to base or end-effector control and focus on resisting extreme forces rather than enabling compliance. We introduce GentleHumanoid, a framework that integrates impedance control into a whole-body motion tracking policy to achieve upper-body compliance. At its core is a unified spring-based formulation that models both resistive contacts (restoring forces when pressing against surfaces) and guiding contacts (pushes or pulls sampled from human motion data). This formulation ensures kinematically consistent forces across the shoulder, elbow, and wrist, while exposing the policy to diverse interaction scenarios. Safety is further supported through task-adjustable force thresholds. We evaluate our approach in both simulation and on the Unitree G1 humanoid across tasks requiring different levels of compliance, including gentle hugging, sit-to-stand assistance, and safe object manipulation. Compared to baselines, our policy consistently reduces peak contact forces while maintaining task success, resulting in smoother and more natural interactions. These results highlight a step toward humanoid robots that can safely and effectively collaborate with humans and handle objects in real-world environments.

Mind Map

In-depth Reading

English Analysis~36 min read · 46,492 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction".

1.2. Authors

The authors are Qingzhou Lu, Yao Feng, Baiyu Shi, Michael Piseno, Zhenan Bao, and C. Karen Liu. All authors are affiliated with Stanford University.

1.3. Journal/Conference

The paper is published as a preprint on arXiv. The official publication venue is not yet specified, but the abstract and content suggest it is intended for a major robotics or AI conference/journal.

1.4. Publication Year

The paper was published at (UTC) 2025-11-06T18:59:33.000Z.

1.5. Abstract

The paper introduces GentleHumanoid, a framework designed to enable humanoid robots to safely and naturally interact with humans and objects in human-centered environments. It addresses the limitation of most current reinforcement learning (RL) policies that prioritize rigid tracking and force suppression, and existing impedance-augmented approaches that are restricted to base or end-effector control, focusing on resisting extreme forces rather than enabling compliance. GentleHumanoid integrates impedance control into a whole-body motion tracking policy to achieve upper-body compliance. Its core innovation is a unified spring-based formulation that models both resistive contacts (restoring forces when pressing against surfaces) and guiding contacts (pushes or pulls derived from human motion data). This formulation ensures kinematically consistent forces across the robot's shoulder, elbow, and wrist, exposing the policy to diverse interaction scenarios. Safety is further supported by task-adjustable force thresholds. The approach is evaluated in both simulation and on the Unitree G1 humanoid robot across tasks requiring varying levels of compliance, such as gentle hugging, sit-to-stand assistance, and safe object manipulation. Compared to baselines, GentleHumanoid consistently reduces peak contact forces while maintaining task success, leading to smoother and more natural interactions. The results represent a step towards humanoid robots capable of safe and effective collaboration with humans and object handling in real-world settings.

1.6. Original Source Link

https://arxiv.org/abs/2511.04679 The publication status is a preprint on arXiv.

1.7. PDF Link

https://arxiv.org/pdf/2511.04679v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the lack of safe and natural physical interaction capabilities in humanoid robots, particularly in human-centered environments. Current state-of-the-art reinforcement learning (RL) policies for humanoid control often emphasize rigid position or velocity tracking, treating external forces as disturbances to be suppressed. This rigidity limits their applicability in tasks requiring adaptive compliance, such as handling delicate objects or engaging in physical contact with humans (e.g., hugging, offering assistance).

The problem is important because the successful deployment of humanoids in real-world, human-centric settings critically depends on their ability to interact safely and naturally. Existing solutions, such as impedance or admittance control, are typically restricted to base or end-effector control and primarily focus on resisting extreme forces, not on enabling the nuanced, multi-joint compliance needed for gentle interactions. Tasks like giving a comforting hug or assisting with a sit-to-stand transition require compliance across the entire upper-body kinematic chain, where multiple links (shoulders, elbows, hands) may be in contact simultaneously, with compliance levels ranging from gentle yielding to firm support, all while adhering to safety thresholds.

The paper's entry point or innovative idea is to address these challenges by integrating impedance control into a whole-body motion-tracking policy to achieve comprehensive upper-body compliance. This involves a novel unified spring-based formulation for interaction forces and safety-aware force thresholding to manage diverse contact scenarios.

2.2. Main Contributions / Findings

The paper's primary contributions are:

GentleHumanoid Framework: Introduction of a framework that seamlessly integrates impedance control with motion tracking for whole-body humanoid control featuring upper-body compliance.
Unified Interaction Force Modeling: Development of a novel unified spring-based formulation that covers both resistive and guiding contacts. This formulation leverages human motion datasets to ensure kinematic consistency and capture diverse interaction scenarios across multiple links (shoulder, elbow, wrist).
Safety-Aware Force Thresholding: Design of a force-thresholding mechanism that maintains interaction forces within safe limits, enabling comfortable and safer physical human-robot interaction. The thresholds are task-adjustable.
Hugging Evaluation Setup: Creation of a custom pressure-sensing pad specifically tailored for evaluating hugging interactions, allowing for reliable measurement of distributed contact forces.
Empirical Validation: Validation of the approach in both simulation and on the Unitree G1 humanoid robot. This validation demonstrates safer, smoother, and more adaptable performance compared to baseline methods across various tasks, including hugging, sit-to-stand assistance, and object manipulation.

The key findings are that GentleHumanoid consistently reduces peak contact forces while maintaining task success, leading to smoother and more natural interactions. It exhibits superior compliance, stability, and adaptability compared to rigid RL tracking policies and end-effector focused force-adaptive policies, making it suitable for safe human-robot collaboration and delicate object handling.

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following fundamental concepts:

Humanoid Robots: These are robots designed to resemble the human body, typically with a torso, head, two arms, and two legs. Their complex kinematics (study of motion without considering forces) and dynamics (study of motion considering forces) make their control challenging.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. The agent learns an optimal policy (a mapping from states to actions) through trial and error.
- Policy: In RL, a policy ( $\pi$ ) is a function that maps an observed state of the environment to an action to be taken. For humanoid robots, this could mean mapping current joint angles and velocities to target joint positions or torques.
- PPO (Proximal Policy Optimization): A popular RL algorithm used for training policies. It's a policy gradient method that tries to keep new policies close to old policies to ensure stable learning, preventing large, destructive updates.
Whole-Body Control (WBC): A control strategy for robots with many degrees of freedom (like humanoids) that coordinates the motion and forces across all joints and links simultaneously. The goal is to achieve complex tasks while maintaining balance, respecting joint limits, and managing contacts.
Impedance Control: A control strategy that regulates the relationship between a robot's position and the force it exerts or experiences when interacting with an environment. Instead of strictly controlling position or force, it controls the robot's "mechanical impedance" (its resistance to motion).
- Stiffness ( $K_p$ ): In impedance control, stiffness is analogous to a spring constant. A high stiffness means the robot strongly resists deviations from its desired position, requiring a large force to move it. A low stiffness means it yields easily to external forces.
- Damping ( $K_d$ ): Analogous to a damper, damping resists motion proportional to velocity. It helps stabilize the system and dissipates energy, preventing oscillations. Critical damping is a specific level of damping that returns the system to equilibrium as quickly as possible without oscillating.
Kinematic Chain: A series of rigid bodies (links) connected by joints. The upper body of a humanoid, from the shoulder to the hand, forms a kinematic chain. Kinematically consistent forces mean that forces applied at different points along this chain are related in a physically plausible way, reflecting how the human body naturally moves and reacts.
Sim-to-Real Transfer: The process of training a robot control policy in a simulated environment and then deploying it on a real physical robot. This is often challenging due to sim-to-real gap (discrepancies between simulation and reality), but crucial for training complex RL policies safely and efficiently.
Low-level PD controllers: Proportional-Derivative (PD) controllers are basic feedback control loops commonly used in robotics to track desired joint positions or velocities. They apply a torque proportional to the position error (P-term) and the rate of change of that error (D-term). In this paper, the RL policy outputs target joint positions, and the PD controllers ensure the robot's actual joints reach these targets.

3.2. Previous Works

The paper categorizes related work into three main areas:

3.2.1. Humanoid Whole Body Control

Traditional Model-Based Methods: Early approaches like Model Predictive Control (MPC) (e.g., [13]-[15]) were used to generate stable humanoid behaviors. These methods rely on detailed mathematical models of the robot and its environment.
- Key Idea: MPC works by predicting the robot's future behavior over a short horizon, solving an optimization problem to find control inputs that minimize a cost function (e.g., tracking error, energy consumption, instability) while respecting constraints (e.g., joint limits, balance). It then applies the first part of the optimal control sequence and repeats the process.
- Limitations: These methods require extensive expert design and meticulous tuning, and can be computationally expensive, making them less adaptable to diverse, unmodeled interactions.
Learning-Based Methods (RL for WBC): More recent work leverages RL to learn whole-body control policies, often by learning from human motion data (e.g., [3]-[8]). These methods have shown impressive results for dynamic locomotion and manipulation.
- Key Idea: Instead of explicitly modeling dynamics, RL policies learn to map observations directly to control actions through interaction with a simulated environment. Learning from human motion data allows robots to imitate natural, complex movements.
- Limitations: Most of these RL policies prioritize rigid tracking and tend to treat external forces as disturbances to suppress, making them less suitable for tasks requiring adaptive compliance and safe physical interaction.

3.2.2. Force-adaptive Control

Classical Force-Adaptive Methods: Impedance and admittance control are classical techniques for regulating interaction forces. They have been extended to whole-body frameworks (e.g., [15]-[17]).
- Key Idea: These controllers define a desired mechanical behavior (e.g., a certain stiffness or damping) for the robot's interaction with the environment, rather than strictly controlling position or force.
- Limitations: They often require careful tuning and can be complex to apply to high-dimensional whole-body systems.
RL-Based Force-Adaptive Methods: Recent RL approaches incorporate impedance or admittance control (e.g., [9]-[11]) or aim to implicitly learn robustness to external disturbances and extreme forces (e.g., [12], [18]).
- Limitations: These methods typically focus on end-effector interactions (e.g., controlling the force at the hand) rather than coordinated interactions involving multiple body parts like elbows and shoulders, which are crucial for tasks like carrying large objects or human physical interaction.

3.2.3. Human-Humanoid Interaction

Early Works: Explored human-in-the-loop strategies and haptic feedback for soft contact (e.g., [19], [20]).
Task-Specific Assistance: Applied traditional control to specific tasks like sit-to-stand transitions (e.g., [21], [22]).
- Limitations: These approaches are usually tailored to a single scenario and often do not generalize across different interaction contexts.
Vision-Based Criteria: Focus on designing policies to avoid human collisions (e.g., [23]).
- Limitations: These primarily focus on collision avoidance rather than enabling compliant physical interaction.

3.3. Technological Evolution

The field of humanoid robot control has evolved from purely model-based approaches, which relied on precise mathematical models and significant human engineering, to increasingly learning-based methods, particularly Reinforcement Learning (RL). Early RL efforts focused on dynamic locomotion and manipulation, achieving impressive feats but often prioritizing rigid trajectory tracking and suppressing external disturbances. This led to robots that were robust to unintended forces but stiff and unsafe for intended physical interaction.

The next step in this evolution involved integrating classical force-adaptive control (like impedance control) into RL frameworks to allow for more nuanced interactions. However, these integrations were often limited to end-effector control or focused on resisting extreme forces.

This paper's work (GentleHumanoid) fits into this timeline by pushing the boundaries of RL-based force-adaptive control towards whole-body, multi-link compliance, especially for the upper body. It moves beyond end-effector focus to coordinate forces across the entire kinematic chain, explicitly addresses safety through force thresholding, and exposes the RL policy to a diverse range of compliant interaction scenarios through a novel spring-based formulation using human motion data. This represents a significant step towards enabling humanoids to operate safely and naturally in complex human-centered environments.

3.4. Differentiation Analysis

Compared to the main methods in related work, GentleHumanoid presents several core differences and innovations:

Whole-Body Upper-Body Compliance vs. End-Effector/Base:
- Previous RL with Impedance: Typically restricted to end-effector or base control (e.g., [9]-[11]), meaning compliance is only managed at the robot's hand or its main body, not coordinated across the entire arm.
- GentleHumanoid: Introduces a framework for whole-body upper-body compliance, meaning forces and compliance are coordinated across the shoulder, elbow, and wrist simultaneously. This is crucial for tasks like hugging or carrying large objects where multiple points of contact are involved.
Enabling Compliance vs. Resisting Extreme Forces:
- Previous Force-Adaptive RL: Often focused on implicitly learning robustness to external disturbances and extreme forces (e.g., [12], [18]), primarily designed for resisting rather than yielding.
- GentleHumanoid: Explicitly aims to enable compliance, allowing the robot to yield gently or provide supportive forces as needed, suitable for delicate interactions.
Unified Spring-Based Formulation for Diverse Interactions:
- Previous Interaction Modeling: Physics engines provide contact forces but are often noisy, localized, and lack coordination, and only occur during actual collisions.
- GentleHumanoid: Introduces a novel unified spring-based formulation that models both resistive and guiding contacts. This allows for simulating diverse and kinematically consistent interaction forces by sampling from human motion data, which ensures coordination across the kinematic chain and exposes the policy to a much broader range of interaction scenarios during training than real physics alone.
Safety-Aware Force Thresholding:
- Previous Approaches: While some methods aim for robustness, explicit task-adjustable force thresholds to guarantee safety and comfort in human interaction are less common.
- GentleHumanoid: Integrates an adaptive force thresholding mechanism that caps maximum allowable forces, tunable for specific tasks (e.g., gentle hugging vs. firm support), and explicitly trains the policy to respect these limits.
Generalization Across Interaction Contexts:
- Previous Human-Humanoid Interaction: Often tailored to specific scenarios (e.g., sit-to-stand assistance [21], [22]).
- GentleHumanoid: Aims for a general motion-tracking policy capable of handling multiple interaction scenarios (hugging, sit-to-stand, object manipulation) by adapting its compliance level.
  
  In essence, GentleHumanoid provides a more holistic and nuanced approach to compliant physical interaction for humanoids by coordinating forces across the entire upper body, proactively ensuring safety, and learning from diverse, kinematically consistent interaction scenarios.

4. Methodology

4.1. Principles

The core idea behind GentleHumanoid is to achieve whole-body humanoid control with upper-body compliance by integrating impedance control principles into an RL-based motion-tracking policy. The theoretical basis is that a robot's motion is influenced by two types of forces: a driving force that pulls it towards a desired target motion, and an interaction force arising from physical contact with the environment (humans or objects). By modeling both these forces with spring-damper systems and training an RL policy to track reference dynamics governed by these forces, the robot learns to exhibit adaptable and safe compliant behavior. The key intuition is that instead of rigidly resisting all external forces, the robot should "yield" or "push back" with a controlled, human-like stiffness and damping, coordinated across its entire upper-body kinematic chain. Safety is ensured by capping the maximum allowable forces.

4.2. Core Methodology In-depth (Layer by Layer)

The GentleHumanoid framework aims to enable robust and safe whole-body humanoid control for diverse motions and compliant interactions. It frames this as learning a compliant motion-tracking policy where the humanoid follows target movements while adapting to interaction forces.

4.2.1. Problem Formulation

The motion of each link $i$ of the humanoid robot is modeled as being influenced by two types of forces: a driving force ( $f_{\mathrm{drive}, i}$ ) and an interaction force ( $f_{\mathrm{interact}, i}$ ). This is represented by the equation:

$M \ddot { \pmb x } _ { i } = \pmb f _ { \mathrm { d r i v e } , i } + \pmb f _ { \mathrm { i n t e r a c t } , i }$

Here:

$\pmb x _ { i }$ represents the 3D Cartesian position of link $i$ .
$\ddot { \pmb x } _ { i }$ is the 3D Cartesian acceleration of link $i$ .
$M$ is a scalar virtual mass (in $\mathrm{kg}$ ) assigned to each link. This is a conceptual mass used in the reference dynamics model, not the actual physical mass of the link. The paper sets $M = 0.1 ~ \mathrm{kg}$ .
$\pmb f _ { \mathrm { d r i v e } , i }$ is the driving force for link $i$ , which pulls the link towards its target motion.
$\pmb f _ { \mathrm { i n t e r a c t } , i }$ is the interaction force for link $i$ , representing physical contact with humans or objects.

For clarity, the index $i$ is omitted in subsequent equations, meaning the formulation applies to any given link. All link positions ( $\pmb x$ ) and velocities ( $\dot { \pmb x }$ ) are 3D Cartesian quantities expressed in the robot's root frame.

4.2.2. Impedance-Based Driving Force from Target Motion

The driving force ( $\pmb f_{\mathrm{drive}}$ ) for each link is generated from its target motion using a virtual spring-damper system, a concept from classical impedance control. This force pulls the current link position towards its target trajectory:

$\begin{array} { r } { \pmb { f } _ { \mathrm { d r i v e } } = K _ { p } ( \pmb { x } _ { \mathrm { t a r } } - \pmb { x } _ { \mathrm { c u r } } ) + K _ { d } ( \pmb { v } _ { \mathrm { t a r } } - \pmb { v } _ { \mathrm { c u r } } ) \ , } \end{array}$

Where:

$\pmb x_{\mathrm{cur}}$ and $\pmb v_{\mathrm{cur}}$ are the current 3D Cartesian position and velocity of the link.
$\pmb x_{\mathrm{tar}}$ and $\pmb v_{\mathrm{tar}}$ are the corresponding target 3D Cartesian position and velocity from the desired motion.
$K_p$ is the impedance stiffness gain, controlling how strongly the link position tracks its target. A higher $K_p$ means a stiffer response, resisting deviation more.
$K_d$ is the impedance damping gain, controlling the resistance to velocity deviations.
To ensure stable and smooth behavior, the damping gain $K_d$ is set to the critical damping value: $K _ { d } = 2 \sqrt { M K _ { p } }$ . This prevents oscillations and returns the system to its target state quickly without overshooting.

The RL policy operates in joint space, producing actions (target joint positions) which are then tracked by low-level joint PD controllers. The RL policy learns to coordinate these compliant forces across multiple joints, mapping them into these joint-level actions to balance stability and adaptability.

4.2.3. Interaction Force Modeling

When physical contact occurs, interaction forces are introduced. The paper designs a unified interaction force model to account for both multi-link coupling (forces affecting multiple parts of the arm) and diversity in force types. It distinguishes two cases, both modeled using the same spring formulation:

Resistive contact: Forces generated when the humanoid actively presses against a human or object.
Guiding contact: Forces applied by an external agent (e.g., a human pushing or pulling the humanoid's arm).

The interaction force ( $\pmb f_{\mathrm{interact}}$ ) is modeled as:

$\begin{array} { r } { \pmb { f } _ { \mathrm { i n t e r a c t } } = K _ { \mathrm { s p r i n g } } \big ( \pmb { x } _ { \mathrm { a n c h o r } } - \pmb { x } _ { \mathrm { c u r } } \big ) , } \end{array}$

Where:

$K_{\mathrm{spring}}$ is the stiffness of the interaction spring.
$\pmb x_{\mathrm{cur}}$ is the current link position.
$\pmb x_{\mathrm{anchor}}$ is the spring anchor position, which is defined differently for the two contact types:

$\pmb { x } _ { \mathrm { a n c h o r } } = \left\{ \begin{array} { l l } { \pmb { x } _ { \mathrm { c u r } } ( t _ { 0 } ) , } & { \mathrm { r e s i s t i v e ~ c o n t a c t } , } \\ { \pmb { x } _ { \mathrm { s a m p l e } } , } & { \mathrm { g u i d i n g ~ c o n t a c t } . } \end{array} \right.$
For resistive contact, $\pmb x_{\mathrm{anchor}}$ is fixed at the link position $\pmb x_{\mathrm{cur}}(t_0)$ at the moment of initial contact $t_0$ . This generates restoring forces that push the link back to its initial contact point if it moves away.
For guiding contact, $\pmb x_{\mathrm{anchor}}$ is a link position $\pmb x_{\mathrm{sample}}$ sampled from a human motion dataset. This represents an external agent steering the humanoid towards a new configuration, generating guiding forces that pull the humanoid towards this sampled target.

The use of posture samples from real human motion data for guiding contact is crucial. It ensures that the guiding forces are kinematically valid and coordinated across the entire kinematic chain (e.g., shoulder, elbow, wrist), preventing unrealistic or destabilizing forces that might arise from independently applying forces to each link. During training, posture distributions are precomputed from motion datasets, and suitable postures close to the current multi-link positions are selected. A target position is then randomly sampled from these to serve as the spring anchor.

To increase interaction diversity during training, the framework randomizes both the stiffness ( $K_{\mathrm{spring}}$ ) and the active links (which links experience interaction forces).

Stiffness is sampled from a uniform distribution: $K_{\mathrm{spring}} \sim \mathcal{U}(5, 250)$ .
Active-contact sets are chosen with specific probabilities:
- $40\%$ no external force.
- $15\%$ both arms (all 6 relevant links: shoulder, elbow, hand for each arm) under force.
- $30\%$ a single arm (left or right; its 3 links) under force ( $15\%$ for each arm).
- $15\%$ only a single link under force. Anchors and selections are resampled every 5 seconds with a short transition window to ensure continuity. This extensive randomization exposes the policy to a broad range of interaction dynamics, promoting robust compliance.

The following figure (Figure 2 from the original paper) provides an overview of the GentleHumanoid framework:

$该图像是一个示意图，展示了GentleHumanoid框架的机制，包括参考动力学、训练过程和实际部署。图中包含了 $x_{i+1} = x_{i} + riangle t imes ar{x}_{i}$ 的运动模型，说明了驱动力和交互力的关系。该框架旨在实现安全和顺应的交互，适用于与人类的接触和物体操作。$
该图像是一个示意图，展示了GentleHumanoid框架的机制，包括参考动力学、训练过程和实际部署。图中包含了 $x_{i+1} = x_{i} + riangle t imes ar{x}_{i}$ 的运动模型，说明了驱动力和交互力的关系。该框架旨在实现安全和顺应的交互，适用于与人类的接触和物体操作。

F.:Overview ramework. )Referencedynamics:mpedance-based dynamics integratedriving forces (ormoton tt nt eoe T is optimized using rewards that compare simulated states $( \boldsymbol { \mathscr { x } } ^ { \mathrm { s i m } } , \dot { \boldsymbol { x } } ^ { \mathrm { s i m } } )$ to reference dynamics $( \pmb { x } ^ { \mathrm { r e f } } , \dot { \pmb { x } } ^ { \mathrm { r e f } } )$ Deployment: therai GHua polpl eawor akscdisnbastuugg u handling large deformable objects.

The figure illustrates how the reference dynamics, incorporating both driving and interaction forces, guides the RL policy learning. The policy is optimized by comparing simulated states with these reference dynamics. The trained policy is then deployed for tasks like hugging or handling deformable objects.

The following figure (Figure 3 from the original paper) visualizes the resulting distribution of interaction forces:

Fig. 3: Interaction force distributions across upper-body links. Probability densities of force magnitudes are shown for the right shoulder (left), right elbow (middle), and right hand (right). Insets (top right) illustrate the corresponding force directions on a sphere.
该图像是一个示意图，展示了右肩、右肘和右手连接的交互力分布。图中显示了力的大小和方向的概率密度，右上角的插图展示了相应的力方向在球面上的分布。

This figure shows that the interaction forces span a wide range of directions on a sphere (as indicated by the insets) and have magnitudes from 0 to $25 \mathrm{~N}$ , demonstrating the diversity generated by the modeling approach.

4.2.4. Safety-Aware Force Thresholding

To prevent unbounded forces (which could arise from large tracking errors in the driving force calculation, Eq. 2) and ensure safe interaction, an adaptive force thresholding mechanism is introduced. This mechanism caps the maximum allowable force applied by the robot.

During training, a piecewise-constant safety threshold $\tau_{\mathrm{safe}}$ is sampled within a defined range: $F_1 \leq \tau_{\mathrm{safe}} \leq F_2$ . This threshold is resampled every 5 seconds, encouraging the policy to be robust across different safety limits. The current $\tau_{\mathrm{safe}}$ is provided to the policy as part of its observation.

When the magnitude of the driving force ( $\| \pmb f_{\mathrm{drive}} \|$ ) exceeds $\tau_{\mathrm{safe}}$ , a scaling mechanism is applied to limit it:

$\pmb { f } _ { \mathrm { d r i v e \mathrm { .l i m i t e d } } } = \operatorname* { m i n } \left( 1 . 0 , \frac { \tau _ { \mathrm { s a f e } } } { \| \pmb { f } _ { \mathrm { d r i v e } } \| } \right) \cdot \pmb { f } _ { \mathrm { d r i v e } } \ ,$

Where:

$\pmb f_{\mathrm{drive.limited}}$ is the limited driving force.
$\tau_{\mathrm{safe}}$ is the sampled safety threshold.
$\| \pmb f_{\mathrm{drive}} \|$ is the magnitude of the original driving force.
The term $\frac { \tau _ { \mathrm { s a f e } } } { \| \pmb { f } _ { \mathrm { d r i v e } } \| }$ acts as a scaling factor. If $\| \pmb f_{\mathrm{drive}} \|$ is less than or equal to $\tau_{\mathrm{safe}}$ , the factor is $\geq 1.0$ , and it's capped at 1.0, meaning no scaling. If $\| \pmb f_{\mathrm{drive}} \|$ exceeds $\tau_{\mathrm{safe}}$ , the factor is $< 1.0$ , and the driving force is scaled down proportionally to $\tau_{\mathrm{safe}}$ .

This thresholding directly influences the robot's compliance:
Lower $\tau_{\mathrm{safe}}$ values: Lead to softer, safer behavior, ideal for gentle interactions like hugging fragile objects.
Higher $\tau_{\mathrm{safe}}$ values: Allow for firmer support, suitable for tasks such as sit-to-stand assistance.

For interactions with humans and fragile objects, the paper sets $F_1 = 5 \mathrm{~N}$ and $F_2 = 15 \mathrm{~N}$ . These values are chosen based on $ISO/TS 15066$ safety guidelines and comfort studies for human-robot interaction. For example, $15 \mathrm{~N}$ over a small contact area ( $0.25 \mathrm{~cm}^2$ ) is $60 \mathrm{~N/cm}^2$ , which is still below $ISO/TS 15066$ pain-onset limits for the torso and arms. For more realistic hugging contacts ( $\sim 16 \mathrm{~cm}^2$ ), this force range corresponds to $3{-}9 \mathrm{~kPa}$ , consistent with measurements of children's hugs.

4.2.5. RL-based Control Policy

The RL policy's objective is to output joint position targets $\pmb a_t$ at $50 \mathrm{~Hz}$ (for low-level PD tracking) such that the humanoid follows a target motion sequence ( $\pmb m_{\mathrm{tar}}$ ) while exhibiting compliant responses to interaction forces ( $f_{\mathrm{interact}}$ ). The policy learns to reproduce the behavior encoded by the impedance-based reference dynamics.

4.2.5.1. Reference Dynamics Integration

The impedance-based reference dynamics are simulated using semi-implicit Euler integration with a fixed time step of $0.005 \mathrm{~s}$ . This simulation generates the reference state (link positions $\pmb x_t^{\mathrm{ref}}$ and velocities $\dot{\pmb x}_t^{\mathrm{ref}}$ ) that the RL policy is trained to track.

$\begin{array} { r l } & { \dot { \pmb { x } } _ { t + 1 } ^ { \mathrm { r e f } } = \dot { \pmb { x } } _ { t } ^ { \mathrm { r e f } } + \Delta t \cdot \frac { { \pmb f } _ { \mathrm { d r i v e } } + { \pmb f } _ { \mathrm { i n t e r a c t } } } { M } , } \\ & { { \pmb x } _ { t + 1 } ^ { \mathrm { r e f } } = { \pmb x } _ { t } ^ { \mathrm { r e f } } + \Delta t \cdot \dot { \pmb x } _ { t + 1 } ^ { \mathrm { r e f } } . } \end{array}$

Where:

$\pmb x_t^{\mathrm{ref}}$ and $\dot{\pmb x}_t^{\mathrm{ref}}$ are the link position and velocity in the reference dynamics model at time $t$ . This is distinct from the actual robot link position $\pmb x^{\mathrm{sim}}$ in the simulator.
$\Delta t$ is the integration step size.
$\pmb f_{\mathrm{drive}}$ is the driving force (which is force-limited as per Eq. 5).
$\pmb f_{\mathrm{interact}}$ is the interaction force.
$M$ is the virtual mass. At each timestep, velocities and positions are updated according to the net driving and interaction forces. Semi-implicit Euler is chosen for its numerical stability. The RL agent observes its environment and outputs actions to align its behavior with this dynamics model.

4.2.5.2. Teacher-Student Architecture

A two-stage teacher-student training framework is employed for sim-to-real transfer. Both policies are trained using PPO [27].

Student Policy Observation: The student policy observes only information available during real-world deployment:

$\begin{array} { r } { \pmb { o } _ { t } = \left( \tau _ { \mathrm { s a f e } } , m _ { \mathrm { t a r } } , \omega , g , \pmb { q } _ { t } ^ { \mathrm { h i s t } } , \pmb { a } _ { t - 3 : t - 1 } \right) , } \end{array}$

Where:
- $\tau_{\mathrm{safe}}$ : Current force-safety limit.
- $\pmb m_{\mathrm{tar}}$ : Target motion information, including future root poses and target joint positions.
- $\omega$ : Root angular velocity.
- $\pmb g$ : Gravity vector expressed in the robot's root frame (projected gravity).
- $\pmb q_t^{\mathrm{hist}}$ : History of recent joint positions (proprioception).
- $\pmb a_{t-3:t-1}$ : History of the last three actions (joint position targets).
Teacher Policy Observation: The teacher policy receives additional comprehensive privileged information:

$\begin{array} { r } { \boldsymbol { o } _ { t } ^ { \mathrm { p r i v } } = ( \boldsymbol { x } _ { t } ^ { \mathrm { r e f } } , \dot { \boldsymbol { x } } _ { t } ^ { \mathrm { r e f } } , f _ { \mathrm { i n t e r a c t } } , f _ { \mathrm { i n t e r a c t } } ^ { \mathrm { s i m } } , \boldsymbol { h } _ { t } , \boldsymbol { \tau } _ { t - 1 } , \boldsymbol { e } _ { \mathrm { c u m } } ) , } \end{array}$

Where:
- $\pmb x_t^{\mathrm{ref}}$ and $\dot{\pmb x}_t^{\mathrm{ref}}$ : Link positions and velocities from the impedance-based reference dynamics (Eq. 6).
- $f_{\mathrm{interact}}$ : Interaction force predicted by the reference dynamics.
- $f_{\mathrm{interact}}^{\mathrm{sim}}$ : Actual interaction force measured in the simulation environment.
- $\pmb h_t$ : The robot's height.
- $\pmb \tau_{t-1}$ : Torques applied to joints in the previous time step.
- $\pmb e_{\mathrm{cum}}$ : Cumulative tracking error.
  
  Both policies output joint position targets $\boldsymbol{a}_t \in \mathbb{R}^{29}$ (for a 29-degree-of-freedom robot), which are then tracked by low-level PD controllers.

4.2.5.3. Motion Datasets

Diverse human motion data is used for training, covering human-human and human-object interactions. Specifically, the AMASS [29], InterX [30], and LAFAN [31] datasets are retargeted to the humanoid using GMR [28]. High-dynamic motions unsuitable for interaction scenarios are filtered out, resulting in approximately 25 hours of data sampled at $50 \mathrm{~Hz}$ .

4.2.5.4. Reward Design

The RL policy is trained using a reward function that combines terms for motion tracking, locomotion stability, and importantly, compliance.

The compliance reward consists of three terms:

Reference Dynamics Tracking: This term encourages the robot to follow the compliant behavior specified by the reference dynamics by minimizing the discrepancy between the actual link state in simulation ( $\pmb x_t^{\mathrm{sim}}, \dot{\pmb x}_t^{\mathrm{sim}}$ ) and the reference state ( $\pmb x_t^{\mathrm{ref}}, \dot{\pmb x}_t^{\mathrm{ref}}$ ) from Eq. 6.

$r _ { \mathrm { d y n } } = \exp \left( - \frac { \lVert \boldsymbol { x } _ { t } ^ { \mathrm { s i m } } - \boldsymbol { x } _ { t } ^ { \mathrm { r e f } } \rVert _ { 2 } } { \sigma _ { x } } \right) + \exp \left( - \frac { \lVert \dot { \boldsymbol { x } } _ { t } ^ { \mathrm { s i m } } - \dot { \boldsymbol { x } } _ { t } ^ { \mathrm { r e f } } \rVert _ { 2 } } { \sigma _ { v } } \right) .$

Where:
- $\lVert \boldsymbol { x } _ { t } ^ { \mathrm { s i m } } - \boldsymbol { x } _ { t } ^ { \mathrm { r e f } } \rVert _ { 2 }$ is the L2-norm (Euclidean distance) of the position error between the simulated and reference link positions.
- $\lVert \dot { \boldsymbol { x } } _ { t } ^ { \mathrm { s i m } } - \dot { \boldsymbol { x } } _ { t } ^ { \mathrm { r e f } } \rVert _ { 2 }$ is the L2-norm of the velocity error.
- $\exp(\cdot)$ is the exponential function, used to create a smooth, positive reward that approaches 1 as errors decrease.
- $\sigma_x$ and $\sigma_v$ are scaling parameters that control the sensitivity of the reward to position and velocity errors, respectively.
Reference Force Tracking: This term explicitly penalizes the discrepancy between the interaction force predicted by the reference dynamics ( $f_{\mathrm{interact}}$ ) and the actual interaction force measured in the simulation environment ( $f_{\mathrm{interact}}^{\mathrm{sim}}$ ). This helps in regulating force magnitudes and enforcing safety thresholds.

$r _ { \mathrm { f o r c e } } = \exp \left( - \frac { \| f _ { \mathrm { i n t e r a c t } } - f _ { \mathrm { i n t e r a c t } } ^ { \mathrm { s i m } } \| _ { 2 } } { \sigma _ { f } } \right) .$

Where:
- $\| f _ { \mathrm { i n t e r a c t } } - f _ { \mathrm { i n t e r a c t } } ^ { \mathrm { s i m } } \| _ { 2 }$ is the L2-norm of the error between the reference interaction force and the simulated interaction force.
- $\sigma_f$ is a scaling parameter for force error sensitivity.
Unsafe Force Penalty: This term directly discourages forces that exceed the safety margin $\tau_{\mathrm{safe}}$ , complementing the driving force thresholding (Eq. 5).

$r _ { \mathrm { p e n } } = - \mathbb { I } ( \lVert f _ { \mathrm { i n t e r a c t } } \rVert > \tau _ { \mathrm { s a f e } } + \delta _ { \mathrm { t o l } } ) .$

Where:
- $\mathbb{I}(\cdot)$ is the indicator function, which equals 1 if the condition inside is true, and 0 otherwise.
- $\lVert f _ { \mathrm { i n t e r a c t } } \rVert$ is the magnitude of the interaction force.
- $\tau_{\mathrm{safe}}$ is the safety threshold.
- $\delta_{\mathrm{tol}}$ is a tolerance margin (set to $10 \mathrm{~N}$ empirically) that allows for minor deviations beyond $\tau_{\mathrm{safe}}$ before a penalty is incurred, preventing the policy from becoming overly conservative.
  
  The overall compliance reward is a weighted sum of these terms:

$r _ { \mathrm { c o m p l i a n c e } } = w _ { \mathrm { d y n } } r _ { \mathrm { d y n } } + w _ { \mathrm { f o r c e } } r _ { \mathrm { f o r c e } } + w _ { \mathrm { p e n } } r _ { \mathrm { p e n } } .$

The weights for each term, along with weights for motion tracking and locomotion stability rewards, are provided in Table I.

The following are the results from Table I of the original paper:

Reward	Weight
Compliance
Reference Dynamics Tracking	2.0
Reference Force Tracking	2.0
Unsafe Force Penalty	6.0
Motion Tracking
Root Tracking	0.5
Joint Tracking	1.0
Locomotion Stability
Survival	5.0
Feet Air Time	10.0
Impact Force	4.0
Slip Penalty	2.0
Action Rate	0.1
Joint Velocity	5.0e-4
Joint Limit	1.0

4.2.6. Appendix A. External Force Application Logic

The interaction forces are applied to a subset of upper-body links (shoulders, wrists, hands) within the simulation. This procedure runs at every simulation step and involves:

Activation and Gain Scheduling:
- An active link is a force-application point currently enabled. A binary mask $\mathbf{m} \in \{0, 1\}^M$ (where $M$ is the number of candidate links) indicates the active set.
- At the start of an interval, one of five modes is sampled to determine $\mathbf{m}$ : no-force, all-links, left-only, right-only, or a random partial subset.
- For each active link, an interaction spring gain $K_{\mathrm{spring}}(t)$ is assigned, which varies smoothly over time (piecewise-linear). Gains can increase, hold, then decrease to zero.
- In parallel, a force safety threshold $\tau_{\mathrm{safe}}(t)$ is smoothly adjusted within a bounded range, used for clamping and reward shaping.
Anchor (Interaction Spring Origin) Update:
- Each active link maintains an anchor $\pmb o(t)$ in the robot's root frame.
- Resistive contact: The anchor stays at its previously established location (relative to the root), simulating a resisting load at the contact site.
- Guiding contact: The anchor smoothly moves towards a newly sampled surface point (from human motion data).
- Updates are smooth to avoid discontinuities when active sets or targets change.
One-Sided Projection:
- Contact is modeled as one-sided: interaction forces act only when the link compresses towards the anchor along the intended direction of interaction.
- If the link moves away from the anchor (i.e., leaves the contact side), the interaction force drops to zero. This prevents non-physical pulling in free space and mimics real unilateral contacts. The displacement from the link to the anchor is computed, and only its component along the intended direction is considered.

Application in the Simulator:

Forces are applied in world coordinates at the active links.

To prevent excessive overall disturbance, the net wrench (combination of force and torque) about the torso is bounded. All per-link forces/torques are summed, and if the totals exceed preset limits, an opposite residual wrench is injected on the torso.

The following are the results from Table II of the original paper:

Parameter	Symbol	Typical value / range
Max per-link force cap	Fmax	30 N
Safety threshold (per link)	Tsafe (t)	515 N (default 10 N)
Net force limit (about torso)	TF	30 N
Net torque limit (about torso)	TM	20 N m
Interaction spring gain	Kspring(t)	5250

4.2.7. Appendix B. Reference Dynamics Integration

All reference quantities are expressed in the robot's root frame. Let $\pmb x_t, \dot{\pmb x}_t$ be the current state of a link, and $\pmb x_t^{\mathrm{tar}}, \dot{\pmb x}_t^{\mathrm{tar}}$ be the target state. The reference dynamics used in this work are:

$M \ddot { \pmb x } _ { t } = f _ { \mathrm { d r i v e } } ( { \pmb x } _ { t } ^ { \mathrm { t a r } } , \dot { \pmb x } _ { t } ^ { \mathrm { t a r } } , { \pmb x } _ { t } , \dot { \pmb x } _ { t } ) + f _ { \mathrm { i n t e r a c t } } ( \cdot ) - D \dot { \pmb x } _ { t } .$

Where:

$M \ddot { \pmb x } _ { t }$ represents the product of virtual mass and acceleration, which equals the net force on the link.
$f _ { \mathrm { d r i v e } } ( { \pmb x } _ { t } ^ { \mathrm { t a r } } , \dot { \pmb x } _ { t } ^ { \mathrm { t a r } } , { \pmb x } _ { t } , \dot { \pmb x } _ { t } )$ is the driving force term (as defined in Eq. 2 and force-limited by Eq. 5), dependent on target and current positions/velocities.
$f _ { \mathrm { i n t e r a c t } } ( \cdot )$ is the interaction force term (as defined in Eq. 3 and 4).
$D \dot { \pmb x } _ { t }$ is an additional damping term ( $D$ is the integration damping coefficient) applied to the current velocity $\dot{\pmb x}_t$ for enhanced stability during integration.

This system is integrated using explicit Euler with a small, fixed number of substeps per simulator step (four substeps in their implementation). Acceleration and velocity are clipped at each step to prevent numerical instability or unphysical values.

The following are the results from Table III of the original paper:

Parameter	Symbol	Value
Virtual mass	M	0.1 kg
Integration damping	D	2.0
Tracking stiffness	Kp	Derived from Kp = τsafe/0.05
Tracking damping	Kd	2√M Kp
Time step	∆t	Same as simulation dt = 0.02s
Substeps per simulator step	Nsub	4
Velocity clip	kxkmax	4 m/s
Acceleration clip	kkmax	1000 m/s2

4.2.8. Appendix C. Autonomous Hugging Pipeline

To ensure a comfortable and personalized hugging experience, GentleHumanoid integrates an autonomous, shape-aware pipeline.

Human Body Shape Estimation:
- The human's location and height are obtained using a motion-capture system with markers.
- An RGB camera on the G1's head provides input for single-image human shape estimation using a method like BEDLAM [32].
- The reconstructed body mesh is rescaled to the subject's true height.
- Waist points ( $x'$ ) are then extracted from the mesh to serve as target contact points for the robot's hands.
Hugging Motion Optimization:
- The robot's default upper-body motion is optimized so that selected robot links (e.g., hands, elbows) reach the SMPL-derived waist targets.
- The optimization variables are upper-body joint angles $\mathbf{q}$ and a planar floating base $\mathbf{r} = (x, y, \psi)$ , where height $z$ is fixed at $z_0$ .
- Let $\mathbf{p}_\ell(\mathbf{q}, \mathbf{r})$ be the forward-kinematics position of link $\ell$ . Let $\{\mathbf{b}_k\}$ be the target points on the waist, and $\Pi_{xy}$ denote the xy-projection.
- The objective function to minimize is:
  
  $\begin{array} { r l } { \displaystyle \operatorname* { m i n } _ { \mathbf { q } , \mathbf { r } } } & { \displaystyle \sum _ { ( \ell , k ) \in S } w _ { \ell k } \left\| \mathbf { p } _ { \ell } ( \mathbf { q } , \mathbf { r } ) - \mathbf { b } _ { k } \right\| ^ { 2 } } \\ & { \quad + \ w _ { t } \left\| \Pi _ { x y } \big ( \mathbf { p } _ { \mathrm { t o r s o } } ( \mathbf { q } , \mathbf { r } ) + \delta { \mathbf { f } } ( \psi ) \big ) - \Pi _ { x y } \big ( \mathbf { b } _ { \mathrm { f r o n t } } \big ) \right\| ^ { 2 } } \\ & { \quad + \ \lambda _ { \mathrm { r e g } } \left\| \mathbf { q } - \mathbf { q } _ { 0 } \right\| ^ { 2 } . } \end{array}$
Where:
- The first term is a sum over a set $S$ of link-target pairs (e.g., hands to back-waist, elbows to opposite-side waist). $w_{\ell k}$ are weights for their relative importance. It minimizes the squared distance between the forward-kinematics position of link $\ell$ and the target point $\mathbf{b}_k$ .
- The second term ensures the torso is properly oriented relative to the human. $\mathbf{p}_{\mathrm{torso}}(\mathbf{q}, \mathbf{r})$ is the torso position. $\delta \approx 5 \mathrm{~cm}$ is a small forward offset. $\mathbf{f}(\psi) = [\cos \psi, \sin \psi, 0]^\top$ denotes the heading. $\Pi_{xy}$ projects the points onto the xy-plane. $w_t$ is the weight for this term. It minimizes the squared distance between the projected torso position (with offset) and the projected front waist target $\mathbf{b}_{\mathrm{front}}$ .
- The third term is a regularizer $\lambda_{\mathrm{reg}} \|\mathbf{q} - \mathbf{q}_0\|^2$ , which keeps the solution close to a neutral upper-body pose $\mathbf{q}_0$ .
- The result is an optimized motion sequence that serves as a personalized reference motion for the specific individual.
Execution:
- First, a locomotion policy is used to guide the robot to a stance $10 \mathrm{~cm}$ in front of the person, directly in front of them with frontal alignment, using motion-capture markers to determine the robot-human relative pose.
- Once in position, control switches to the GentleHumanoid policy to execute the optimized hug.
  
  The following figure (Figure 1(c) from the original paper) showcases the autonomous hugging pipeline:
  
  该图像是包含多个示意图的插图，展示了GentleHumanoid框架在不同任务中的应用，包括温和拥抱和安全物体操控。图中可以看到机器人的上肢如何在与人或物体的互动中保持灵活，提升人机协作的安全性与自然性。

(a) Sit-to-stand Support (d) Balloon Handling GH vho . (b) handshaking with a $5 \mathrm { ~ N ~ }$ force limit, allowing the robot's hand to move naturally with the human's; (c) autonomous s oat comfortable embrace; and (d) balloon handling, showing safe object manipulation where baselines fail. Figure 1(c) specifically shows the autonomous comfortable embrace with shape estimation.

4.2.9. Appendix D. Video to Humanoid

The framework also supports converting monocular RGB videos into humanoid motions.

Human Motion Estimation: A phone records monocular RGB videos. PromptHMR [33] is used to estimate the corresponding human motion as an SMPL-X motion sequence.
Retargeting: The estimated motion is then retargeted to the G1 humanoid using GMR.
Execution: The retargeted motion is executed using the trained GentleHumanoid policy. This demonstrates robustness even with noisy reference motions (e.g., foot skating), successfully handling interactions with various objects like pillows, balloons, and baskets of different sizes and deformabilities.

5. Experimental Setup

5.1. Datasets

The paper utilizes diverse human motion datasets to train its policy, covering both human-human and human-object interactions.

Source: The AMASS [29], InterX [30], and LAFAN [31] datasets.
Processing: These datasets are retargeted to the Unitree G1 humanoid using GMR [28].
Filtering: High-dynamic motions that are not typical of interaction scenarios are filtered out.
Scale: Approximately 25 hours of data.
Sampling Frequency: $50 \mathrm{~Hz}$ .
Characteristics: These datasets provide a rich source of plausible human movements and interactions, crucial for exposing the RL policy to diverse guiding contact scenarios and ensuring kinematically consistent forces.

The choice of these datasets is effective for validating the method's performance because they encompass a wide variety of human poses and interactions, enabling the policy to learn how to move naturally and adapt to different contact conditions as if interacting with a real human.

5.2. Evaluation Metrics

The paper does not explicitly list formal mathematical evaluation metrics in the main body. Instead, it relies on quantitative measurements of forces and qualitative observations of robot behavior to assess compliance, safety, and task success. Based on the experimental results, the implied evaluation metrics and their conceptual definitions are:

Peak Contact Force (or Maximum Interaction Force):
- Conceptual Definition: Quantifies the highest force exerted by the robot during a physical interaction. Lower peak forces indicate safer and gentler interaction, which is critical for human-robot collaboration and handling fragile objects. It's measured in Newtons (N).
- Mathematical Formula: Not explicitly given, but it is typically the maximum value observed over time from force sensors or gauges. $F_{\text{peak}} = \max_{t} (\|\mathbf{f}_{\text{contact}}(t)\|)$
- Symbol Explanation:
  - $F_{\text{peak}}$ : The maximum (peak) contact force.
  - $t$ : Time.
  - $\mathbf{f}_{\text{contact}}(t)$ : The vector of contact force measured at time $t$ .
  - $\|\cdot\|$ : The magnitude (e.g., Euclidean norm) of the force vector.
Force Stability / Consistency:
- Conceptual Definition: Assesses how consistently the interaction forces are maintained within a desired range or how quickly they stabilize after a perturbation. A stable force profile indicates predictable and controlled behavior, avoiding oscillations or uncontrolled force build-up. This is often observed by analyzing force profiles over time.
- Mathematical Formula: Not explicitly given, but could be quantified by metrics like standard deviation of force within a stable phase, or time to reach stability. $\sigma_F = \sqrt{\frac{1}{N-1}\sum_{i=1}^N (\|\mathbf{f}_{\text{contact}}(t_i)\| - \bar{F})^2}$
- Symbol Explanation:
  - $\sigma_F$ : Standard deviation of contact force magnitude.
  - $N$ : Number of samples in the stable phase.
  - $t_i$ : Time instances within the stable phase.
  - $\bar{F}$ : Mean contact force magnitude during the stable phase.
Compliance Level:
- Conceptual Definition: Measures the robot's ability to yield or deform in response to external forces, often reflected by the force required to displace a robot link by a certain amount. A higher compliance means the robot is "softer" and moves more readily with external forces.
- Mathematical Formula: Not explicitly given, but implied by the relationship between applied force and resulting displacement, resembling a spring constant. $K_{\text{effective}} = \frac{\Delta F}{\Delta x}$
- Symbol Explanation:
  - $K_{\text{effective}}$ : The effective stiffness, an inverse measure of compliance.
  - $\Delta F$ : Change in applied force.
  - $\Delta x$ : Resulting change in position/displacement.
Task Success Rate:
- Conceptual Definition: Binary indicator (or percentage) of whether the robot successfully completes a given task (e.g., holding a balloon without dropping/popping it, maintaining a hug posture, providing sit-to-stand assistance).
- Mathematical Formula: $\text{Success Rate} = \frac{\text{Number of successful trials}}{\text{Total number of trials}} \times 100\%$
- Symbol Explanation: Self-explanatory.
Balance Maintenance:
- Conceptual Definition: Assesses the robot's ability to maintain its stability and upright posture during interactions, especially when external forces are applied or when handling objects. Loss of balance indicates failure.
- Mathematical Formula: Not explicitly given, but often monitored by center of mass (CoM) position relative to support polygon, or root stability.
  
  The paper uses commercial force gauges (Mark-10, M5-10) and conformable, customized waist-mounted pressure sensing pads with 40 calibrated capacitive taxels (tactile sensing elements) to measure contact forces and pressures. For the pressure pads, the effective contact area of each taxel is approximated as $6 \mathrm{~mm} \times 6 \mathrm{~mm}$ to compute forces from recorded pressure values.

5.3. Baselines

The GentleHumanoid framework is compared against two baseline policies that represent different training strategies:

Vanilla-RL:
- Description: This is an RL-based motion tracking policy trained without any explicit force perturbations. It represents a common approach in prior whole-body tracking methods where the primary objective is to accurately follow a desired motion trajectory, and external forces are implicitly treated as disturbances to be suppressed.
- Why Representative: It serves as a benchmark for how typical RL motion tracking policies perform in contact-rich scenarios when not specifically designed for compliance. It is expected to be relatively rigid.
Extreme-RL:
- Description: This is another RL-based motion tracking policy but trained with maximum30 \mathrm{~N}end-effector force perturbations. This baseline is designed to represent prior force-adaptive methods that aim to learn robustness to external disturbances, typically focusing on forces applied at the end-effector (e.g., the hand).
- Why Representative: It tests whether simply training with high force perturbations, particularly at the end-effector, is sufficient to achieve whole-body compliance. It is expected to show some robustness but likely still exhibit rigidity or lack of coordination across the entire upper body.
  
  These baselines were chosen to clearly demonstrate the advantages of GentleHumanoid's unified spring-based formulation, multi-link compliance, and safety-aware force thresholding compared to policies that are either unaware of contact forces or only address them at the end-effector level.

6. Results & Analysis

6.1. Core Results Analysis

The experimental evaluation of GentleHumanoid was conducted in both simulation and real-world scenarios on the Unitree G1 humanoid, comparing its performance against Vanilla-RL and Extreme-RL baselines. The results consistently highlight GentleHumanoid's superior performance in terms of compliance, safety (reduced peak forces), and naturalness of interaction.

6.1.1. Simulation Results: Hugging Motion with External Pulling Force

To evaluate compliance in simulation, a hugging motion was used, and an external pulling force was simulated to mimic a human trying to disengage from an embrace. The interaction forces on the hand, elbow, and shoulder links were monitored.

The following figure (Figure 4 from the original paper) shows the force profiles:

Fig. 4: Forces applied by different upper-body links under external interaction. Force profiles over time are shown for the right hand (left), right elbow (middle), and right shoulder (right). Compared to baselines (Vanilla-RL and ExtremeRL), GentleHumanoid maintains lower and more stable force levels across all links, showing safer and more compliant responses during contact.
该图像是一个图表，展示了在外部交互作用下，不同上肢连接部位所施加的力量随时间变化的情况。图中右手、右肘和右肩连接的力状态显示，GentleHumanoid 在所有连接部位上维持了更低且更稳定的力水平，表明其在接触时的响应更安全且更具顺应性。

Analysis:
- Hand Link (Left Panel): GentleHumanoid stabilizes around $10 \mathrm{~N}$ , demonstrating controlled compliance. In contrast, Vanilla-RL settles above $20 \mathrm{~N}$ , and Extreme-RL exceeds $13 \mathrm{~N}$ , indicating stiffer and higher force responses.
- Elbow and Shoulder Links (Middle and Right Panels): Similar trends are observed. While baselines quickly saturate at $15{-}20 \mathrm{~N}$ with rigid responses, GentleHumanoid remains bounded near $7{-}10 \mathrm{~N}$ .
Conclusion: GentleHumanoid adapts smoothly to external interaction, producing compliant motions with significantly lower and more stable force levels across all monitored links (hand, elbow, shoulder). This suggests a more coordinated and gentle response compared to the baselines, which exhibit overly stiff behavior and higher peak forces.

6.1.2. Real-World Experiments on Unitree G1 Humanoid

Three real-world scenarios were used to evaluate compliance.

6.1.2.1. Static Pose with External Force (Wrist Interaction)

Scenario: External forces were applied at the robot's wrist while its base remained static. The ideal behavior is for the arm to yield softly and move with the external force, rather than resisting rigidly.
Measurement: A handheld force gauge (Mark-10, M5-10) was used to apply and record peak forces.
Results:
- Baselines (Vanilla-RL, Extreme-RL): Both resisted stiffly. Instead of the arm yielding, the torso often shifted, leading to imbalance. Extreme-RL was particularly rigid, requiring a peak force of $51.14 \mathrm{~N}$ , while Vanilla-RL required $24.59 \mathrm{~N}$ .
- GentleHumanoid: Responded smoothly and consistently, requiring much lower forces to reposition the arm while maintaining balance. A key observation was its posture-invariant compliance: the same external force was sufficient to modulate arm position across different configurations. The compliance level matched the user-specified force limit. For example, when set to $10 \mathrm{~N}$ , the robot maintained balance around that threshold, with effective ranges between $5{-}15 \mathrm{~N}$ .
Conclusion: GentleHumanoid's tunable force limits and virtual spring-damper dynamics provide a uniform, predictable, and safer interaction experience compared to the stiff and unpredictable responses of the baselines.

The following figure (Figure 5 from the original paper) compares interaction forces:

该图像是一个示意图，比较了不同策略下的交互力。上部分展示了GentleHumanoid在不同安全力限制下的表现，下部分则是基线方法Vanilla-RL和Extreme-RL的表现。力传感器显示在不同姿势下的接触力，其中GentleHumanoid能有效保持在安全阈值内。

Fig. 5: Comparison of interaction forces across policies. Top: GentleHumanoid with tunable force limits, which maintains safe interaction by keeping contact forces within specified thresholds across different postures. Bottom: baseline methods, Vanilla-RL and Extreme-RL, exhibit less consistent compliance, with higher peak forces or oscillatory responses. Force gauge readings (N) are highlighted for clarity. The figure visually confirms that GentleHumanoid (top row) can maintain forces within its set limits (e.g., $10 \mathrm{~N}$ or $15 \mathrm{~N}$ thresholds) across different arm postures, yielding to forces. The baselines (bottom row) show higher peak forces and less controlled responses.

6.1.2.2. Hugging a Mannequin

Scenario: The G1 robot executed a hugging motion with a mannequin, under two conditions: proper alignment and deliberate misalignment to assess safety under imperfect contact.
Measurement: Pressure-sensing pads attached to the mannequin measured distributed contact forces. GentleHumanoid's $\tau_{\mathrm{safe}}$ was set to $10 \mathrm{~N}$ .
Results:
- GentleHumanoid: Maintained bounded and stable forces even under misalignment.
- Baselines (Vanilla-RL, Extreme-RL): Generated higher, less predictable forces or failed to sustain the motion. Vanilla-RL, especially, produced localized high-pressure peaks.
Conclusion: GentleHumanoid demonstrated superior ability to maintain gentle and stable contact during hugging, even with imperfect initial conditions, validating its safety-aware and compliant design.

The following figure (Figure 6 from the original paper) illustrates the hugging evaluation:

该图像是图表，展示了当人形机器人在正确和错误对齐情况下进行拥抱交互时的压力映射及力的变化。上部显示压力传感器的实时可视化，中部为不同控制器在对齐和错位状态下的压力图，底部为力随时间变化的曲线。GentleHumanoid展示出更稳定的压力和力表现。

Fig. 6: Evaluation of hugging interactions with and without misalignment. Top: experimental setup with custom pressuresensing pads and real-time pressure visualization. Middle: pressure maps of peak force frames for different controllers under correct hugging alignment (left) and misalignment (right). GentleHumanoid maintains moderate contact pressures, while baselines produce localized high-pressure peaks, especially under Vanilla-RL. Bottom: Force profiles over time, where GentleHumanoid maintains bounded and stable forces, while baselines exhibit increasing or unstable peaks. The pressure maps (middle row) clearly show that GentleHumanoid distributes pressure more evenly and keeps it within moderate ranges, while baselines, particularly Vanilla-RL, create concentrated high-pressure spots. The force profiles (bottom row) confirm GentleHumanoid's stable and bounded force output versus the increasing or unstable peaks of the baselines.

6.1.2.3. Handling Deformable Objects (Balloons)

Scenario: The robot attempted to handle fragile objects, specifically balloons. The challenge is to maintain sufficient force to stabilize the object without applying excessive force that would damage it.
Setting: GentleHumanoid's force threshold was set to $5 \mathrm{~N}$ .
Results:
- GentleHumanoid: Successfully held the balloon without damage.
- Baselines (Vanilla-RL, Extreme-RL): Applied excessive pressure, eventually squeezing the balloon until the G1 lost balance and dropped it.
Conclusion: This experiment showcases GentleHumanoid's ability to handle delicate tasks requiring fine force control and compliance, where rigid or less-controlled policies fail.

The following figure (Figure 1(d) from the original paper) shows balloon handling:

该图像是包含多个示意图的插图，展示了GentleHumanoid框架在不同任务中的应用，包括温和拥抱和安全物体操控。图中可以看到机器人的上肢如何在与人或物体的互动中保持灵活，提升人机协作的安全性与自然性。

(a) Sit-to-stand Support (d) Balloon Handling GH vho . (b) handshaking with a $5 \mathrm { ~ N ~ }$ force limit, allowing the robot's hand to move naturally with the human's; (c) autonomous s oat comfortable embrace; and (d) balloon handling, showing safe object manipulation where baselines fail. Figure 1(d) demonstrates GentleHumanoid successfully manipulating a balloon, while the text indicates baselines failed this task.

6.1.3. More Applications

GentleHumanoid enables several applications where compliance is critical:

Locomotion Teleoperation: Integration with a locomotion teleoperation framework for the Unitree G1 allows users to control walking and trigger pre-defined compliant reference motions like hugging, sit-to-stand assistance, and object handling via joystick. This suggests potential for future full-body teleoperation (e.g., like TWIST [8]) in healthcare and assistive scenarios, ensuring safe interactions under direct physical contact.
Autonomous Shape-Aware Hugging Pipeline: The framework was integrated with vision-based human shape estimation (using BEDLAM [32] and motion capture for height) to customize hugging positions for individuals of different body shapes. It extracts waist points from a reconstructed body mesh to optimize the humanoid's hugging motion, aligning its hands with these target locations. This pipeline generates stable and comfortable hugging motions for participants of varying builds.

6.2. Data Presentation (Tables)

The following are the results from Table I of the original paper:

Reward	Weight
Compliance
Reference Dynamics Tracking	2.0
Reference Force Tracking	2.0
Unsafe Force Penalty	6.0
Motion Tracking
Root Tracking	0.5
Joint Tracking	1.0
Locomotion Stability
Survival	5.0
Feet Air Time	10.0
Impact Force	4.0
Slip Penalty	2.0
Action Rate	0.1
Joint Velocity	5.0e-4
Joint Limit	1.0

This table details the weights assigned to various reward terms during Reinforcement Learning training. For Compliance, Unsafe Force Penalty has the highest weight (6.0), emphasizing safety. Reference Dynamics Tracking and Reference Force Tracking both have a weight of 2.0, ensuring the policy aligns with the designed compliant behavior and force regulation. Other significant weights include Feet Air Time (10.0) for locomotion stability and Survival (5.0) for basic task completion.

The following are the results from Table II of the original paper:

Parameter	Symbol	Typical value / range
Max per-link force cap	Fmax	30 N
Safety threshold (per link)	Tsafe (t)	515 N (default 10 N)
Net force limit (about torso)	TF	30 N
Net torque limit (about torso)	TM	20 N m
Interaction spring gain	Kspring(t)	5250

Table II lists the external force application parameters used in the simulation. The Max per-link force cap is set to $30 \mathrm{~N}$ , and the Safety threshold $\tau_{\mathrm{safe}}$ is dynamic, ranging from $5 \mathrm{~N}$ to $15 \mathrm{~N}$ (with a default of $10 \mathrm{~N}$ ). The Net force limit and Net torque limit about the torso are $30 \mathrm{~N}$ and $20 \mathrm{~N} \cdot \mathrm{m}$ respectively, ensuring overall stability. The Interaction spring gain $K_{\mathrm{spring}}(t)$ varies widely from 5 to 250. These parameters define the range and characteristics of the simulated interactions.

The following are the results from Table III of the original paper:

Parameter	Symbol	Value
Virtual mass	M	0.1 kg
Integration damping	D	2.0
Tracking stiffness	Kp	Derived from Kp = τsafe/0.05
Tracking damping	Kd	2√M Kp
Time step	∆t	Same as simulation dt = 0.02s
Substeps per simulator step	Nsub	4
Velocity clip	kxkmax	4 m/s
Acceleration clip	kkmax	1000 m/s2

Table III provides the Reference Dynamics and Integration Parameters. The Virtual mass $M$ is $0.1 \mathrm{~kg}$ . An Integration damping $D$ of 2.0 is used for stability. Tracking stiffness $K_p$ is derived from the safety threshold $\tau_{\mathrm{safe}}$ , while Tracking damping $K_d$ is set to critical damping ( $2\sqrt{MK_p}$ ). The simulation time step $\Delta t$ is $0.02 \mathrm{~s}$ with $N_{\mathrm{sub}}=4$ substeps per simulator step. Velocity and acceleration are clipped at $4 \mathrm{~m/s}$ and $1000 \mathrm{~m/s}^2$ respectively to maintain numerical stability.

6.3. Ablation Studies / Parameter Analysis

The paper does not present explicit ablation studies where specific components of GentleHumanoid (e.g., the unified spring formulation, safety thresholding, or specific reward terms) are individually removed or altered to assess their impact. Instead, the analysis focuses on comparing the full GentleHumanoid framework against the Vanilla-RL (no force perturbations) and Extreme-RL (end-effector force perturbations) baselines. This comparison implicitly serves as a form of ablation by demonstrating the comprehensive benefits of the proposed approach over policies lacking its key features.

The impact of key hyperparameters is demonstrated through the tunable force limits (e.g., $5 \mathrm{~N}$ to $15 \mathrm{~N}$ ) and interaction spring gains (5 to 250), which are randomized during training and adjusted at deployment. These parameters directly affect the robot's compliance and safety, showcasing how the policy adapts to varying interaction requirements. The figures illustrate how GentleHumanoid respects these tunable force limits more effectively than baselines, which often exceed them or fail to maintain stable contact.

7. Conclusion & Reflections

7.1. Conclusion Summary

GentleHumanoid represents a significant advancement in humanoid robot control, particularly for contact-rich interactions in human-centered environments. The paper successfully demonstrates a framework that integrates impedance control into a whole-body motion-tracking policy to achieve upper-body compliance. A core innovation is the unified spring-based formulation for modeling resistive and guiding contacts, which, by sampling from human motion datasets, ensures kinematically consistent and diverse interaction forces across multiple links (shoulder, elbow, wrist). The safety-aware force thresholding mechanism further guarantees that interactions remain within comfortable and safe limits.

Through comprehensive evaluation in both simulation and on the Unitree G1 humanoid robot, GentleHumanoid consistently outperforms baseline methods. It achieves significantly reduced peak contact forces, smoother interactions, and maintains task success in various scenarios, including gentle hugging, sit-to-stand assistance, and delicate object manipulation (e.g., balloon handling). These findings underscore the framework's potential to enable humanoids to safely and effectively collaborate with humans and handle objects in real-world applications.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Motion Data Constraints: The current human motion data (AMASS, InterX, LAFAN) used to maintain kinematic consistency constrains the force distribution. For example, forces applied to the shoulder might be relatively small due to limited variation in recorded motions.
- Future Work: Incorporate more diverse motion datasets, such as dancing, to improve coverage and expand the range of force distributions.
Interaction Modeling Fidelity: The interaction modeling relies on simulated spring forces. While providing structured coverage and kinematic consistency, this approach does not fully capture the complexity of real human contact, such as frictional effects or the viscoelastic properties of human tissue.
- Future Work: Develop more sophisticated interaction models that account for these real-world complexities.
Sim-to-Real Discrepancies and Force Regulation Precision: Despite the safety-aware policy constraining interaction forces, real-world experiments revealed occasional overshoots of $1{-}3 \mathrm{~N}$ due to sim-to-real discrepancies.
- Future Work: Integrate additional tactile sensing for more precise force regulation and robust sim-to-real transfer.
Human Localization and Height Acquisition: Currently, human localization and height are obtained from a motion capture system.
- Future Work: Replace the motion capture system with a vision-based pipeline to improve autonomy and practicality, especially for long-horizon tasks.
Long-Horizon Interactions: The current evaluation focuses on shorter, more defined interaction tasks.
- Future Work: Extend evaluations to long-horizon interactions where the humanoid must dynamically adapt its motion to human partners' behaviors, potentially by integrating richer sensing and general perception and reasoning systems like vision-language models.

7.3. Personal Insights & Critique

This paper presents a highly relevant and timely contribution to humanoid robotics, addressing a critical gap in enabling truly safe and natural human-robot physical interaction. The integration of impedance control with RL is a powerful approach, leveraging the strengths of both classical control (predictable force response) and modern RL (adaptability to diverse motions).

A key insight is the unified spring-based formulation for resistive and guiding contacts. This creative solution to generate diverse, kinematically consistent interaction forces for RL training is particularly clever, bypassing the limitations of noisy physics engine contacts and the difficulty of real-world data collection. The use of human motion datasets to ground these virtual interactions in plausible human-like movements is also a strong point.

The concept of task-adjustable force thresholds is crucial for practical deployment. It moves beyond a one-size-fits-all approach to safety, allowing the robot to be appropriately gentle or firm based on the task context, directly addressing comfort and safety requirements in different scenarios.

One potential area for deeper exploration, not fully covered by the paper's baselines, might be comparing GentleHumanoid with model-based impedance control for whole-body tasks. While RL offers generalization advantages, a rigorous comparison of control performance and sim-to-real gap with optimized model-based approaches could provide further insights into the specific benefits of the RL component for compliance.

The paper's method, particularly the unified spring-based interaction model and safety-aware thresholding, could be transferred or applied to other domains beyond humanoids. For instance, in mobile manipulation with legged robots or industrial robotic arms interacting with human co-workers or delicate assembly parts, similar principles of multi-link, safety-aware compliance could enhance safety and versatility. The autonomous hugging pipeline also highlights the potential for personalized physical assistance in elderly care or rehabilitation, suggesting a broader social impact for this research.

A potential issue or area for improvement, as hinted by the authors, is the current reliance on simulated spring forces for interaction modeling. While effective for training, its generalization to highly complex, dynamic, and textured real-world contacts (e.g., highly deformable objects with varying friction, or interacting with unpredictable human movements) might still present challenges. Future work on incorporating richer sensing (e.g., distributed tactile sensors on the robot's skin, not just on external pads) and real-time force adaptation based on these inputs would be critical for bridging this gap. The vision-based human pose estimation for autonomous tasks is also a promising direction, but the robustness of these systems in diverse lighting and occlusions will be key for true autonomy.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~36 min read · 46,492 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Humanoid Whole Body Control

3.2.2. Force-adaptive Control

3.2.3. Human-Humanoid Interaction

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

4.2.2. Impedance-Based Driving Force from Target Motion

4.2.3. Interaction Force Modeling

4.2.4. Safety-Aware Force Thresholding

4.2.5. RL-based Control Policy

4.2.5.1. Reference Dynamics Integration

4.2.5.2. Teacher-Student Architecture

4.2.5.3. Motion Datasets

4.2.5.4. Reward Design

4.2.6. Appendix A. External Force Application Logic

4.2.7. Appendix B. Reference Dynamics Integration

4.2.8. Appendix C. Autonomous Hugging Pipeline

4.2.9. Appendix D. Video to Humanoid

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Simulation Results: Hugging Motion with External Pulling Force

6.1.2. Real-World Experiments on Unitree G1 Humanoid

6.1.2.1. Static Pose with External Force (Wrist Interaction)

6.1.2.2. Hugging a Mannequin

6.1.2.3. Handling Deformable Objects (Balloons)

6.1.3. More Applications

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers