Paper status: completed

Learning Human-Humanoid Coordination for Collaborative Object Carrying

Published:10/16/2025

Human-Humanoid Collaboration (1)Proprioceptive Reinforcement Learning (1)Collaborative Carrying Tasks (1)Dynamic Object Interaction (1)Closed-Loop Training Environment (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The COLA method enables effective human-humanoid collaboration in complex carrying tasks using proprioception-only reinforcement learning. It predicts object motion and human intent, achieving a 24.7% reduction in human effort while maintaining stability, validated across various

Abstract

Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids' complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7%. compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baseline models. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.

Mind Map

In-depth Reading

English Analysis~32 min read · 43,919 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is Learning Human-Humanoid Coordination for Collaborative Object Carrying. The title signifies an approach that uses learning, specifically reinforcement learning, to enable humanoid robots to collaborate effectively with humans in tasks involving carrying objects. The emphasis is on coordination and compliance between the human and the humanoid.

1.2. Authors

The authors are:

Yushi Du (Equal contribution, corresponding author) - Department of Electrical and Electronic Engineering, The University of Hong Kong; School of Computer Science and Technology, Beijing Institute of Technology
Yixuan Li (Equal contribution) - School of Computer Science and Technology, Beijing Institute of Technology; Yuanpei College, Peking University
Baoxiong Jia (Equal contribution, corresponding author) - School of Computer Science and Technology, Beijing Institute of Technology
Yutang Lin - Yuanpei College, Peking University
Pei Zhou - Department of Electrical and Electronic Engineering, The University of Hong Kong
Wei Liang - School of Computer Science and Technology, Beijing Institute of Technology
Yanchao Yang (Corresponding author) - Department of Electrical and Electronic Engineering, The University of Hong Kong
Siyuan Huang

The affiliations suggest a collaborative effort between multiple institutions, with researchers from computer science, electrical engineering, and potentially other related fields. The presence of multiple corresponding authors indicates a significant collaborative research project.

1.3. Journal/Conference

The paper is published as a preprint on arXiv. As an arXiv preprint, it has not yet undergone formal peer review, but it is common for research papers to be shared on arXiv before or during the review process for conferences or journals. The publication year is listed as 2025, which implies it's a forthcoming or very recent publication.

1.4. Publication Year

The paper was published at (UTC) 2025-10-16T04:36:25.000Z, indicating a publication year of 2025.

1.5. Abstract

The paper addresses the challenge of human-humanoid collaboration for collaborative object carrying, an area that has seen limited exploration for humanoids due to their complex whole-body dynamics, despite progress in compliant robot-human collaboration for robotic arms. The authors propose COLA, a proprioception-only reinforcement learning approach that integrates leader and follower behaviors into a single policy. The model is trained in a closed-loop environment with dynamic object interactions to implicitly predict object motion patterns and human intentions, facilitating compliant collaboration and load balance through coordinated trajectory planning.

Evaluations, including comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrate the approach's effectiveness, generalization, and robustness across diverse terrains and objects. Simulation results show a 24.7% reduction in human effort compared to baselines, while maintaining object stability. Real-world experiments confirm robust carrying for various object types (e.g., boxes, desks, stretchers) and movement patterns (e.g., straight-line, turning, slope climbing). Human user studies with 23 participants reported an average 27.4% improvement over baseline models. A key advantage is that COLA achieves compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, making it a practical solution for real-world deployment.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2510.14293
PDF Link: https://arxiv.org/pdf/2510.14293v1.pdf The paper is available as a preprint on arXiv.org.

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

The core problem the paper addresses is the significant challenge of enabling humanoid robots to collaborate effectively and compliantly with humans, particularly in tasks like collaborative object carrying. While human-robot collaboration for robotic arms has advanced, extending this to humanoids is complex due to their intricate whole-body dynamics.

2.1.2. Importance and Gaps in Prior Research

Human-humanoid collaboration holds immense promise for various applications such as healthcare, domestic assistance, and manufacturing. However, current humanoid advancements in locomotion, teleoperation, and manipulation haven't translated well into effective collaboration. Existing human-humanoid collaboration methods often rely on model-based approaches or heuristic rules, which predefine subtasks or focus on limited-scope interactions like predicting horizontal velocity from haptic cues. These approaches generally neglect whole-body coordination capabilities and lack the ability to perform complex, dynamic collaborative tasks (e.g., picking up objects from the ground, carrying objects on slopes). They also struggle with:

Adapting to diverse environments (e.g., maintaining object stability on varied terrains).
Responding compliantly to human motions (e.g., standing up together), often without direct force sensing.
Dynamically allocating roles (leader/follower) for efficiency. The interdependency of these requirements makes collaborative carrying a particularly difficult task for humanoids.

2.1.3. Paper's Entry Point or Innovative Idea

The paper's innovative idea is to propose COLA, a proprioception-only reinforcement learning approach that learns human-humanoid coordination for collaborative object carrying. It addresses the limitations of previous work by:

Unifying Leader and Follower Behaviors: Integrating both roles into a single policy, allowing for flexible role switching.
Proprioception-Only Learning: Relying solely on the robot's internal proprioceptive feedback (joint positions, velocities, root orientation) for real-world deployment, eliminating the need for external sensors or complex interaction models.
Implicit Prediction of Human Intentions and Object Dynamics: Training in a closed-loop environment with dynamic object interactions allows the model to implicitly predict object motion patterns and human intentions.
Leveraging Key Insights:
- Offsets between joint states and their targets serve as a proxy for estimating interaction forces.
- The carried object's state encodes implicit collaboration constraints like stability and coordination.
Three-Step Training Framework: Utilizing a teacher-student framework where a teacher policy (with privileged information) guides a student policy (purely proprioceptive) for practical deployment.

2.2. Main Contributions / Findings

2.2.1. Primary Contributions

The primary contributions of the paper can be summarized as:

Unified Residual Model for Whole-Body Collaboration: Proposing COLA, a proprioception-only residual model that enables compliant, coordinated, and generalizable whole-body collaborative carrying across diverse movement patterns.
Three-Step Training Framework and Closed-Loop Environment: Developing a novel three-step training framework and a closed-loop training environment that explicitly models humanoid-object interactions. This allows the robot to implicitly learn object movements and assist humans through compliant collaboration.
Demonstrated Effectiveness, Generalization, and Robustness: Validating the proposed policy through extensive simulation and real-world experiments, showing superior effort reduction and trajectory coordination compared to baseline approaches.
Practical Solution for Real-World Deployment: Demonstrating the method's ability to operate without external sensors or complex interaction models, making it suitable for practical deployment.

2.2.2. Key Conclusions and Findings

The key conclusions and findings include:

Significant Human Effort Reduction: Simulation experiments show a 24.7% reduction in human effort (and 31.47% in another mention) compared to baselines, while maintaining object stability. This directly addresses the goal of easing the human partner's burden.
Precise Coordination and Trajectory Tracking: The method achieves low linear velocity tracking error ( $10.2 \ \mathrm{cm/s}$ ) and angular tracking error ( $0.1 \ \mathrm{rad/s}$ ) relative to human motion, indicating precise coordination.
Robustness and Generalization: Real-world experiments validate robust collaborative carrying across diverse object types (e.g., boxes, desks, stretchers) and movement patterns (e.g., straight-line, turning, slope climbing), demonstrating the model's versatility.
Implicit Intention Learning: The model implicitly learns to interpret human intentions through simple pushing and pulling actions, eliminating the need for explicit commands or remote controls.
Compliance to External Forces: The model demonstrates compliant behavior, responding appropriately to external forces for movement initiation (e.g., moving when force exceeds 15N) and vertical disturbances, showcasing agile full-body motions.
Positive User Experience: Human user studies with 23 participants confirmed an average 27.4% improvement in compliance and height-tracking compared to baseline models, validating its practical effectiveness and user acceptance.
Effectiveness of Architecture: The residual teacher policy and distillation training are crucial for effective and compliant collaboration, outperforming end-to-end MLP and Transformer baselines. A compact MLP-based student policy is found to be more effective than a Transformer due to better adaptation to prompt human movements.

These findings collectively demonstrate that COLA offers a practical and effective solution for enabling compliant human-humanoid collaborative carrying, addressing critical challenges in human-robot interaction.

3.1. Foundational Concepts

To understand this paper, a beginner needs to grasp several foundational concepts in robotics, machine learning, and control theory.

3.1.1. Humanoid Robots

Humanoid robots are robots designed to resemble the human body, typically with a torso, head, two arms, and two legs. This morphology allows them to operate in human-centric environments and perform tasks requiring human-like mobility and manipulation. Their whole-body dynamics are complex because they are underactuated (have fewer actuators than degrees of freedom in certain movements) and high-dimensional, making stable locomotion and manipulation challenging, especially when interacting with the environment or humans.

3.1.2. Reinforcement Learning (RL)

Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward.

Agent: The decision-maker (e.g., the humanoid robot).
Environment: The world the agent interacts with (e.g., the physical space, objects, human partner).
State: The current situation of the agent and environment (e.g., robot's joint angles, object's position, human's velocity).
Action: A decision made by the agent (e.g., adjusting joint torques, changing speed).
Reward: A scalar feedback signal from the environment that indicates the desirability of the agent's actions. The agent's goal is to learn a policy – a mapping from states to actions – that maximizes the total expected reward over time.
Policy: The strategy that the agent uses to determine its next action based on the current state.

3.1.3. Proprioception

Proprioception refers to the robot's internal sense of its own body's position, movement, and force. In robotics, this typically includes data from:

Joint encoders: Measuring the angles and velocities of the robot's joints.
Inertial Measurement Units (IMUs): Measuring orientation and angular velocity (e.g., root orientation, gravity vector). Proprioception-only means the robot relies solely on these internal senses, without external sensors like cameras (vision), LiDAR, or force/torque sensors at the end-effectors, for perceiving its environment and interacting with objects/humans. This is crucial for simplifying real-world deployment and reducing sensor dependence.

3.1.4. Whole-Body Control (WBC)

Whole-Body Control is a control strategy for complex robots (like humanoids) that coordinates all of the robot's joints and effectors simultaneously to achieve a desired task while respecting physical constraints (e.g., balance, joint limits). It contrasts with controlling individual limbs or joints in isolation. In the context of this paper, a WBC policy manages the robot's entire body to achieve both locomotion (movement) and manipulation (object handling) commands.

3.1.5. Compliance

In robotics, compliance refers to a robot's ability to yield or adapt to external forces or positional changes from its environment or interaction partners (e.g., humans). A compliant robot can absorb impacts and move naturally with a human, making interaction safer and more intuitive, as opposed to a stiff, position-controlled robot that resists any deviation from its programmed path. This is often achieved through impedance control or force control, where the robot's response to force or position errors is tuned.

3.1.6. Residual Learning

Residual learning is a technique where a model learns to predict a residual (difference) from an existing baseline or simpler model, rather than learning the entire output from scratch. This can simplify the learning task for complex functions, as the residual might be easier to learn than the complete function. In this paper, a residual teacher policy learns to make corrective adjustments (residual actions) on top of a pre-trained Whole-Body Control (WBC) policy.

3.1.7. Teacher-Student Framework (Knowledge Distillation)

Knowledge distillation is a model compression technique where a smaller, simpler model (the student) is trained to mimic the behavior of a larger, more complex model (the teacher). The teacher model typically has superior performance and might use more information (e.g., privileged information like ground truth object states). The student model is then deployed because it is more efficient and can operate with fewer inputs (e.g., proprioception-only). Behavioral cloning is a method used for distillation where the student learns by minimizing the difference between its outputs and the teacher's outputs for the same inputs.

3.1.8. Closed-Loop Environment

A closed-loop environment in simulation or control means that the output of the system feeds back as an input, creating a continuous feedback loop. In this paper, it means the robot's actions affect the object's state, and the object's state (along with human actions) influences the robot's subsequent decisions, creating dynamic, interactive learning.

3.2. Previous Works

The paper extensively references prior research in humanoid robot development and human-robot collaboration.

3.2.1. Humanoid Robot Development

Recent years have seen significant progress in:

Agile Locomotion: Humanoids are learning to walk, run, and navigate complex terrains (e.g., [8, 14, 28, 33]). For instance, Styleloco [14] uses generative adversarial distillation for natural humanoid robot locomotion. Humanoid Parkour Learning [33] explores dynamic movements.
Teleoperation: Humans can control humanoids remotely for various tasks (e.g., [12, 26]). Clone [12] focuses on closed-loop whole-body humanoid teleoperation for long-horizon tasks, and Twist [26] is a teleoperated whole-body imitation system.
Dexterous Manipulation: Humanoids are becoming more capable of handling objects with their hands (e.g., [19, 27]). Mimicdroid [19] focuses on in-context learning for humanoid robot manipulation from human play videos, and [27] explores generalizable humanoid manipulation with 3D diffusion policies.

These advancements highlight the growing capabilities of humanoids but often focus on individual skills rather than integrated collaboration.

3.2.2. Robot-Human Collaboration (General)

Robot-human collaboration is a long-standing research area [5, 17, 18, 25, 31, 32], but much of it has focused on robotic arms in confined workspaces.

Robotic Arms: Compliant robot-human collaboration has been extensively developed for robotic arms, where force sensing and impedance control are often used to ensure safe and adaptable interaction [10, 11]. For instance, Impedance Learning-based Adaptive Force Tracking [11] focuses on robots on unknown terrains. Learning Physical Collaborative Robot Behaviors from Human Demonstrations [18] explores learning from human examples.
Intent Recognition: Predicting human intention is crucial for effective collaboration (e.g., [6, 13, 15, 16, 31]). Hybrid Recurrent Neural Network [6] and Multi-modal Policy Learning [13] are examples of intention recognition for human-robot collaboration. Robot Reading Human Gaze [16] highlights the importance of cues like eye tracking. Closed-loop Open-vocabulary Mobile Manipulation [31] uses models like GPT-4V for intent.

3.2.3. Human-Humanoid Collaboration (Specific to Carrying)

Previous work on human-humanoid collaboration, particularly for object carrying, is limited:

Model-Based Approaches: Some methods use heuristic rules or predefined subtasks (e.g., [1, 2, 17]). For example, [1] and [2] explore collaborative human-humanoid carrying using vision and haptic sensing, often breaking down tasks into basic walking patterns and primitive behaviors.
Limited Scope Learning: H2-COMPACT [3] proposes a learning-based model using haptic cues to predict horizontal velocity commands, but its scope is restricted.
Force-Aware Control: While force regulation is crucial [11], and compliant control has been demonstrated in contact-rich manipulation [4, 24, 30], explicit force estimation for human-humanoid collaboration remains underexplored. FACET [24] focuses on force-adaptive control for legged robots. Learning Unified Force and Position Control [30] addresses loco-manipulation.

3.2.4. Environment-Conditioned Locomotion

Prior work on environment-conditioned locomotion [14, 29, 33] has shown how robots can adapt their movement to different terrains. Falcon [29] focuses on force-adaptive humanoid loco-manipulation. While relevant, these works often don't fully integrate the challenges of dynamic human interaction and object stability during collaborative tasks.

3.3. Technological Evolution

The field has evolved from focusing on individual robotic capabilities (locomotion, manipulation, teleoperation) to increasingly complex interactions. Early human-robot collaboration for arms often relied on explicit programming or detailed force/position sensing. With humanoids, the complexity of whole-body dynamics necessitated model-based control or simpler heuristic rules for collaboration due to the difficulty of integrating all aspects. The rise of reinforcement learning has allowed for more adaptive and data-driven approaches, moving beyond explicit modeling to implicitly learning complex interaction dynamics. This paper's work represents a step in this evolution by:

Leveraging advanced RL for whole-body control.
Moving from explicit force sensing to implicit force estimation via proprioception.
Integrating leader/follower roles within a single policy.
Addressing whole-body coordination for complex collaborative carrying tasks, which was previously a gap.

3.4. Differentiation Analysis

Compared to the main methods in related work, COLA offers several core innovations:

Whole-Body Coordination vs. Partial Coordination: Unlike prior human-humanoid collaboration methods [1, 2, 3, 17, 32] that neglect whole-body coordination or focus on limited aspects (e.g., horizontal velocity, pre-defined subtasks), COLA specifically enables whole-body coordination. This allows for complex tasks like picking objects from the ground or climbing slopes while carrying.
Proprioception-Only for Real-World Deployment: Many force-aware control methods [11, 24, 30] rely on explicit force estimation using dedicated force/torque sensors. COLA differentiates itself by achieving compliant collaboration using proprioception-only inputs. It implicitly estimates interaction forces through joint state offsets, making it more practical for real-world deployment by reducing sensor requirements and complexity.
Implicit Learning of Intentions and Object Dynamics: Instead of relying on multi-modal data or explicit intention prediction models [6, 13, 15, 16], COLA learns human intentions and object motion patterns implicitly within a closed-loop environment. This allows the robot to adapt its collaboration strategy in real-time, which is difficult to encode with manually designed commands.
Unified Leader/Follower Policy: Previous works often separate leader and follower roles or require explicit commands. COLA integrates both behaviors within a single policy, controlled by a simple velocity command (zero velocity for following), allowing for flexible role switching.
Robustness and Generalization: By training in a closed-loop environment that explicitly models humanoid-object interactions and dynamic object interactions, COLA demonstrates superior generalization across diverse terrains, objects, and movement patterns compared to baselines. This addresses the challenge of adapting to diverse environments, which is a common limitation in prior work.

In essence, COLA moves beyond single-constraint solutions to integrate force interactions, implicit constraints, and dynamic coordination into a coherent framework for humanoid collaborative carrying, bridging the gap between advanced humanoid capabilities and practical, compliant human-humanoid collaboration.

4. Methodology

4.1. Principles

The core idea behind COLA is to leverage reinforcement learning to enable a humanoid robot to collaborate compliantly with a human partner for object carrying. The method is built on two key principles:

Proxy for Interaction Forces: Offsets between joint states and their targets (i.e., the difference between the desired joint position/velocity and the actual one) can serve as an implicit proxy for estimating interaction forces. This allows the robot to infer how much force is being applied by the human or object without needing dedicated force/torque sensors.
Object State as Collaboration Constraints: The state of the carried object (e.g., its position, orientation, velocity) implicitly encodes critical collaboration constraints such as stability and coordination requirements. By learning to maintain desired object states, the robot inherently learns to collaborate effectively.

To achieve this, COLA employs a three-step training framework within a closed-loop environment that models the dynamic interactions between the humanoid, the object, and the human. This allows the robot to implicitly predict object motion patterns and human intentions, leading to compliant collaboration and load balance through coordinated trajectory planning. The ultimate goal is a proprioception-only policy for real-world deployment, reducing reliance on external sensors.

4.2. Core Methodology In-depth (Layer by Layer)

The COLA methodology is structured into three distinct learning steps: Whole-body controller training, Residual teacher policy training for collaboration, and Student policy distillation.

4.2.1. Task Definition

The task is defined as a humanoid assisting a human partner to transport an object that is challenging for a single person. The robot's objectives are:

Coordinate Movement: Align its velocity with the human's velocity.
Support Weight: Reduce the human's physical burden by supporting the object's weight.
Stabilize Orientation: Maintain the object's orientation throughout transportation.

4.2.2. Step 1: Whole-body Control (WBC) Policy Training

In the first step, a foundational Whole-Body Control (WBC) policy is trained in a simulator without specific collaboration constraints. This policy is responsible for the robot's basic motor skills, locomotion, and manipulation.

Goal Command ( $\mathcal{G}$ ): The WBC policy receives a combined goal command, $\mathcal{G}$ , which includes both lower-body locomotion and upper-body end-effector commands.
- Lower-body locomotion goal command (\mathcal{G}^{\mathrm{lower}}{t}): This specifies the desired linear velocity ( $v^{\mathrm{lin}}_{t}$ ), angular velocity ( $v^{\mathrm{ang}}_{t}$ ), and root height ( $h^{\mathrm{root}}_{t}$ ) for the robot's base. $ \mathcal{G}^{\mathrm{lower}}{t} \triangleq \left[ v^{\mathrm{lin}}{t}, v^{\mathrm{ang}}{t}, h^{\mathrm{root}}_{t} \right] $
- Upper-body end-effector goal command (\mathcal{G}^{\mathrm{upper}}): This specifies the target position ( $p^{\mathrm{ee}}$ ) and orientation ( $r^{\mathrm{ee}}$ ) for the robot's end-effectors (e.g., hands). $ \mathcal{G}^{\mathrm{upper}} = \left[ p^{\mathrm{ee}}, r^{\mathrm{ee}} \right] $
- The combined goal command is: $ \mathcal{G} = [ \mathcal{G}^{\mathrm{lower}}, \mathcal{G}^{\mathrm{upper}} ] $
Observation Space ( $\mathcal{O}^{\mathrm{wbc}}_{t}$ ): The WBC policy takes as input a history of the robot's proprioceptive observations. This history includes:
- Joint positions (q^{\mathrm{pos}}_{t-l:t}): The positions of the robot's $N=29$ joints (excluding fingers) over a history length $l$ .
- Joint velocities (q^{\mathrm{vel}}_{t-l:t}): The velocities of the robot's $N=29$ joints over a history length $l$ .
- Robot root orientation (\omega^{\mathrm{root}}_{t-l:t}): The orientation of the robot's base in quaternion form over a history length $l$ .
- Gravity vector (g_{t-l:t}): The gravity vector in the robot's root frame over a history length $l$ .
- Previous actions (a^{\mathrm{prev}}{t-(l+1):t-1}): The actions taken by the robot in the preceding time steps. $ \mathcal{O}^{\mathrm{wbc}}{t} \triangleq \left[ q^{\mathrm{pos}}{t-l:t}, q^{\mathrm{vel}}{t-l:t}, \omega^{\mathrm{root}}{t-l:t}, g{t-l:t}, a^{\mathrm{prev}}_{t-(l+1):t-1} \right] $ where $l$ is the length of the history.
Action Space ( $\mathcal{A}^{\mathrm{wbc}}$ ): The action space for the WBC policy, $\mathcal{A}^{\mathrm{wbc}}$ , represents the target joint positions for the robot's $N=29$ joints. PD position control is used for actuation, meaning the robot's motors try to reach these target positions.
Policy Function ( $\mathcal{F}^{\mathrm{wbc}}$ ): The WBC policy is formally defined as a function that maps the goal command and proprioceptive observations to the action: $ \mathcal{F}^{\mathrm{wbc}} : \mathcal{G} \times \mathcal{O}^{\mathrm{wbc}} \to \mathcal{A}^{\mathrm{wbc}}, \mathcal{A}^{\mathrm{wbc}} \in \mathbb{R}^{N} $ where $\mathbb{R}^{N}$ denotes an $N$ -dimensional real vector, representing the target joint positions for $N$ joints.
Training Details: The WBC policy is trained using Proximal Policy Optimization (PPO) with rewards following prior works [21, 29]. To improve robustness under payloads, external forces are applied to the humanoid's end-effectors during training, enhancing its force-adaptive capabilities.

4.2.3. Step 2: Residual Teacher Policy Training

In the second step, a residual teacher policy is trained on top of the pre-trained WBC policy within a closed-loop environment. This environment explicitly models the dynamic interaction between the human, object, and humanoid. The teacher policy has access to privileged information to accurately model object dynamics.

Closed-Loop Training Environment: As illustrated in Figure 3, the environment includes the humanoid, a supporting base body (simulating the human carrier), and the carried object. The object is connected to the support body via a 6-DoF joint. The object is placed in the robot's hand, and the hand joints are fixed in a predefined grasp pose.

该图像是示意图，展示了我们的闭环训练环境。在左侧，图中显示了载物体的目标速度由绿色箭头表示，而当前速度由红色箭头表示。右侧图示则展示了相应的人形机器人在与物体交互过程中的动态变化。

The green arrow represents the goal velocity of the carried object, while the red arrow indicates its current velocity.
Privileged Information ( $\mathcal{O}^{\mathrm{priv}}_{t}$ ): The teacher policy is granted access to privileged information about the carried object, which includes:
- Linear velocity (\widetilde{v}^{\mathrm{lin}}_{t-l:t}): The object's ground-truth linear velocity history.
- Angular velocity (\widetilde{v}^{\mathrm{ang}}_{t-l:t}): The object's ground-truth angular velocity history.
- Position (\widetilde{p}_{t-l:t}): The object's ground-truth position history.
- Orientation (\widetilde{r}{t-l:t}): The object's ground-truth orientation history. $ \mathcal{O}^{\mathrm{priv}}{t} \triangleq \left[ \widetilde{v}^{\mathrm{lin}}{t-l:t}, \widetilde{v}^{\mathrm{ang}}{t-l:t}, \widetilde{p}{t-l:t}, \widetilde{r}{t-l:t} \right] $ with a history of length $l$ .
Teacher Observation Space ( $\mathcal{O}^{\mathrm{teacher}}_{t}$ ): The teacher policy receives both the robot's proprioceptive observations ( $\mathcal{O}^{\mathrm{wbc}}_{t}$ ) and the privileged information ( $\mathcal{O}^{\mathrm{priv}}_{t}$ ). $ \mathcal{O}^{\mathrm{teacher}}{t} \triangleq [\mathcal{O}^{\mathrm{wbc}}{t}, \mathcal{O}^{\mathrm{priv}}_{t}] $
Residual Action ( $\mathcal{A}^{\mathrm{teacher}}$ ): The teacher policy does not directly output the full action. Instead, it outputs a residual action ( $\mathcal{A}^{\mathrm{teacher}}$ ), which is a corrective adjustment to the WBC policy's output. The final collaborative action ( $\mathcal{A}^{\mathrm{collab}}$ ) is the sum of the WBC action ( $\mathcal{A}^{\mathrm{wbc}}$ ) and the residual action. $ \mathcal{A}^{\mathrm{collab}} = \mathcal{A}^{\mathrm{wbc}} + \mathcal{A}^{\mathrm{teacher}} $
Policy Function ( $\mathcal{F}^{\mathrm{teacher}}$ ): The teacher policy is defined as: $ \mathcal{F}^{\mathrm{teacher}} : [ \mathcal{O}^{\mathrm{wbc}}, \mathcal{O}^{\mathrm{priv}} ] \to \mathcal{A}^{\mathrm{teacher}}, \mathcal{A}^{\mathrm{teacher}} \in \mathbb{R}^{N} $

Reward Function: The teacher's learning is guided by a composite reward function that combines base whole-body control rewards (from the WBC training) with task-specific rewards for collaboration. These rewards are detailed in Table I and are crucial for learning compliant and coordinated carrying. The following are the results from Table I of the original paper:

Term	Expression	Weight
Linear Vel. Tracking	φ(vCM applied lin	1.0
Yaw Vel. Tracking	Vang ,goal	1.0
Z-axis Vel. Penalty	−kvθb ,obj \|	0.05
Height Diff. Penalty	j − hobj\|l −khobj	10.0
Force Penalty	− \| F support-obj \|	0.002

The reward terms are:

Linear Vel. Tracking: Rewards tracking the linear velocity of the carried object. The expression $\varphi(x) = e^{-kx^2}$ $φ (x) = e^{- k x^{2}}$ is a Gaussian-like function that gives a higher reward for smaller errors.
- $v_{\mathrm{CM}}^{\mathrm{applied}}$ : Applied linear velocity to the center of mass of the object.
- $v_{\mathrm{lin}}$ : The robot's linear velocity.
Yaw Vel. Tracking: Rewards tracking the angular velocity (yaw) of the carried object.
- $V_{\mathrm{ang, goal}}$ : The goal angular velocity.
Z-axis Vel. Penalty: Penalizes vertical velocity of the object to maintain stability.
- $k$ : A constant.
- $\theta_{b, \mathrm{obj}}$ : Angular velocity of the object around the Z-axis.
Height Diff. Penalty: Penalizes differences in height between the object ends held by the human and humanoid.
- $h^{\mathrm{obj}}_{1}, h^{\mathrm{obj}}_{2}$ : Heights of the object's two ends.
- $k_h$ : A constant.
Force Penalty: Penalizes the horizontal force between the support body and the object, aiming to minimize the human's effort.
- $F_{\mathrm{support-obj}}$ : Horizontal force between the support body and object.

Goal Command Modification: During this step, the goal command of the model is modified based on the settings described in Section IV (Implementation Details), specifically for collaborative carrying. The velocityv^{\mathrm{applied}}$$ is applied to the supporting base body at the end of the object opposite the robot-held end. The magnitude of applied velocity is sampled from a range, and angular velocity control uses a PD controller to apply torque to the support body. Height control samples a target height for the support body and applies PD-controlled force.

4.2.4. Step 3: Knowledge Distillation (Student Policy Training)

In the final step, the expertise learned by the combined WBC and residual teacher policy ( $\mathcal{F}^{\mathrm{wbc}} + \mathcal{F}^{\mathrm{teacher}}$ ) is distilled into a student policy ( $\mathcal{F}^{\mathrm{student}}$ ). This student policy is designed for real-world deployment and operates solely on proprioceptive observations ( $\mathcal{O}^{\mathrm{wbc}}$ ), without access to privileged information.

Student Observation Space: The student policy only receives the proprioceptive observations $\mathcal{O}^{\mathrm{wbc}}$ .
Student Action Space: The student policy outputs its action $\mathcal{A}^{\mathrm{student}}$ , which is also the target joint positions.
Policy Function ( $\mathcal{F}^{\mathrm{student}}$ ): The student policy is defined as: $ \mathcal{F}^{\mathrm{student}} : \mathcal{O}^{\mathrm{wbc}} \to \mathcal{A}^{\mathrm{student}}, \mathrm{where} \mathcal{A}^{\mathrm{student}} \in \mathbb{R}^{N} $
Distillation Method: Behavioral cloning is used to distill the teacher policy into the student policy. The student is trained to mimic the teacher's behavior by minimizing the mean squared error between their outputs during interactions with the environment. The loss function for distillation is: $ \mathcal{L}_{\mathrm{distill}} = \mathbb{E} \left[ | \mathcal{A}^{\mathrm{student}} - \mathcal{A}^{\mathrm{collab}} |^2 \right] $ where:
- $\mathcal{A}^{\mathrm{student}}$ : The action output by the student policy.
- $\mathcal{A}^{\mathrm{collab}}$ : The collaborative action (output of the teacher policy, i.e., $\mathcal{A}^{\mathrm{wbc}} + \mathcal{A}^{\mathrm{teacher}}$ ).
- $\mathbb{E}[\cdot]$ : Expected value.
- $\| \cdot \|^2$ : Squared Euclidean norm, representing the squared difference between the student's action and the teacher's action.
Role Allocation (COLA-F and COLA-L): The paper defines two experimental settings based on goal command observation:
- COLA-F (Follower): All networks receive a goal command input of zero. This means the robot primarily follows the human's implicit cues.
- COLA-L (Leader): The policy is provided with a sampled goal command (within the range used for WBC). This allows the robot to actively lead or pursue a specific trajectory while collaborating. Role allocation is effectively controlled via a velocity command, where zero velocity for the robot implies a follower role.
  
  The overall training pipeline is illustrated in Figure 2 (from the original paper):
  
  该图像是示意图，展示了人类与类人机器人协作的整体训练过程。图中包含三个步骤：第一步为全身控制器训练，通过GT命令和自我觉察信息进行数据处理；第二步为残差教师的协作训练；第三步为学生策略的提炼，采用BC蒸馏。图示中还展示了机器人与人类在搬运对象时的真实场景，说明了研究的应用背景。

The diagram shows the three steps: Whole-body controller training, Residual teacher policy training, and Student policy distillation using behavioral cloning. The teacher uses privileged information and proprioception to output a residual action that adjusts the WBC action, forming the collaborative action. The student learns from the collaborative action using only proprioception for real-world deployment.

4.2.5. Implementation Details

Training Setup:
- Platform: Isaac Lab simulator.
- Hardware: Single RTX 4090D GPU.
- Algorithm: PPO with 4096 parallel environments.
- Network Architecture:
  - WBC actor and critic networks: Three-layer Multi-Layer Perceptrons (MLPs) of size (512, 256, 128).
  - Residual teacher and student policy networks: Two additional MLPs with the same dimensions (512, 256, 128) stacked on top of the WBC network.
- Training Steps:
  - WBC: 350k environment steps (approx. 15k PPO updates).
  - Residual Teacher: 250k environment steps (approx. 10k PPO updates).
  - Distillation: 250k environment steps (approx. 10k PPO updates).
- Total Training Time: 48 hours.

Observation Space Details (Command Sampling): Whole-body control commands are sampled from predefined ranges.

End-effector goal command: Represents the 6-DoF target pose (position and orientation) of the robot's wrist.
- Since the task focuses on collaborative carrying rather than complex upper-body manipulation, large-range upper-body motions are not sampled.
- End-effector positions are randomly sampled within a small cubic region near the default grasping pose.
- End-effector orientations are sampled within a conical region around the nominal grasp orientation using Spherical Linear Interpolation (SLERP).
The WBC achieves a tracking error of $5.6 \mathrm{cm}$ for end-effector goal position and $7^\circ$ for end-effector goal orientation.

The carried object and support body are connected via a 6-DoF joint. Friction, damping, and joint limits ensure support body movements faithfully transmit to the object. The following are the results from Table II of the original paper:

Term	Range
Base Lin. Vel. X (m/s)	(−0.8, 1.2)
Base Lin. Vel. Y (m/s) Base Ang. Vel. (rad/s)	(−0.5, 0.5)
Base Height (m)	(−1.2, 1.2)
End-effector Position (m)	(0.45, 0.9)
End-effector Orientation (rad)	(0.15)
Support Object Lin. Vel. (m/s)	(π/6)
Support Object Ang. Vel. (rad/s)	(−0.6, 1.0)
	(−0.8, 0.8)
Support Object Height (m)	(0.5, 0.85)

*Note: End-effector Position denotes the side length of the cube where the goal position is sampled from; End-effector Orientation denotes the halfangle of the cone that defines the sampling range of orientation goals.

Base Lin. Vel. X (m/s): Linear velocity along the robot's forward/backward axis.
Base Lin. Vel. Y (m/s): Linear velocity along the robot's sideways axis.
Base Ang. Vel. (rad/s): Angular velocity around the robot's vertical (yaw) axis.
Base Height (m): Desired height of the robot's base.
End-effector Position (m): The side length of the cubic region from which the goal position for the end-effector is sampled.
End-effector Orientation (rad): The half-angle of the cone defining the sampling range for end-effector orientation goals.
Support Object Lin. Vel. (m/s): Linear velocity applied to the simulated human's side of the object.
Support Object Ang. Vel. (rad/s): Angular velocity applied to the simulated human's side of the object.
Support Object Height (m): Height of the simulated human's side of the object.

5. Experimental Setup

5.1. Datasets

The paper does not use traditional datasets in the supervised learning sense. Instead, it relies on a closed-loop training environment in a simulator (Isaac Lab) to generate continuous interaction data for reinforcement learning.

5.1.1. Closed-Loop Training Environment

The simulation environment is dynamically constructed to model the interactions:

Components: Humanoid robot (G1 model with 29 joints excluding fingers), a supporting base body (simulates human carrier), and a carried object.
Interaction Model: The object is connected to the support body via a 6-DoF joint. The object is placed in the robot's hand, and the hand joints are fixed in a predefined grasp pose.
Dynamic Inputs:
- A goal command (\mathcal{G}) is randomly sampled (ranges defined in Table II) to guide the humanoid's movement.
- A velocity (v^{\mathrm{applied}}) is sampled and applied to the supporting base body (representing the human's side) at the object's opposite end. This applied velocity is updated at twice the frequency of the goal command to simulate dynamic human input.
- For angular velocity control, a target angular velocity is set, and a PD controller applies torque to the support body.
- For height control, a target height for the support body is randomly sampled, and a PD-controlled force adjusts its height. The robot is not required to maintain a fixed height, allowing for adaptive responses.
  
  This dynamic, interactive simulation setup serves as the "data generation" mechanism, allowing the reinforcement learning agent to learn from continuous interaction rather than a static dataset.

5.1.2. Objects and Terrains

Simulation: The paper implicitly states that various objects and terrains are used during simulation to test effectiveness, generalization, and robustness. The figures show diverse objects like a rod, box, stretcher, and cart.
Real-World: Real-world experiments use boxes, desks, and stretchers as carried objects, and movement patterns include straight-line, turning, and slope climbing.

5.2. Evaluation Metrics

The paper uses several quantitative metrics to evaluate performance, categorized into trajectory following, height tracking, and coordination/effort reduction.

5.2.1. Linear Velocity Tracking Error (Lin. Vel.)

Conceptual Definition: This metric quantifies how accurately the robot's linear velocity matches the human's (or the desired object's) linear velocity during the collaborative carrying task. A lower value indicates better coordination in terms of forward/backward and sideways movement.
Mathematical Formula: $ \text{Lin. Vel. Error} = \frac{1}{T} \sum_{t=1}^{T} | v_{\mathrm{robot}, t}^{\mathrm{lin}} - v_{\mathrm{human}, t}^{\mathrm{lin}} | $
Symbol Explanation:
- $T$ : Total number of time steps (duration of the episode).
- $v_{\mathrm{robot}, t}^{\mathrm{lin}}$ : The linear velocity of the robot at time step $t$ .
- $v_{\mathrm{human}, t}^{\mathrm{lin}}$ : The linear velocity of the human (or the desired linear velocity of the object) at time step $t$ .
- $\| \cdot \|$ : Euclidean norm, representing the magnitude of the difference.

5.2.2. Angular Velocity Tracking Error (Ang. Vel.)

Conceptual Definition: This metric measures how well the robot's angular velocity (rotational movement, specifically yaw) aligns with the human's (or desired object's) angular velocity. A lower value signifies better rotational coordination.
Mathematical Formula: $ \text{Ang. Vel. Error} = \frac{1}{T} \sum_{t=1}^{T} | \omega_{\mathrm{robot}, t} - \omega_{\mathrm{human}, t} | $
Symbol Explanation:
- $T$ : Total number of time steps.
- $\omega_{\mathrm{robot}, t}$ : The angular velocity of the robot at time step $t$ .
- $\omega_{\mathrm{human}, t}$ : The angular velocity of the human (or the desired angular velocity of the object) at time step $t$ .
- $\| \cdot \|$ : Euclidean norm.

5.2.3. Height Error (Height Err.)

Conceptual Definition: This metric assesses the stability of height coordination during carrying. It measures the difference in vertical height between the object end held by the human and the object end held by the humanoid, indicating how level the object is maintained. A lower error implies greater object stability and better load balance.
Mathematical Formula: $ \text{Height Err.} = \frac{1}{T} \sum_{t=1}^{T} | h_{\mathrm{human-end}, t} - h_{\mathrm{robot-end}, t} | $
Symbol Explanation:
- $T$ : Total number of time steps.
- $h_{\mathrm{human-end}, t}$ : The height of the object end held by the human at time step $t$ .
- $h_{\mathrm{robot-end}, t}$ : The height of the object end held by the humanoid at time step $t$ .
- $| \cdot |$ : Absolute difference.

5.2.4. Average External Force (Avg. E.F.)

Conceptual Definition: This metric quantifies the average horizontal interaction force between the human (or the simulated support body) and the object. It directly reflects the physical effort required from the human to move the carried object along the intended direction. A lower force indicates that the robot is contributing more effectively to the carrying task, thereby reducing the human's burden and demonstrating stronger compliance.
Mathematical Formula: (Based on the paper's description, this would be the magnitude of the force applied by the human's simulated side to the object.) $ \text{Avg. E.F.} = \frac{1}{T} \sum_{t=1}^{T} | F_{\mathrm{human-obj}, t}^{\mathrm{horizontal}} | $
Symbol Explanation:
- $T$ : Total number of time steps.
- $F_{\mathrm{human-obj}, t}^{\mathrm{horizontal}}$ : The horizontal force exerted by the human (or supporting base body) on the object at time step $t$ .
- $\| \cdot \|$ : Euclidean norm, representing the magnitude of the horizontal force vector.

5.2.5. Human User Study Metrics

For human user studies, participants rated compliance and height tracking on a scale of 1 to 5, where a higher score indicates better performance.

Height Tracking (User Study): Qualitative assessment of how well the object's height is maintained and coordinated.
Smoothness (User Study): Qualitative assessment of the fluidity and naturalness of the collaboration.

5.3. Baselines

The paper compares COLA against several baseline models to demonstrate its effectiveness and justify architectural choices.

5.3.1. Vanilla MLP

Description: This baseline trains a simple Multi-Layer Perceptron (MLP) policy directly from scratch, initialized with the weights of the Whole-Body Controller (WBC). It is trained end-to-end with PPO to perform the collaborative carrying task.
Purpose: To evaluate the benefit of the residual learning and teacher-student distillation framework compared to a direct, monolithic RL approach.

5.3.2. Explicit Goal Estimation

Description: This baseline replaces the whole-body control command with a predicted one and removes the residual component from the teacher policy. The resulting policy is then distilled into a student policy. This implies that the model explicitly tries to predict the next WBC command based on observations, rather than learning a residual adjustment.
Purpose: To investigate whether explicitly predicting WBC commands is as effective as learning residual adjustments for collaborative tasks, especially in dynamic interaction scenarios. It tests the hypothesis that collaboration requires implicitly learning dynamic interactions rather than just predicting goal commands.

5.3.3. Transformer

Description: This baseline replaces the student policy's original architecture (which is MLP-based) with a Transformer network. Transformers are known for their ability to process sequential data and capture long-range dependencies.
Purpose: To evaluate the architectural choice of MLP versus Transformer for the student policy. It assesses whether the Transformer's temporal processing capabilities are beneficial or detrimental for real-time, compliant human-humanoid collaboration where prompt adaptation to dynamic human movements might be more critical than long-term memory.

5.3.4. Locomotion (Implicit Baseline)

Although not explicitly listed as a baseline for the quantitative comparison table, a "Locomotion" policy is mentioned in the Human User Study results (Table IV) and Compliance to External Forces analysis (Figure 5b). This likely refers to a basic WBC policy trained for locomotion without specific collaborative carrying capabilities, serving as a very naive baseline for interaction.

6. Results & Analysis

6.1. Core Results Analysis

The experiments evaluate COLA against baselines in both simulation and real-world scenarios, focusing on trajectory following, height tracking, human effort reduction, and compliance.

6.1.1. Simulation Results: Effectiveness and Compliance

The simulation results, presented in Table III, compare COLA (in both COLA-F and COLA-L settings with varying history lengths) against the Explicit Goal Estimation and Transformer baselines.

The following are the results from Table III of the original paper:

Methods	Lin. Vel. (m/s) ↓	Ang. Vel. (rad/s) ↓	Height Err. (m) ↓	Avg. E.F. (N) ↓
Explicit Goal Estimation	0.235	0.335	0.102	19.067
Transformer	0.178	0.310	0.077	19.382
COLA-F-History10	0.121	0.131	0.037	15.435
COLA-F-History50	0.116	0.132	0.036	14.574
COLA-F	0.109	0.118	0.031	14.576
COLA-L-History10	0.118	0.106	0.039	13.924
COLA-L-History50	0.112	0.103	0.036	13.495
COLA-L	0.102	0.098	0.038	12.298

Superior Trajectory Tracking: COLA consistently outperforms baselines across Linear Velocity (Lin. Vel.) and Angular Velocity (Ang. Vel.) tracking errors.
- COLA-L achieves the lowest Lin. Vel. error of $0.102 \ \mathrm{m/s}$ and Ang. Vel. error of $0.098 \ \mathrm{rad/s}$ . This demonstrates precise coordination with human movements.
Better Height Stability: COLA also shows significantly lower Height Error (Height Err.), with COLA-F achieving $0.031 \ \mathrm{m}$ and COLA-L remaining competitive. This indicates superior object stability during carrying.
Reduced Human Effort (Compliance): The Avg. E.F. metric shows that COLA drastically reduces the average external force required from the human.
- COLA-L achieves the lowest Avg. E.F. of $12.298 \ \mathrm{N}$ .
- Compared to the best baseline (Explicit Goal Estimation at $19.067 \ \mathrm{N}$ ), COLA-L reduces human effort by approximately $35.5\%$ ( $(19.067 - 12.298) / 19.067 \approx 0.355$ ). The abstract mentions a 24.7% reduction compared to baseline approaches, and in the "Overall" summary, 31.47% reduction is mentioned. These values vary slightly depending on the specific baseline chosen for comparison or the exact calculation context, but all indicate significant reduction.
- The lower Avg. E.F. reflects COLA's stronger compliance and active participation in the carrying task.

6.1.2. Comparison with Baselines

Explicit Goal Estimation: This baseline performs the poorest across all metrics. This highlights that collaborative carrying is more complex than just predicting whole-body control commands. The dynamic interactions require implicit learning within a closed-loop environment.
Transformer: While better than Explicit Goal Estimation, the Transformer baseline is significantly outperformed by COLA. This suggests that the Transformer's temporal processing might introduce unnecessary complexity for prompt adaptation to human movements, which is critical for real-time collaboration.
Vanilla MLP: The paper discusses Vanilla MLP in the text (but not in Table III). It achieves relatively high performance among baselines but struggles with Ang. Vel. and Height Err., indicating difficulty in inferring angular and vertical movements compared to linear ones. This further supports the need for the teacher-student distillation framework to learn complex interaction patterns.

6.1.3. COLA-L vs. COLA-F

COLA-L consistently outperforms COLA-F (lower Lin. Vel., Ang. Vel., and Avg. E.F.). This is attributed to the goal command provided to COLA-L, which helps the policy learn to collaborate more actively and precisely. The goal command provides informative cues, especially in the presence of noise and disturbances.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Architecture Choice (MLP vs. Transformer)

The comparison in Table III shows COLA (MLP-based student policy) outperforms the Transformer baseline. The Transformer also required twice the training steps to converge. This indicates that a compact MLP-based model is more effective. The authors hypothesize that the Transformer's long-term temporal processing introduces unnecessary complexity. For collaboration, the robot needs to adapt promptly to current human movements, and relying on outdated information (which a Transformer might over-process) could lead to hesitation and degraded cooperation.

6.2.2. History Length

The ablation study on history length for COLA-F and COLA-L (Table III shows History10 and History50 variants) reveals:

A shorter history (implicitly, History10 vs. default History25 for COLA-F/L) provides insufficient information for implicit collaboration learning.
Increasing history length to 50 (History50) yields little improvement over History25.
The authors chose 25 as a balance between performance and learning efficiency. This suggests that the task is not highly sensitive to long-term joint position changes, corroborating the findings that the MLP-based architecture is better than a Transformer that focuses on long sequences.

6.3. Real-World Experiments: Practical Value and Compliance

6.3.1. Qualitative Results

Figure 4 showcases the qualitative effectiveness of COLA in real-world scenarios.

该图像是一个示意图，展示了人类与类人机器人协作完成物体搬运的多种场景，包括担架搬运、杆高跟踪、坡道上的箱子搬运和推车操作。这些场景突显了机器人与人类在动态环境中的协作能力。

The images demonstrate successful collaborative carrying of diverse objects (rod, box, stretcher, cart) across various grasping poses and even on sloped terrains. This highlights the versatility and generalizability of the method. The model implicitly learns to interpret human intentions through force-based interaction (pushing/pulling), allowing the humanoid to infer desired movements and execute them autonomously.

6.3.2. Quantitative Real-World Metrics

Figure 5 presents quantitative results for compliance and height tracking in real-world settings.

该图像是图表，展示了协作搬运的有效性定量结果。图中(a)展示了不同外部力作用下机器人的基速变化，(b)显示了机器人的高度随时间步骤的变化，(c)和(d)则分别对真实力量的最小值和高度差进行了比较分析。

Compliance to External Forces (Simulation - Figure 5a & 5b):
- Figure 5a shows the robot's velocity response to an external force applied to its palm. COLA initiates movement when the force exceeds $15 \mathrm{N}$ , while the baseline model remains stationary. Forces below $15 \mathrm{N}$ are interpreted as stabilization cues. This indicates COLA's ability to discern between stabilizing actions and initiating movement based on force thresholds.
- Figure 5b illustrates height response to vertical external forces on the end-effector. The Locomotion policy maintains a constant height, and Vanilla MLP squats to a fixed height, passively supporting the force. In contrast, both COLA settings (COLA-F and COLA-L) actively comply with vertical disturbances, demonstrating agile full-body motions.
Minimal Force to Move Robot (Real-World - Figure 5c):
- Figure 5c compares the minimal force required to move the robot in the real world. COLA demonstrates stronger compliance by requiring less force to initiate movement compared to the baseline. This directly reflects a reduction in human effort.
Height Difference (Real-World - Figure 5d):
- Figure 5d shows the height difference between the human-held end and the humanoid-held end of the object in real-world experiments. COLA significantly reduces this height-tracking error by approximately three-quarters compared to the baseline. This confirms stable object pose tracking in real-world collaborative tasks.

6.3.3. Human User Studies

A study with 23 participants rated COLA's performance on Height Tracking and Smoothness on a scale of 1 to 5. The following are the results from Table IV of the original paper:

Methods	Height Tracking ↑	Smoothness ↑
Locomotion	2.96	2.61
Vanilla MLP	3.09	3.09
COLA	3.96	3.96

Superior User Ratings: COLA achieves the highest scores in both Height Tracking (3.96) and Smoothness (3.96) compared to Locomotion and Vanilla MLP baselines. This confirms COLA's effectiveness and provides quantitative evidence of improved user experience in real-world collaborative scenarios. The abstract states an average improvement of 27.4% compared to baseline models, which is consistent with these higher scores.

6.3.4. Implicit Force Estimation from Joint States

Figure 6 illustrates how the humanoid's behavior is sensitive to forces applied at specific joints.

Fig. 6: Movement Analysis. When a continuous external force is applied to the robot's torso, it resists to maintain a stable stance. In contrast, when a smaller force is applied to the robot's end-effector, it tends to follow the force. 该图像是示意图，展示了两种不同的协作行为。左侧展示了机器人在接收作用于躯干的外力时的反应，维持稳定姿态；右侧展示了当作用于机器人末端执行器的较小外力时，其跟随该外力的行为。

When forces are applied to the hand or arm during carrying, the humanoid tends to follow. Conversely, forces applied to the torso or legs result in the humanoid maintaining a stable stance. This observation suggests that COLA effectively learns interaction dynamics by interpreting offsets between joint states and their targets as cues for interaction forces and human intentions, without explicit force sensors.

6.4. Advantages and Disadvantages

Advantages:

Reduced Human Effort: Quantitatively proven in both simulation and real-world experiments.
High Compliance: The robot responds adaptively to human guidance and external forces, leading to intuitive interaction.
Precise Coordination: Achieves low tracking errors for linear and angular velocities, and maintains object height stability.
Proprioception-Only: Simplifies real-world deployment by eliminating external sensor requirements.
Generalizable: Works across diverse objects, terrains, and movement patterns.
Implicit Intention Learning: Interprets human intentions from physical interaction, removing the need for explicit commands.
Unified Policy: Integrates leader and follower behaviors within a single policy.

Disadvantages/Observations:

COLA-L (leader mode) generally outperforms COLA-F (follower mode), suggesting that explicit goal commands can enhance collaboration, even if they are sampled rather than directly provided by a human.
The MLP-based student policy is found to be more effective than a Transformer, indicating that long-term temporal dependencies might be less critical than prompt adaptation for this specific task.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces COLA, a novel proprioception-only reinforcement learning approach for human-humanoid collaborative object carrying. The core innovation lies in its three-step residual learning framework, which enables the humanoid to function as both a leader and a follower in collaborative tasks. By leveraging a closed-loop training environment that explicitly models humanoid-object interactions, COLA implicitly learns object movements and human intentions from proprioceptive feedback alone. This allows for compliant collaboration, maintaining load balance through coordinated trajectory planning without requiring external sensors or complex interaction models. Extensive simulation and real-world experiments validate COLA's effectiveness, demonstrating significant human effort reduction (up to 31.47% in simulation and 27.4% in user studies), precise trajectory coordination, and robust generalization across various objects and terrains.

7.2. Limitations & Future Work

7.2.1. Limitations

The authors acknowledge the following limitations:

Proprioception-Only: While a strength for deployment simplicity, relying solely on proprioception might limit the robot's understanding of complex human non-verbal cues or environmental context that visual or tactile sensors could provide.
Implicit vs. Explicit Planning: The current model implicitly infers intentions and dynamics. This might not be sufficient for more complex scenarios where the humanoid needs to plan autonomously to assist humans, requiring a deeper understanding of the task goals and the human's long-term objectives.

7.2.2. Future Work

Based on these limitations, the authors suggest future research directions:

Multi-Modal Perception: Exploring the integration of visual and tactile sensors to provide more informative cues for human-humanoid collaboration. This could enhance the robot's perception of the human's state, intentions, and the environment.
Autonomous Planning: Enabling humanoids to plan autonomously to assist humans. This would involve higher-level reasoning capabilities beyond reactive compliance, allowing the robot to take initiative and proactively contribute to the collaborative task.

7.3. Personal Insights & Critique

7.3.1. Inspirations

The COLA paper offers several inspiring aspects:

Elegance of Proprioception-Only Approach: The idea that a robot can achieve complex, compliant collaboration solely through internal sensing is powerful. It highlights how rich information can be extracted from seemingly simple proprioceptive data when combined with sophisticated reinforcement learning and a carefully designed closed-loop training environment. This can significantly reduce hardware complexity and cost for real-world robotic deployments.
Implicit Learning of Intentions: The ability to implicitly infer human intentions through physical interaction, rather than relying on explicit communication or complex intention recognition modules, is a major step towards more natural and intuitive human-robot interaction. This "learn by doing" approach in simulation provides a robust way for robots to adapt to diverse human behaviors.
Teacher-Student Framework for Real-World Transfer: The teacher-student distillation framework is an effective strategy for bridging the gap between training in a privileged information-rich simulation and deploying a robust, proprioception-only policy in the real world. This design pattern is highly transferable to other complex robotic tasks.

7.3.2. Potential Issues, Unverified Assumptions, or Areas for Improvement

Scalability to More Complex Human Intentions: While effective for collaborative carrying, the implicit learning of human intentions might have limitations. What if the human's intention is ambiguous, changes rapidly, or involves non-physical cues (e.g., verbal commands, gestures)? The current proprioception-only model might struggle here, reinforcing the authors' suggestion for multi-modal perception.
Robustness to Diverse Human Biomechanics/Interaction Styles: The human user study involved 23 participants, which is a good start. However, human interaction styles, strengths, and physical characteristics vary widely. How well does the model generalize to individuals with very different interaction forces, gaits, or even disabilities? Further testing with a wider demographic could reveal limitations.
Long-Term Carrying and Fatigue: The paper focuses on coordination and effort reduction. For very long-duration carrying tasks, human and robot fatigue becomes a factor. Does the robot adapt its compliance or effort contribution over time as the human tires? This could be an interesting area for future reward function design.
Unexpected Disturbances: While the closed-loop environment models dynamic interactions, real-world environments are full of unexpected disturbances (e.g., uneven ground, sudden nudges from bystanders). How does the proprioception-only model handle these without additional environmental awareness?
Safety Guarantees: For real-world human-humanoid collaboration, especially involving heavy objects, formal safety guarantees are paramount. While compliance improves safety, a reinforcement learning approach might not offer strict, verifiable safety boundaries. Future work could explore integrating formal methods or safety layers on top of the RL policy.

7.3.3. Transferability to Other Domains

The methodologies and insights from COLA are highly transferable:

Other Collaborative Manipulation Tasks: The residual learning and teacher-student framework could be applied to other collaborative manipulation tasks where a humanoid assists a human (e.g., assembling large parts, pushing heavy doors, holding tools).
Teleoperation with Force Feedback: The proprioception-only estimation of interaction forces could be used to provide implicit haptic feedback in teleoperation systems, enhancing the operator's sense of touch without needing physical force sensors on the robot.
Human-Robot Co-assembly: In manufacturing, humanoids could assist in co-assembly lines, learning to provide compliant support for parts or tools, reducing strain on human workers.
Rehabilitation Robotics: The compliant interaction capabilities could be valuable in rehabilitation scenarios, where humanoids assist patients with physical therapy exercises, adapting to their strength and movement patterns.

Overall, COLA represents a significant step towards more practical and intuitive human-humanoid collaboration, paving the way for wider adoption of humanoids in various assistive and industrial roles.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.