Learning Human-Humanoid Coordination for Collaborative Object Carrying
TL;DR Summary
The COLA method enables effective human-humanoid collaboration in complex carrying tasks using proprioception-only reinforcement learning. It predicts object motion and human intent, achieving a 24.7% reduction in human effort while maintaining stability, validated across various
Abstract
Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids' complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7%. compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baseline models. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is Learning Human-Humanoid Coordination for Collaborative Object Carrying. The title signifies an approach that uses learning, specifically reinforcement learning, to enable humanoid robots to collaborate effectively with humans in tasks involving carrying objects. The emphasis is on coordination and compliance between the human and the humanoid.
1.2. Authors
The authors are:
-
Yushi Du (Equal contribution, corresponding author) - Department of Electrical and Electronic Engineering, The University of Hong Kong; School of Computer Science and Technology, Beijing Institute of Technology
-
Yixuan Li (Equal contribution) - School of Computer Science and Technology, Beijing Institute of Technology; Yuanpei College, Peking University
-
Baoxiong Jia (Equal contribution, corresponding author) - School of Computer Science and Technology, Beijing Institute of Technology
-
Yutang Lin - Yuanpei College, Peking University
-
Pei Zhou - Department of Electrical and Electronic Engineering, The University of Hong Kong
-
Wei Liang - School of Computer Science and Technology, Beijing Institute of Technology
-
Yanchao Yang (Corresponding author) - Department of Electrical and Electronic Engineering, The University of Hong Kong
-
Siyuan Huang
The affiliations suggest a collaborative effort between multiple institutions, with researchers from computer science, electrical engineering, and potentially other related fields. The presence of multiple corresponding authors indicates a significant collaborative research project.
1.3. Journal/Conference
The paper is published as a preprint on arXiv. As an arXiv preprint, it has not yet undergone formal peer review, but it is common for research papers to be shared on arXiv before or during the review process for conferences or journals. The publication year is listed as 2025, which implies it's a forthcoming or very recent publication.
1.4. Publication Year
The paper was published at (UTC) 2025-10-16T04:36:25.000Z, indicating a publication year of 2025.
1.5. Abstract
The paper addresses the challenge of human-humanoid collaboration for collaborative object carrying, an area that has seen limited exploration for humanoids due to their complex whole-body dynamics, despite progress in compliant robot-human collaboration for robotic arms. The authors propose COLA, a proprioception-only reinforcement learning approach that integrates leader and follower behaviors into a single policy. The model is trained in a closed-loop environment with dynamic object interactions to implicitly predict object motion patterns and human intentions, facilitating compliant collaboration and load balance through coordinated trajectory planning.
Evaluations, including comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrate the approach's effectiveness, generalization, and robustness across diverse terrains and objects. Simulation results show a 24.7% reduction in human effort compared to baselines, while maintaining object stability. Real-world experiments confirm robust carrying for various object types (e.g., boxes, desks, stretchers) and movement patterns (e.g., straight-line, turning, slope climbing). Human user studies with 23 participants reported an average 27.4% improvement over baseline models. A key advantage is that COLA achieves compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, making it a practical solution for real-world deployment.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2510.14293 - PDF Link:
https://arxiv.org/pdf/2510.14293v1.pdfThe paper is available as a preprint onarXiv.org.
2. Executive Summary
2.1. Background & Motivation
2.1.1. Core Problem
The core problem the paper addresses is the significant challenge of enabling humanoid robots to collaborate effectively and compliantly with humans, particularly in tasks like collaborative object carrying. While human-robot collaboration for robotic arms has advanced, extending this to humanoids is complex due to their intricate whole-body dynamics.
2.1.2. Importance and Gaps in Prior Research
Human-humanoid collaboration holds immense promise for various applications such as healthcare, domestic assistance, and manufacturing. However, current humanoid advancements in locomotion, teleoperation, and manipulation haven't translated well into effective collaboration. Existing human-humanoid collaboration methods often rely on model-based approaches or heuristic rules, which predefine subtasks or focus on limited-scope interactions like predicting horizontal velocity from haptic cues. These approaches generally neglect whole-body coordination capabilities and lack the ability to perform complex, dynamic collaborative tasks (e.g., picking up objects from the ground, carrying objects on slopes). They also struggle with:
- Adapting to diverse environments (e.g., maintaining object stability on varied terrains).
- Responding compliantly to human motions (e.g., standing up together), often without direct
force sensing. - Dynamically allocating roles (leader/follower) for efficiency.
The interdependency of these requirements makes
collaborative carryinga particularly difficult task for humanoids.
2.1.3. Paper's Entry Point or Innovative Idea
The paper's innovative idea is to propose COLA, a proprioception-only reinforcement learning approach that learns human-humanoid coordination for collaborative object carrying. It addresses the limitations of previous work by:
- Unifying Leader and Follower Behaviors: Integrating both roles into a single policy, allowing for flexible role switching.
- Proprioception-Only Learning: Relying solely on the robot's internal
proprioceptive feedback(joint positions, velocities, root orientation) for real-world deployment, eliminating the need for external sensors or complex interaction models. - Implicit Prediction of Human Intentions and Object Dynamics: Training in a
closed-loop environmentwithdynamic object interactionsallows the model to implicitly predictobject motion patternsandhuman intentions. - Leveraging Key Insights:
- Offsets between
joint statesand theirtargetsserve as a proxy for estimatinginteraction forces. - The
carried object's stateencodes implicitcollaboration constraintslikestabilityandcoordination.
- Offsets between
- Three-Step Training Framework: Utilizing a
teacher-student frameworkwhere ateacher policy(withprivileged information) guides astudent policy(purelyproprioceptive) for practical deployment.
2.2. Main Contributions / Findings
2.2.1. Primary Contributions
The primary contributions of the paper can be summarized as:
- Unified Residual Model for Whole-Body Collaboration: Proposing
COLA, aproprioception-onlyresidual model that enables compliant, coordinated, and generalizablewhole-body collaborative carryingacross diverse movement patterns. - Three-Step Training Framework and Closed-Loop Environment: Developing a novel
three-step training frameworkand aclosed-loop training environmentthat explicitly modelshumanoid-object interactions. This allows the robot to implicitly learnobject movementsand assist humans throughcompliant collaboration. - Demonstrated Effectiveness, Generalization, and Robustness: Validating the proposed policy through extensive
simulationandreal-world experiments, showing superioreffort reductionandtrajectory coordinationcompared to baseline approaches. - Practical Solution for Real-World Deployment: Demonstrating the method's ability to operate without
external sensorsorcomplex interaction models, making it suitable for practical deployment.
2.2.2. Key Conclusions and Findings
The key conclusions and findings include:
-
Significant Human Effort Reduction: Simulation experiments show a
24.7% reduction in human effort(and31.47%in another mention) compared to baselines, while maintaining object stability. This directly addresses the goal of easing the human partner's burden. -
Precise Coordination and Trajectory Tracking: The method achieves low
linear velocity tracking error() andangular tracking error() relative to human motion, indicating precise coordination. -
Robustness and Generalization: Real-world experiments validate robust collaborative carrying across diverse object types (e.g., boxes, desks, stretchers) and movement patterns (e.g., straight-line, turning, slope climbing), demonstrating the model's versatility.
-
Implicit Intention Learning: The model implicitly learns to interpret human intentions through simple
pushing and pulling actions, eliminating the need for explicit commands or remote controls. -
Compliance to External Forces: The model demonstrates compliant behavior, responding appropriately to external forces for movement initiation (e.g., moving when force exceeds 15N) and vertical disturbances, showcasing agile full-body motions.
-
Positive User Experience: Human user studies with 23 participants confirmed an average
27.4% improvementin compliance and height-tracking compared to baseline models, validating its practical effectiveness and user acceptance. -
Effectiveness of Architecture: The
residual teacher policyanddistillation trainingare crucial for effective and compliant collaboration, outperforming end-to-endMLPandTransformerbaselines. A compactMLP-basedstudent policy is found to be more effective than aTransformerdue to better adaptation to prompt human movements.These findings collectively demonstrate that
COLAoffers a practical and effective solution for enablingcompliant human-humanoid collaborative carrying, addressing critical challenges inhuman-robot interaction.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner needs to grasp several foundational concepts in robotics, machine learning, and control theory.
3.1.1. Humanoid Robots
Humanoid robots are robots designed to resemble the human body, typically with a torso, head, two arms, and two legs. This morphology allows them to operate in human-centric environments and perform tasks requiring human-like mobility and manipulation. Their whole-body dynamics are complex because they are underactuated (have fewer actuators than degrees of freedom in certain movements) and high-dimensional, making stable locomotion and manipulation challenging, especially when interacting with the environment or humans.
3.1.2. Reinforcement Learning (RL)
Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward.
- Agent: The decision-maker (e.g., the humanoid robot).
- Environment: The world the agent interacts with (e.g., the physical space, objects, human partner).
- State: The current situation of the agent and environment (e.g., robot's joint angles, object's position, human's velocity).
- Action: A decision made by the agent (e.g., adjusting joint torques, changing speed).
- Reward: A scalar feedback signal from the environment that indicates the desirability of the agent's actions. The agent's goal is to learn a
policy– a mapping from states to actions – that maximizes the total expected reward over time. - Policy: The strategy that the agent uses to determine its next action based on the current state.
3.1.3. Proprioception
Proprioception refers to the robot's internal sense of its own body's position, movement, and force. In robotics, this typically includes data from:
Joint encoders: Measuring the angles and velocities of the robot's joints.Inertial Measurement Units (IMUs): Measuring orientation and angular velocity (e.g.,root orientation,gravity vector).Proprioception-onlymeans the robot relies solely on these internal senses, without external sensors like cameras (vision), LiDAR, or force/torque sensors at the end-effectors, for perceiving its environment and interacting with objects/humans. This is crucial for simplifying real-world deployment and reducing sensor dependence.
3.1.4. Whole-Body Control (WBC)
Whole-Body Control is a control strategy for complex robots (like humanoids) that coordinates all of the robot's joints and effectors simultaneously to achieve a desired task while respecting physical constraints (e.g., balance, joint limits). It contrasts with controlling individual limbs or joints in isolation. In the context of this paper, a WBC policy manages the robot's entire body to achieve both locomotion (movement) and manipulation (object handling) commands.
3.1.5. Compliance
In robotics, compliance refers to a robot's ability to yield or adapt to external forces or positional changes from its environment or interaction partners (e.g., humans). A compliant robot can absorb impacts and move naturally with a human, making interaction safer and more intuitive, as opposed to a stiff, position-controlled robot that resists any deviation from its programmed path. This is often achieved through impedance control or force control, where the robot's response to force or position errors is tuned.
3.1.6. Residual Learning
Residual learning is a technique where a model learns to predict a residual (difference) from an existing baseline or simpler model, rather than learning the entire output from scratch. This can simplify the learning task for complex functions, as the residual might be easier to learn than the complete function. In this paper, a residual teacher policy learns to make corrective adjustments (residual actions) on top of a pre-trained Whole-Body Control (WBC) policy.
3.1.7. Teacher-Student Framework (Knowledge Distillation)
Knowledge distillation is a model compression technique where a smaller, simpler model (the student) is trained to mimic the behavior of a larger, more complex model (the teacher). The teacher model typically has superior performance and might use more information (e.g., privileged information like ground truth object states). The student model is then deployed because it is more efficient and can operate with fewer inputs (e.g., proprioception-only). Behavioral cloning is a method used for distillation where the student learns by minimizing the difference between its outputs and the teacher's outputs for the same inputs.
3.1.8. Closed-Loop Environment
A closed-loop environment in simulation or control means that the output of the system feeds back as an input, creating a continuous feedback loop. In this paper, it means the robot's actions affect the object's state, and the object's state (along with human actions) influences the robot's subsequent decisions, creating dynamic, interactive learning.
3.2. Previous Works
The paper extensively references prior research in humanoid robot development and human-robot collaboration.
3.2.1. Humanoid Robot Development
Recent years have seen significant progress in:
-
Agile Locomotion: Humanoids are learning to walk, run, and navigate complex terrains (e.g., [8, 14, 28, 33]). For instance,
Styleloco[14] usesgenerative adversarial distillationfor natural humanoid robot locomotion.Humanoid Parkour Learning[33] explores dynamic movements. -
Teleoperation: Humans can control humanoids remotely for various tasks (e.g., [12, 26]).
Clone[12] focuses on closed-loop whole-body humanoid teleoperation for long-horizon tasks, andTwist[26] is a teleoperated whole-body imitation system. -
Dexterous Manipulation: Humanoids are becoming more capable of handling objects with their hands (e.g., [19, 27]).
Mimicdroid[19] focuses on in-context learning for humanoid robot manipulation from human play videos, and [27] explores generalizable humanoid manipulation with 3D diffusion policies.These advancements highlight the growing capabilities of humanoids but often focus on individual skills rather than integrated collaboration.
3.2.2. Robot-Human Collaboration (General)
Robot-human collaboration is a long-standing research area [5, 17, 18, 25, 31, 32], but much of it has focused on robotic arms in confined workspaces.
- Robotic Arms:
Compliant robot-human collaborationhas been extensively developed for robotic arms, where force sensing and impedance control are often used to ensure safe and adaptable interaction [10, 11]. For instance,Impedance Learning-based Adaptive Force Tracking[11] focuses on robots on unknown terrains.Learning Physical Collaborative Robot Behaviors from Human Demonstrations[18] explores learning from human examples. - Intent Recognition: Predicting human intention is crucial for effective collaboration (e.g., [6, 13, 15, 16, 31]).
Hybrid Recurrent Neural Network[6] andMulti-modal Policy Learning[13] are examples ofintention recognitionfor human-robot collaboration.Robot Reading Human Gaze[16] highlights the importance of cues like eye tracking.Closed-loop Open-vocabulary Mobile Manipulation[31] uses models likeGPT-4Vfor intent.
3.2.3. Human-Humanoid Collaboration (Specific to Carrying)
Previous work on human-humanoid collaboration, particularly for object carrying, is limited:
- Model-Based Approaches: Some methods use
heuristic rulesorpredefined subtasks(e.g., [1, 2, 17]). For example, [1] and [2] explore collaborative human-humanoid carrying using vision and haptic sensing, often breaking down tasks into basic walking patterns and primitive behaviors. - Limited Scope Learning:
H2-COMPACT[3] proposes a learning-based model usinghaptic cuesto predict horizontal velocity commands, but its scope is restricted. - Force-Aware Control: While
force regulationis crucial [11], andcompliant controlhas been demonstrated incontact-rich manipulation[4, 24, 30], explicit force estimation forhuman-humanoid collaborationremains underexplored.FACET[24] focuses onforce-adaptive controlfor legged robots.Learning Unified Force and Position Control[30] addressesloco-manipulation.
3.2.4. Environment-Conditioned Locomotion
Prior work on environment-conditioned locomotion [14, 29, 33] has shown how robots can adapt their movement to different terrains. Falcon [29] focuses on force-adaptive humanoid loco-manipulation. While relevant, these works often don't fully integrate the challenges of dynamic human interaction and object stability during collaborative tasks.
3.3. Technological Evolution
The field has evolved from focusing on individual robotic capabilities (locomotion, manipulation, teleoperation) to increasingly complex interactions. Early human-robot collaboration for arms often relied on explicit programming or detailed force/position sensing. With humanoids, the complexity of whole-body dynamics necessitated model-based control or simpler heuristic rules for collaboration due to the difficulty of integrating all aspects. The rise of reinforcement learning has allowed for more adaptive and data-driven approaches, moving beyond explicit modeling to implicitly learning complex interaction dynamics. This paper's work represents a step in this evolution by:
- Leveraging advanced
RLforwhole-body control. - Moving from
explicit force sensingtoimplicit force estimationviaproprioception. - Integrating
leader/followerroles within a singlepolicy. - Addressing
whole-body coordinationfor complexcollaborative carryingtasks, which was previously a gap.
3.4. Differentiation Analysis
Compared to the main methods in related work, COLA offers several core innovations:
-
Whole-Body Coordination vs. Partial Coordination: Unlike prior
human-humanoid collaborationmethods [1, 2, 3, 17, 32] that neglectwhole-body coordinationor focus on limited aspects (e.g., horizontal velocity, pre-defined subtasks),COLAspecifically enableswhole-body coordination. This allows for complex tasks like picking objects from the ground or climbing slopes while carrying. -
Proprioception-Only for Real-World Deployment: Many
force-aware controlmethods [11, 24, 30] rely onexplicit force estimationusing dedicatedforce/torque sensors.COLAdifferentiates itself by achievingcompliant collaborationusingproprioception-onlyinputs. It implicitly estimatesinteraction forcesthroughjoint state offsets, making it more practical for real-world deployment by reducing sensor requirements and complexity. -
Implicit Learning of Intentions and Object Dynamics: Instead of relying on
multi-modal dataorexplicit intention prediction models[6, 13, 15, 16],COLAlearnshuman intentionsandobject motion patternsimplicitly within aclosed-loop environment. This allows the robot to adapt itscollaboration strategyin real-time, which is difficult to encode with manually designed commands. -
Unified Leader/Follower Policy: Previous works often separate
leaderandfollowerroles or require explicit commands.COLAintegrates both behaviors within asingle policy, controlled by a simplevelocity command(zero velocity for following), allowing for flexible role switching. -
Robustness and Generalization: By training in a
closed-loop environmentthat explicitly modelshumanoid-object interactionsanddynamic object interactions,COLAdemonstrates superiorgeneralizationacross diverseterrains,objects, andmovement patternscompared to baselines. This addresses the challenge of adapting to diverse environments, which is a common limitation in prior work.In essence,
COLAmoves beyond single-constraint solutions to integrateforce interactions,implicit constraints, anddynamic coordinationinto a coherent framework forhumanoid collaborative carrying, bridging the gap between advanced humanoid capabilities and practical, compliant human-humanoid collaboration.
4. Methodology
4.1. Principles
The core idea behind COLA is to leverage reinforcement learning to enable a humanoid robot to collaborate compliantly with a human partner for object carrying. The method is built on two key principles:
-
Proxy for Interaction Forces:
Offsets between joint states and their targets(i.e., the difference between the desired joint position/velocity and the actual one) can serve as an implicitproxy for estimating interaction forces. This allows the robot to infer how much force is being applied by the human or object without needing dedicatedforce/torque sensors. -
Object State as Collaboration Constraints: The
state of the carried object(e.g., its position, orientation, velocity) implicitly encodes criticalcollaboration constraintssuch asstabilityandcoordinationrequirements. By learning to maintain desired object states, the robot inherently learns to collaborate effectively.To achieve this,
COLAemploys athree-step training frameworkwithin aclosed-loop environmentthat models the dynamic interactions between thehumanoid, theobject, and thehuman. This allows the robot to implicitly predictobject motion patternsandhuman intentions, leading tocompliant collaborationandload balancethroughcoordinated trajectory planning. The ultimate goal is aproprioception-onlypolicy for real-world deployment, reducing reliance on external sensors.
4.2. Core Methodology In-depth (Layer by Layer)
The COLA methodology is structured into three distinct learning steps: Whole-body controller training, Residual teacher policy training for collaboration, and Student policy distillation.
4.2.1. Task Definition
The task is defined as a humanoid assisting a human partner to transport an object that is challenging for a single person. The robot's objectives are:
- Coordinate Movement: Align its velocity with the human's velocity.
- Support Weight: Reduce the human's physical burden by supporting the object's weight.
- Stabilize Orientation: Maintain the object's orientation throughout transportation.
4.2.2. Step 1: Whole-body Control (WBC) Policy Training
In the first step, a foundational Whole-Body Control (WBC) policy is trained in a simulator without specific collaboration constraints. This policy is responsible for the robot's basic motor skills, locomotion, and manipulation.
-
Goal Command (): The
WBCpolicy receives a combined goal command, , which includes bothlower-body locomotionandupper-body end-effectorcommands.Lower-body locomotion goal command (\mathcal{G}^{\mathrm{lower}}{t}): This specifies the desired linear velocity (), angular velocity (), and root height () for the robot's base. $ \mathcal{G}^{\mathrm{lower}}{t} \triangleq \left[ v^{\mathrm{lin}}{t}, v^{\mathrm{ang}}{t}, h^{\mathrm{root}}_{t} \right] $Upper-body end-effector goal command (\mathcal{G}^{\mathrm{upper}}): This specifies the target position () and orientation () for the robot's end-effectors (e.g., hands). $ \mathcal{G}^{\mathrm{upper}} = \left[ p^{\mathrm{ee}}, r^{\mathrm{ee}} \right] $- The combined goal command is: $ \mathcal{G} = [ \mathcal{G}^{\mathrm{lower}}, \mathcal{G}^{\mathrm{upper}} ] $
-
Observation Space (): The
WBCpolicy takes as input a history of the robot'sproprioceptive observations. This history includes:Joint positions (q^{\mathrm{pos}}_{t-l:t}): The positions of the robot's joints (excluding fingers) over a history length .Joint velocities (q^{\mathrm{vel}}_{t-l:t}): The velocities of the robot's joints over a history length .Robot root orientation (\omega^{\mathrm{root}}_{t-l:t}): The orientation of the robot's base in quaternion form over a history length .Gravity vector (g_{t-l:t}): The gravity vector in the robot's root frame over a history length .Previous actions (a^{\mathrm{prev}}{t-(l+1):t-1}): The actions taken by the robot in the preceding time steps. $ \mathcal{O}^{\mathrm{wbc}}{t} \triangleq \left[ q^{\mathrm{pos}}{t-l:t}, q^{\mathrm{vel}}{t-l:t}, \omega^{\mathrm{root}}{t-l:t}, g{t-l:t}, a^{\mathrm{prev}}_{t-(l+1):t-1} \right] $ where is the length of the history.
-
Action Space (): The action space for the
WBCpolicy, , represents the target joint positions for the robot's joints.PD position controlis used for actuation, meaning the robot's motors try to reach these target positions. -
Policy Function (): The
WBCpolicy is formally defined as a function that maps the goal command and proprioceptive observations to the action: $ \mathcal{F}^{\mathrm{wbc}} : \mathcal{G} \times \mathcal{O}^{\mathrm{wbc}} \to \mathcal{A}^{\mathrm{wbc}}, \mathcal{A}^{\mathrm{wbc}} \in \mathbb{R}^{N} $ where denotes an -dimensional real vector, representing the target joint positions for joints. -
Training Details: The
WBCpolicy is trained usingProximal Policy Optimization (PPO)with rewards following prior works [21, 29]. To improverobustnessunderpayloads, external forces are applied to the humanoid's end-effectors during training, enhancing itsforce-adaptive capabilities.
4.2.3. Step 2: Residual Teacher Policy Training
In the second step, a residual teacher policy is trained on top of the pre-trained WBC policy within a closed-loop environment. This environment explicitly models the dynamic interaction between the human, object, and humanoid. The teacher policy has access to privileged information to accurately model object dynamics.
-
Closed-Loop Training Environment: As illustrated in Figure 3, the environment includes the
humanoid, asupporting base body(simulating the human carrier), and thecarried object. The object is connected to the support body via a6-DoF joint. The object is placed in the robot's hand, and the hand joints are fixed in a predefined grasp pose.
该图像是示意图,展示了我们的闭环训练环境。在左侧,图中显示了载物体的目标速度由绿色箭头表示,而当前速度由红色箭头表示。右侧图示则展示了相应的人形机器人在与物体交互过程中的动态变化。The green arrow represents the goal velocity of the carried object, while the red arrow indicates its current velocity.
-
Privileged Information (): The
teacher policyis granted access toprivileged informationabout the carried object, which includes:Linear velocity (\widetilde{v}^{\mathrm{lin}}_{t-l:t}): The object's ground-truth linear velocity history.Angular velocity (\widetilde{v}^{\mathrm{ang}}_{t-l:t}): The object's ground-truth angular velocity history.Position (\widetilde{p}_{t-l:t}): The object's ground-truth position history.Orientation (\widetilde{r}{t-l:t}): The object's ground-truth orientation history. $ \mathcal{O}^{\mathrm{priv}}{t} \triangleq \left[ \widetilde{v}^{\mathrm{lin}}{t-l:t}, \widetilde{v}^{\mathrm{ang}}{t-l:t}, \widetilde{p}{t-l:t}, \widetilde{r}{t-l:t} \right] $ with a history of length .
-
Teacher Observation Space (): The
teacher policyreceives both the robot'sproprioceptive observations() and theprivileged information(). $ \mathcal{O}^{\mathrm{teacher}}{t} \triangleq [\mathcal{O}^{\mathrm{wbc}}{t}, \mathcal{O}^{\mathrm{priv}}_{t}] $ -
Residual Action (): The
teacher policydoes not directly output the full action. Instead, it outputs aresidual action(), which is a corrective adjustment to theWBCpolicy's output. The final collaborative action () is the sum of theWBCaction () and theresidual action. $ \mathcal{A}^{\mathrm{collab}} = \mathcal{A}^{\mathrm{wbc}} + \mathcal{A}^{\mathrm{teacher}} $ -
Policy Function (): The
teacher policyis defined as: $ \mathcal{F}^{\mathrm{teacher}} : [ \mathcal{O}^{\mathrm{wbc}}, \mathcal{O}^{\mathrm{priv}} ] \to \mathcal{A}^{\mathrm{teacher}}, \mathcal{A}^{\mathrm{teacher}} \in \mathbb{R}^{N} $ -
Reward Function: The
teacher'slearning is guided by a compositereward functionthat combines basewhole-body control rewards(from theWBCtraining) withtask-specific rewardsfor collaboration. These rewards are detailed in Table I and are crucial for learning compliant and coordinated carrying. The following are the results from Table I of the original paper:Term Expression Weight Linear Vel. Tracking φ(vCM applied lin 1.0 Yaw Vel. Tracking Vang ,goal 1.0 Z-axis Vel. Penalty −kvθb ,obj | 0.05 Height Diff. Penalty j − hobj|l −khobj 10.0 Force Penalty − | F support-obj | 0.002 The reward terms are:
Linear Vel. Tracking: Rewards tracking the linear velocity of the carried object. The expression is a Gaussian-like function that gives a higher reward for smaller errors.- : Applied linear velocity to the center of mass of the object.
- : The robot's linear velocity.
Yaw Vel. Tracking: Rewards tracking the angular velocity (yaw) of the carried object.- : The goal angular velocity.
Z-axis Vel. Penalty: Penalizes vertical velocity of the object to maintain stability.- : A constant.
- : Angular velocity of the object around the Z-axis.
Height Diff. Penalty: Penalizes differences in height between the object ends held by the human and humanoid.- : Heights of the object's two ends.
- : A constant.
Force Penalty: Penalizes the horizontal force between the support body and the object, aiming to minimize the human's effort.- : Horizontal force between the support body and object.
-
Goal Command Modification: During this step, the goal command of the model is modified based on the settings described in Section IV (Implementation Details), specifically for
collaborative carrying. Thevelocityv^{\mathrm{applied}}$$ is applied to thesupporting base bodyat the end of the object opposite the robot-held end. The magnitude ofapplied velocityis sampled from a range, andangular velocity controluses aPD controllerto apply torque to the support body.Height controlsamples a target height for the support body and appliesPD-controlled force.
4.2.4. Step 3: Knowledge Distillation (Student Policy Training)
In the final step, the expertise learned by the combined WBC and residual teacher policy () is distilled into a student policy (). This student policy is designed for real-world deployment and operates solely on proprioceptive observations (), without access to privileged information.
-
Student Observation Space: The
student policyonly receives theproprioceptive observations. -
Student Action Space: The
student policyoutputs its action , which is also the target joint positions. -
Policy Function (): The
student policyis defined as: $ \mathcal{F}^{\mathrm{student}} : \mathcal{O}^{\mathrm{wbc}} \to \mathcal{A}^{\mathrm{student}}, \mathrm{where} \mathcal{A}^{\mathrm{student}} \in \mathbb{R}^{N} $ -
Distillation Method:
Behavioral cloningis used to distill theteacher policyinto thestudent policy. Thestudentis trained to mimic theteacher'sbehavior by minimizing themean squared errorbetween their outputs during interactions with the environment. Theloss functionfor distillation is: $ \mathcal{L}_{\mathrm{distill}} = \mathbb{E} \left[ | \mathcal{A}^{\mathrm{student}} - \mathcal{A}^{\mathrm{collab}} |^2 \right] $ where:- : The action output by the
student policy. - : The collaborative action (output of the
teacher policy, i.e., ). - : Expected value.
- : Squared Euclidean norm, representing the squared difference between the
student'saction and theteacher'saction.
- : The action output by the
-
Role Allocation (COLA-F and COLA-L): The paper defines two experimental settings based on
goal commandobservation:-
COLA-F (Follower): All networks receive a
goal command input of zero. This means the robot primarily follows the human's implicit cues. -
COLA-L (Leader): The policy is provided with a
sampled goal command(within the range used forWBC). This allows the robot to actively lead or pursue a specific trajectory while collaborating. Role allocation is effectively controlled via a velocity command, where zero velocity for the robot implies afollowerrole.The overall training pipeline is illustrated in Figure 2 (from the original paper):
该图像是示意图,展示了人类与类人机器人协作的整体训练过程。图中包含三个步骤:第一步为全身控制器训练,通过GT命令和自我觉察信息进行数据处理;第二步为残差教师的协作训练;第三步为学生策略的提炼,采用BC蒸馏。图示中还展示了机器人与人类在搬运对象时的真实场景,说明了研究的应用背景。
-
The diagram shows the three steps: Whole-body controller training, Residual teacher policy training, and Student policy distillation using behavioral cloning. The teacher uses privileged information and proprioception to output a residual action that adjusts the WBC action, forming the collaborative action. The student learns from the collaborative action using only proprioception for real-world deployment.
4.2.5. Implementation Details
-
Training Setup:
- Platform:
Isaac Labsimulator. - Hardware: Single
RTX 4090D GPU. - Algorithm:
PPOwith 4096 parallel environments. - Network Architecture:
WBCactor and critic networks: Three-layerMulti-Layer Perceptrons (MLPs)of size (512, 256, 128).Residual teacherandstudent policynetworks: Two additionalMLPswith the same dimensions (512, 256, 128) stacked on top of theWBCnetwork.
- Training Steps:
WBC: 350k environment steps (approx. 15k PPO updates).Residual Teacher: 250k environment steps (approx. 10k PPO updates).Distillation: 250k environment steps (approx. 10k PPO updates).
- Total Training Time: 48 hours.
- Platform:
-
Observation Space Details (Command Sampling):
Whole-body control commandsare sampled from predefined ranges.-
End-effector goal command: Represents the6-DoF target pose(position and orientation) of the robot's wrist.- Since the task focuses on
collaborative carryingrather than complexupper-body manipulation, large-range upper-body motions are not sampled. End-effector positionsare randomly sampled within a small cubic region near the default grasping pose.End-effector orientationsare sampled within a conical region around the nominal grasp orientation usingSpherical Linear Interpolation (SLERP).
- Since the task focuses on
-
The
WBCachieves a tracking error of forend-effector goal positionand forend-effector goal orientation. -
The
carried objectandsupport bodyare connected via a6-DoF joint.Friction,damping, andjoint limitsensuresupport bodymovements faithfully transmit to the object. The following are the results from Table II of the original paper:Term Range Base Lin. Vel. X (m/s) (−0.8, 1.2) Base Lin. Vel. Y (m/s) Base Ang. Vel. (rad/s) (−0.5, 0.5) Base Height (m) (−1.2, 1.2) End-effector Position (m) (0.45, 0.9) End-effector Orientation (rad) (0.15) Support Object Lin. Vel. (m/s) (π/6) Support Object Ang. Vel. (rad/s) (−0.6, 1.0) (−0.8, 0.8) Support Object Height (m) (0.5, 0.85)
*Note: End-effector Position denotes the side length of the cube where the goal position is sampled from; End-effector Orientation denotes the halfangle of the cone that defines the sampling range of orientation goals.
Base Lin. Vel. X (m/s): Linear velocity along the robot's forward/backward axis.Base Lin. Vel. Y (m/s): Linear velocity along the robot's sideways axis.Base Ang. Vel. (rad/s): Angular velocity around the robot's vertical (yaw) axis.Base Height (m): Desired height of the robot's base.End-effector Position (m): The side length of the cubic region from which the goal position for the end-effector is sampled.End-effector Orientation (rad): The half-angle of the cone defining the sampling range for end-effector orientation goals.Support Object Lin. Vel. (m/s): Linear velocity applied to the simulated human's side of the object.Support Object Ang. Vel. (rad/s): Angular velocity applied to the simulated human's side of the object.Support Object Height (m): Height of the simulated human's side of the object.
-
5. Experimental Setup
5.1. Datasets
The paper does not use traditional datasets in the supervised learning sense. Instead, it relies on a closed-loop training environment in a simulator (Isaac Lab) to generate continuous interaction data for reinforcement learning.
5.1.1. Closed-Loop Training Environment
The simulation environment is dynamically constructed to model the interactions:
- Components:
Humanoid robot(G1 model with 29 joints excluding fingers), asupporting base body(simulates human carrier), and acarried object. - Interaction Model: The
objectis connected to thesupport bodyvia a6-DoF joint. The object is placed in the robot's hand, and the hand joints are fixed in a predefined grasp pose. - Dynamic Inputs:
-
A
goal command (\mathcal{G})is randomly sampled (ranges defined in Table II) to guide the humanoid's movement. -
A
velocity (v^{\mathrm{applied}})is sampled and applied to thesupporting base body(representing the human's side) at the object's opposite end. Thisapplied velocityis updated at twice the frequency of thegoal commandto simulate dynamic human input. -
For
angular velocity control, a target angular velocity is set, and aPD controllerapplies torque to thesupport body. -
For
height control, a target height for thesupport bodyis randomly sampled, and aPD-controlled forceadjusts its height. The robot is not required to maintain a fixed height, allowing for adaptive responses.This dynamic, interactive simulation setup serves as the "data generation" mechanism, allowing the
reinforcement learning agentto learn from continuous interaction rather than a static dataset.
-
5.1.2. Objects and Terrains
- Simulation: The paper implicitly states that various objects and terrains are used during simulation to test effectiveness, generalization, and robustness. The figures show diverse objects like a
rod,box,stretcher, andcart. - Real-World: Real-world experiments use
boxes,desks, andstretchersas carried objects, and movement patterns includestraight-line,turning, andslope climbing.
5.2. Evaluation Metrics
The paper uses several quantitative metrics to evaluate performance, categorized into trajectory following, height tracking, and coordination/effort reduction.
5.2.1. Linear Velocity Tracking Error (Lin. Vel.)
- Conceptual Definition: This metric quantifies how accurately the robot's linear velocity matches the human's (or the desired object's) linear velocity during the collaborative carrying task. A lower value indicates better coordination in terms of forward/backward and sideways movement.
- Mathematical Formula: $ \text{Lin. Vel. Error} = \frac{1}{T} \sum_{t=1}^{T} | v_{\mathrm{robot}, t}^{\mathrm{lin}} - v_{\mathrm{human}, t}^{\mathrm{lin}} | $
- Symbol Explanation:
- : Total number of time steps (duration of the episode).
- : The linear velocity of the robot at time step .
- : The linear velocity of the human (or the desired linear velocity of the object) at time step .
- : Euclidean norm, representing the magnitude of the difference.
5.2.2. Angular Velocity Tracking Error (Ang. Vel.)
- Conceptual Definition: This metric measures how well the robot's angular velocity (rotational movement, specifically yaw) aligns with the human's (or desired object's) angular velocity. A lower value signifies better rotational coordination.
- Mathematical Formula: $ \text{Ang. Vel. Error} = \frac{1}{T} \sum_{t=1}^{T} | \omega_{\mathrm{robot}, t} - \omega_{\mathrm{human}, t} | $
- Symbol Explanation:
- : Total number of time steps.
- : The angular velocity of the robot at time step .
- : The angular velocity of the human (or the desired angular velocity of the object) at time step .
- : Euclidean norm.
5.2.3. Height Error (Height Err.)
- Conceptual Definition: This metric assesses the stability of height coordination during carrying. It measures the difference in vertical height between the object end held by the human and the object end held by the humanoid, indicating how level the object is maintained. A lower error implies greater object stability and better load balance.
- Mathematical Formula: $ \text{Height Err.} = \frac{1}{T} \sum_{t=1}^{T} | h_{\mathrm{human-end}, t} - h_{\mathrm{robot-end}, t} | $
- Symbol Explanation:
- : Total number of time steps.
- : The height of the object end held by the human at time step .
- : The height of the object end held by the humanoid at time step .
- : Absolute difference.
5.2.4. Average External Force (Avg. E.F.)
- Conceptual Definition: This metric quantifies the average horizontal interaction force between the human (or the simulated
support body) and the object. It directly reflects the physical effort required from the human to move the carried object along the intended direction. A lower force indicates that the robot is contributing more effectively to the carrying task, thereby reducing the human's burden and demonstrating stronger compliance. - Mathematical Formula: (Based on the paper's description, this would be the magnitude of the force applied by the human's simulated side to the object.) $ \text{Avg. E.F.} = \frac{1}{T} \sum_{t=1}^{T} | F_{\mathrm{human-obj}, t}^{\mathrm{horizontal}} | $
- Symbol Explanation:
- : Total number of time steps.
- : The horizontal force exerted by the human (or supporting base body) on the object at time step .
- : Euclidean norm, representing the magnitude of the horizontal force vector.
5.2.5. Human User Study Metrics
For human user studies, participants rated compliance and height tracking on a scale of 1 to 5, where a higher score indicates better performance.
- Height Tracking (User Study): Qualitative assessment of how well the object's height is maintained and coordinated.
- Smoothness (User Study): Qualitative assessment of the fluidity and naturalness of the collaboration.
5.3. Baselines
The paper compares COLA against several baseline models to demonstrate its effectiveness and justify architectural choices.
5.3.1. Vanilla MLP
- Description: This baseline trains a simple
Multi-Layer Perceptron (MLP)policy directly from scratch, initialized with the weights of theWhole-Body Controller (WBC). It is trainedend-to-endwithPPOto perform the collaborative carrying task. - Purpose: To evaluate the benefit of the
residual learningandteacher-student distillationframework compared to a direct, monolithicRLapproach.
5.3.2. Explicit Goal Estimation
- Description: This baseline replaces the
whole-body control commandwith a predicted one and removes theresidual componentfrom theteacher policy. The resulting policy is then distilled into astudent policy. This implies that the model explicitly tries to predict the nextWBCcommand based on observations, rather than learning aresidual adjustment. - Purpose: To investigate whether explicitly predicting
WBCcommands is as effective as learningresidual adjustmentsfor collaborative tasks, especially in dynamic interaction scenarios. It tests the hypothesis that collaboration requires implicitly learning dynamic interactions rather than just predicting goal commands.
5.3.3. Transformer
- Description: This baseline replaces the
student policy's original architecture(which isMLP-based) with aTransformernetwork.Transformersare known for their ability to process sequential data and capture long-range dependencies. - Purpose: To evaluate the architectural choice of
MLPversusTransformerfor thestudent policy. It assesses whether theTransformer'stemporal processing capabilities are beneficial or detrimental for real-time, complianthuman-humanoid collaborationwhere prompt adaptation to dynamic human movements might be more critical than long-term memory.
5.3.4. Locomotion (Implicit Baseline)
Although not explicitly listed as a baseline for the quantitative comparison table, a "Locomotion" policy is mentioned in the Human User Study results (Table IV) and Compliance to External Forces analysis (Figure 5b). This likely refers to a basic WBC policy trained for locomotion without specific collaborative carrying capabilities, serving as a very naive baseline for interaction.
6. Results & Analysis
6.1. Core Results Analysis
The experiments evaluate COLA against baselines in both simulation and real-world scenarios, focusing on trajectory following, height tracking, human effort reduction, and compliance.
6.1.1. Simulation Results: Effectiveness and Compliance
The simulation results, presented in Table III, compare COLA (in both COLA-F and COLA-L settings with varying history lengths) against the Explicit Goal Estimation and Transformer baselines.
The following are the results from Table III of the original paper:
| Methods | Lin. Vel. (m/s) ↓ | Ang. Vel. (rad/s) ↓ | Height Err. (m) ↓ | Avg. E.F. (N) ↓ |
| Explicit Goal Estimation | 0.235 | 0.335 | 0.102 | 19.067 |
| Transformer | 0.178 | 0.310 | 0.077 | 19.382 |
| COLA-F-History10 | 0.121 | 0.131 | 0.037 | 15.435 |
| COLA-F-History50 | 0.116 | 0.132 | 0.036 | 14.574 |
| COLA-F | 0.109 | 0.118 | 0.031 | 14.576 |
| COLA-L-History10 | 0.118 | 0.106 | 0.039 | 13.924 |
| COLA-L-History50 | 0.112 | 0.103 | 0.036 | 13.495 |
| COLA-L | 0.102 | 0.098 | 0.038 | 12.298 |
- Superior Trajectory Tracking:
COLAconsistently outperforms baselines acrossLinear Velocity (Lin. Vel.)andAngular Velocity (Ang. Vel.)tracking errors.COLA-Lachieves the lowestLin. Vel. errorof andAng. Vel. errorof . This demonstrates precise coordination with human movements.
- Better Height Stability:
COLAalso shows significantly lowerHeight Error (Height Err.), withCOLA-Fachieving andCOLA-Lremaining competitive. This indicates superior object stability during carrying. - Reduced Human Effort (Compliance): The
Avg. E.F.metric shows thatCOLAdrastically reduces the average external force required from the human.COLA-Lachieves the lowestAvg. E.F.of .- Compared to the best baseline (
Explicit Goal Estimationat ),COLA-Lreduces human effort by approximately (). The abstract mentions a24.7%reduction compared to baseline approaches, and in the "Overall" summary,31.47%reduction is mentioned. These values vary slightly depending on the specific baseline chosen for comparison or the exact calculation context, but all indicate significant reduction. - The lower
Avg. E.F.reflectsCOLA'sstrongercomplianceand active participation in the carrying task.
6.1.2. Comparison with Baselines
- Explicit Goal Estimation: This baseline performs the poorest across all metrics. This highlights that
collaborative carryingis more complex than just predictingwhole-body control commands. The dynamic interactions require implicit learning within aclosed-loop environment. - Transformer: While better than
Explicit Goal Estimation, theTransformerbaseline is significantly outperformed byCOLA. This suggests that theTransformer'stemporal processing might introduce unnecessary complexity for prompt adaptation to human movements, which is critical for real-time collaboration. - Vanilla MLP: The paper discusses
Vanilla MLPin the text (but not in Table III). It achieves relatively high performance among baselines but struggles withAng. Vel.andHeight Err., indicating difficulty in inferring angular and vertical movements compared to linear ones. This further supports the need for theteacher-student distillation frameworkto learn complex interaction patterns.
6.1.3. COLA-L vs. COLA-F
COLA-L consistently outperforms COLA-F (lower Lin. Vel., Ang. Vel., and Avg. E.F.). This is attributed to the goal command provided to COLA-L, which helps the policy learn to collaborate more actively and precisely. The goal command provides informative cues, especially in the presence of noise and disturbances.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Architecture Choice (MLP vs. Transformer)
The comparison in Table III shows COLA (MLP-based student policy) outperforms the Transformer baseline. The Transformer also required twice the training steps to converge. This indicates that a compact MLP-based model is more effective. The authors hypothesize that the Transformer's long-term temporal processing introduces unnecessary complexity. For collaboration, the robot needs to adapt promptly to current human movements, and relying on outdated information (which a Transformer might over-process) could lead to hesitation and degraded cooperation.
6.2.2. History Length
The ablation study on history length for COLA-F and COLA-L (Table III shows History10 and History50 variants) reveals:
- A shorter history (implicitly,
History10vs. defaultHistory25forCOLA-F/L) provides insufficient information for implicit collaboration learning. - Increasing history length to 50 (
History50) yields little improvement overHistory25. - The authors chose
25as a balance between performance and learning efficiency. This suggests that the task is not highly sensitive tolong-term joint position changes, corroborating the findings that theMLP-basedarchitecture is better than aTransformerthat focuses on long sequences.
6.3. Real-World Experiments: Practical Value and Compliance
6.3.1. Qualitative Results
Figure 4 showcases the qualitative effectiveness of COLA in real-world scenarios.
该图像是一个示意图,展示了人类与类人机器人协作完成物体搬运的多种场景,包括担架搬运、杆高跟踪、坡道上的箱子搬运和推车操作。这些场景突显了机器人与人类在动态环境中的协作能力。
The images demonstrate successful collaborative carrying of diverse objects (rod, box, stretcher, cart) across various grasping poses and even on sloped terrains. This highlights the versatility and generalizability of the method. The model implicitly learns to interpret human intentions through force-based interaction (pushing/pulling), allowing the humanoid to infer desired movements and execute them autonomously.
6.3.2. Quantitative Real-World Metrics
Figure 5 presents quantitative results for compliance and height tracking in real-world settings.
该图像是图表,展示了协作搬运的有效性定量结果。图中(a)展示了不同外部力作用下机器人的基速变化,(b)显示了机器人的高度随时间步骤的变化,(c)和(d)则分别对真实力量的最小值和高度差进行了比较分析。
-
Compliance to External Forces (Simulation - Figure 5a & 5b):
- Figure 5a shows the robot's
velocity responseto an external force applied to its palm.COLAinitiates movement when the force exceeds , while thebaseline modelremains stationary. Forces below are interpreted as stabilization cues. This indicatesCOLA'sability to discern between stabilizing actions and initiating movement based on force thresholds. - Figure 5b illustrates
height responseto vertical external forces on the end-effector. TheLocomotionpolicy maintains a constant height, andVanilla MLPsquats to a fixed height, passively supporting the force. In contrast, bothCOLAsettings (COLA-F and COLA-L)actively complywith vertical disturbances, demonstrating agilefull-body motions.
- Figure 5a shows the robot's
-
Minimal Force to Move Robot (Real-World - Figure 5c):
- Figure 5c compares the
minimal force required to move the robotin the real world.COLAdemonstrates stronger compliance by requiring less force to initiate movement compared to the baseline. This directly reflects a reduction in human effort.
- Figure 5c compares the
-
Height Difference (Real-World - Figure 5d):
- Figure 5d shows the
height difference between the human-held end and the humanoid-held endof the object in real-world experiments.COLAsignificantly reduces thisheight-tracking errorby approximatelythree-quarterscompared to the baseline. This confirms stable object pose tracking in real-world collaborative tasks.
- Figure 5d shows the
6.3.3. Human User Studies
A study with 23 participants rated COLA's performance on Height Tracking and Smoothness on a scale of 1 to 5.
The following are the results from Table IV of the original paper:
| Methods | Height Tracking ↑ | Smoothness ↑ |
| Locomotion | 2.96 | 2.61 |
| Vanilla MLP | 3.09 | 3.09 |
| COLA | 3.96 | 3.96 |
- Superior User Ratings:
COLAachieves the highest scores in bothHeight Tracking(3.96) andSmoothness(3.96) compared toLocomotionandVanilla MLPbaselines. This confirmsCOLA'seffectiveness and provides quantitative evidence of improved user experience in real-world collaborative scenarios. The abstract states an average improvement of27.4%compared to baseline models, which is consistent with these higher scores.
6.3.4. Implicit Force Estimation from Joint States
Figure 6 illustrates how the humanoid's behavior is sensitive to forces applied at specific joints.
该图像是示意图,展示了两种不同的协作行为。左侧展示了机器人在接收作用于躯干的外力时的反应,维持稳定姿态;右侧展示了当作用于机器人末端执行器的较小外力时,其跟随该外力的行为。
When forces are applied to the hand or arm during carrying, the humanoid tends to follow. Conversely, forces applied to the torso or legs result in the humanoid maintaining a stable stance. This observation suggests that COLA effectively learns interaction dynamics by interpreting offsets between joint states and their targets as cues for interaction forces and human intentions, without explicit force sensors.
6.4. Advantages and Disadvantages
Advantages:
- Reduced Human Effort: Quantitatively proven in both simulation and real-world experiments.
- High Compliance: The robot responds adaptively to human guidance and external forces, leading to intuitive interaction.
- Precise Coordination: Achieves low tracking errors for linear and angular velocities, and maintains object height stability.
- Proprioception-Only: Simplifies real-world deployment by eliminating external sensor requirements.
- Generalizable: Works across diverse objects, terrains, and movement patterns.
- Implicit Intention Learning: Interprets human intentions from physical interaction, removing the need for explicit commands.
- Unified Policy: Integrates leader and follower behaviors within a single policy.
Disadvantages/Observations:
COLA-L(leader mode) generally outperformsCOLA-F(follower mode), suggesting that explicit goal commands can enhance collaboration, even if they are sampled rather than directly provided by a human.- The
MLP-basedstudent policy is found to be more effective than aTransformer, indicating that long-term temporal dependencies might be less critical than prompt adaptation for this specific task.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces COLA, a novel proprioception-only reinforcement learning approach for human-humanoid collaborative object carrying. The core innovation lies in its three-step residual learning framework, which enables the humanoid to function as both a leader and a follower in collaborative tasks. By leveraging a closed-loop training environment that explicitly models humanoid-object interactions, COLA implicitly learns object movements and human intentions from proprioceptive feedback alone. This allows for compliant collaboration, maintaining load balance through coordinated trajectory planning without requiring external sensors or complex interaction models. Extensive simulation and real-world experiments validate COLA's effectiveness, demonstrating significant human effort reduction (up to 31.47% in simulation and 27.4% in user studies), precise trajectory coordination, and robust generalization across various objects and terrains.
7.2. Limitations & Future Work
7.2.1. Limitations
The authors acknowledge the following limitations:
- Proprioception-Only: While a strength for deployment simplicity, relying solely on
proprioceptionmight limit the robot's understanding of complex human non-verbal cues or environmental context thatvisualortactile sensorscould provide. - Implicit vs. Explicit Planning: The current model implicitly infers intentions and dynamics. This might not be sufficient for more complex scenarios where the humanoid needs to
plan autonomouslyto assist humans, requiring a deeper understanding of the task goals and the human's long-term objectives.
7.2.2. Future Work
Based on these limitations, the authors suggest future research directions:
- Multi-Modal Perception: Exploring the integration of
visualandtactile sensorsto provide more informative cues forhuman-humanoid collaboration. This could enhance the robot's perception of the human's state, intentions, and the environment. - Autonomous Planning: Enabling humanoids to
plan autonomouslyto assist humans. This would involve higher-level reasoning capabilities beyond reactive compliance, allowing the robot to take initiative and proactively contribute to the collaborative task.
7.3. Personal Insights & Critique
7.3.1. Inspirations
The COLA paper offers several inspiring aspects:
- Elegance of Proprioception-Only Approach: The idea that a robot can achieve complex, compliant collaboration solely through internal sensing is powerful. It highlights how rich information can be extracted from seemingly simple
proprioceptive datawhen combined with sophisticatedreinforcement learningand a carefully designedclosed-loop training environment. This can significantly reduce hardware complexity and cost for real-world robotic deployments. - Implicit Learning of Intentions: The ability to implicitly infer
human intentionsthrough physical interaction, rather than relying on explicit communication or complexintention recognitionmodules, is a major step towards more natural and intuitive human-robot interaction. This "learn by doing" approach in simulation provides a robust way for robots to adapt to diverse human behaviors. - Teacher-Student Framework for Real-World Transfer: The
teacher-student distillationframework is an effective strategy for bridging the gap between training in aprivileged information-rich simulation and deploying a robust,proprioception-onlypolicy in the real world. This design pattern is highly transferable to other complex robotic tasks.
7.3.2. Potential Issues, Unverified Assumptions, or Areas for Improvement
- Scalability to More Complex Human Intentions: While effective for
collaborative carrying, the implicit learning of human intentions might have limitations. What if the human's intention is ambiguous, changes rapidly, or involves non-physical cues (e.g., verbal commands, gestures)? The currentproprioception-onlymodel might struggle here, reinforcing the authors' suggestion formulti-modal perception. - Robustness to Diverse Human Biomechanics/Interaction Styles: The human user study involved 23 participants, which is a good start. However, human interaction styles, strengths, and physical characteristics vary widely. How well does the model generalize to individuals with very different interaction forces, gaits, or even disabilities? Further testing with a wider demographic could reveal limitations.
- Long-Term Carrying and Fatigue: The paper focuses on coordination and effort reduction. For very long-duration carrying tasks, human and robot
fatiguebecomes a factor. Does the robot adapt itscomplianceoreffort contributionover time as the human tires? This could be an interesting area for future reward function design. - Unexpected Disturbances: While the
closed-loop environmentmodels dynamic interactions, real-world environments are full of unexpected disturbances (e.g., uneven ground, sudden nudges from bystanders). How does theproprioception-onlymodel handle these without additional environmental awareness? - Safety Guarantees: For real-world
human-humanoid collaboration, especially involving heavy objects, formalsafety guaranteesare paramount. Whilecomplianceimproves safety, areinforcement learningapproach might not offer strict, verifiable safety boundaries. Future work could explore integratingformal methodsorsafety layerson top of theRL policy.
7.3.3. Transferability to Other Domains
The methodologies and insights from COLA are highly transferable:
-
Other Collaborative Manipulation Tasks: The
residual learningandteacher-student frameworkcould be applied to othercollaborative manipulation taskswhere a humanoid assists a human (e.g., assembling large parts, pushing heavy doors, holding tools). -
Teleoperation with Force Feedback: The
proprioception-onlyestimation of interaction forces could be used to provide implicithaptic feedbackinteleoperationsystems, enhancing the operator's sense of touch without needing physical force sensors on the robot. -
Human-Robot Co-assembly: In manufacturing, humanoids could assist in co-assembly lines, learning to provide compliant support for parts or tools, reducing strain on human workers.
-
Rehabilitation Robotics: The compliant interaction capabilities could be valuable in
rehabilitation scenarios, where humanoids assist patients with physical therapy exercises, adapting to their strength and movement patterns.Overall,
COLArepresents a significant step towards more practical and intuitivehuman-humanoid collaboration, paving the way for wider adoption of humanoids in various assistive and industrial roles.
Similar papers
Recommended via semantic vector search.