Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning
TL;DR Summary
This paper presents a reinforcement learning training pipeline to develop a unified whole-body controller for humanoid badminton, enabling coordinated footwork and striking without reliance on motion priors or expert demonstrations. The training is validated in both simulated and
Abstract
Humanoid robots have demonstrated strong capabilities for interacting with static scenes across locomotion, manipulation, and more challenging loco-manipulation tasks. Yet the real world is dynamic, and quasi-static interactions are insufficient to cope with diverse environmental conditions. As a step toward more dynamic interaction scenarios, we present a reinforcement-learning-based training pipeline that produces a unified whole-body controller for humanoid badminton, enabling coordinated lower-body footwork and upper-body striking without motion priors or expert demonstrations. Training follows a three-stage curriculum: first footwork acquisition, then precision-guided racket swing generation, and finally task-focused refinement, yielding motions in which both legs and arms serve the hitting objective. For deployment, we incorporate an Extended Kalman Filter (EKF) to estimate and predict shuttlecock trajectories for target striking. We also introduce a prediction-free variant that dispenses with EKF and explicit trajectory prediction. To validate the framework, we conduct five sets of experiments in both simulation and the real world. In simulation, two robots sustain a rally of 21 consecutive hits. Moreover, the prediction-free variant achieves successful hits with comparable performance relative to the target-known policy. In real-world tests, both prediction and controller modules exhibit high accuracy, and on-court hitting achieves an outgoing shuttle speed up to 19.1 m/s with a mean return landing distance of 4 m. These experimental results show that our proposed training scheme can deliver highly dynamic while precise goal striking in badminton, and can be adapted to more dynamics-critical domains.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning
1.2. Authors
Chenhao Liu*, Leyun Jiang†, Yibo Wang†, Kairan Yao, Jinchen Fu and Xiaoyu Ren. All authors are affiliated with Beijing Phybot Technology Co., Ltd, Beijing, China. Chenhao Liu's email is liuchenhao@phybot.cn.
1.3. Journal/Conference
The paper is published as a preprint on arXiv, indicated by the provided link https://arxiv.org/abs/2511.11218. arXiv is a widely recognized open-access preprint server for research articles in fields such as physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While it is not a peer-reviewed journal or conference in its preprint form, it serves as an important platform for early dissemination of research findings and allows for community feedback before formal publication. Given the advanced topic and detailed experimental results, it is likely intended for submission to a top-tier robotics or AI conference/journal.
1.4. Publication Year
2025 (Published at UTC: 2025-11-14T12:22:19.000Z)
1.5. Abstract
This paper introduces a reinforcement learning (RL) based training pipeline that develops a unified whole-body controller for humanoid robots to play badminton. The controller enables coordinated lower-body footwork and upper-body striking without relying on motion priors or expert demonstrations. The training follows a three-stage curriculum: first, footwork acquisition, then precision-guided racket swing generation, and finally, task-focused refinement. For deployment, an Extended Kalman Filter (EKF) is incorporated to estimate and predict shuttlecock trajectories for target striking. A prediction-free variant is also introduced, which dispenses with explicit trajectory prediction. The framework is validated through five sets of experiments in both simulation and the real world. In simulation, two robots sustained a rally of 21 consecutive hits, and the prediction-free variant achieved comparable performance. Real-world tests demonstrated high accuracy for both prediction and controller modules, with outgoing shuttle speeds up to and a mean return landing distance of 4 m. These results highlight the training scheme's ability to achieve highly dynamic and precise goal striking in badminton, suggesting its adaptability to other dynamics-critical domains.
1.6. Original Source Link
Official Source: https://arxiv.org/abs/2511.11218
PDF Link: https://arxiv.org/pdf/2511.11218v2.pdf
Publication Status: Preprint (arXiv)
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the challenge of enabling humanoid robots to perform highly dynamic, contact-rich interactions with fast-moving objects within tight reaction windows. While humanoids have made significant progress in locomotion and manipulation in static or quasi-static environments, real-world scenarios often demand agile responses to dynamic elements.
This problem is particularly important because it represents a crucial stepping stone towards developing general-purpose embodied agents that can operate effectively and safely in human-centric environments. Current research often focuses on either locomotion (how a robot moves) or manipulation (how it interacts with objects), but loco-manipulation (coordinated movement and manipulation) in highly dynamic contexts remains largely unexplored.
Badminton serves as an ideal testbed for this challenge due to several factors:
-
Sub-second perception-action loops: Players (and robots) must react extremely quickly to incoming shots.
-
Precise timing and orientation: Hitting a shuttlecock effectively requires accurately timing the swing and orienting the racket face within a specific 3D
interception volume. -
Whole-body coordination: Successful badminton play necessitates blending rapid arm swings with stable and agile leg movements (footwork).
Existing challenges in robotic racket sports, particularly when comparing badminton to table tennis, further highlight the difficulty:
-
Aerodynamic uncertainty: Shuttlecock trajectories are highly unpredictable due to strong drag and a unique "flip" regime, reducing decision time despite longer flight paths.
-
Large swing amplitudes: Badminton requires much larger swings than table tennis, leading to significant whole-body disturbances and balance challenges.
-
Footwork-strike co-evolution: Unlike some table tennis systems where base repositioning is decoupled, in badminton, lower-body motion directly influences hitting accuracy and must be tightly integrated with striking.
The paper's entry point or innovative idea is to address these challenges by proposing a
multi-stage reinforcement learningtraining pipeline that develops aunified whole-body controllerfor humanoid robots, specifically for badminton. This approach aims to overcome the limitations of prior work that often relied onreference motions, decoupledlower-bodyandupper-bodycontrol, or simplifiedhitting mechanics. The innovation lies in fosteringfootwork-strike synergythrough a structuredRL curriculumthat allows the policy to discover energy-efficient swings and coordinated movements without explicit motion priors, leading to direct applicability on real hardware.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
First Real-World Humanoid Badminton System: The authors present what they claim to be the first real-world humanoid robot capable of playing badminton. This autonomous system uses a
unified whole-body controllerto return machine-served shuttles on a 21-degree-of-freedom (DoF) humanoid. It achieves impressive performance metrics, includingswing speedsof approximately andoutgoing shuttle speedsup to undersub-second reaction times. This demonstrates the feasibility of humanoids performing highly dynamic, interactive tasks. -
Stage-wise Reinforcement Learning for Footwork-Strike Coordination: A novel
three-stage curriculumis introduced forRLtraining. This curriculum systematically develops complexloco-manipulationskills:- S1 (Footwork acquisition): Focuses on learning stable footwork and approaching target regions.
- S2 (Precision-guided swing generation): Builds upon S1 by introducing precise
racket swing generation, incorporating timing and progressively tighteningpositionandorientation accuracy. - S3 (Task-focused refinement): Removes generic
locomotion-shaping regularizersto eliminategradient interferenceand maximizehitting performance, leading to more energy-efficient and task-optimal behaviors. This structured approach is crucial for achievingwhole-body coordinationandfootwork-strike synergy.
-
Prediction-Free Variant for Enhanced Robustness: The paper explores a
prediction-free variantof the controller. This variant represents a moreend-to-end policythat implicitly inferstimingandhitting targetsdirectly fromshort-horizon shuttle observations(current and historical shuttle poses), rather than relying on an explicittrajectory predictorlike anEKF. This approach aims to improverobustnesstoaerodynamic variabilityand simplify deployment by reducing reliance on externalprediction modulesoraerodynamic parameters. While currently validated primarily in simulation, it offers a promising direction for future real-world applications.The key conclusions and findings are:
-
The
multi-stage RL curriculumsuccessfully trains aunified whole-body controllerthat can executelarge-amplitude swingswhile maintainingbalanceand reacting withinsub-second windowsto intercept fast incoming shuttles autonomously. -
In simulation, the system demonstrates high reliability, with two robots sustaining a rally of 21 consecutive hits.
-
The
prediction-free variantshows comparable hitting performance in simulation to thetarget-known policy, suggesting its potential for more robust and streamlined real-world deployment. -
Real-world experiments confirm the
sim2real transferabilityof the learned policy, achieving highprediction accuracyandcontroller module accuracy. The robot successfully returns machine-served shuttles with impressiveoutgoing speedsandreturn shot quality. -
The system develops
foot-racket co-timingandrecovery behaviorswithout explicit hand-coding, which emerge naturally from the training process.These findings solve the problem of enabling humanoids to engage in highly dynamic and interactive tasks, specifically in the context of badminton, by providing a robust and generalizable
reinforcement learningframework forwhole-body control.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following foundational concepts:
-
Humanoid Robots: These are robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. They are engineered to operate in human-centric environments, requiring advanced capabilities in
balance,locomotion(walking, running), andmanipulation(object interaction). The humanoid in this paper,Phybot C1, is a 1.28m tall, 30kg robot with 21Degrees of Freedom (DoF).DoFrefers to the number of independent parameters that define the configuration or state of a mechanical system. For example, a single joint that can only rotate in one plane has 1DoF. A humanoid with 21DoFhas 21 such controllable movements, distributed across its hips, knees, ankles, waist, shoulders, and elbows. -
Reinforcement Learning (RL):
RLis a type of machine learning where anagentlearns to make decisions by performing actions in anenvironmentto maximize a cumulativerewardsignal. The agent observes thestateof the environment, takes anaction, and receives arewardand a newstate. The goal is to learn apolicy– a mapping from states to actions – that yields the largest long-term reward.- Agent: The learning entity (e.g., the humanoid robot).
- Environment: The world the agent interacts with (e.g., the badminton court, shuttlecock dynamics).
- State (): A representation of the environment at a given time (e.g., robot's joint angles, velocities, shuttlecock position).
- Action (): A decision made by the agent that influences the environment (e.g., desired joint positions for the robot).
- Reward (): A scalar feedback signal from the environment indicating the desirability of an action (e.g., positive reward for hitting the shuttlecock accurately, negative reward for falling).
- Policy (): The strategy the agent uses to choose actions given states ().
- Value Function: Estimates the expected cumulative reward from a given state or state-action pair.
-
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An
MDPis formally defined by a tuple , where:- is the set of all possible
states. In this paper, the authors mention anunobservedstate . - is the
observation space. This is what the agent actually perceives from the environment. - is the set of possible
actions. - is the
transition probability function, specifying the probability of moving to a new state given the current state and action : . - is the
reward function, defining the reward received for taking action in state . The policy aims to maximize the expected discounted sum of future rewards , where is thediscount factor(a value between 0 and 1 that determines the importance of future rewards).
- is the set of all possible
-
Proximal Policy Optimization (PPO): A popular
RL algorithmthat belongs to the family ofactor-critic methods.PPOaims to find a balance between ease of implementation, sample efficiency, and good performance. It modifies theobjective functionofREINFORCEto include aclipping mechanismthat constrains the policy updates, preventing excessively large policy changes that could lead to instability.- Actor-Critic:
Actor-critic methodsuse two neural networks: anactornetwork that learns the policy (how to act) and acriticnetwork that learns the value function (how good the current state or action is). - Asymmetric Actor-Critic: A variant where the
critichas access to more information (e.g.,privileged informationlike noise-free states or future targets) than theactorduring training. This helps stabilizevalue estimationwhile keeping theactor'sobservations restricted to what is available during deployment.
- Actor-Critic:
-
Extended Kalman Filter (EKF): A non-linear extension of the
Kalman Filter, used for state estimation indynamic systemswhere thesystem modeland/ormeasurement modelare non-linear. TheEKFlinearizes thesystemandmeasurement modelsaround the current state estimate usingTaylor series expansions. It is commonly used fortrajectory predictionandsensor fusion. In this paper, it's used to estimate and predict theshuttlecock's trajectorygiven noisymotion capturemeasurements. -
Motion Capture (Mocap): A technology for digitally recording the movement of objects or people. It typically uses cameras to track
markersplaced on the subject. In this paper,Mocapprovides precisepositionandorientationdata for the robot andshuttlecockin the real world. -
Domain Randomization: A technique used in
RLto improve thesim2real transferabilityof policies trained in simulation. By randomizing various physical parameters (e.g.,friction,mass,sensor noise,latency,aerodynamic parameters) of the simulated environment, thepolicybecomes more robust to discrepancies between the simulation and the real world, allowingzero-shot transfer(deployment without further real-world training orsystem identification).
3.2. Previous Works
The paper discusses several previous works in whole-body loco-manipulation and robotic racket sports, highlighting both progress and existing gaps.
3.2.1. Whole-Body Loco-manipulation
-
Decoupled Control: Historically,
locomotionandmanipulationtasks were often solved separately to reduce complexity.- [13] proposes an
MPC controllerfor end-effector tracking on a quadruped, coupled with a separateRL-based controllerfor locomotion. - [14] assigns
base movement,arm swing, andhand catchingto three distinctpolicy modulesfor object catching. - [15] advocates for decoupling
upper-bodyandlower-body controlinto separate agents for more preciseend-effector trackingduring robot movement. - Critical Reflection: While decoupling simplifies control, it can lead to suboptimal performance in tasks requiring tight coordination, as the
basemovement might not be fully optimized to supportmanipulationtasks dynamically.
- [13] proposes an
-
Unified Control Approaches: More recent efforts aim to unify
lowerandupper-body control.- [17, 18, 19] leverage
multi-critic architecturesto facilitate the optimization of coordinated behaviors.Multi-criticapproaches use multiplecriticnetworks to evaluate different aspects of thepolicy, potentially leading to more stable and efficient learning of complexloco-manipulationtasks. - [20] introduces a
physical feasibility reward designto guideunified policy learning. - [12, 21, 22] demonstrate that
unified whole-body loco-manipulationis effective for dynamic tasks likeracket sportsandtossing, whereleg contributionsare crucial forpower transmissionandtiming, not justlocomotion. - Critical Reflection: These works pave the way for true
whole-body coordination, recognizing that in dynamic tasks, the entire robot kinematic chain contributes to the overall objective.
- [17, 18, 19] leverage
3.2.2. Robotic Racket Sports
Robotic racket sportsare considered key benchmarks due to the tight coupling betweenperception,target interception, andprecise end-effector control.-
Table Tennis:
- [8] achieved human-level table tennis with a highly modular and hierarchical
policy architecturecombining learned and engineered components, but noted its complexity and reliance on manual tuning and real-world data. - [10] demonstrated
humanoid table tennisusing anRL-based whole-body controllercoupled with amodel-based plannerfor ball trajectory and racket target planning. However, it relied onexpert demonstrationsforreference motions, which can impose style constraints. - [11] jointly trained an
RL policywith anauxiliary learnable predictorforball trajectoriesandracket targets. A key distinction here is thatbase motionwas executed by a separate command, implyinglower-body footworkandupper-body strikingwere not jointly optimized by a single policy. - Critical Reflection:
Humanoid table tennisis challenging, but badminton significantly escalates these difficulties due toaerodynamics,larger swing amplitudes, and the necessity for deepfootwork-strike coordination. The reliance onreference motionsordecoupled controlin previous table tennis works limits their applicability to the more dynamic badminton scenario.
- [8] achieved human-level table tennis with a highly modular and hierarchical
-
Badminton:
- [12] introduced a
perception-informed whole-body policyfor badminton on aquadrupedal platform, achieving impressiveshuttlecock trackingandstriking. - Critical Reflection: While notable,
quadrupedal robotshave a much larger and more stablesupport polygoncompared tohumanoids, simplifyingbalance controlduringlarge-amplitude swings. This makesquadrupedal badmintona less direct challenge forwhole-body coordinationon ahumanoid. The paper notes that simulatedhumanoid badminton policiesin [12] did not exhibitbadminton-style footwork, suggesting missingtraining signalsforlower-body coordination.
- [12] introduced a
-
3.3. Technological Evolution
The evolution of robotics control for dynamic tasks has moved from classical control methods (e.g., PID controllers, Model Predictive Control - MPC) often relying on precise models, to learning-based approaches like Reinforcement Learning. Early RL applications often focused on simpler tasks or required extensive expert demonstrations and reference motions. The trend has been towards:
-
Unified Control: Integrating
locomotionandmanipulationinto a single,whole-body controllerto leverage the full kinematic chain. -
Increased Autonomy: Moving away from explicit
motion priorsorexpert demonstrations, allowing theRL agentto discover optimal behaviors through interaction with the environment. -
Robustness and Sim2Real Transfer: Employing techniques like
domain randomizationto bridge thesimulation-to-real-world gapand enablezero-shot deployment. -
End-to-End Learning: Reducing reliance on complex, hand-tuned
prediction modelsby integratingperceptionandcontrolmore tightly, as seen in theprediction-free variant.This paper's work fits squarely within this evolution by pushing the boundaries of
unified whole-body controlforhumanoidsin a highly dynamic,contact-rich task(badminton) withoutmotion priors. It builds on theRLadvancements inloco-manipulationbut specifically addresses the unique challenges of humanoid balance andfootwork-strike synergythat previoustable tennisorquadrupedal badmintonsystems did not fully tackle.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's approach offers several core differences and innovations:
-
Unified Whole-Body Controller without Motion Priors: Unlike [10] which used
expert demonstrationsforreference motionsin table tennis, this paper'sRL policydiscoversenergy-efficient swingsandwhole-body coordinationpurely throughreinforcement learning. This simplifies implementation and avoids imposing potentially suboptimalstyle constraints. -
Deep Footwork-Strike Synergy: Many
table tennissystems ([10, 11]) use separate commands forbase motionor decouplelower-bodyandupper-body control. This paper explicitly fostersfootwork-strike synergyonhumanoidsthrough itsmulti-stage curriculum, wherelower-body motionsare intrinsically linked tohitting accuracyandbalancein a unified policy. This is critical for badminton's large-amplitude strokes. -
3D Orientation-Aware Striking: While some
table tennisworks usedvirtual hit planesto simplify striking to a2D problem, badminton inherently demandsorientation-aware contactsthroughout a3D space. The proposed system addresses this directly. -
Humanoid-Specific Challenges: It specifically tackles the greater
balance challengesofhumanoids(narrowsupport polygon, highcenter of mass) duringlarge-amplitude strokes, differentiating it fromquadrupedal badminton([12]) which has inherently simplerbalance control. -
Multi-Stage RL Curriculum: The novel
three-stage curriculumprovides a structured way to progressively learn complexwhole-body skills, starting fromfootwork, thenswing precision, and finallytask-focused refinement. This curriculum design is key to managing thecredit assignment problemin complexRL tasks. -
Prediction-Free Variant: The introduction of a
prediction-free variantthat infershitting targetsimplicitly fromshuttle historyis a significant step towardsend-to-end controland greaterrobustnesstoaerodynamic variability, potentially simplifying deployment by removing explicit reliance onmodel-based predictors.In essence, the paper provides a more integrated, autonomous, and physically challenging solution for
robotic racket sportsonhumanoid platforms, specifically tailored to the dynamic demands of badminton.
4. Methodology
The core methodology revolves around training a unified whole-body controller for a humanoid robot to play badminton using a multi-stage reinforcement learning (RL) curriculum. This approach allows the robot to learn coordinated footwork and striking behaviors without relying on motion priors or expert demonstrations.
4.1. Principles
The core idea is to treat the complex task of playing badminton as a partially observable Markov Decision Process (MDP) and learn a parametric policy using reinforcement learning. The theoretical basis is that by carefully designing the observation space, action space, reward function, and a multi-stage training curriculum, the RL agent (the humanoid robot) can discover optimal whole-body coordination for dynamic interaction tasks. The intuition is to gradually introduce complexity, first teaching basic locomotion and target approach, then refining striking precision and swing dynamics, and finally optimizing for the task-specific objective while maintaining robustness.
For deployment, the system uses an Extended Kalman Filter (EKF) to predict shuttlecock trajectories, providing the RL policy with the necessary target information. A prediction-free variant is also explored, where the policy directly infers hitting targets from historical shuttlecock observations, aiming for greater end-to-end autonomy and robustness to aerodynamic uncertainty.
4.2. Core Methodology In-depth (Layer by Layer)
The system consists of several key components: the RL-based Dynamic Whole-Body Controller (with its multi-stage RL and multi-stage reward design), Model-based Hitting Target Generation and Prediction, and a Prediction-free variant.
4.2.1. RL-based Dynamic Whole-Body Controller
The problem is formulated as a partially observable Markov Decision Process (MDP) .
-
At time , the (unobserved) state is .
-
The agent receives an observation .
-
Applies an action .
-
Transitions with .
-
Accrues reward .
The goal is to learn a
parametric policythat maximizes the expected cumulative discounted reward: $ J(\theta) = \mathbb{E} \left[ \sum_t \gamma^t r_t \right] $ where is thediscount factor.
4.2.1.1. Observation Space
The actor's basic observation is an 87-dimensional vector. This vector combines:
-
Proprioception information: Data from the robot's own body sensors (e.g., joint positions, joint velocities, IMU data for base orientation and angular velocity). -
External sensing from Mocap: Data about the robot's base position, orientation, and linear velocity obtained from themotion capture system.To mitigate
partial observabilityand provide more context, a history of past observations is stacked: -
Short/long history: 5 frames + 20 frames of alljoint positions,velocities, andactions. This adds another 1470 features. This historical data helps thepolicyinfer trends and compensate for noisy or incomplete instantaneous observations.The
critic'sobservation space is 98-dimensional. It includes allactor'sobservations plusprivileged information: -
Noise-free base and joint states: Ideal sensor readings without measurement noise, which are available in simulation for training but not deployment. -
Racket speed: May be difficult to acquire accurately in the real world. -
Preemptive knowledge: Information used to make theMDPwell-posed for accuratevalue learning, such as two subsequent hitting times, next target position and orientation, and the number of remaining targets. All quantities are expressed in theworld frame.
4.2.1.2. Action Space
The policy outputs 21 joint position targets for all DoF (Degrees of Freedom). These targets are scaled by a unified action scale of 0.25. A Low-level PD controller operating at 500 Hz tracks these desired joint positions. The policy inference runs at 50 Hz.
4.2.1.3. Episode Settings
Each episode (a single training run until termination or completion) contains six swing targets. This design encourages follow-through behavior and recovery/preparation between hits.
- Asymmetric Actor-Critic (
PPOvariant): Due to the varying number of remaining targets within an episode (which influences the value function), anasymmetric actor-criticarchitecture is adopted [26]. Thecriticreceives additionalpreemptive information(e.g., noise-free states, future targets) to stabilizevalue estimation, while theactor(which needs to be deployable) only uses observations available in the real world. - Specific
criticinfo: Two subsequent hitting times, next target position and orientation, and the number of remaining targets.
4.2.1.4. Training Settings
- Algorithm:
PPO[27] is used. - Hardware: Single Nvidia RTX 4090.
- Parallelism: 4096 parallel environments in
IsaacGym[28].IsaacGymis a physics simulation platform optimized forGPU-based parallel reinforcement learning. - Network Architecture: Both
actorandcriticareMulti-Layer Perceptrons (MLPs)with hidden layers of sizes (512, 256, 128) andELU (Exponential Linear Unit)activations.ELUis an activation function similar toReLUbut allows negative values, which can help mitigate thevanishing gradient problem.
4.2.1.5. Three-stage curriculum
The training process is structured into three stages, with identical observation and action spaces across all stages. Only reward terms and their weights change between stages. Training in each stage continues from the checkpoint of the previous stage once its primary objective has converged. The total training time is on a single RTX 4090.
- S1 — Footwork acquisition toward sampled hit regions:
- Objective: Learn to approach target regions with reasonable lower limb gait and maintain a stable, forward-facing posture while traversing between six sampled hit locations within an episode. This stage primarily focuses on
locomotionandbase repositioning.
- Objective: Learn to approach target regions with reasonable lower limb gait and maintain a stable, forward-facing posture while traversing between six sampled hit locations within an episode. This stage primarily focuses on
- S2 — Precision-guided swing generation:
- Objective: Building on S1, introduce a
sparse hit reward(active only at the scheduled hit instant ) to enforcetiming. Initially,loose pose accuracyis allowed to facilitate the emergence of a full swing. As training progresses,positionandorientation accuracyare gradually tightened. Lightswing-style regularizationis added forhuman-like kinematics, andenergy,torque, andcollision constraintsare strengthened forefficient,stable hits.
- Objective: Building on S1, introduce a
- S3 — Task-focused refinement:
-
Objective: Remove numerous
locomotion-shaping regularizers(e.g.,foot distance,step/contact symmetry,step height) and thetarget approach rewardfrom S1 to avoidgradient interferencewith the primaryhitting objective.Safety,energy, andhardware limit constraint termsare retained.Domain randomizationandobservation noiseare enabled to consolidaterobustnessforsim2real transfer.The ablation study shows the necessity of this multi-stage approach: skipping S1 causes divergence, jumping from S1 to S3 makes the
curriculum gaptoo large leading to failure, and S3 significantly improves performance and robustness over S1+S2.
-
4.2.1.6. Multi-Stage Reward Design
A modular reward function is designed, comprising a locomotion-style term , an arm hitting term , and global regularization .
-
S1 — Footwork acquisition:
- Main Shaping: Encourages reaching the target region without requiring precise approach.
Target approach reward: $ r_{\mathrm{track}} = \exp \bigl( - 4 \max ( d - 0.3, 0 ) \bigr) $ where is the 2D Euclidean distance between the projected end-effector target position and the robot's base position. This reward exponentially decays if the distance exceeds 0.3m, encouraging the robot to get close to the target area.- Standard Style Shaping:
Base heightandorientation: Encourages stable, upright posture.Acceleration regularization: Penalizes jerky movements.Contact-aware footstep terms: Includesair time(penalizes feet staying off the ground too long),touchdown speed(penalizes hard landings),step height(encourages natural step height),foot posture(maintaining proper foot orientation),no-double-flight(prevents both feet from being in the air simultaneously for too long).Simple gait symmetry shaping: Encourages balanced left/right leg movements.Face alignment: Orienting the robot towards the incoming shuttle.
- Safety Constraints:
Action rate,joint position/velocity/acceleration,torque,energypenalties, followinglegged_gymconventions [29]. - Convergence: S1 typically converges within 1k iterations, yielding reliable target-region tracking with natural gait.
-
S2 — Precision-guided swing generation:
- Reward Changes: Lower the weight of
regional trackingand introduce ahit-instant reward() with a large weight (activated six times per episode). - Hit Reward (): Comprises two terms:
hitting precisionandracket swinging speed. Unlike [12] which separates position and orientation, this paper combines them.- Define
target racket normalas the -axis of (target end-effector orientation). - Let be the
end-effector linear velocity. - The
effective speed componentin the direction of the target is . Position error: (Euclidean distance between current and target end-effector position).Orientation error: (angle between current and target racket normals).- The
hit reward, active only at , is: $ \left{ \begin{array}{ll} r_{\mathrm{swing}} = \exp \Big( -\frac{e_{ee_pos}^2}{\sigma_p} \Big) \exp \Big( -\frac{e_{ee_ori}}{\sigma_r} \Big) + 0.3 r_v, \ \quad \quad r_v = 1 - \exp \big( -\frac{\max(0, v_{\mathrm{ee}} \cdot n^*)}{\sigma_v} \big). \end{array} \right. $- : position tolerance.
- : orientation tolerance.
- : racket speed tolerance.
- Scheduled Tightening: Initial tolerances are wide (, ) to allow a full swing to emerge, then gradually tightened (, ) as training progresses.
- Racket Speed Sigma: Fixed at , meaning a swing yields of the speed term, balancing accuracy with deployment stability.
- Define
- Swing-Style Regularization:
Racket y-axis alignment: $ r_{y-\mathrm{align}} = \left( \hat{\mathbf{y}}{\mathrm{ee}}^\top \hat{\mathbf{y}}{\mathrm{world}} \right)^2, \quad \hat{\mathbf{y}}{\mathrm{ee}} = R(q{\mathrm{ee}}) \mathbf{e}_y, $ where is the racket's local y-axis in the world frame, and is the world's y-axis. This encourages a human-likebackswingandforward swingalong the reverse of thebackswingfor acomplete kinetic chain.Default holding pose: $ r_{\mathrm{hold}} = - \sum_{j \in \mathcal{A}_{\mathrm{arm}}} \big( q_j - q_j^{\mathrm{hold}} \big)^2, $ where is the set of arm joint indices. This improvesdeploy-time stabilityandrecoverabilitywhen no shuttle is launched.
- Additional Penalties:
Collision penaltiesare added, andenergyandtorque costsare strengthened. - Progression: Policy first learns to bring the racket near the target, then develops an early
backswing, and finally a smoothbackswing-swing-recoverywith peak velocity near .
- Reward Changes: Lower the weight of
-
S3 — Task-focused refinement:
- Reward Changes: Removes the
target approach rewardand manygait-shaping termsfrom (e.g.,foot distance,contact and step symmetry,step height) to preventgradient conflictwith thehitting objective. - Retained Terms:
Global regularization termsandr_hit rewardsare kept. - Robustness:
Domain randomizationandobservation noiseare enabled to consolidate robustness. - Outcome:
Hitting metricsimprove ( increases by 3-5%), whileenergyandtorque costsdecrease by approximately 20%, indicating moretask-optimalandefficientbehavior.
- Reward Changes: Removes the
-
Global Regularization (all stages): Includes
action rate penalties,joint position/velocity/acceleration, andtorque limits, followinglegged gympractices.
4.2.2. Model-based Hitting Target Generation and Prediction
4.2.2.1. Generate Shuttlecock Trajectory for Training
A physics-based simulation approach is used to generate badminton flight trajectories, following the dynamics model from [30].
-
The
shuttlecock's flight stateis updated using the equation: $ m \frac{dv}{dt} = mg - m \frac{| \mathbf{v} | \mathbf{v}}{L} $ where:- :
shuttlecock mass. - :
shuttlecock velocity. - :
gravitational acceleration. - :
aerodynamic characteristic length, defined as .- :
air density. - :
cross-sectional area of the shuttlecock. - :
drag coefficient.
- :
- The
computed aerodynamic lengthused in this work is 3.4.
- :
-
Trajectory Filtering: Generated trajectories must meet specific conditions:
Hitting zone: , , and meters. The -direction asymmetry is due to the right-hand racket setting.Minimum traversal time: 0.8 seconds. This filtering ensures realistic and relevant training data.
-
Training Dataset: Selected trajectories are combined with corresponding
interception pointdata, includingposition,orientation(trajectory tangential line asracket normal), andtiming features. These are stored as tensors for training.
4.2.2.2. Shuttlecock Trajectory Prediction for Deployment
In deployment, an EKF-based trajectory prediction algorithm provides real-time shuttlecock flight estimation and hitting point prediction.
- EKF Implementation: Adheres to the same badminton dynamics model as used for trajectory generation.
- Phases: Operates through
measurement update(incorporating new sensor data) andprediction(forecasting the state). - Hitting Target Prediction: When the predicted height first falls into a predefined
interception box, the spatial coordinates and time of traversal are designated as thepredicted hitting target:- (target end-effector position).
- (target hitting time).
- Racket Orientation: The corresponding
velocity vectorat is converted into aquaternion representationto guide the robot'send-effector orientation. - Activation and Update: Prediction is activated at (once sufficient trajectory history is collected) and continuously updates the
hitting targetin a rolling manner, feeding it to the controller.
4.2.3. Prediction-free variant
This variant modifies the observation space of the policy, while reward function, action space, and three-stage training schedule remain identical.
-
Actor Observation: The
actorno longer receives the commandedhitting position,orientation, andtime. Instead, it receives asliding windowofworld-frame shuttle positions: the current frame and the previous five frames, sampled at 50 Hz (i.e., . -
Implicit Inference: From this short history, the
actormust implicitly infer theintended interception poseandtiming. -
Critic Access: The
criticretainsprivileged accessto the "actual"interception point, which helps stabilizevalue learning. -
Training Data Generation: For each commanded target, the entire
shuttle trajectory(integrated byforward badminton dynamics) is stored, not just the singleinterception point. This allows therewardandcriticto know theground-truth hit specification. -
Advantages:
- Deployment no longer relies on a
hand-tuned predictor(e.g.,EKFplusparametric aerodynamic model). - The
controllerconditions directly on measuredshuttle positions, making the overall pipeline moreend-to-end. - Reduces explicit dependence on
predictororaerodynamic parameters.
- Deployment no longer relies on a
-
Domain Randomization for Robustness: During training,
aerodynamic parametersare randomized per shot, exposing thepolicyto an "aerodynamic patch" whose exact coefficients are unknown. This mimics human inference of landing tendencies from a brief flight segment. -
Current Status: Provides initial simulation evidence of comparable performance but real-robot validation is left for future work.
The following are the results from Table I of the original paper:
stiffness (N·m/rad) damping (N·m·s/rad) hip_pitch 100 hip_pitch 10 hip_roll 100 hip_roll 10 hip_yaw 100 hip_yaw 10 knee 50 knee 10 ankle_pitch 50 ankle_pitch 5 ankle_roll 5 ankle_roll 5 waist_yaw 100 waist_yaw 10 shoulder_pitch 50 shoulder_pitch 10 shoulder_roll 50 shoulder_roll 10 shoulder_yaw 5 shoulder_yaw 5 elbow_pitch 50 elbow_pitch 10
Table I: Per-joint PD gains. This table lists the proportional (stiffness, in N·m/rad) and derivative (damping, in N·m·s/rad) gains for each joint (DoF) of the humanoid robot. These gains are used by the low-level PD controller to track the joint position targets output by the RL policy. Different joints have different stiffness and damping values, reflecting their roles and dynamic requirements (e.g., hip and waist joints have higher stiffness for base stability, while ankle joints might have lower values to allow more compliance).
The following are the results from Table II of the original paper:
| Actuator model (PhyArc series) | 47 | 68 | 78 | 102 |
| Size | 47×68 | 68×75 | 78×79 | 102×54.8 |
| Reducer type | Cycloid | Cycloid | Cycloid | Cycloid |
| Reduction ratio | 25 | 25 | 25 | 25 |
| Rated speed (RPM) | 100 | 100 | 100 | 100 |
| Rated torque (N·m) | 3 | 28 | 40 | 60 |
| No-load speed (RPM) | 343.8 | 181.5 | 120 | 124.2 |
| Rated power (W) | 216 | 720 | 720 | 1080 |
| Peak power (W) | 864 | 4608 | 4608 | 5760 |
| Peak torque (N·m) | 12 | 96 | 136 | 244 |
| Rotor inertia | 0.00719 | 0.0339 | 0.0634 | 0.1734 |
Table II: Actuator constants. This table summarizes the specifications of the four PhyArc actuator modules used in the robot. It includes physical size, cycloidal reducer type, reduction ratio, rated and no-load speeds (RPM), rated and peak torque (N·m), rated and peak power (W), and rotor inertia. These constants are crucial for setting torque/velocity limits, defining DoF properties, estimating power consumption, and performing safety checks during experiments, especially in simulation for realistic dynamics. Cycloidal reducers are known for high precision, low backlash, and high torque density.
5. Experimental Setup
The framework is validated through five sets of experiments: two in simulation and three in the real world.
5.1. Datasets
The training data for the RL policy is generated through a physics-based simulation of badminton flight trajectories.
- Source: A physics-based simulation approach following the badminton dynamics model in [30].
- Scale: 2 million raw trajectories were generated, from which 196,940 met specific criteria and were selected for robot training.
- Characteristics and Domain:
-
Initial Conditions: The initial positions and velocities for trajectory generation are randomly sampled within specific ranges to ensure diversity in the training data, as detailed in [12]: $ \left{ \begin{array}{ll} p_{x,t_0} \sim U(5, 8) \ p_{y,t_0} \sim U(-2, 2) \ p_{z,t_0} \sim U(-0.5, 2.5) \ v_{x,t_0} \sim U(-25, -13) \ v_{y,t_0} \sim U(-3, 3) \ v_{z,t_0} \sim U(9, 18) \end{array} \right. $ where:
- : Initial position coordinates (x, y, z) of the shuttlecock at time .
- : Initial velocity components (x, y, z) of the shuttlecock at time .
U(a, b): Denotes a uniform distribution over the interval[a, b].
-
Filtering Criteria: Trajectories are filtered to ensure
interception pointswithin a specifichitting zone: , , and . A minimum traversal time of 0.8 seconds is also required. The asymmetric -range is due to the right-hand racket setting. -
Output Data: The selected trajectories are combined with corresponding
interception point data(position, orientation as trajectory tangential line, and timing features) to form the training dataset. -
Distribution: The majority of selected trajectories reached the hitting zone within a time interval of [0.8, 1.4] seconds.
The following are the results from Figure 8 of the original paper:
该图像是一个三维图示,展示了羽毛球的运动轨迹。不同颜色的线条代表了不同的轨迹,区域的边界框标记出机器人训练过程中羽毛球的截击点范围,满足条件 , 和 。
Figure 8: Trajectory generation. Shuttlecock trajectories are filtered to ensure interception points within the region , and for robot training. This figure visualizes the range of generated shuttlecock trajectories in 3D space, highlighting the specific hitting zone(a bounding box) from which trajectories are selected for training. Different colored lines represent individual flight paths, demonstrating the diversity and coverage of the generated data.
-
The following are the results from Figure 9 of the original paper:

该图像是一个三维图表,示例展示了羽毛球飞行轨迹(黄色)及其拦截点(红色)。图中标注了拦截位置(0.188, -0.701, 1.541)米、时间(0.963 s)以及方向的四元数参数。
Figure 9: Individual trajectory analysis. An example of sampled shuttle flight trajectory (gold) with the selected interception point (red). Corresponding target frame at the intercept is drawn, where is the incoming-flight direction. The annotation reports the intercept position, orientation and time-to-intercept. This figure provides a detailed view of a single simulated shuttlecock trajectory, showing its path, the designated interception point (red sphere), and the coordinate frame at that point. It also lists the numerical values for intercept position, orientation (as a quaternion), and time-to-intercept, which are critical components of the target tuple used for training.
The following are the results from Figure 10 of the original paper:

该图像是图表,展示了与网球(z ∈ [1.5, 1.6])相关的交叉时间分布。图中呈现了交叉时间的频率与概率,频率用蓝色柱状图表示,概率用橙色曲线表示,显示在不同时间区间内的轨迹数量。
Figure 10: Training trajectory statistics. Distribution of the shuttlecock interception time. This histogram shows the distribution of shuttlecock interception times for the 196,940 trajectories selected for training. The x-axis represents the interception time in seconds, and the y-axis shows the frequency (blue bars) and probability (orange curve). This illustrates that most trajectories fall within a range suitable for the robot's reaction capabilities, primarily between 0.8 and 1.4 seconds.
- Rationale for Dataset Choice: These synthetic datasets are crucial because real-world collection of diverse
shuttlecock trajectorieswith preciseground truth interception pointsforRL trainingis prohibitively difficult and time-consuming. The physics-based simulation allows for generating a vast and varied dataset covering the operational space of the robot, ensuring that thepolicyis exposed to a wide range of scenarios during training. The filtering process ensures the generated data is relevant and feasible for the robot's capabilities.
5.2. Evaluation Metrics
The paper uses several metrics to evaluate the performance of the system:
-
Rally Length:
- Conceptual Definition: Measures the number of consecutive successful returns between two robots in simulation. It quantifies the system's ability to maintain a continuous game-play sequence, demonstrating
robustness,recovery, andhigh-quality returns. - Mathematical Formula: Not explicitly provided, but represents a count of successful hits.
- Symbol Explanation:
Countof successfully returned shuttlecocks before an error (fall, out-of-bounds, net, missed hit).
- Conceptual Definition: Measures the number of consecutive successful returns between two robots in simulation. It quantifies the system's ability to maintain a continuous game-play sequence, demonstrating
-
Position Error at Impact ():
- Conceptual Definition: The Euclidean distance between the actual point of contact (racket center) and the designated
target interception positionat the moment of impact. It quantifies the accuracy of the robot's spatial positioning for hitting. - Mathematical Formula: $ e_{ee_pos} = | p_{\mathrm{ee}} - p_{\mathrm{ee}}^* | $
- Symbol Explanation:
- : The actual 3D position of the end-effector (racket center) at impact.
- : The target 3D position for the end-effector at impact.
- : Euclidean norm (distance).
- Conceptual Definition: The Euclidean distance between the actual point of contact (racket center) and the designated
-
Orientation Error at Impact ():
- Conceptual Definition: The angular difference between the actual orientation of the racket face and the target orientation of the racket face at the moment of impact. It quantifies the accuracy of the robot's racket angle for hitting the shuttlecock in the desired direction.
- Mathematical Formula: $ e_{ee_ori} = \mathrm{ang_err}z (q{\mathrm{ee}}, q_{\mathrm{ee}}^*), \text{ where } \mathrm{ang_err}_z \text{ is the angle between the current and target racket normals.} $
- Symbol Explanation:
- : The actual orientation (as a quaternion) of the end-effector (racket).
- : The target orientation (as a quaternion) of the end-effector (racket).
- : A function that calculates the angle between the -axes of the two input quaternions, representing the normal direction of the racket face.
-
Executed Swing Speed ():
- Conceptual Definition: The linear speed of the racket at the moment of impact with the shuttlecock. It indicates the power and dynamics of the robot's swing. The effective component in the target direction is used for reward, but the total speed is often reported.
- Mathematical Formula: $ v_{\mathrm{swing}} = | v_{\mathrm{ee}} | $
- Symbol Explanation:
- : The linear velocity vector of the end-effector (racket center) at impact.
- : Euclidean norm (magnitude of the velocity vector).
-
EKF Prediction Error (Position & Time):
- Conceptual Definition: Measures the accuracy of the
Extended Kalman Filterin predicting the shuttlecock's future position and time of arrival at theinterception point. This is crucial for the controller to plan its actions. - Mathematical Formula (Position Error): $ e_{\mathrm{pred_pos}} = | p_{\mathrm{sh_pred}} - p_{\mathrm{sh_true}} | $
- Mathematical Formula (Time Error): $ e_{\mathrm{pred_time}} = | t_{\mathrm{sh_pred}} - t_{\mathrm{sh_true}} | $
- Symbol Explanation:
- : The predicted 3D position of the shuttlecock at interception.
- : The ground-truth 3D position of the shuttlecock at interception.
- : The predicted time of shuttlecock interception.
- : The ground-truth time of shuttlecock interception.
- : Euclidean norm.
- : Absolute value.
- Conceptual Definition: Measures the accuracy of the
-
Outgoing Shuttle Speed:
- Conceptual Definition: The speed of the shuttlecock immediately after being hit by the robot. This indicates the power and effectiveness of the robot's return shot.
- Mathematical Formula: The paper uses an elastic interaction model to compute outgoing velocity:
$
\left{ \begin{array}{ll} v_{\mathrm{racket},n} = (v_{\mathrm{racket}} \cdot n_{\mathrm{racket}}) n_{\mathrm{racket}}, \ v_{\mathrm{shuttle},n} = (v_{\mathrm{incoming}} \cdot n_{\mathrm{racket}}) n_{\mathrm{racket}}, \ v_{\mathrm{out}} = v_{\mathrm{incoming}} - 2v_{\mathrm{shuttle},n} + 2v_{\mathrm{racket},n}. \end{array} \right.
$
The
outgoing shuttle speedis then . - Symbol Explanation:
- : Velocity of the racket at impact.
- : Normal vector of the racket face at impact.
- : Incoming velocity of the shuttlecock.
- : Component of racket velocity normal to the racket face.
- : Component of incoming shuttle velocity normal to the racket face.
- : Outgoing shuttle velocity vector.
-
Mean Return Landing Distance:
- Conceptual Definition: The average distance from the
interception areawhere the robot's returned shots land. This metric assesses the quality and depth of the return shots. - Mathematical Formula: Not explicitly provided, but implies measuring the landing spot of returned shuttles relative to a reference point on the court.
- Symbol Explanation: Average distance in meters.
- Conceptual Definition: The average distance from the
-
Hit Success Rate:
- Conceptual Definition: The percentage of shuttlecocks successfully returned by the robot. A successful hit is defined by
position errorbelow andorientation errorbelow 0.2 rad. - Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Hits}}{\text{Total Number of Attempts}} \times 100% $
- Symbol Explanation: A percentage value.
- Conceptual Definition: The percentage of shuttlecocks successfully returned by the robot. A successful hit is defined by
5.3. Baselines
The paper primarily focuses on presenting its novel multi-stage RL training pipeline and does not conduct direct comparisons against external baseline methods from other research groups for the overall badminton task. Instead, it uses internal comparisons and ablations:
-
Target-Known Policy vs. Prediction-Free Policy (Internal Comparison): This is a key comparison within the paper's own framework.
- Target-Known Policy: The main policy that receives explicit
planned interception position,orientation, andtimefrom theEKF. - Prediction-Free Policy: An alternative variant that learns to infer
impact poseandtimingsolely fromshort-history shuttle positions. - Representativeness: This comparison evaluates the effectiveness and
robustnessof theend-to-end learningapproach versus amodular perception-control pipeline.
- Target-Known Policy: The main policy that receives explicit
-
Ablation Studies on Curriculum Stages (Internal Ablation):
-
The paper implicitly compares the full
three-stage curriculum(S1+S2+S3) against ablated versions (e.g., S1+S2, skipping S1, skipping S2). -
Representativeness: This ablation is critical for validating the design choices of the
multi-stage curriculumand demonstrating the necessity of each stage forreliable convergenceandoptimal performance.While there are references to other works like
humanoid table tennis([10, 11]) orquadrupedal badminton([12]), the paper highlights how itshumanoid badmintontask differs significantly in complexity andwhole-body coordinationrequirements, making direct quantitative comparison difficult given the absence of other published real-world humanoid badminton systems. Therefore, the core baseline for this work is the performance of its ownTarget-Known policyand the incremental improvements achieved through itsmulti-stage curriculumand thePrediction-Free variant.
-
The following are the results from Table III of the original paper:
| Hyperparameter | Setting |
| discount factor | 0.99 |
| GAE lambda | 0.95 |
| learning rate | adaptive |
| KLD target | 0.01 |
| clip param | 0.2 |
| control dt (s) | 0.02 |
| num. envs | 4096 |
| actor MLP size | (512, 256, 128) |
| critic MLP size | (512, 256, 128) |
| network activation | elu |
| optimizer | AdamW |
Table III: Hyperparameter configuration. This table details the hyperparameters used for the PPO algorithm during training. Key parameters include:
-
discount factor (0.99): Determines the importance of future rewards. A value close to 1 emphasizes long-term rewards. -
GAE lambda (0.95):Generalized Advantage Estimation (GAE)parameter, used for balancing bias and variance inadvantage estimation. -
learning rate (adaptive): Adjusts step size duringoptimization. -
KLD target (0.01):Kullback-Leibler Divergencetarget, a constraint used inPPOto limit policy updates. -
clip param (0.2):PPO's clipping parameter, which limits how far the new policy can deviate from the old policy during an update step. -
control dt (0.02 s): The time step for thepolicy's control loop(corresponding to 50 Hz). -
num. envs (4096): Number of parallel simulation environments used inIsaacGymfor efficient data collection. -
actor/critic MLP size ((512, 256, 128)): Specifies the architecture of theMulti-Layer Perceptron (MLP)networks for theactorandcritic, indicating the number of neurons in each hidden layer. -
network activation (elu): Theactivation functionused in the neural networks. -
optimizer (AdamW): Theoptimization algorithmused to update network weights.The following are the results from Table IV of the original paper:
Domain random Range Noise Range friction range [0.5, 1.0] dof_pos 0.05 push interval 5 dof_vel 0.2 max push vel_xy 0.5 lin_vel 0.2 max push ang_vel 0.5 ang_vel 0.1 added base_com range [-0.08, 0.08] gravity 0.1 join_friction range [0.01, 1.0] added inertia range [0.01, 0.1]
Table IV: Domain randomization and observation noise. This table details the parameters for domain randomization and observation noise applied during training. These techniques are crucial for improving sim2real transferability.
- Domain Randomization:
friction range ([0.5, 1.0]): Randomizes friction coefficients of surfaces.push interval (5): Frequency of external pushes applied to the robot.max push vel_xy (0.5): Maximum horizontal velocity change from external pushes.max push ang_vel (0.5): Maximum angular velocity change from external pushes.added base_com range ([-0.08, 0.08]): Randomly shifts thecenter of massof the robot's base.join_friction range ([0.01, 1.0]): Randomizes friction in the robot's joints.added inertia range ([0.01, 0.1]): Randomly adds inertia to robot links.
- Observation Noise:
-
dof_pos (0.05): Noise added tojoint positionobservations. -
dof_vel (0.2): Noise added tojoint velocityobservations. -
lin_vel (0.2): Noise added tolinear velocityobservations. -
ang_vel (0.1): Noise added toangular velocityobservations. -
gravity (0.1): Perturbation of thegravity vector.The following are the results from Table V of the original paper:
Stage Category Term Weight S1 S1 base_height 5 S1 base_ang_vel_xy -10 S1 base_orientation -50 S1 contact_no_vel -10 S1 feet_orientation 10 S1 feet_no_fly -2 S1 feet_height 10 S1 feet_distance 2 S1 air_time & land_time -500 S1 sym_contact_forces 1 S1 sym_step -5 S1 face_the_net 8 S1 target_approach 30 S1 S1 action_rate -0.8 S1 dof_pos_limit -30 S1 dof_vel_limit -0.1 S1 dof_torque_limit -0.1 S1 dof_acc -5× 10-5 S1 dof_vel -1× 10-3 S1 dof_torque −1× 10-4 S1 momentum_positive 5 S2 S2 base_height 5 S2 base_ang_vel_xy -10 S2 base_orientation -50 S2 contact_no_vel -10 S2 feet_orientation 10 S2 feet_no_fly -2 S2 feet_height 10 S2 feet_distance 2 S2 air_time & land_time -500 S2 sym_contact_forces 1 S2 sym_step -5 S2 face_the_net 5 S2 target_approach 15 S2 4000 S2 5 S2 10 S2 action_rate -0.8 S2 dof_pos_limit -30 S2 dof_vel_limit -0.1 S2 dof_torque_limit -0.1 S2 dof_acc -5× 10-5 S2 dof_vel -1× 10-3 S2 dof_torque −1× 10-4 S2 momentum_positive 5 S2 energy -0.01 S2 collision -10 S3 S3 base_height 5 S3 base_ang_vel_xy -10 S3 base_orientation -50 S3 contact_no_vel -10 S3 feet_orientation 10 S3 feet_no_fly -2 S3 air_time & land_time -500 S3 face_the_net 5 S3 4000 S3 5 S3 10 S3 action_rate -0.8 S3 dof_pos_limit -30 S3 dof_vel_limit -0.1 S3 dof_torque_limit -0.5 S3 dof_acc -5×10-5 S3 dof_vel -1× 10-3 S3 dof_torque −1× 10−4 S3 momentum_positive 5 S3 energy -0.01 S3 collision -10
-
Table V: Reward weights by stage. This table provides a detailed breakdown of the reward terms and their respective weights across the three training stages (S1, S2, S3). This explicitly shows how the curriculum gradually shifts focus:
- S1 (Footwork acquisition): Heavy emphasis on
locomotionterms (), particularlybase orientation,air time/land time(for stable stepping), andtarget_approach.Hitting rewardsare not active. - S2 (Precision-guided swing generation): The
target_approachweight is reduced, and crucially,hitting rewards(, , ) are introduced with high weights, especially (4000).Energyandcollisionpenalties are also added. - S3 (Task-focused refinement): Many
gait-shaping locomotionterms are removed (e.g.,feet_height,feet_distance,sym_contact_forces,sym_step), allowing the policy to optimize more directly for thehitting objective. Thedof_torque_limitpenalty is increased, further encouragingenergy efficiency.
6. Results & Analysis
6.1. Core Results Analysis
The experiments are designed to test accuracy, agility, robustness, and deployability in both simulation and the real world. A hit success is defined by a position error below and orientation error below 0.2 rad, a target motivated by the racket's geometry.
6.1.1. Simulation Results
The following are the results from Figure 3 of the original paper:

该图像是图示,展示了两台 humanoid 机器人在羽毛球比赛中的情景。图 (a) 描绘了两台机器人在21次击球中的位置和运动轨迹,标注了击球点和羽毛球的飞行轨迹。在图 (b) 中,展示了预测无关策略,而图 (c) 展示了目标已知策略,分别用红色和绿色球体表示击球和成功击打的位置。
Figure 3: Simulation results. Figure (a) illustrates the Two-Robot Rally scenario, where two identical humanoid robots sustain a rally of 21 consecutive returns. Figure (b) demonstrates the Prediction-Free policy: the robot infers the optimal impact position and orientation solely from the first five recorded shuttlecock positions after serving. Figure (c) presents the Target-Known policy, where a predetermined hitting position is provided. The red sphere indicates the designated hitting location, while the green sphere confirms successful impact execution by the robot.
6.1.1.1. Two-Robot Rally
- Setup: Two identical humanoids, both running the
Target-Known policy, face each other on a scaled badminton court. Theoutgoing shuttle velocityafter each hit is computed assuming an elastic interaction. The rally continues as long as the shuttle is returned successfully (clears net, lands in bounds, no falls, no misses). - Results: The robots sustained a rally of 21 consecutive returns.
- Analysis: This demonstrates that the
controlleris capable of repeatedly:- Repositioning: Moving to intercept the shuttle.
- Returning shuttles with high quality: The hits are accurate enough to continue the rally.
- Recovering posture: Maintaining balance and preparing for the next shot over extended exchanges.
This is a strong validation of the
whole-body coordinationandrobustnessof the learned policy.
6.1.1.2. Target-Known vs. Prediction-Free Comparison
-
Setup: Compares the
Target-Known policy(receivingplanned interception position, orientation, and time) with thePrediction-Free policy(observing current and five-frame history ofshuttle positions) under identical simulation conditions. ThePrediction-Free policyalso experiences moderate variations inaerodynamic characteristic lengthto simulate different shuttles. Each policy executes twenty random hits. -
Metrics:
Position error,orientation errorat impact, andexecuted swing speed.The following are the results from Figure 4 of the original paper:
该图像是图表,展示了目标已知策略与无预测策略之间的比较。上部分为位置误差比较,中间部分为方向误差比较,底部则展示了挥拍速度比较。各策略在多个试验中的表现差异明显。
Figure 4: Comparison between target-known and prediction-free policy. The top part of this figure shows the position errorfor both strategies. The middle section of the figure shows theorientation errorcomparison, the orientation corresponds to the normal direction of the racket face. The bottom part of the figure comparesswing velocity. -
Results (from Figure 4):
- Position Error: Both policies achieve very low
position error, mostly below forTarget-Knownand slightly higher but still within acceptable bounds forPrediction-Free. - Orientation Error: Similar to position, both show good
orientation accuracy, with thePrediction-Freevariant showing a modest increase in error. - Swing Velocity: Both policies achieve comparable
swing velocities, often exceeding .
- Position Error: Both policies achieve very low
-
Analysis: The results indicate that explicit target information is not indispensable. The
Prediction-Free policycan infer both thehitting targetandtimingon its own, with only a modest drop in performance compared to theTarget-Known policy. This provides initial evidence for the viability of a moreend-to-end control strategythat is robust toaerodynamic variabilityand does not rely on a separatepredictor.
6.1.2. Sim2Real Transfer
The controller is trained entirely in simulation and deployed to hardware zero-shot (without real-world training).
- Techniques for Transfer:
Domain randomizationandobservation noiseare applied during training (detailed inTable IV) to cover keydynamicsandsensing variations.Constraint termsin thereward functiondiscourage brittle,high-torque solutions. Thestaged curriculumencourages acomplete kinetic chainfor hitting. - Outcome: The learned controller
transfers successfullyto the real robot in theMocap arenawithoutsystem identificationormanual parameter adjustment, validating the effectiveness of thesim2realstrategy.
6.1.3. Real Robot Deployment
6.1.3.1. EKF Prediction Accuracy study
-
Objective: Quantify the accuracy of the
EKF-based shuttlecock trajectory prediction. -
Setup: 20 authentic badminton flight trajectories were collected using
Mocap. Partial segments served asmeasurement inputfor theEKF, and predictions were compared againstground-truth contact positionsandtiming.The following are the results from Figure 5 of the original paper:
该图像是图表,展示了在相对截击时间内的总位置误差和截击时间预测误差。上图显示了位置误差的均值及其标准差,紫色曲线表示位置误差;下图显示了截击时间的绝对误差,橙色曲线代表时间误差均值。截击时间以虚线表示。
Figure 5: EKF Prediction Accuracy. The predicted striking position error(top) and strikingtime error(bottom) were evaluated over 20 shuttlecock trajectories. The shaded regions represent the standard deviation. At 0.6 s before interception, the meanposition errorwas less than , already smaller than the radius of the racket. -
Results (from Figure 5):
- Position Error: At before impact, the mean predicted
position errorwas below , which is smaller than the racket's radius. This error sharply converged to about by prior to contact. - Timing Error: Mean
interception timing prediction errorwas approximately at before hitting, rapidly converging thereafter.
- Position Error: At before impact, the mean predicted
-
Analysis: The
EKFdemonstrates high accuracy in predicting bothshuttlecock positionandtimingwell within the reaction window needed for the robot, confirming its reliability as aperception module.The following are the results from Figure 11 of the original paper:
该图像是图表,展示了在不同的拦截时间相对时间下,总位置误差和拦截时间预测误差的比较。上部分显示了总位置误差的变化,下部分则展示了绝对时间误差的变化。图中还包含了不同情况下的误差范围,并指明了拦截时间。
Figure 11: EKF prediction accuracy under varying aerodynamic characteristic lengths. The parameter scales the characteristic lengthto emulate variations inshuttle aerodynamics. This figure shows asensitivity analysisof theEKF's prediction accuracywhen theaerodynamic characteristic lengthis perturbed by a factor . Even with this perturbation, the meanposition error(top) still converges to within by about before interception, indicating reasonablerobustnessto uncertainties inaerodynamic parameters.
6.1.3.2. Virtual-Target Swinging
-
Objective: Quantify the robot's
swing errorandspeedwithout a real shuttle, isolating thecontroller'smotor performance. -
Setup: The robot was instructed to execute hitting motions towards 71
randomly sampled target positions.Mocapmeasured theracket face centerand compared it to the commanded target.The following are the results from Figure 6 of the original paper:
该图像是图表,展示了击球过程中的总位置误差和击球时的球拍速度。上部显示了总位置误差(以毫米为单位)的变化,以及平均值24.29 mm的参考线;下部则展现了球拍在击球时的速度(以米每秒为单位),其平均值为5.32 m/s。两部分皆包含了数据分布的箱线图,以可视化不同击球点索引下的误差和速度。
Figure 6: Virtual-Target Swinging. The upper portion of this figure depicts the Euclidean distance errorbetween the racket center and the designated hitting position at the moment of impact, while the lower portion illustrates the correspondingracket speedat impact. -
Results (from Figure 6):
- Mean
Euclidean distance error: . - Average
racket speed: at impact.
- Mean
-
Analysis: These results confirm the robot's ability to precisely control the
racket contact pointand generate sufficientswing speed, demonstrating its practical feasibility forhigh-precision swinging actions.The following are the results from Figure 12 of the original paper:
该图像是图表,展示了机器人在指定击打位置 (50, -250, 1540) mm 附近的拍子中心轨迹。蓝色线条表示原始轨迹,绿色球体表示机器人经过的实际拍子中心位置,图中包含20个实际运动点,整体运动路径展现了机器人挥拍的动作。
Figure 12: Trajectory of the racket center. For a designated hitting position at (50, -250, 1540) mm, the robot executed 20 swinging motions. The green spheres represent the positions of the racket center as it passed through the plane during each swing. This figure visualizes the repeatabilityof theracket center's trajectoryfor 20 swings aimed at a single fixed target. The tight clustering of green spheres indicates high precision.
The following are the results from Figure 13 of the original paper:

该图像是图表,展示了在 X-Y 平面上的误差分布。图中标注了目标位置(黑星标),最近的击球点(蓝圈),以及最远的击球点(红圆)。误差的统计信息包括均值 、标准差 、最大值 和最小值 。红色虚线圈表示均方误差圆。
Figure 13: Swing Error Analysis. Over 20 repeated swings, the mean Euclidean distance error was measured at , with a standard deviation of . The maximum and minimum errors recorded were and , respectively. This figure presents a statistical analysis of the swing error from 20 repeated swings to a fixed virtual target. The mean error is close to the racket's center, and the low standard deviation indicates good repeatability.
6.1.3.3. Real-World Shuttle Hitting
-
Objective: Integrate
predictionandcontrol modulesto perform autonomousshuttlecock hittingin the real world. -
Setup: A
ball machineserves shuttles in aMocap arena. Theinterception areais constrained for safety and field-of-view reasons. -
Results:
- Success Rate: The robot successfully returned 42 out of 46 shuttles (91.3% success rate), with the 4 misses hitting the edge of the racket frame.
- Interception Range: The robot intercepts shuttles at about above the ground within a area, which is a relatively large range for a robot of height.
- Outgoing Shuttle Speed: Up to , with an average of .
- Return Shot Quality: Produces
sharp slope trajectories(peak height exceeding ) and a mean return landing distance of 4 m from theinterception area.
-
Analysis: This demonstrates the system's overall effectiveness in a complex real-world task, showcasing
coordinatedandagile motion,high success rate, andquality returns.The following are the results from Figure 7 of the original paper:
该图像是插图,展示了机器人在实际羽毛球击打过程中所处的两种姿势,位于适应的 的拦截区域内。图中标注了机器人高度范围为 到 ,并突出显示了击打区域的界限。
Figure 7: Real-World Shuttle Hitting. This figure captures the robot's actual hitting postures at the two opposing boundaries of the interception area. This figure visually confirms the robot's ability to reach and strike shuttles across a significant interception area, illustrating itswhole-body coordinationandflexibilityin adapting its posture to differenttarget locations.
6.2. Ablation Studies / Parameter Analysis
The paper includes an ablation study on the multi-stage curriculum (as mentioned in Methodology Section 4.2.1.5).
- S1 (Footwork acquisition): Found to be essential. Removing it causes training to
diverge directly. - S2 (Precision-guided swing generation): Also essential. Skipping it (jumping from S1 to S3) creates too large a
curriculum gap, leading totraining failureto reliably converge. - S3 (Task-focused refinement): While S1+S2 yields a
hardware-deployable policy, S3 is crucial for breaking performance plateaus.- Benefits of S3: Increases the primary
hitting reward() by 3-5%, and leads tolower action rates,reduced joint velocity and acceleration penalties,lower energy consumption,cleaner foot-contact force profiles, andreduced joint-torque usage. Notably,energyandtorque costsdecrease by approximately 20%.
- Benefits of S3: Increases the primary
- Analysis: This detailed ablation demonstrates the necessity of each stage in the curriculum. The progressive introduction of complexity, followed by the removal of
locomotion-shaping regularizersin S3, is key to achieving optimal,energy-efficient, and robusthitting performance. The reduction inenergyandtorque costsin S3 indicates that thepolicylearns moretask-optimalandphysically efficientsolutions when extraneousregularizersare removed.
6.3. Striking Motion Analysis
The following are the results from Figure 14 of the original paper:

该图像是展示机器人进行实地击打动作的三幅快照。左侧是机器人在羽毛球发射前的原地脚步动作;中间展示了羽毛球发射时的靠近和向后挥杆阶段(黄色框突出显示羽毛球,绿色箭头显示拍子挥动方向,白色箭头表示脚步动作);右侧展示了击打和随后的动作:机器人在挥拍的同时向羽毛球迈出一步,随后完成随后的动作。
Figure 14: Real-world striking motion. Three snapshots of a successful return in the real world. Left: in-place stepping before the shuttle is launched. Middle: approach and backswing phase as the shuttle launched (yellow box highlights the shuttle, the green arrow indicates the racket swing direction, and the white arrow indicates the stepping motion). Right: hit and follow-through: the robot simultaneously takes a step and swings the racket toward the shuttle, then completes the motion with a follow-through. This figure visually breaks down a successful real-world hit sequence into three key phases, illustrating the whole-body coordination achieved by the RL policy. It shows the robot's preparatory phase (in-place stepping), approach and backswing (coordinated leg movement and arm preparation), and the impact and follow-through (simultaneous stepping and swinging with the non-racket arm aiding balance).
-
Key Observations from Motion Analysis:
- Resting State: When no shuttle is coming, the robot maintains an active
in-place steppingbehavior, holding the racket in front of its body, indicating a readiness for reaction. - Coordinated Approach: Upon shuttle launch, the
controllerinitiatescorrective steps(short or long strides) while simultaneously performing abackswing. Thisfoot-racket co-timingis an emergent behavior, not hand-coded. - Whole-Body Impulse: During the hitting instant, both legs and arm accelerate in a coordinated manner to generate
whole-body impulsefor highracket speed. - Balance Mechanism: The
non-racket armswings in the opposite direction during thefast stroke, acting as a counterweight to help counteractangular momentumand maintainbalance. - Recovery: After striking, the robot executes a
recovery motionto prepare for the next potential shot.
- Resting State: When no shuttle is coming, the robot maintains an active
-
Emergent Behaviors:
Tiptoe reaching: Observed for higherinterception heights.Recentering: The robot recenters near the middle-right of the court between hits, reflecting the sampledtarget distributionand mimicking humanre-centering behavior.
-
Analysis: These observations highlight the sophisticated and
human-like behaviorsthat emerge from themulti-stage RL training, particularly thesynergy between lower-body footworkandupper-body striking, which is crucial for dynamic tasks like badminton. Theemergent balance mechanismsandrecovery behaviorsfurther underscore the effectiveness of thewhole-body controlapproach.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work successfully demonstrates the first real-world humanoid badminton system, powered by a unified whole-body reinforcement learning (RL) policy. The core innovation lies in its three-stage curriculum: starting with footwork acquisition, progressing to precision-guided swing generation, and culminating in task-focused refinement. This curriculum enables the humanoid to learn coordinated stepping and striking without relying on expert demonstrations or motion priors, producing complex, human-like badminton behaviors within tight sub-second time windows.
Key achievements include:
- In simulation, two identical humanoids sustained a rally of 21 consecutive returns, exhibiting high
hitting accuracy(within position error and 0.2 rad orientation error). - The
prediction-free variantshowed comparable performance to thetarget-known policyin simulation, demonstratingrobustnesstoaerodynamic variationswithout an explicitshuttle predictor. - In real-world tests, the
zero-shot transferred policyachievedswing speedsof and returnedball-machine serveswithoutgoing shuttle speedsup to . The framework represents a significant step towards enablingagile,interactive whole-body response tasksonhumanoidsin dynamic environments.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
-
Mocap Arena Constraints: Current tests are limited by the
Mocap arena'sceiling height, restrictingshuttle trajectoriesto low arcs from aball machine. Thisconstrains the robot's potentialand makeslong multi-ball ralliesor play with human partners difficult.- Future Work: Deploying in
higher-ceiling venuesand withbroader interception bandswould allow for more dynamic and variedshuttle trajectories, revealing the full potential of thecontroller.
- Future Work: Deploying in
-
Limited Stroke Repertoire: The learned
striking behaviorisstereotyped, predominantlyforehand-like. It lacks thediverse repertoire of human strokes(e.g.,backhand hits,lunges,jumps,smashes). Thefeasible interception areais also currently restricted to a relativelynarrow band.- Future Work: Expanding the
training data distributionandreward shapingto encourage a wider variety ofstrokesandinterception areas.
- Future Work: Expanding the
-
Prediction-Free Variant Deployment: While promising in simulation, the
prediction-free varianthas not yet been validated on hardware. It relies on astrictly timed 50 Hz historyofshuttle positions, requiring robustbufferingandtime-stampingforaccurate velocityandtiming inferenceby theactor.- Future Work: Deploying the
prediction-free policyon hardware and performing a thorough analysis of itsrobustnessin real-world conditions.
- Future Work: Deploying the
-
Vision-based Perception: The current system relies on an external
Mocap systemforshuttlecock tracking.- Future Work: Moving towards
pure vision-based operationwould requirereliable visual odometryandpoliciesthat actively keep the shuttle within thefield of viewduringaggressive motion. Ahead-mounted camerawithlearning signalsto align the view with theshuttle trajectoryis suggested as a practical path.
- Future Work: Moving towards
-
Higher-Level Strategy: The current
controlleracts as amotor primitive. It does not makestrategic decisionsabout where to hit or when to swing.Current failuressometimes occur when consecutivetargetsare too far apart for the availablemaneuver time.- Future Work: Training a
higher-level policyviamulti-agent trainingorself-playto decideinterception points,swing timing, andracket-face orientation. This would involvereward shapingthat explicitly encourageshigh-speed legged maneuversto cover larger distances.
- Future Work: Training a
-
Adaptability to Other Sports:
- Future Work: The framework could be adapted to other
dynamics-critical domainsliketennisorsquashby replacing theshuttlecock dynamic modeland adjustinginterception stylesandreward tuningsto reflectsport-specific behaviors.
- Future Work: The framework could be adapted to other
7.3. Personal Insights & Critique
This paper presents a truly impressive achievement in humanoid robotics, particularly in the domain of dynamic loco-manipulation. The successful zero-shot transfer from simulation to a real humanoid for such a complex task as badminton is a testament to the power of well-designed RL curricula and domain randomization.
Inspirations:
- Curriculum Learning for Complex Tasks: The
three-stage curriculumis a masterclass in breaking down a formidableRL probleminto manageable, progressive steps. This structured approach, particularly the refinement stage (S3) wherelocomotion regularizersare strategically removed, is a valuable blueprint for training other complexwhole-body behaviors. It highlights thatoptimal task performancemay require shedding genericregularizersonce fundamental skills are acquired. - Emergent Behaviors: The observation that
foot-racket co-timing,tiptoe reaching, andrecenteringemerged without explicit programming is a powerful demonstration ofRL's abilityto discoverintelligent and adaptive strategies. This reinforces the idea of empowering agents to learn fromtask-specific rewardsrather than strictlyimitation learning. - Sim2Real Robustness: The rigorous application of
domain randomizationandobservation noiseis a key takeaway. The fact that the policyzero-shot transferswith high efficacy suggests that thesesim2real techniquesare maturing to a point where they can reliably bridge the reality gap for highly dynamic,contact-rich tasks. - Prediction-Free Control: The
prediction-free variantis a forward-looking contribution. Humans don't compute explicitKalman filterpredictions; they react to sensory input and infer intentions. Thisend-to-end approach, even if currently only validated in simulation, offers a path to more robust systems less sensitive tomodel inaccuraciesorparameter tuning.
Critique & Areas for Improvement:
-
Generalization of Stroke Styles: While the paper successfully demonstrates a formidable
forehand-like stroke, the lack of diverse stroke types (e.g.,backhand,smashes,dropshots) is a significant limitation for actual badminton play. Real badminton involves anticipating and executing a wide array of shots. Future work could explore incorporatingstyle-conditioninginputs or more diversereward landscapesto encourage a broaderrepertoire. -
Dealing with Occlusions and Multiple Objects: The current
Mocap-based perceptionandsingle-shuttleenvironment simplifyperception. In a real game, occlusions (net, opponent, robot's own body) and managing multipleshuttlecocks(in practice scenarios) would introduce substantialperception challenges. Theprediction-free varianthelps withaerodynamic uncertaintybut notperceptual uncertaintyfrom occlusions. -
Strategy and Adversarial Play: The robot currently reacts to machine-served shuttles. Integrating a
higher-level strategic policy(as suggested by the authors) would be the next frontier. Playing against an opponent (human or robot) introducesadversarial dynamicsand the need forgame theoryandlong-term planning, which are entirely differentRL challenges. -
Energy Efficiency for Extended Play: While the paper notes a reduction in
energyandtorque costsin S3, the long-termenergy budgetfor sustained humanoid operation is critical. Further optimization ofenergy consumptionwould be valuable for practical deployment over extended periods. -
Human-Robot Interaction Safety: For deployment in human environments,
safety guaranteesandhuman-robot interaction protocols(e.g., avoiding accidental collisions with humans or other objects) would need rigorous development beyond thecollision penaltiesin the reward function.Overall, this paper pushes the boundaries of
humanoid capabilitiesindynamic environments, setting a new benchmark forwhole-body controlinrobotic sports. The insights gained from itsmulti-stage curriculumandsim2real transfer strategyare broadly applicable across the field oflegged roboticsandinteractive AI.
Similar papers
Recommended via semantic vector search.