Paper status: completed

Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning

Published:11/14/2025

Humanoid Whole-Body Control (4)Multistage Reinforcement Learning (2)Reinforcement Learning Training Pipeline (1)Action Generation in Dynamic Environments (1)Badminton Motion Control (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents a reinforcement learning training pipeline to develop a unified whole-body controller for humanoid badminton, enabling coordinated footwork and striking without reliance on motion priors or expert demonstrations. The training is validated in both simulated and

Abstract

Humanoid robots have demonstrated strong capabilities for interacting with static scenes across locomotion, manipulation, and more challenging loco-manipulation tasks. Yet the real world is dynamic, and quasi-static interactions are insufficient to cope with diverse environmental conditions. As a step toward more dynamic interaction scenarios, we present a reinforcement-learning-based training pipeline that produces a unified whole-body controller for humanoid badminton, enabling coordinated lower-body footwork and upper-body striking without motion priors or expert demonstrations. Training follows a three-stage curriculum: first footwork acquisition, then precision-guided racket swing generation, and finally task-focused refinement, yielding motions in which both legs and arms serve the hitting objective. For deployment, we incorporate an Extended Kalman Filter (EKF) to estimate and predict shuttlecock trajectories for target striking. We also introduce a prediction-free variant that dispenses with EKF and explicit trajectory prediction. To validate the framework, we conduct five sets of experiments in both simulation and the real world. In simulation, two robots sustain a rally of 21 consecutive hits. Moreover, the prediction-free variant achieves successful hits with comparable performance relative to the target-known policy. In real-world tests, both prediction and controller modules exhibit high accuracy, and on-court hitting achieves an outgoing shuttle speed up to 19.1 m/s with a mean return landing distance of 4 m. These experimental results show that our proposed training scheme can deliver highly dynamic while precise goal striking in badminton, and can be adapted to more dynamics-critical domains.

Mind Map

In-depth Reading

English Analysis~43 min read · 55,266 chars

1. Bibliographic Information

1.1. Title

Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning

1.2. Authors

Chenhao Liu*, Leyun Jiang†, Yibo Wang†, Kairan Yao, Jinchen Fu and Xiaoyu Ren. All authors are affiliated with Beijing Phybot Technology Co., Ltd, Beijing, China. Chenhao Liu's email is liuchenhao@phybot.cn.

1.3. Journal/Conference

The paper is published as a preprint on arXiv, indicated by the provided link https://arxiv.org/abs/2511.11218. arXiv is a widely recognized open-access preprint server for research articles in fields such as physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While it is not a peer-reviewed journal or conference in its preprint form, it serves as an important platform for early dissemination of research findings and allows for community feedback before formal publication. Given the advanced topic and detailed experimental results, it is likely intended for submission to a top-tier robotics or AI conference/journal.

1.4. Publication Year

2025 (Published at UTC: 2025-11-14T12:22:19.000Z)

1.5. Abstract

This paper introduces a reinforcement learning (RL) based training pipeline that develops a unified whole-body controller for humanoid robots to play badminton. The controller enables coordinated lower-body footwork and upper-body striking without relying on motion priors or expert demonstrations. The training follows a three-stage curriculum: first, footwork acquisition, then precision-guided racket swing generation, and finally, task-focused refinement. For deployment, an Extended Kalman Filter (EKF) is incorporated to estimate and predict shuttlecock trajectories for target striking. A prediction-free variant is also introduced, which dispenses with explicit trajectory prediction. The framework is validated through five sets of experiments in both simulation and the real world. In simulation, two robots sustained a rally of 21 consecutive hits, and the prediction-free variant achieved comparable performance. Real-world tests demonstrated high accuracy for both prediction and controller modules, with outgoing shuttle speeds up to $19.1 \ \mathrm{m/s}$ and a mean return landing distance of 4 m. These results highlight the training scheme's ability to achieve highly dynamic and precise goal striking in badminton, suggesting its adaptability to other dynamics-critical domains.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2511.11218 PDF Link: https://arxiv.org/pdf/2511.11218v2.pdf Publication Status: Preprint (arXiv)

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the challenge of enabling humanoid robots to perform highly dynamic, contact-rich interactions with fast-moving objects within tight reaction windows. While humanoids have made significant progress in locomotion and manipulation in static or quasi-static environments, real-world scenarios often demand agile responses to dynamic elements.

This problem is particularly important because it represents a crucial stepping stone towards developing general-purpose embodied agents that can operate effectively and safely in human-centric environments. Current research often focuses on either locomotion (how a robot moves) or manipulation (how it interacts with objects), but loco-manipulation (coordinated movement and manipulation) in highly dynamic contexts remains largely unexplored.

Badminton serves as an ideal testbed for this challenge due to several factors:

Sub-second perception-action loops: Players (and robots) must react extremely quickly to incoming shots.
Precise timing and orientation: Hitting a shuttlecock effectively requires accurately timing the swing and orienting the racket face within a specific 3D interception volume.
Whole-body coordination: Successful badminton play necessitates blending rapid arm swings with stable and agile leg movements (footwork).

Existing challenges in robotic racket sports, particularly when comparing badminton to table tennis, further highlight the difficulty:
Aerodynamic uncertainty: Shuttlecock trajectories are highly unpredictable due to strong drag and a unique "flip" regime, reducing decision time despite longer flight paths.
Large swing amplitudes: Badminton requires much larger swings than table tennis, leading to significant whole-body disturbances and balance challenges.
Footwork-strike co-evolution: Unlike some table tennis systems where base repositioning is decoupled, in badminton, lower-body motion directly influences hitting accuracy and must be tightly integrated with striking.

The paper's entry point or innovative idea is to address these challenges by proposing a multi-stage reinforcement learning training pipeline that develops a unified whole-body controller for humanoid robots, specifically for badminton. This approach aims to overcome the limitations of prior work that often relied on reference motions, decoupled lower-body and upper-body control, or simplified hitting mechanics. The innovation lies in fostering footwork-strike synergy through a structured RL curriculum that allows the policy to discover energy-efficient swings and coordinated movements without explicit motion priors, leading to direct applicability on real hardware.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

First Real-World Humanoid Badminton System: The authors present what they claim to be the first real-world humanoid robot capable of playing badminton. This autonomous system uses a unified whole-body controller to return machine-served shuttles on a 21-degree-of-freedom (DoF) humanoid. It achieves impressive performance metrics, including swing speeds of approximately $5.3 \ \mathrm{m/s}$ and outgoing shuttle speeds up to $19.1 \ \mathrm{m/s}$ under sub-second reaction times. This demonstrates the feasibility of humanoids performing highly dynamic, interactive tasks.
Stage-wise Reinforcement Learning for Footwork-Strike Coordination: A novel three-stage curriculum is introduced for RL training. This curriculum systematically develops complex loco-manipulation skills:
1. S1 (Footwork acquisition): Focuses on learning stable footwork and approaching target regions.
2. S2 (Precision-guided swing generation): Builds upon S1 by introducing precise racket swing generation, incorporating timing and progressively tightening position and orientation accuracy.
3. S3 (Task-focused refinement): Removes generic locomotion-shaping regularizers to eliminate gradient interference and maximize hitting performance, leading to more energy-efficient and task-optimal behaviors. This structured approach is crucial for achieving whole-body coordination and footwork-strike synergy.
Prediction-Free Variant for Enhanced Robustness: The paper explores a prediction-free variant of the controller. This variant represents a more end-to-end policy that implicitly infers timing and hitting targets directly from short-horizon shuttle observations (current and historical shuttle poses), rather than relying on an explicit trajectory predictor like an EKF. This approach aims to improve robustness to aerodynamic variability and simplify deployment by reducing reliance on external prediction modules or aerodynamic parameters. While currently validated primarily in simulation, it offers a promising direction for future real-world applications.

The key conclusions and findings are:
The multi-stage RL curriculum successfully trains a unified whole-body controller that can execute large-amplitude swings while maintaining balance and reacting within sub-second windows to intercept fast incoming shuttles autonomously.
In simulation, the system demonstrates high reliability, with two robots sustaining a rally of 21 consecutive hits.
The prediction-free variant shows comparable hitting performance in simulation to the target-known policy, suggesting its potential for more robust and streamlined real-world deployment.
Real-world experiments confirm the sim2real transferability of the learned policy, achieving high prediction accuracy and controller module accuracy. The robot successfully returns machine-served shuttles with impressive outgoing speeds and return shot quality.
The system develops foot-racket co-timing and recovery behaviors without explicit hand-coding, which emerge naturally from the training process.

These findings solve the problem of enabling humanoids to engage in highly dynamic and interactive tasks, specifically in the context of badminton, by providing a robust and generalizable reinforcement learning framework for whole-body control.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following foundational concepts:

Humanoid Robots: These are robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. They are engineered to operate in human-centric environments, requiring advanced capabilities in balance, locomotion (walking, running), and manipulation (object interaction). The humanoid in this paper, Phybot C1, is a 1.28m tall, 30kg robot with 21 Degrees of Freedom (DoF). DoF refers to the number of independent parameters that define the configuration or state of a mechanical system. For example, a single joint that can only rotate in one plane has 1 DoF. A humanoid with 21 DoF has 21 such controllable movements, distributed across its hips, knees, ankles, waist, shoulders, and elbows.
Reinforcement Learning (RL): RL is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. The agent observes the state of the environment, takes an action, and receives a reward and a new state. The goal is to learn a policy – a mapping from states to actions – that yields the largest long-term reward.
- Agent: The learning entity (e.g., the humanoid robot).
- Environment: The world the agent interacts with (e.g., the badminton court, shuttlecock dynamics).
- State ( $s$ ): A representation of the environment at a given time (e.g., robot's joint angles, velocities, shuttlecock position).
- Action ( $a$ ): A decision made by the agent that influences the environment (e.g., desired joint positions for the robot).
- Reward ( $r$ ): A scalar feedback signal from the environment indicating the desirability of an action (e.g., positive reward for hitting the shuttlecock accurately, negative reward for falling).
- Policy ( $\pi$ ): The strategy the agent uses to choose actions given states ( $\pi(a|s)$ ).
- Value Function: Estimates the expected cumulative reward from a given state or state-action pair.
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is formally defined by a tuple $(\mathcal{S}, \mathcal{O}, \mathcal{A}, \mathcal{P}, \mathcal{R})$ , where:
- $\mathcal{S}$ is the set of all possible states. In this paper, the authors mention an unobserved state $s_t \in \mathcal{S}$ .
- $\mathcal{O}$ is the observation space. This is what the agent actually perceives from the environment.
- $\mathcal{A}$ is the set of possible actions.
- $\mathcal{P}$ is the transition probability function, specifying the probability of moving to a new state $s_{t+1}$ given the current state $s_t$ and action $a_t$ : $P(s_{t+1}|s_t, a_t)$ .
- $\mathcal{R}$ is the reward function, defining the reward $r_t = \mathcal{R}(s_t, a_t)$ received for taking action $a_t$ in state $s_t$ . The policy $\pi_{\boldsymbol{\theta}}(a_t|o_t)$ aims to maximize the expected discounted sum of future rewards $J(\theta) = \mathbb{E}[\sum_t \gamma^t r_t]$ , where $\gamma$ is the discount factor (a value between 0 and 1 that determines the importance of future rewards).
Proximal Policy Optimization (PPO): A popular RL algorithm that belongs to the family of actor-critic methods. PPO aims to find a balance between ease of implementation, sample efficiency, and good performance. It modifies the objective function of REINFORCE to include a clipping mechanism that constrains the policy updates, preventing excessively large policy changes that could lead to instability.
- Actor-Critic: Actor-critic methods use two neural networks: an actor network that learns the policy (how to act) and a critic network that learns the value function (how good the current state or action is).
- Asymmetric Actor-Critic: A variant where the critic has access to more information (e.g., privileged information like noise-free states or future targets) than the actor during training. This helps stabilize value estimation while keeping the actor's observations restricted to what is available during deployment.
Extended Kalman Filter (EKF): A non-linear extension of the Kalman Filter, used for state estimation in dynamic systems where the system model and/or measurement model are non-linear. The EKF linearizes the system and measurement models around the current state estimate using Taylor series expansions. It is commonly used for trajectory prediction and sensor fusion. In this paper, it's used to estimate and predict the shuttlecock's trajectory given noisy motion capture measurements.
Motion Capture (Mocap): A technology for digitally recording the movement of objects or people. It typically uses cameras to track markers placed on the subject. In this paper, Mocap provides precise position and orientation data for the robot and shuttlecock in the real world.
Domain Randomization: A technique used in RL to improve the sim2real transferability of policies trained in simulation. By randomizing various physical parameters (e.g., friction, mass, sensor noise, latency, aerodynamic parameters) of the simulated environment, the policy becomes more robust to discrepancies between the simulation and the real world, allowing zero-shot transfer (deployment without further real-world training or system identification).

3.2. Previous Works

The paper discusses several previous works in whole-body loco-manipulation and robotic racket sports, highlighting both progress and existing gaps.

3.2.1. Whole-Body Loco-manipulation

Decoupled Control: Historically, locomotion and manipulation tasks were often solved separately to reduce complexity.
- [13] proposes an MPC controller for end-effector tracking on a quadruped, coupled with a separate RL-based controller for locomotion.
- [14] assigns base movement, arm swing, and hand catching to three distinct policy modules for object catching.
- [15] advocates for decoupling upper-body and lower-body control into separate agents for more precise end-effector tracking during robot movement.
- Critical Reflection: While decoupling simplifies control, it can lead to suboptimal performance in tasks requiring tight coordination, as the base movement might not be fully optimized to support manipulation tasks dynamically.
Unified Control Approaches: More recent efforts aim to unify lower and upper-body control.
- [17, 18, 19] leverage multi-critic architectures to facilitate the optimization of coordinated behaviors. Multi-critic approaches use multiple critic networks to evaluate different aspects of the policy, potentially leading to more stable and efficient learning of complex loco-manipulation tasks.
- [20] introduces a physical feasibility reward design to guide unified policy learning.
- [12, 21, 22] demonstrate that unified whole-body loco-manipulation is effective for dynamic tasks like racket sports and tossing, where leg contributions are crucial for power transmission and timing, not just locomotion.
- Critical Reflection: These works pave the way for true whole-body coordination, recognizing that in dynamic tasks, the entire robot kinematic chain contributes to the overall objective.

3.2.2. Robotic Racket Sports

Robotic racket sports are considered key benchmarks due to the tight coupling between perception, target interception, and precise end-effector control.
- Table Tennis:
  - [8] achieved human-level table tennis with a highly modular and hierarchical policy architecture combining learned and engineered components, but noted its complexity and reliance on manual tuning and real-world data.
  - [10] demonstrated humanoid table tennis using an RL-based whole-body controller coupled with a model-based planner for ball trajectory and racket target planning. However, it relied on expert demonstrations for reference motions, which can impose style constraints.
  - [11] jointly trained an RL policy with an auxiliary learnable predictor for ball trajectories and racket targets. A key distinction here is that base motion was executed by a separate command, implying lower-body footwork and upper-body striking were not jointly optimized by a single policy.
  - Critical Reflection: Humanoid table tennis is challenging, but badminton significantly escalates these difficulties due to aerodynamics, larger swing amplitudes, and the necessity for deep footwork-strike coordination. The reliance on reference motions or decoupled control in previous table tennis works limits their applicability to the more dynamic badminton scenario.
- Badminton:
  - [12] introduced a perception-informed whole-body policy for badminton on a quadrupedal platform, achieving impressive shuttlecock tracking and striking.
  - Critical Reflection: While notable, quadrupedal robots have a much larger and more stable support polygon compared to humanoids, simplifying balance control during large-amplitude swings. This makes quadrupedal badminton a less direct challenge for whole-body coordination on a humanoid. The paper notes that simulated humanoid badminton policies in [12] did not exhibit badminton-style footwork, suggesting missing training signals for lower-body coordination.

3.3. Technological Evolution

The evolution of robotics control for dynamic tasks has moved from classical control methods (e.g., PID controllers, Model Predictive Control - MPC) often relying on precise models, to learning-based approaches like Reinforcement Learning. Early RL applications often focused on simpler tasks or required extensive expert demonstrations and reference motions. The trend has been towards:

Unified Control: Integrating locomotion and manipulation into a single, whole-body controller to leverage the full kinematic chain.
Increased Autonomy: Moving away from explicit motion priors or expert demonstrations, allowing the RL agent to discover optimal behaviors through interaction with the environment.
Robustness and Sim2Real Transfer: Employing techniques like domain randomization to bridge the simulation-to-real-world gap and enable zero-shot deployment.
End-to-End Learning: Reducing reliance on complex, hand-tuned prediction models by integrating perception and control more tightly, as seen in the prediction-free variant.

This paper's work fits squarely within this evolution by pushing the boundaries of unified whole-body control for humanoids in a highly dynamic, contact-rich task (badminton) without motion priors. It builds on the RL advancements in loco-manipulation but specifically addresses the unique challenges of humanoid balance and footwork-strike synergy that previous table tennis or quadrupedal badminton systems did not fully tackle.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's approach offers several core differences and innovations:

Unified Whole-Body Controller without Motion Priors: Unlike [10] which used expert demonstrations for reference motions in table tennis, this paper's RL policy discovers energy-efficient swings and whole-body coordination purely through reinforcement learning. This simplifies implementation and avoids imposing potentially suboptimal style constraints.
Deep Footwork-Strike Synergy: Many table tennis systems ([10, 11]) use separate commands for base motion or decouple lower-body and upper-body control. This paper explicitly fosters footwork-strike synergy on humanoids through its multi-stage curriculum, where lower-body motions are intrinsically linked to hitting accuracy and balance in a unified policy. This is critical for badminton's large-amplitude strokes.
3D Orientation-Aware Striking: While some table tennis works used virtual hit planes to simplify striking to a 2D problem, badminton inherently demands orientation-aware contacts throughout a 3D space. The proposed system addresses this directly.
Humanoid-Specific Challenges: It specifically tackles the greater balance challenges of humanoids (narrow support polygon, high center of mass) during large-amplitude strokes, differentiating it from quadrupedal badminton ([12]) which has inherently simpler balance control.
Multi-Stage RL Curriculum: The novel three-stage curriculum provides a structured way to progressively learn complex whole-body skills, starting from footwork, then swing precision, and finally task-focused refinement. This curriculum design is key to managing the credit assignment problem in complex RL tasks.
Prediction-Free Variant: The introduction of a prediction-free variant that infers hitting targets implicitly from shuttle history is a significant step towards end-to-end control and greater robustness to aerodynamic variability, potentially simplifying deployment by removing explicit reliance on model-based predictors.

In essence, the paper provides a more integrated, autonomous, and physically challenging solution for robotic racket sports on humanoid platforms, specifically tailored to the dynamic demands of badminton.

4. Methodology

The core methodology revolves around training a unified whole-body controller for a humanoid robot to play badminton using a multi-stage reinforcement learning (RL) curriculum. This approach allows the robot to learn coordinated footwork and striking behaviors without relying on motion priors or expert demonstrations.

4.1. Principles

The core idea is to treat the complex task of playing badminton as a partially observable Markov Decision Process (MDP) and learn a parametric policy using reinforcement learning. The theoretical basis is that by carefully designing the observation space, action space, reward function, and a multi-stage training curriculum, the RL agent (the humanoid robot) can discover optimal whole-body coordination for dynamic interaction tasks. The intuition is to gradually introduce complexity, first teaching basic locomotion and target approach, then refining striking precision and swing dynamics, and finally optimizing for the task-specific objective while maintaining robustness.

For deployment, the system uses an Extended Kalman Filter (EKF) to predict shuttlecock trajectories, providing the RL policy with the necessary target information. A prediction-free variant is also explored, where the policy directly infers hitting targets from historical shuttlecock observations, aiming for greater end-to-end autonomy and robustness to aerodynamic uncertainty.

4.2. Core Methodology In-depth (Layer by Layer)

The system consists of several key components: the RL-based Dynamic Whole-Body Controller (with its multi-stage RL and multi-stage reward design), Model-based Hitting Target Generation and Prediction, and a Prediction-free variant.

4.2.1. RL-based Dynamic Whole-Body Controller

The problem is formulated as a partially observable Markov Decision Process (MDP) $\mathcal{M} = (\mathcal{S}, \mathcal{O}, \mathcal{A}, \mathcal{P}, \mathcal{R})$ .

At time $t$ , the (unobserved) state is $s_t \in \mathcal{S}$ .
The agent receives an observation $o_t \in \mathcal{O}$ .
Applies an action $a_t \in \mathcal{A}$ .
Transitions with $s_{t+1} \sim \mathcal{P}(\cdot \mid s_t, a_t)$ .
Accrues reward $r_t = \mathcal{R}(s_t, a_t)$ .

The goal is to learn a parametric policy $\pi_{\boldsymbol{\theta}}(a_t \mid o_t)$ that maximizes the expected cumulative discounted reward: $ J(\theta) = \mathbb{E} \left[ \sum_t \gamma^t r_t \right] $ where $\gamma$ is the discount factor.

4.2.1.1. Observation Space

The actor's basic observation is an 87-dimensional vector. This vector combines:

Proprioception information: Data from the robot's own body sensors (e.g., joint positions, joint velocities, IMU data for base orientation and angular velocity).
External sensing from Mocap: Data about the robot's base position, orientation, and linear velocity obtained from the motion capture system.

To mitigate partial observability and provide more context, a history of past observations is stacked:
Short/long history: 5 frames + 20 frames of all joint positions, velocities, and actions. This adds another 1470 features. This historical data helps the policy infer trends and compensate for noisy or incomplete instantaneous observations.

The critic's observation space is 98-dimensional. It includes all actor's observations plus privileged information:
Noise-free base and joint states: Ideal sensor readings without measurement noise, which are available in simulation for training but not deployment.
Racket speed: May be difficult to acquire accurately in the real world.
Preemptive knowledge: Information used to make the MDP well-posed for accurate value learning, such as two subsequent hitting times, next target position and orientation, and the number of remaining targets. All quantities are expressed in the world frame.

4.2.1.2. Action Space

The policy outputs 21 joint position targets for all DoF (Degrees of Freedom). These targets are scaled by a unified action scale of 0.25. A Low-level PD controller operating at 500 Hz tracks these desired joint positions. The policy inference runs at 50 Hz.

4.2.1.3. Episode Settings

Each episode (a single training run until termination or completion) contains six swing targets. This design encourages follow-through behavior and recovery/preparation between hits.

Asymmetric Actor-Critic (PPO variant): Due to the varying number of remaining targets within an episode (which influences the value function), an asymmetric actor-critic architecture is adopted [26]. The critic receives additional preemptive information (e.g., noise-free states, future targets) to stabilize value estimation, while the actor (which needs to be deployable) only uses observations available in the real world.
Specific critic info: Two subsequent hitting times, next target position and orientation, and the number of remaining targets.

4.2.1.4. Training Settings

Algorithm: PPO [27] is used.
Hardware: Single Nvidia RTX 4090.
Parallelism: 4096 parallel environments in IsaacGym [28]. IsaacGym is a physics simulation platform optimized for GPU-based parallel reinforcement learning.
Network Architecture: Both actor and critic are Multi-Layer Perceptrons (MLPs) with hidden layers of sizes (512, 256, 128) and ELU (Exponential Linear Unit) activations. ELU is an activation function similar to ReLU but allows negative values, which can help mitigate the vanishing gradient problem.

4.2.1.5. Three-stage curriculum

The training process is structured into three stages, with identical observation and action spaces across all stages. Only reward terms and their weights change between stages. Training in each stage continues from the checkpoint of the previous stage once its primary objective has converged. The total training time is $12.4 \ \mathrm{h}$ on a single RTX 4090.

S1 — Footwork acquisition toward sampled hit regions:
- Objective: Learn to approach target regions with reasonable lower limb gait and maintain a stable, forward-facing posture while traversing between six sampled hit locations within an episode. This stage primarily focuses on locomotion and base repositioning.
S2 — Precision-guided swing generation:
- Objective: Building on S1, introduce a sparse hit reward (active only at the scheduled hit instant $t^*$ ) to enforce timing. Initially, loose pose accuracy is allowed to facilitate the emergence of a full swing. As training progresses, position and orientation accuracy are gradually tightened. Light swing-style regularization is added for human-like kinematics, and energy, torque, and collision constraints are strengthened for efficient, stable hits.
S3 — Task-focused refinement:
- Objective: Remove numerous locomotion-shaping regularizers (e.g., foot distance, step/contact symmetry, step height) and the target approach reward from S1 to avoid gradient interference with the primary hitting objective. Safety, energy, and hardware limit constraint terms are retained. Domain randomization and observation noise are enabled to consolidate robustness for sim2real transfer.
  
  The ablation study shows the necessity of this multi-stage approach: skipping S1 causes divergence, jumping from S1 to S3 makes the curriculum gap too large leading to failure, and S3 significantly improves performance and robustness over S1+S2.

4.2.1.6. Multi-Stage Reward Design

A modular reward function is designed, comprising a locomotion-style term $r_{\mathrm{loco}}$ , an arm hitting term $r_{\mathrm{hit}}$ , and global regularization $r_{\mathrm{global}}$ .

S1 — Footwork acquisition:
- Main Shaping: Encourages reaching the target region without requiring precise approach.
- Target approach reward: $ r_{\mathrm{track}} = \exp \bigl( - 4 \max ( d - 0.3, 0 ) \bigr) $ where $d = \| ( p_{\mathrm{ee}}^* - p_{\mathrm{base}} )_{xy} \|$ is the 2D Euclidean distance between the projected end-effector target position and the robot's base position. This reward exponentially decays if the distance $d$ exceeds 0.3m, encouraging the robot to get close to the target area.
- Standard Style Shaping:
  - Base height and orientation: Encourages stable, upright posture.
  - Acceleration regularization: Penalizes jerky movements.
  - Contact-aware footstep terms: Includes air time (penalizes feet staying off the ground too long), touchdown speed (penalizes hard landings), step height (encourages natural step height), foot posture (maintaining proper foot orientation), no-double-flight (prevents both feet from being in the air simultaneously for too long).
  - Simple gait symmetry shaping: Encourages balanced left/right leg movements.
  - Face alignment: Orienting the robot towards the incoming shuttle.
- Safety Constraints: Action rate, joint position/velocity/acceleration, torque, energy penalties, following legged_gym conventions [29].
- Convergence: S1 typically converges within 1k iterations, yielding reliable target-region tracking with natural gait.
S2 — Precision-guided swing generation:
- Reward Changes: Lower the weight of regional tracking and introduce a hit-instant reward ( $r_{\mathrm{swing}}$ ) with a large weight (activated six times per episode).
- Hit Reward ( $r_{\mathrm{hit}}$ ): Comprises two terms: hitting precision and racket swinging speed. Unlike [12] which separates position and orientation, this paper combines them.
  - Define target racket normal $n^*$ as the $z$ -axis of $q_{\mathrm{ee}}^*$ (target end-effector orientation).
  - Let $v_{\mathrm{ee}}$ be the end-effector linear velocity.
  - The effective speed component in the direction of the target is $v_{\mathrm{ee}} \cdot n^*$ .
  - Position error: $e_{ee\_pos} = \| p_{\mathrm{ee}} - p_{\mathrm{ee}}^* \|$ (Euclidean distance between current and target end-effector position).
  - Orientation error: $e_{ee\_ori} = \mathrm{ang\_err}_z (q_{\mathrm{ee}}, q_{\mathrm{ee}}^*)$ (angle between current and target racket normals).
  - The hit reward, active only at $t=t_{\mathrm{hit}}$ $t = t_{hit}$ , is: $ \left{ \begin{array}{ll} r_{\mathrm{swing}} = \exp \Big( -\frac{e_{ee_pos}^2}{\sigma_p} \Big) \exp \Big( -\frac{e_{ee_ori}}{\sigma_r} \Big) + 0.3 r_v, \ \quad \quad r_v = 1 - \exp \big( -\frac{\max(0, v_{\mathrm{ee}} \cdot n^*)}{\sigma_v} \big). \end{array} \right. $
    - $\sigma_p$ : position tolerance.
    - $\sigma_r$ : orientation tolerance.
    - $\sigma_v$ : racket speed tolerance.
  - Scheduled Tightening: Initial tolerances are wide ( $\sigma_p=2.0$ , $\sigma_r=8.0$ ) to allow a full swing to emerge, then gradually tightened ( $\sigma_p=0.1$ , $\sigma_r=1.0$ ) as training progresses.
  - Racket Speed Sigma: Fixed at $\sigma_v = 3.0$ , meaning a $5 \ \mathrm{m/s}$ swing yields $>80\%$ of the speed term, balancing accuracy with deployment stability.
- Swing-Style Regularization:
  - Racket y-axis alignment: $ r_{y-\mathrm{align}} = \left( \hat{\mathbf{y}}{\mathrm{ee}}^\top \hat{\mathbf{y}}{\mathrm{world}} \right)^2, \quad \hat{\mathbf{y}}{\mathrm{ee}} = R(q{\mathrm{ee}}) \mathbf{e}_y, $ where $\hat{\mathbf{y}}_{\mathrm{ee}}$ is the racket's local y-axis in the world frame, and $\hat{\mathbf{y}}_{\mathrm{world}}$ is the world's y-axis. This encourages a human-like backswing and forward swing along the reverse of the backswing for a complete kinetic chain.
  - Default holding pose: $ r_{\mathrm{hold}} = - \sum_{j \in \mathcal{A}_{\mathrm{arm}}} \big( q_j - q_j^{\mathrm{hold}} \big)^2, $ where $\mathcal{A}_{\mathrm{arm}}$ is the set of arm joint indices. This improves deploy-time stability and recoverability when no shuttle is launched.
- Additional Penalties: Collision penalties are added, and energy and torque costs are strengthened.
- Progression: Policy first learns to bring the racket near the target, then develops an early backswing, and finally a smooth backswing-swing-recovery with peak velocity near $t^*$ .
S3 — Task-focused refinement:
- Reward Changes: Removes the target approach reward and many gait-shaping terms from $r_{\mathrm{loco}}$ (e.g., foot distance, contact and step symmetry, step height) to prevent gradient conflict with the hitting objective.
- Retained Terms: Global regularization terms and r_hit rewards are kept.
- Robustness: Domain randomization and observation noise are enabled to consolidate robustness.
- Outcome: Hitting metrics improve ( $r_{\mathrm{swing}}$ increases by 3-5%), while energy and torque costs decrease by approximately 20%, indicating more task-optimal and efficient behavior.
Global Regularization (all stages): Includes action rate penalties, joint position/velocity/acceleration, and torque limits, following legged gym practices.

4.2.2. Model-based Hitting Target Generation and Prediction

4.2.2.1. Generate Shuttlecock Trajectory for Training

A physics-based simulation approach is used to generate badminton flight trajectories, following the dynamics model from [30].

The shuttlecock's flight state is updated using the equation: $ m \frac{dv}{dt} = mg - m \frac{| \mathbf{v} | \mathbf{v}}{L} $ where:
- $m$ : shuttlecock mass.
- $v$ : shuttlecock velocity.
- $g$ : gravitational acceleration.
- $L$ $L$ : aerodynamic characteristic length, defined as $L = 2m / \rho S C_D$ $L = 2 m / ρS C_{D}$ .
  - $\rho$ : air density.
  - $S$ : cross-sectional area of the shuttlecock.
  - $C_D$ : drag coefficient.
- The computed aerodynamic length $L$ used in this work is 3.4.
Trajectory Filtering: Generated trajectories must meet specific conditions:
- Hitting zone: $x \in [-0.8, 0.8]$ , $y \in [-1, 0.2]$ , and $z \in [1.5, 1.6]$ meters. The $y$ -direction asymmetry is due to the right-hand racket setting.
- Minimum traversal time: 0.8 seconds. This filtering ensures realistic and relevant training data.
Training Dataset: Selected trajectories are combined with corresponding interception point data, including position, orientation (trajectory tangential line as racket normal), and timing features. These are stored as tensors for training.

4.2.2.2. Shuttlecock Trajectory Prediction for Deployment

In deployment, an EKF-based trajectory prediction algorithm provides real-time shuttlecock flight estimation and hitting point prediction.

EKF Implementation: Adheres to the same badminton dynamics model as used for trajectory generation.
Phases: Operates through measurement update (incorporating new sensor data) and prediction (forecasting the state).
Hitting Target Prediction: When the predicted height first falls into a predefined interception box, the spatial coordinates and time of traversal are designated as the predicted hitting target:
- $\mathbf{p}_{\mathrm{ee}}^* = [x_{\mathrm{hit}}, y_{\mathrm{hit}}, z_{\mathrm{target}}]^\mathsf{T}$ (target end-effector position).
- $t^*$ (target hitting time).
Racket Orientation: The corresponding velocity vector at $t_{\mathrm{hit}}$ is converted into a quaternion representation to guide the robot's end-effector orientation.
Activation and Update: Prediction is activated at $0.36 \ \mathrm{s}$ (once sufficient trajectory history is collected) and continuously updates the hitting target in a rolling manner, feeding it to the controller.

4.2.3. Prediction-free variant

This variant modifies the observation space of the policy, while reward function, action space, and three-stage training schedule remain identical.

Actor Observation: The actor no longer receives the commanded hitting position, orientation, and time $(p_{\mathrm{ee}}^*, q_{\mathrm{ee}}^*, t^*)$ . Instead, it receives a sliding window of world-frame shuttle positions: the current frame and the previous five frames, sampled at 50 Hz (i.e., $\{ p_{\mathrm{sh}}(t), p_{\mathrm{sh}}(t-0.02), \ldots, p_{\mathrm{sh}}(t-0.10) \} )$ .
Implicit Inference: From this short history, the actor must implicitly infer the intended interception pose and timing.
Critic Access: The critic retains privileged access to the "actual" interception point, which helps stabilize value learning.
Training Data Generation: For each commanded target, the entire shuttle trajectory (integrated by forward badminton dynamics) is stored, not just the single interception point. This allows the reward and critic to know the ground-truth hit specification.
Advantages:
- Deployment no longer relies on a hand-tuned predictor (e.g., EKF plus parametric aerodynamic model).
- The controller conditions directly on measured shuttle positions, making the overall pipeline more end-to-end.
- Reduces explicit dependence on predictor or aerodynamic parameters.
Domain Randomization for Robustness: During training, aerodynamic parameters are randomized per shot, exposing the policy to an "aerodynamic patch" whose exact coefficients are unknown. This mimics human inference of landing tendencies from a brief flight segment.

Current Status: Provides initial simulation evidence of comparable performance but real-robot validation is left for future work.

The following are the results from Table I of the original paper:

stiffness (N·m/rad)		damping (N·m·s/rad)
hip_pitch	100 hip_pitch	10
hip_roll	100	hip_roll 10
hip_yaw	100 hip_yaw	10
knee	50 knee	10
ankle_pitch	50 ankle_pitch	5
ankle_roll	5 ankle_roll	5
waist_yaw	100 waist_yaw	10
shoulder_pitch	50 shoulder_pitch	10
shoulder_roll	50 shoulder_roll	10
shoulder_yaw	5 shoulder_yaw	5
elbow_pitch	50 elbow_pitch	10

Table I: Per-joint PD gains. This table lists the proportional (stiffness, in N·m/rad) and derivative (damping, in N·m·s/rad) gains for each joint (DoF) of the humanoid robot. These gains are used by the low-level PD controller to track the joint position targets output by the RL policy. Different joints have different stiffness and damping values, reflecting their roles and dynamic requirements (e.g., hip and waist joints have higher stiffness for base stability, while ankle joints might have lower values to allow more compliance).

The following are the results from Table II of the original paper:

Actuator model (PhyArc series)	47	68	78	102
Size	47×68	68×75	78×79	102×54.8
Reducer type	Cycloid	Cycloid	Cycloid	Cycloid
Reduction ratio	25	25	25	25
Rated speed (RPM)	100	100	100	100
Rated torque (N·m)	3	28	40	60
No-load speed (RPM)	343.8	181.5	120	124.2
Rated power (W)	216	720	720	1080
Peak power (W)	864	4608	4608	5760
Peak torque (N·m)	12	96	136	244
Rotor inertia	0.00719	0.0339	0.0634	0.1734

Table II: Actuator constants. This table summarizes the specifications of the four PhyArc actuator modules used in the robot. It includes physical size, cycloidal reducer type, reduction ratio, rated and no-load speeds (RPM), rated and peak torque (N·m), rated and peak power (W), and rotor inertia. These constants are crucial for setting torque/velocity limits, defining DoF properties, estimating power consumption, and performing safety checks during experiments, especially in simulation for realistic dynamics. Cycloidal reducers are known for high precision, low backlash, and high torque density.

5. Experimental Setup

The framework is validated through five sets of experiments: two in simulation and three in the real world.

5.1. Datasets

The training data for the RL policy is generated through a physics-based simulation of badminton flight trajectories.

Source: A physics-based simulation approach following the badminton dynamics model in [30].
Scale: 2 million raw trajectories were generated, from which 196,940 met specific criteria and were selected for robot training.
Characteristics and Domain:
- Initial Conditions: The initial positions and velocities for trajectory generation are randomly sampled within specific ranges to ensure diversity in the training data, as detailed in [12]: $ \left{ \begin{array}{ll} p_{x,t_0} \sim U(5, 8) \ p_{y,t_0} \sim U(-2, 2) \ p_{z,t_0} \sim U(-0.5, 2.5) \ v_{x,t_0} \sim U(-25, -13) \ v_{y,t_0} \sim U(-3, 3) \ v_{z,t_0} \sim U(9, 18) \end{array} \right. $ where:
  - $p_{x,t_0}, p_{y,t_0}, p_{z,t_0}$ : Initial position coordinates (x, y, z) of the shuttlecock at time $t_0$ .
  - $v_{x,t_0}, v_{y,t_0}, v_{z,t_0}$ : Initial velocity components (x, y, z) of the shuttlecock at time $t_0$ .
  - U(a, b): Denotes a uniform distribution over the interval [a, b].
- Filtering Criteria: Trajectories are filtered to ensure interception points within a specific hitting zone: $x \in [-0.8, 0.8] \ \mathrm{m}$ , $y \in [-1, 0.2] \ \mathrm{m}$ , and $z \in [1.5, 1.6] \ \mathrm{m}$ . A minimum traversal time of 0.8 seconds is also required. The asymmetric $y$ -range is due to the right-hand racket setting.
- Output Data: The selected trajectories are combined with corresponding interception point data (position, orientation as trajectory tangential line, and timing features) to form the training dataset.
- Distribution: The majority of selected trajectories reached the hitting zone within a time interval of [0.8, 1.4] seconds.
  
  The following are the results from Figure 8 of the original paper:
  
  $Fig. 8: Trajectory generation. Shuttlecock trajectories are filtered to ensure interception points within the region $x \\in$ $\[ - 0 . 8 , 0 . 8 \] \\mathbf { m }$ , $y \\in \\left\[ - 1 , 0 . 2 \\right\] \\mathrm { m }$ and $z \\in \[ 1 . 5 , 1 . 6 \] \\mathrm { m }$ for robot training.$ 该图像是一个三维图示，展示了羽毛球的运动轨迹。不同颜色的线条代表了不同的轨迹，区域的边界框标记出机器人训练过程中羽毛球的截击点范围，满足条件 $x \in [-0.8, 0.8] \ \mathrm{m}$ ， $y \in [-1, 0.2] \ \mathrm{m}$ 和 $z \in [1.5, 1.6] \ \mathrm{m}$ 。 Figure 8: Trajectory generation. Shuttlecock trajectories are filtered to ensure interception points within the region $x \in [-0.8, 0.8] \ \mathrm{m}$ , $y \in [-1, 0.2] \ \mathrm{m}$ and $z \in [1.5, 1.6] \ \mathrm{m}$ for robot training. This figure visualizes the range of generated shuttlecock trajectories in 3D space, highlighting the specific hitting zone (a bounding box) from which trajectories are selected for training. Different colored lines represent individual flight paths, demonstrating the diversity and coverage of the generated data.

The following are the results from Figure 9 of the original paper:

$Fig. 9: Individual trajectory analysis. An example of sampled shuttle flight trajectory (gold) with the selected interception point (red). Corresponding target frame at the intercept is drawn, where $z$ is the incoming-flight direction. The annotation reports the intercept position, orientation and time-tointercept.$
该图像是一个三维图表，示例展示了羽毛球飞行轨迹（黄色）及其拦截点（红色）。图中标注了拦截位置（0.188, -0.701, 1.541）米、时间（0.963 s）以及方向的四元数参数。 Figure 9: Individual trajectory analysis. An example of sampled shuttle flight trajectory (gold) with the selected interception point (red). Corresponding target frame at the intercept is drawn, where $z$ is the incoming-flight direction. The annotation reports the intercept position, orientation and time-to-intercept. This figure provides a detailed view of a single simulated shuttlecock trajectory, showing its path, the designated interception point (red sphere), and the coordinate frame at that point. It also lists the numerical values for intercept position, orientation (as a quaternion), and time-to-intercept, which are critical components of the target tuple used for training.

The following are the results from Figure 10 of the original paper:

Fig. 10: Training trajectory statistics. Distribution of the shuttlecock interception time.
该图像是图表，展示了与网球（z ∈ [1.5, 1.6]）相关的交叉时间分布。图中呈现了交叉时间的频率与概率，频率用蓝色柱状图表示，概率用橙色曲线表示，显示在不同时间区间内的轨迹数量。 Figure 10: Training trajectory statistics. Distribution of the shuttlecock interception time. This histogram shows the distribution of shuttlecock interception times for the 196,940 trajectories selected for training. The x-axis represents the interception time in seconds, and the y-axis shows the frequency (blue bars) and probability (orange curve). This illustrates that most trajectories fall within a range suitable for the robot's reaction capabilities, primarily between 0.8 and 1.4 seconds.

Rationale for Dataset Choice: These synthetic datasets are crucial because real-world collection of diverse shuttlecock trajectories with precise ground truth interception points for RL training is prohibitively difficult and time-consuming. The physics-based simulation allows for generating a vast and varied dataset covering the operational space of the robot, ensuring that the policy is exposed to a wide range of scenarios during training. The filtering process ensures the generated data is relevant and feasible for the robot's capabilities.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate the performance of the system:

Rally Length:
- Conceptual Definition: Measures the number of consecutive successful returns between two robots in simulation. It quantifies the system's ability to maintain a continuous game-play sequence, demonstrating robustness, recovery, and high-quality returns.
- Mathematical Formula: Not explicitly provided, but represents a count of successful hits.
- Symbol Explanation: Count of successfully returned shuttlecocks before an error (fall, out-of-bounds, net, missed hit).
Position Error at Impact ( $e_{ee\_pos}$ ):
- Conceptual Definition: The Euclidean distance between the actual point of contact (racket center) and the designated target interception position at the moment of impact. It quantifies the accuracy of the robot's spatial positioning for hitting.
- Mathematical Formula: $ e_{ee_pos} = | p_{\mathrm{ee}} - p_{\mathrm{ee}}^* | $
- Symbol Explanation:
  - $p_{\mathrm{ee}}$ : The actual 3D position of the end-effector (racket center) at impact.
  - $p_{\mathrm{ee}}^*$ : The target 3D position for the end-effector at impact.
  - $\| \cdot \|$ : Euclidean norm (distance).
Orientation Error at Impact ( $e_{ee\_ori}$ ):
- Conceptual Definition: The angular difference between the actual orientation of the racket face and the target orientation of the racket face at the moment of impact. It quantifies the accuracy of the robot's racket angle for hitting the shuttlecock in the desired direction.
- Mathematical Formula: $ e_{ee_ori} = \mathrm{ang_err}z (q{\mathrm{ee}}, q_{\mathrm{ee}}^*), \text{ where } \mathrm{ang_err}_z \text{ is the angle between the current and target racket normals.} $
- Symbol Explanation:
  - $q_{\mathrm{ee}}$ : The actual orientation (as a quaternion) of the end-effector (racket).
  - $q_{\mathrm{ee}}^*$ : The target orientation (as a quaternion) of the end-effector (racket).
  - $\mathrm{ang\_err}_z(\cdot, \cdot)$ : A function that calculates the angle between the $z$ -axes of the two input quaternions, representing the normal direction of the racket face.
Executed Swing Speed ( $v_{\mathrm{swing}}$ ):
- Conceptual Definition: The linear speed of the racket at the moment of impact with the shuttlecock. It indicates the power and dynamics of the robot's swing. The effective component in the target direction is used for reward, but the total speed is often reported.
- Mathematical Formula: $ v_{\mathrm{swing}} = | v_{\mathrm{ee}} | $
- Symbol Explanation:
  - $v_{\mathrm{ee}}$ : The linear velocity vector of the end-effector (racket center) at impact.
  - $\| \cdot \|$ : Euclidean norm (magnitude of the velocity vector).
EKF Prediction Error (Position & Time):
- Conceptual Definition: Measures the accuracy of the Extended Kalman Filter in predicting the shuttlecock's future position and time of arrival at the interception point. This is crucial for the controller to plan its actions.
- Mathematical Formula (Position Error): $ e_{\mathrm{pred_pos}} = | p_{\mathrm{sh_pred}} - p_{\mathrm{sh_true}} | $
- Mathematical Formula (Time Error): $ e_{\mathrm{pred_time}} = | t_{\mathrm{sh_pred}} - t_{\mathrm{sh_true}} | $
- Symbol Explanation:
  - $p_{\mathrm{sh\_pred}}$ : The predicted 3D position of the shuttlecock at interception.
  - $p_{\mathrm{sh\_true}}$ : The ground-truth 3D position of the shuttlecock at interception.
  - $t_{\mathrm{sh\_pred}}$ : The predicted time of shuttlecock interception.
  - $t_{\mathrm{sh\_true}}$ : The ground-truth time of shuttlecock interception.
  - $\| \cdot \|$ : Euclidean norm.
  - $| \cdot |$ : Absolute value.
Outgoing Shuttle Speed:
- Conceptual Definition: The speed of the shuttlecock immediately after being hit by the robot. This indicates the power and effectiveness of the robot's return shot.
- Mathematical Formula: The paper uses an elastic interaction model to compute outgoing velocity: $ \left{ \begin{array}{ll} v_{\mathrm{racket},n} = (v_{\mathrm{racket}} \cdot n_{\mathrm{racket}}) n_{\mathrm{racket}}, \ v_{\mathrm{shuttle},n} = (v_{\mathrm{incoming}} \cdot n_{\mathrm{racket}}) n_{\mathrm{racket}}, \ v_{\mathrm{out}} = v_{\mathrm{incoming}} - 2v_{\mathrm{shuttle},n} + 2v_{\mathrm{racket},n}. \end{array} \right. $ The outgoing shuttle speed is then $\| v_{\mathrm{out}} \|$ .
- Symbol Explanation:
  - $v_{\mathrm{racket}}$ : Velocity of the racket at impact.
  - $n_{\mathrm{racket}}$ : Normal vector of the racket face at impact.
  - $v_{\mathrm{incoming}}$ : Incoming velocity of the shuttlecock.
  - $v_{\mathrm{racket},n}$ : Component of racket velocity normal to the racket face.
  - $v_{\mathrm{shuttle},n}$ : Component of incoming shuttle velocity normal to the racket face.
  - $v_{\mathrm{out}}$ : Outgoing shuttle velocity vector.
Mean Return Landing Distance:
- Conceptual Definition: The average distance from the interception area where the robot's returned shots land. This metric assesses the quality and depth of the return shots.
- Mathematical Formula: Not explicitly provided, but implies measuring the landing spot of returned shuttles relative to a reference point on the court.
- Symbol Explanation: Average distance in meters.
Hit Success Rate:
- Conceptual Definition: The percentage of shuttlecocks successfully returned by the robot. A successful hit is defined by position error below $0.10 \ \mathrm{m}$ and orientation error below 0.2 rad.
- Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Hits}}{\text{Total Number of Attempts}} \times 100% $
- Symbol Explanation: A percentage value.

5.3. Baselines

The paper primarily focuses on presenting its novel multi-stage RL training pipeline and does not conduct direct comparisons against external baseline methods from other research groups for the overall badminton task. Instead, it uses internal comparisons and ablations:

Target-Known Policy vs. Prediction-Free Policy (Internal Comparison): This is a key comparison within the paper's own framework.
- Target-Known Policy: The main policy that receives explicit planned interception position, orientation, and time from the EKF.
- Prediction-Free Policy: An alternative variant that learns to infer impact pose and timing solely from short-history shuttle positions.
- Representativeness: This comparison evaluates the effectiveness and robustness of the end-to-end learning approach versus a modular perception-control pipeline.
Ablation Studies on Curriculum Stages (Internal Ablation):
- The paper implicitly compares the full three-stage curriculum (S1+S2+S3) against ablated versions (e.g., S1+S2, skipping S1, skipping S2).
- Representativeness: This ablation is critical for validating the design choices of the multi-stage curriculum and demonstrating the necessity of each stage for reliable convergence and optimal performance.
  
  While there are references to other works like humanoid table tennis ([10, 11]) or quadrupedal badminton ([12]), the paper highlights how its humanoid badminton task differs significantly in complexity and whole-body coordination requirements, making direct quantitative comparison difficult given the absence of other published real-world humanoid badminton systems. Therefore, the core baseline for this work is the performance of its own Target-Known policy and the incremental improvements achieved through its multi-stage curriculum and the Prediction-Free variant.

The following are the results from Table III of the original paper:

Hyperparameter	Setting
discount factor	0.99
GAE lambda	0.95
learning rate	adaptive
KLD target	0.01
clip param	0.2
control dt (s)	0.02
num. envs	4096
actor MLP size	(512, 256, 128)
critic MLP size	(512, 256, 128)
network activation	elu
optimizer	AdamW

Table III: Hyperparameter configuration. This table details the hyperparameters used for the PPO algorithm during training. Key parameters include:

discount factor (0.99): Determines the importance of future rewards. A value close to 1 emphasizes long-term rewards.
GAE lambda (0.95): Generalized Advantage Estimation (GAE) parameter, used for balancing bias and variance in advantage estimation.
learning rate (adaptive): Adjusts step size during optimization.
KLD target (0.01): Kullback-Leibler Divergence target, a constraint used in PPO to limit policy updates.
clip param (0.2): PPO's clipping parameter, which limits how far the new policy can deviate from the old policy during an update step.
control dt (0.02 s): The time step for the policy's control loop (corresponding to 50 Hz).
num. envs (4096): Number of parallel simulation environments used in IsaacGym for efficient data collection.
actor/critic MLP size ((512, 256, 128)): Specifies the architecture of the Multi-Layer Perceptron (MLP) networks for the actor and critic, indicating the number of neurons in each hidden layer.
network activation (elu): The activation function used in the neural networks.

optimizer (AdamW): The optimization algorithm used to update network weights.

The following are the results from Table IV of the original paper:

Domain random	Range	Noise	Range
friction range	[0.5, 1.0]	dof_pos	0.05
push interval	5	dof_vel	0.2
max push vel_xy	0.5	lin_vel	0.2
max push ang_vel	0.5	ang_vel	0.1
added base_com range	[-0.08, 0.08]	gravity	0.1
join_friction range	[0.01, 1.0]
added inertia range	[0.01, 0.1]

Table IV: Domain randomization and observation noise. This table details the parameters for domain randomization and observation noise applied during training. These techniques are crucial for improving sim2real transferability.

Domain Randomization:
- friction range ([0.5, 1.0]): Randomizes friction coefficients of surfaces.
- push interval (5): Frequency of external pushes applied to the robot.
- max push vel_xy (0.5): Maximum horizontal velocity change from external pushes.
- max push ang_vel (0.5): Maximum angular velocity change from external pushes.
- added base_com range ([-0.08, 0.08]): Randomly shifts the center of mass of the robot's base.
- join_friction range ([0.01, 1.0]): Randomizes friction in the robot's joints.
- added inertia range ([0.01, 0.1]): Randomly adds inertia to robot links.

Observation Noise:

dof_pos (0.05): Noise added to joint position observations.
dof_vel (0.2): Noise added to joint velocity observations.
lin_vel (0.2): Noise added to linear velocity observations.
ang_vel (0.1): Noise added to angular velocity observations.

gravity (0.1): Perturbation of the gravity vector.

The following are the results from Table V of the original paper:

Stage	Category	Term	Weight
S1
S1	$r_{\mathrm{loco}}$	base_height	5
S1	$r_{\mathrm{loco}}$	base_ang_vel_xy	-10
S1	$r_{\mathrm{loco}}$	base_orientation	-50
S1	$r_{\mathrm{loco}}$	contact_no_vel	-10
S1	$r_{\mathrm{loco}}$	feet_orientation	10
S1	$r_{\mathrm{loco}}$	feet_no_fly	-2
S1	$r_{\mathrm{loco}}$	feet_height	10
S1	$r_{\mathrm{loco}}$	feet_distance	2
S1	$r_{\mathrm{loco}}$	air_time & land_time	-500
S1	$r_{\mathrm{loco}}$	sym_contact_forces	1
S1	$r_{\mathrm{loco}}$	sym_step	-5
S1	$r_{\mathrm{loco}}$	face_the_net	8
S1	$r_{\mathrm{loco}}$	target_approach	30
S1	$r_{\mathrm{hit}}$
S1	$r_{\mathrm{global}}$	action_rate	-0.8
S1	$r_{\mathrm{global}}$	dof_pos_limit	-30
S1	$r_{\mathrm{global}}$	dof_vel_limit	-0.1
S1	$r_{\mathrm{global}}$	dof_torque_limit	-0.1
S1	$r_{\mathrm{global}}$	dof_acc	-5× 10-5
S1	$r_{\mathrm{global}}$	dof_vel	-1× 10-3
S1	$r_{\mathrm{global}}$	dof_torque	−1× 10-4
S1	$r_{\mathrm{global}}$	momentum_positive	5
S2
S2	$r_{\mathrm{loco}}$	base_height	5
S2	$r_{\mathrm{loco}}$	base_ang_vel_xy	-10
S2	$r_{\mathrm{loco}}$	base_orientation	-50
S2	$r_{\mathrm{loco}}$	contact_no_vel	-10
S2	$r_{\mathrm{loco}}$	feet_orientation	10
S2	$r_{\mathrm{loco}}$	feet_no_fly	-2
S2	$r_{\mathrm{loco}}$	feet_height	10
S2	$r_{\mathrm{loco}}$	feet_distance	2
S2	$r_{\mathrm{loco}}$	air_time & land_time	-500
S2	$r_{\mathrm{loco}}$	sym_contact_forces	1
S2	$r_{\mathrm{loco}}$	sym_step	-5
S2	$r_{\mathrm{loco}}$	face_the_net	5
S2	$r_{\mathrm{loco}}$	target_approach	15
S2	$r_{\mathrm{hit}}$	$r_{\mathrm{swing}}$	4000
S2	$r_{\mathrm{hit}}$	$r_{y-\mathrm{align}}$	5
S2	$r_{\mathrm{hit}}$	$r_{\mathrm{hold}}$	10
S2	$r_{\mathrm{global}}$	action_rate	-0.8
S2	$r_{\mathrm{global}}$	dof_pos_limit	-30
S2	$r_{\mathrm{global}}$	dof_vel_limit	-0.1
S2	$r_{\mathrm{global}}$	dof_torque_limit	-0.1
S2	$r_{\mathrm{global}}$	dof_acc	-5× 10-5
S2	$r_{\mathrm{global}}$	dof_vel	-1× 10-3
S2	$r_{\mathrm{global}}$	dof_torque	−1× 10-4
S2	$r_{\mathrm{global}}$	momentum_positive	5
S2	$r_{\mathrm{global}}$	energy	-0.01
S2	$r_{\mathrm{global}}$	collision	-10
S3
S3	$r_{\mathrm{loco}}$	base_height	5
S3	$r_{\mathrm{loco}}$	base_ang_vel_xy	-10
S3	$r_{\mathrm{loco}}$	base_orientation	-50
S3	$r_{\mathrm{loco}}$	contact_no_vel	-10
S3	$r_{\mathrm{loco}}$	feet_orientation	10
S3	$r_{\mathrm{loco}}$	feet_no_fly	-2
S3	$r_{\mathrm{loco}}$	air_time & land_time	-500
S3	$r_{\mathrm{loco}}$	face_the_net	5
S3	$r_{\mathrm{hit}}$	$r_{\mathrm{swing}}$	4000
S3	$r_{\mathrm{hit}}$	$r_{y-\mathrm{align}}$	5
S3	$r_{\mathrm{hit}}$	$r_{\mathrm{hold}}$	10
S3	$r_{\mathrm{global}}$	action_rate	-0.8
S3	$r_{\mathrm{global}}$	dof_pos_limit	-30
S3	$r_{\mathrm{global}}$	dof_vel_limit	-0.1
S3	$r_{\mathrm{global}}$	dof_torque_limit	-0.5
S3	$r_{\mathrm{global}}$	dof_acc	-5×10-5
S3	$r_{\mathrm{global}}$	dof_vel	-1× 10-3
S3	$r_{\mathrm{global}}$	dof_torque	−1× 10−4
S3	$r_{\mathrm{global}}$	momentum_positive	5
S3	$r_{\mathrm{global}}$	energy	-0.01
S3	$r_{\mathrm{global}}$	collision	-10

Table V: Reward weights by stage. This table provides a detailed breakdown of the reward terms and their respective weights across the three training stages (S1, S2, S3). This explicitly shows how the curriculum gradually shifts focus:

S1 (Footwork acquisition): Heavy emphasis on locomotion terms ( $r_{\mathrm{loco}}$ ), particularly base orientation, air time/land time (for stable stepping), and target_approach. Hitting rewards are not active.
S2 (Precision-guided swing generation): The target_approach weight is reduced, and crucially, hitting rewards ( $r_{\mathrm{swing}}$ , $r_{y-\mathrm{align}}$ , $r_{\mathrm{hold}}$ ) are introduced with high weights, especially $r_{\mathrm{swing}}$ (4000). Energy and collision penalties are also added.
S3 (Task-focused refinement): Many gait-shaping locomotion terms are removed (e.g., feet_height, feet_distance, sym_contact_forces, sym_step), allowing the policy to optimize more directly for the hitting objective. The dof_torque_limit penalty is increased, further encouraging energy efficiency.

6. Results & Analysis

6.1. Core Results Analysis

The experiments are designed to test accuracy, agility, robustness, and deployability in both simulation and the real world. A hit success is defined by a position error below $0.10 \ \mathrm{m}$ and orientation error below 0.2 rad, a target motivated by the racket's geometry.

6.1.1. Simulation Results

The following are the results from Figure 3 of the original paper:

该图像是图示，展示了两台 humanoid 机器人在羽毛球比赛中的情景。图 (a) 描绘了两台机器人在21次击球中的位置和运动轨迹，标注了击球点和羽毛球的飞行轨迹。在图 (b) 中，展示了预测无关策略，而图 (c) 展示了目标已知策略，分别用红色和绿色球体表示击球和成功击打的位置。 Figure 3: Simulation results. Figure (a) illustrates the Two-Robot Rally scenario, where two identical humanoid robots sustain a rally of 21 consecutive returns. Figure (b) demonstrates the Prediction-Free policy: the robot infers the optimal impact position and orientation solely from the first five recorded shuttlecock positions after serving. Figure (c) presents the Target-Known policy, where a predetermined hitting position is provided. The red sphere indicates the designated hitting location, while the green sphere confirms successful impact execution by the robot.

6.1.1.1. Two-Robot Rally

Setup: Two identical humanoids, both running the Target-Known policy, face each other on a scaled badminton court. The outgoing shuttle velocity after each hit is computed assuming an elastic interaction. The rally continues as long as the shuttle is returned successfully (clears net, lands in bounds, no falls, no misses).
Results: The robots sustained a rally of 21 consecutive returns.
Analysis: This demonstrates that the controller is capable of repeatedly:
- Repositioning: Moving to intercept the shuttle.
- Returning shuttles with high quality: The hits are accurate enough to continue the rally.
- Recovering posture: Maintaining balance and preparing for the next shot over extended exchanges. This is a strong validation of the whole-body coordination and robustness of the learned policy.

6.1.1.2. Target-Known vs. Prediction-Free Comparison

Setup: Compares the Target-Known policy (receiving planned interception position, orientation, and time) with the Prediction-Free policy (observing current and five-frame history of shuttle positions) under identical simulation conditions. The Prediction-Free policy also experiences moderate variations in aerodynamic characteristic length to simulate different shuttles. Each policy executes twenty random hits.
Metrics: Position error, orientation error at impact, and executed swing speed.

The following are the results from Figure 4 of the original paper:

该图像是图表，展示了目标已知策略与无预测策略之间的比较。上部分为位置误差比较，中间部分为方向误差比较，底部则展示了挥拍速度比较。各策略在多个试验中的表现差异明显。 Figure 4: Comparison between target-known and prediction-free policy. The top part of this figure shows the position error for both strategies. The middle section of the figure shows the orientation error comparison, the orientation corresponds to the normal direction of the racket face. The bottom part of the figure compares swing velocity.
Results (from Figure 4):
- Position Error: Both policies achieve very low position error, mostly below $0.05 \ \mathrm{m}$ for Target-Known and slightly higher but still within acceptable bounds for Prediction-Free.
- Orientation Error: Similar to position, both show good orientation accuracy, with the Prediction-Free variant showing a modest increase in error.
- Swing Velocity: Both policies achieve comparable swing velocities, often exceeding $5 \ \mathrm{m/s}$ .
Analysis: The results indicate that explicit target information is not indispensable. The Prediction-Free policy can infer both the hitting target and timing on its own, with only a modest drop in performance compared to the Target-Known policy. This provides initial evidence for the viability of a more end-to-end control strategy that is robust to aerodynamic variability and does not rely on a separate predictor.

6.1.2. Sim2Real Transfer

The controller is trained entirely in simulation and deployed to hardware zero-shot (without real-world training).

Techniques for Transfer: Domain randomization and observation noise are applied during training (detailed in Table IV) to cover key dynamics and sensing variations. Constraint terms in the reward function discourage brittle, high-torque solutions. The staged curriculum encourages a complete kinetic chain for hitting.
Outcome: The learned controller transfers successfully to the real robot in the Mocap arena without system identification or manual parameter adjustment, validating the effectiveness of the sim2real strategy.

6.1.3. Real Robot Deployment

6.1.3.1. EKF Prediction Accuracy study

Objective: Quantify the accuracy of the EKF-based shuttlecock trajectory prediction.
Setup: 20 authentic badminton flight trajectories were collected using Mocap. Partial segments served as measurement input for the EKF, and predictions were compared against ground-truth contact positions and timing.

The following are the results from Figure 5 of the original paper:

$Fig. 5: EKF Prediction Accuracy. The predicted striking position error (top) and striking time error (bottom) were evaluated over 20 shuttlecock trajectories. The shaded regions represent the standard deviation. At 0.6 s before interception, the mean position error was less than $1 0 0 ~ \\mathrm { { m m } }$ , already smaller than the radius of the racket.$ 该图像是图表，展示了在相对截击时间内的总位置误差和截击时间预测误差。上图显示了位置误差的均值及其标准差，紫色曲线表示位置误差；下图显示了截击时间的绝对误差，橙色曲线代表时间误差均值。截击时间以虚线表示。 Figure 5: EKF Prediction Accuracy. The predicted striking position error (top) and striking time error (bottom) were evaluated over 20 shuttlecock trajectories. The shaded regions represent the standard deviation. At 0.6 s before interception, the mean position error was less than $100 \ \mathrm{mm}$ , already smaller than the radius of the racket.
Results (from Figure 5):
- Position Error: At $0.6 \ \mathrm{s}$ before impact, the mean predicted position error was below $100 \ \mathrm{mm}$ , which is smaller than the racket's radius. This error sharply converged to about $10 \ \mathrm{mm}$ by $0.3 \ \mathrm{s}$ prior to contact.
- Timing Error: Mean interception timing prediction error was approximately $20 \ \mathrm{ms}$ at $0.6 \ \mathrm{s}$ before hitting, rapidly converging thereafter.
Analysis: The EKF demonstrates high accuracy in predicting both shuttlecock position and timing well within the reaction window needed for the robot, confirming its reliability as a perception module.

The following are the results from Figure 11 of the original paper:

$Fig. 11: EKF prediction accuracy under varying aerodynamic characteristic lengths. The parameter $\\alpha$ scales the characteristic length to emulate variations in shuttle aerodynamics.$ 该图像是图表，展示了在不同的拦截时间相对时间下，总位置误差和拦截时间预测误差的比较。上部分显示了总位置误差的变化，下部分则展示了绝对时间误差的变化。图中还包含了不同情况下的误差范围，并指明了拦截时间。 Figure 11: EKF prediction accuracy under varying aerodynamic characteristic lengths. The parameter $\alpha$ scales the characteristic length to emulate variations in shuttle aerodynamics. This figure shows a sensitivity analysis of the EKF's prediction accuracy when the aerodynamic characteristic length is perturbed by a factor $\alpha \sim \mathcal{U}(0.8, 1.2)$ . Even with this perturbation, the mean position error (top) still converges to within $10 \ \mathrm{cm}$ by about $0.45 \ \mathrm{s}$ before interception, indicating reasonable robustness to uncertainties in aerodynamic parameters.

6.1.3.2. Virtual-Target Swinging

Objective: Quantify the robot's swing error and speed without a real shuttle, isolating the controller's motor performance.
Setup: The robot was instructed to execute hitting motions towards 71 randomly sampled target positions. Mocap measured the racket face center and compared it to the commanded target.

The following are the results from Figure 6 of the original paper:

该图像是图表，展示了击球过程中的总位置误差和击球时的球拍速度。上部显示了总位置误差（以毫米为单位）的变化，以及平均值24.29 mm的参考线；下部则展现了球拍在击球时的速度（以米每秒为单位），其平均值为5.32 m/s。两部分皆包含了数据分布的箱线图，以可视化不同击球点索引下的误差和速度。 Figure 6: Virtual-Target Swinging. The upper portion of this figure depicts the Euclidean distance error between the racket center and the designated hitting position at the moment of impact, while the lower portion illustrates the corresponding racket speed at impact.
Results (from Figure 6):
- Mean Euclidean distance error: $24.29 \ \mathrm{mm}$ .
- Average racket speed: $5.32 \ \mathrm{m/s}$ at impact.
Analysis: These results confirm the robot's ability to precisely control the racket contact point and generate sufficient swing speed, demonstrating its practical feasibility for high-precision swinging actions.

The following are the results from Figure 12 of the original paper:

$Fig. 12: Trajectory of the racket center. For a designated hitting position at (50, -250, 1540) mm, the robot executed 20 swinging motions. The green spheres represent the positions of the racket center as it passed through the $\\mathrm { ~ z ~ } = \\mathrm { ~ 1 5 4 0 ~ m m }$ plane during each swing.$ 该图像是图表，展示了机器人在指定击打位置 (50, -250, 1540) mm 附近的拍子中心轨迹。蓝色线条表示原始轨迹，绿色球体表示机器人经过的实际拍子中心位置，图中包含20个实际运动点，整体运动路径展现了机器人挥拍的动作。 Figure 12: Trajectory of the racket center. For a designated hitting position at (50, -250, 1540) mm, the robot executed 20 swinging motions. The green spheres represent the positions of the racket center as it passed through the $z = 1540 \ \mathrm{mm}$ plane during each swing. This figure visualizes the repeatability of the racket center's trajectory for 20 swings aimed at a single fixed target. The tight clustering of green spheres indicates high precision.

The following are the results from Figure 13 of the original paper:

$Fig. 13: Swing Error Analysis. Over 20 repeated swings, the mean Euclidean distance error was measured at 23.21 mm, with a standard deviation of $1 0 . 5 5 \\ \\mathrm { m m }$ The maximum and minimum errors recorded were $5 1 . 0 7 ~ \\mathrm { m m }$ and $6 . 3 4 ~ \\mathrm { m m }$ , respectively.$
该图像是图表，展示了在 X-Y 平面上的误差分布。图中标注了目标位置（黑星标），最近的击球点（蓝圈），以及最远的击球点（红圆）。误差的统计信息包括均值 $23.21 \ \mathrm{mm}$ 、标准差 $10.55 \ \mathrm{mm}$ 、最大值 $51.07 \ \mathrm{mm}$ 和最小值 $6.34 \ \mathrm{mm}$ 。红色虚线圈表示均方误差圆。 Figure 13: Swing Error Analysis. Over 20 repeated swings, the mean Euclidean distance error was measured at $23.21 \ \mathrm{mm}$ , with a standard deviation of $10.55 \ \mathrm{mm}$ . The maximum and minimum errors recorded were $51.07 \ \mathrm{mm}$ and $6.34 \ \mathrm{mm}$ , respectively. This figure presents a statistical analysis of the swing error from 20 repeated swings to a fixed virtual target. The mean error is close to the racket's center, and the low standard deviation indicates good repeatability.

6.1.3.3. Real-World Shuttle Hitting

Objective: Integrate prediction and control modules to perform autonomous shuttlecock hitting in the real world.
Setup: A ball machine serves shuttles in a Mocap arena. The interception area is constrained for safety and field-of-view reasons.
Results:
- Success Rate: The robot successfully returned 42 out of 46 shuttles (91.3% success rate), with the 4 misses hitting the edge of the racket frame.
- Interception Range: The robot intercepts shuttles at about $1.4 \text{-} 1.7 \ \mathrm{m}$ above the ground within a $98 \ \mathrm{cm} \times 50 \ \mathrm{cm}$ area, which is a relatively large range for a robot of $1.28 \ \mathrm{m}$ height.
- Outgoing Shuttle Speed: Up to $19.1 \ \mathrm{m/s}$ , with an average of $11.1 \ \mathrm{m/s}$ .
- Return Shot Quality: Produces sharp slope trajectories (peak height exceeding $3 \ \mathrm{m}$ ) and a mean return landing distance of 4 m from the interception area.
Analysis: This demonstrates the system's overall effectiveness in a complex real-world task, showcasing coordinated and agile motion, high success rate, and quality returns.

The following are the results from Figure 7 of the original paper:

$Fig. 7: Real-World Shuttle Hitting. This figure captures the robot's actual hitting postures at the two opposing boundaries of the $9 8 ~ \\mathrm { c m } \\times 5 0 ~ \\mathrm { c m }$ interception area.$ 该图像是插图，展示了机器人在实际羽毛球击打过程中所处的两种姿势，位于适应的 $98~\mathrm{cm} \times 50~\mathrm{cm}$ 的拦截区域内。图中标注了机器人高度范围为 $1.4~\mathrm{m}$ 到 $1.7~\mathrm{m}$ ，并突出显示了击打区域的界限。 Figure 7: Real-World Shuttle Hitting. This figure captures the robot's actual hitting postures at the two opposing boundaries of the $98 \ \mathrm{cm} \times 50 \ \mathrm{cm}$ interception area. This figure visually confirms the robot's ability to reach and strike shuttles across a significant interception area, illustrating its whole-body coordination and flexibility in adapting its posture to different target locations.

6.2. Ablation Studies / Parameter Analysis

The paper includes an ablation study on the multi-stage curriculum (as mentioned in Methodology Section 4.2.1.5).

S1 (Footwork acquisition): Found to be essential. Removing it causes training to diverge directly.
S2 (Precision-guided swing generation): Also essential. Skipping it (jumping from S1 to S3) creates too large a curriculum gap, leading to training failure to reliably converge.
S3 (Task-focused refinement): While S1+S2 yields a hardware-deployable policy, S3 is crucial for breaking performance plateaus.
- Benefits of S3: Increases the primary hitting reward ( $r_{\mathrm{swing}}$ ) by 3-5%, and leads to lower action rates, reduced joint velocity and acceleration penalties, lower energy consumption, cleaner foot-contact force profiles, and reduced joint-torque usage. Notably, energy and torque costs decrease by approximately 20%.
Analysis: This detailed ablation demonstrates the necessity of each stage in the curriculum. The progressive introduction of complexity, followed by the removal of locomotion-shaping regularizers in S3, is key to achieving optimal, energy-efficient, and robust hitting performance. The reduction in energy and torque costs in S3 indicates that the policy learns more task-optimal and physically efficient solutions when extraneous regularizers are removed.

6.3. Striking Motion Analysis

The following are the results from Figure 14 of the original paper:

该图像是展示机器人进行实地击打动作的三幅快照。左侧是机器人在羽毛球发射前的原地脚步动作；中间展示了羽毛球发射时的靠近和向后挥杆阶段（黄色框突出显示羽毛球，绿色箭头显示拍子挥动方向，白色箭头表示脚步动作）；右侧展示了击打和随后的动作：机器人在挥拍的同时向羽毛球迈出一步，随后完成随后的动作。 Figure 14: Real-world striking motion. Three snapshots of a successful return in the real world. Left: in-place stepping before the shuttle is launched. Middle: approach and backswing phase as the shuttle launched (yellow box highlights the shuttle, the green arrow indicates the racket swing direction, and the white arrow indicates the stepping motion). Right: hit and follow-through: the robot simultaneously takes a step and swings the racket toward the shuttle, then completes the motion with a follow-through. This figure visually breaks down a successful real-world hit sequence into three key phases, illustrating the whole-body coordination achieved by the RL policy. It shows the robot's preparatory phase (in-place stepping), approach and backswing (coordinated leg movement and arm preparation), and the impact and follow-through (simultaneous stepping and swinging with the non-racket arm aiding balance).

Key Observations from Motion Analysis:
- Resting State: When no shuttle is coming, the robot maintains an active in-place stepping behavior, holding the racket in front of its body, indicating a readiness for reaction.
- Coordinated Approach: Upon shuttle launch, the controller initiates corrective steps (short or long strides) while simultaneously performing a backswing. This foot-racket co-timing is an emergent behavior, not hand-coded.
- Whole-Body Impulse: During the hitting instant, both legs and arm accelerate in a coordinated manner to generate whole-body impulse for high racket speed.
- Balance Mechanism: The non-racket arm swings in the opposite direction during the fast stroke, acting as a counterweight to help counteract angular momentum and maintain balance.
- Recovery: After striking, the robot executes a recovery motion to prepare for the next potential shot.
Emergent Behaviors:
- Tiptoe reaching: Observed for higher interception heights.
- Recentering: The robot recenters near the middle-right of the court between hits, reflecting the sampled target distribution and mimicking human re-centering behavior.
Analysis: These observations highlight the sophisticated and human-like behaviors that emerge from the multi-stage RL training, particularly the synergy between lower-body footwork and upper-body striking, which is crucial for dynamic tasks like badminton. The emergent balance mechanisms and recovery behaviors further underscore the effectiveness of the whole-body control approach.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work successfully demonstrates the first real-world humanoid badminton system, powered by a unified whole-body reinforcement learning (RL) policy. The core innovation lies in its three-stage curriculum: starting with footwork acquisition, progressing to precision-guided swing generation, and culminating in task-focused refinement. This curriculum enables the humanoid to learn coordinated stepping and striking without relying on expert demonstrations or motion priors, producing complex, human-like badminton behaviors within tight sub-second time windows.

Key achievements include:

In simulation, two identical humanoids sustained a rally of 21 consecutive returns, exhibiting high hitting accuracy (within $0.10 \ \mathrm{m}$ position error and 0.2 rad orientation error).
The prediction-free variant showed comparable performance to the target-known policy in simulation, demonstrating robustness to aerodynamic variations without an explicit shuttle predictor.
In real-world tests, the zero-shot transferred policy achieved swing speeds of $5.3 \ \mathrm{m/s}$ and returned ball-machine serves with outgoing shuttle speeds up to $19.1 \ \mathrm{m/s}$ . The framework represents a significant step towards enabling agile, interactive whole-body response tasks on humanoids in dynamic environments.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Mocap Arena Constraints: Current tests are limited by the Mocap arena's ceiling height, restricting shuttle trajectories to low arcs from a ball machine. This constrains the robot's potential and makes long multi-ball rallies or play with human partners difficult.
- Future Work: Deploying in higher-ceiling venues and with broader interception bands would allow for more dynamic and varied shuttle trajectories, revealing the full potential of the controller.
Limited Stroke Repertoire: The learned striking behavior is stereotyped, predominantly forehand-like. It lacks the diverse repertoire of human strokes (e.g., backhand hits, lunges, jumps, smashes). The feasible interception area is also currently restricted to a relatively narrow band.
- Future Work: Expanding the training data distribution and reward shaping to encourage a wider variety of strokes and interception areas.
Prediction-Free Variant Deployment: While promising in simulation, the prediction-free variant has not yet been validated on hardware. It relies on a strictly timed 50 Hz history of shuttle positions, requiring robust buffering and time-stamping for accurate velocity and timing inference by the actor.
- Future Work: Deploying the prediction-free policy on hardware and performing a thorough analysis of its robustness in real-world conditions.
Vision-based Perception: The current system relies on an external Mocap system for shuttlecock tracking.
- Future Work: Moving towards pure vision-based operation would require reliable visual odometry and policies that actively keep the shuttle within the field of view during aggressive motion. A head-mounted camera with learning signals to align the view with the shuttle trajectory is suggested as a practical path.
Higher-Level Strategy: The current controller acts as a motor primitive. It does not make strategic decisions about where to hit or when to swing. Current failures sometimes occur when consecutive targets are too far apart for the available maneuver time.
- Future Work: Training a higher-level policy via multi-agent training or self-play to decide interception points, swing timing, and racket-face orientation. This would involve reward shaping that explicitly encourages high-speed legged maneuvers to cover larger distances.
Adaptability to Other Sports:
- Future Work: The framework could be adapted to other dynamics-critical domains like tennis or squash by replacing the shuttlecock dynamic model and adjusting interception styles and reward tunings to reflect sport-specific behaviors.

7.3. Personal Insights & Critique

This paper presents a truly impressive achievement in humanoid robotics, particularly in the domain of dynamic loco-manipulation. The successful zero-shot transfer from simulation to a real humanoid for such a complex task as badminton is a testament to the power of well-designed RL curricula and domain randomization.

Inspirations:

Curriculum Learning for Complex Tasks: The three-stage curriculum is a masterclass in breaking down a formidable RL problem into manageable, progressive steps. This structured approach, particularly the refinement stage (S3) where locomotion regularizers are strategically removed, is a valuable blueprint for training other complex whole-body behaviors. It highlights that optimal task performance may require shedding generic regularizers once fundamental skills are acquired.
Emergent Behaviors: The observation that foot-racket co-timing, tiptoe reaching, and recentering emerged without explicit programming is a powerful demonstration of RL's ability to discover intelligent and adaptive strategies. This reinforces the idea of empowering agents to learn from task-specific rewards rather than strictly imitation learning.
Sim2Real Robustness: The rigorous application of domain randomization and observation noise is a key takeaway. The fact that the policy zero-shot transfers with high efficacy suggests that these sim2real techniques are maturing to a point where they can reliably bridge the reality gap for highly dynamic, contact-rich tasks.
Prediction-Free Control: The prediction-free variant is a forward-looking contribution. Humans don't compute explicit Kalman filter predictions; they react to sensory input and infer intentions. This end-to-end approach, even if currently only validated in simulation, offers a path to more robust systems less sensitive to model inaccuracies or parameter tuning.

Critique & Areas for Improvement:

Generalization of Stroke Styles: While the paper successfully demonstrates a formidable forehand-like stroke, the lack of diverse stroke types (e.g., backhand, smashes, dropshots) is a significant limitation for actual badminton play. Real badminton involves anticipating and executing a wide array of shots. Future work could explore incorporating style-conditioning inputs or more diverse reward landscapes to encourage a broader repertoire.
Dealing with Occlusions and Multiple Objects: The current Mocap-based perception and single-shuttle environment simplify perception. In a real game, occlusions (net, opponent, robot's own body) and managing multiple shuttlecocks (in practice scenarios) would introduce substantial perception challenges. The prediction-free variant helps with aerodynamic uncertainty but not perceptual uncertainty from occlusions.
Strategy and Adversarial Play: The robot currently reacts to machine-served shuttles. Integrating a higher-level strategic policy (as suggested by the authors) would be the next frontier. Playing against an opponent (human or robot) introduces adversarial dynamics and the need for game theory and long-term planning, which are entirely different RL challenges.
Energy Efficiency for Extended Play: While the paper notes a reduction in energy and torque costs in S3, the long-term energy budget for sustained humanoid operation is critical. Further optimization of energy consumption would be valuable for practical deployment over extended periods.
Human-Robot Interaction Safety: For deployment in human environments, safety guarantees and human-robot interaction protocols (e.g., avoiding accidental collisions with humans or other objects) would need rigorous development beyond the collision penalties in the reward function.

Overall, this paper pushes the boundaries of humanoid capabilities in dynamic environments, setting a new benchmark for whole-body control in robotic sports. The insights gained from its multi-stage curriculum and sim2real transfer strategy are broadly applicable across the field of legged robotics and interactive AI.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~43 min read · 55,266 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Whole-Body Loco-manipulation

3.2.2. Robotic Racket Sports

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. RL-based Dynamic Whole-Body Controller

4.2.1.1. Observation Space

4.2.1.2. Action Space

4.2.1.3. Episode Settings

4.2.1.4. Training Settings

4.2.1.5. Three-stage curriculum

4.2.1.6. Multi-Stage Reward Design

4.2.2. Model-based Hitting Target Generation and Prediction

4.2.2.1. Generate Shuttlecock Trajectory for Training

4.2.2.2. Shuttlecock Trajectory Prediction for Deployment

4.2.3. Prediction-free variant

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Simulation Results

6.1.1.1. Two-Robot Rally

6.1.1.2. Target-Known vs. Prediction-Free Comparison

6.1.2. Sim2Real Transfer

6.1.3. Real Robot Deployment

6.1.3.1. EKF Prediction Accuracy study

6.1.3.2. Virtual-Target Swinging

6.1.3.3. Real-World Shuttle Hitting

6.2. Ablation Studies / Parameter Analysis

6.3. Striking Motion Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers