Paper status: completed

From Motion to Behavior: Hierarchical Modeling of Humanoid Generative Behavior Control

Published:06/03/2025

Generative Behavior Control framework (1)LLM-guided motion planning (27)Human motion generation and synthesis (1)GBC-100K behavior planning dataset (1)Multi-modal action representation and modeling (5)

Original Link

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Generative Behavior Control (GBC), a framework leveraging LLMs for hierarchical behavior planning to guide Task and Motion Planning (TAMP) in synthesizing diverse, long-horizon human motions. Addressing existing limitations, GBC, supported by the new GBC-100

Abstract

From Motion to Behavior: Hierarchical Modeling of Humanoid Generative Behavior Control Jusheng Zhang 1 , Jinzhou Tang 1 , Sidi Liu 1 , Mingyan Li 1 , Sheng Zhang 2 , Jian Wang 3 , Keze Wang 1,* 1 Sun Yat-sen University 2 University of Maryland, College Park 3 Snap Inc. * Corresponding author: kezewang@gmail.com Figure 1. From motion to behavior. (a) Simple periodic motion patterns without complex, behavioral semantic meaning, (b) Complex, semantically meaningful human behaviors, demonstrating our framework’s ability to generate goal-oriented, coherent behavior sequences. (c) Our proposed Generative Behavior Control (GBC) framework bridges this gap between low-level motions and high-level behavioral understanding. Abstract Human motion generative modeling or synthesis aims to characterize complicated human motions of daily activi- ties in diverse real-world environments. However, current research predominantly focuses on either low-level, short- period motions or high-level action planning, without tak- ing into account the hierarchical goal-oriented nature of human activities. In this work, we take a step forward from human motion generation to human behavior mod- el

Mind Map

In-depth Reading

English Analysis~12 min read · 15,026 chars

1. Bibliographic Information

Title: From Motion to Behavior: Hierarchical Modeling of Humanoid Generative Behavior Control
Authors: Jusheng Zhang, Jinzhou Tang, Sidi Liu, Mingyan Li, Sheng Zhang, Jian Wang, Keze Wang
Affiliations: Sun Yat-sen University, University of Maryland, College Park, Snap Inc.
Journal/Conference: The paper's formatting and content suggest it was submitted to a top-tier computer vision or robotics conference, such as CVPR, ICCV, ECCV, or ICRA. The publication year is inferred to be 2024 or later, based on the references.
Publication Year: 2024/2025
Abstract: The paper addresses the limitations of current human motion generation methods, which typically handle either short, low-level motions or high-level planning, but not both in a unified, hierarchical way. It introduces a shift from "motion generation" to "human behavior modeling." The authors propose a unified framework called Generative Behavior Control (GBC), which uses Large Language Models (LLMs) to generate hierarchical behavior plans that guide a Task and Motion Planning (TAMP) system to synthesize diverse, long-horizon human motions. To support this research, they also introduce GBC-100K, a new large-scale dataset with hierarchical semantic and motion annotations. Experiments show that their method, PHYLOMAN, can generate motions that are 10 times longer, more diverse, and more purposeful than existing methods.
Original Source Link: /files/papers/68e0aff59cc40dff7dd2bb36/paper.pdf (Note: The provided link is a local file path; the GitHub link is often the intended public source). The paper is presented as a preprint.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Generating realistic human motion is a long-standing challenge. Existing methods are fragmented. Some are good at creating short, simple movements (e.g., walking for a few seconds), while others focus on abstract planning (e.g., "make coffee") without generating the actual physical motions. This creates a significant gap: there is no effective way to generate long, complex, and goal-oriented human behaviors that are both physically realistic and semantically coherent.
- Key Challenges: Previous works struggle with three main issues:
  1. Temporal Coherence: Maintaining realistic motion over long durations (minutes instead of seconds) is very difficult; models often produce jerky or repetitive movements.
  2. Physical Plausibility: Generated motions often violate physical laws, resulting in characters floating, sliding on the ground, or moving through objects.
  3. Semantic Gap: High-level goals (e.g., "conduct an orchestra") are hard to translate into specific, low-level joint movements.
- Innovation: This paper proposes to reframe the problem from simple motion generation to behavior modeling. The core insight is to treat human activity as a hierarchical, goal-driven process, inspired by cognitive science and robotics. They combine the reasoning power of LLMs for high-level planning with the precision of robotics-inspired motion control for physically grounded execution.
Main Contributions / Findings (What):
- **A New Task - Generative Behavior Control (GBC): They formally define a new task that requires generating long-horizon, physically consistent, and semantically meaningful human behaviors from high-level instructions.
- A New Dataset - GBC-100K: A large-scale (over 100,000 video-motion pairs) multimodal dataset collected from internet videos. Crucially, it is annotated with a hierarchical structure of text descriptions: high-level BehaviorScripts, mid-level MotionScripts, and low-level PoseScripts. This bridges the semantic gap and facilitates hierarchical learning.
- A New Framework - PHYLOMAN: A novel framework that integrates three key technologies:
  1. LLMs for High-Level Planning: An LLM decomposes a high-level goal into a sequence of actionable steps.
  2. Robotics Planning (TAMP): A Task and Motion Planning approach structures these steps into key poses and transitions.
  3. Physics-Informed Control: A physics simulator and controller ensure the final motion is physically realistic and stable.
- Key Finding: PHYLOMAN can generate high-quality human motion sequences that are 10 times longer than those produced by state-of-the-art methods, while demonstrating superior physical and semantic fidelity.

3. Prerequisite Knowledge & Related Work

Foundational Concepts:
- Human Motion Synthesis: The process of algorithmically generating sequences of human poses to create animations. This is often driven by conditioning inputs like text or audio.
- SMPL (Skinned Multi-Person Linear Model): A statistical 3D model of the human body that can represent a wide variety of human shapes and poses with a small set of parameters (pose and shape coefficients). It is the standard representation for the human body in this field.
- Large Language Models (LLMs): AI models like GPT-4 trained on vast amounts of text data. They excel at understanding context, reasoning, and planning, making them suitable for breaking down high-level behavioral goals into smaller steps.
- Task and Motion Planning (TAMP): A field in robotics that combines high-level, symbolic planning (the "Task," e.g., pick up the cup) with low-level, continuous control (the "Motion," e.g., calculate the joint angles to move the arm). This paper adapts TAMP for generating human-like behavior.
- Diffusion Models: A powerful class of generative models that create data by starting with random noise and gradually "denoising" it into a coherent sample, guided by a condition (like text). They are widely used in modern image and motion generation.
- Physics Simulator: A software environment (like MuJoCo mentioned in the paper) that models physical laws like gravity, friction, and collisions. Using a simulator ensures that generated motions are physically possible.

Previous Works & Differentiation:
- Behavior Decomposition: While psychology provides theories on how humans structure behavior, computational models like MotionLLM and Video-LLaVA have started to analyze this from video. However, these models primarily focus on understanding existing videos, not generating new, physically plausible behaviors.
- Human Motion Synthesis: Methods like MDM and MotionLCM use diffusion models to generate realistic short-term motions from text. Physics-informed methods (PhysDiff, PHC) improve physical realism. The key limitation is their inability to handle long-horizon sequences and complex, multi-step goals. They lack a high-level planning component.
- Motion Control for Robotics: TAMP and Reinforcement Learning (RL) are used to make robots perform complex tasks. However, traditional TAMP is often deterministic and struggles with the diversity and subtlety of human motion. RL is data-hungry and has trouble generalizing. These methods often lack the rich semantic understanding that LLMs provide.
- Differentiation:** PHYLOMAN is unique because it unifies these three disparate fields. It uses an LLM for semantic planning (what to do), a TAMP-like structure to organize the plan (how to sequence it), and a physics-based controller for realistic execution (how to move). This hierarchical approach is what enables the generation of long, coherent, and meaningful behaviors.

4. Methodology (Core Technology & Implementation)

The proposed framework, PHYLOMAN, solves the Generative Behavior Control (GBC) problem by breaking it down into two main stages: Task Planning and Motion Planning.

Framework Overview

Image 1 shows this two-stage pipeline. A text prompt is first fed into an LLM for Task Planning, which produces a Composed Script of PoseScripts and MotionScripts. This script then guides the Motion Planning stage, where a Generator creates an initial motion sequence, and a Controller refines it into a final, physically plausible behavior.

Problem Formulation: The goal is to generate a sequence of SMPL poses from a high-level instruction. This is framed as a hierarchical planning problem.
1. Task Level: An LLM planner takes an instruction and generates an alternating sequence of static poses (PoseScripts $\{p_i\}$ ) and transitions between them (MotionScripts $\{a_i\}$ ). This must satisfy a high-level constraint $C_T$ that ensures both physical validity (e.g., joint limits) and semantic consistency. $C _ { T } ( x _ { i } , a _ { i } , x _ { i + 1 } ) = 0 , \quad \forall i \in \{ 0 , \dots , n - 1 \}$
2. Motion Level: For each pose pair $(x_i, x_{i+1})$ and transition $a_i$ , a motion planner computes a continuous trajectory $\tau_i$ that connects them while obeying low-level physical constraints $C_M$ (e.g., collision avoidance, balance). $C _ { M } ( \tau _ { i } ( \lambda ) ) \le 0 , \quad \forall \lambda \in [ 0 , 1 ]$
PHYLOMAN Architecture:

Stage 1: Task Planning (High-Level Reasoning)
1. LLM as Behavior Planner: The process starts with a high-level text prompt (e.g., "A mechanic changes a tire on a bike"). An LLM, using a Chain-of-Thought prompting strategy, decomposes this complex behavior into a structured sequence of key poses and the motions that connect them.
  - PoseScript: A textual description of a static keyframe pose (e.g., "crouching with the left hand near the wheel nut").
  - MotionScript: A textual description of the movement between two keyframes (e.g., "unscrewing the nut with a fast circular motion").
2. Parameterizing Text to Poses: A significant challenge is converting the textual PoseScripts into numerical SMPL pose parameters. This is handled by a pre-trained Variational Autoencoder (VAE) from the PoseScript paper. The VAE's encoder maps the text to a latent space, and its decoder generates a corresponding SMPL pose $x_i$ from that latent representation $z_i$ . $z _ { i } \sim E _ { \phi } ( p _ { i } ) , \quad x _ { i } \sim D _ { \theta } ( z _ { i } )$
Stage 2: Motion Planning (Low-Level Execution)
1. Motion Diffusion Prior (In-betweening): Given the sequence of key poses $\{x_0, x_1, ..., x_n\}$ , a conditional diffusion model fills in the gaps. It generates an initial, discrete trajectory $\mathcal{M}$ that provides a rough draft of the motion between the keyframes. This model is conditioned on the key poses and the textual MotionScripts. $\mathcal { M } = \varphi _ { \boldsymbol { \theta } } ( \mathbf { z } ; \mathcal {X} , \mathcal {A} , t )$
2. Continuous Motion Planning (Physics-based Refinement): The rough trajectory $\mathcal{M}$ from the diffusion model may still be physically unrealistic. The final step is to use a physics-based humanoid controller (e.g., HOVER) in a simulator (MuJoCo). The controller takes the trajectory $\mathcal{M}$ as a reference and computes the necessary joint torques to make a simulated avatar follow it. This process refines the motion, ensuring it respects physical laws like balance, joint limits, and ground contact, resulting in a smooth and physically plausible final behavior.
The GBC-100K Dataset:

As shown in Image 2, the creation of this dataset was a major undertaking:
1. Data Collection: Started with ~500k raw videos from various sources like YouTube, ActivityNet, Kinetics, etc.
2. Filtering & Segmentation: Videos were filtered to keep only those with a clear single human subject (YOLOv8-Pose). They were then segmented into short clips without scene changes (TransNetv2).
3. Motion Estimation: A 3D pose estimation model (TRAM) was used to extract SMPL pose sequences from each video clip. Low-quality sequences were filtered out.
4. Motion Captioning: A multimodal LLM was used to generate hierarchical text annotations for each clip, creating the BehaviorScript, MotionScript, and PoseScript structure. This hierarchical labeling is the dataset's most critical feature.

5. Experimental Setup

Datasets:
- GBC-100K: The newly proposed dataset, used for training and evaluation. It contains ~124k sequences totaling 250 hours.
- HumanML3D: A widely used existing benchmark for text-to-motion generation. It is used for comparison to show the added complexity and benefit of GBC-100K.
Evaluation Metrics:
- Text-Motion Alignment:
  - R-Precision: Measures how often the correct motion is retrieved from a pool given a text description.
  - Multimodal Distance (MM Dist): The average cosine distance between text and motion embeddings. Lower is better.
- Generation Quality & Diversity:
  - Fréchet Inception Distance (FID): Compares the feature distribution of generated motions to real motions. Lower is better.
  - Diversity: Measures the variety of generated motions for different texts.
  - MultiModality: Measures the variety of generated motions for the same text.
- Task Performance & Physical Realism:
  - Success Rate (SR): A human-evaluated score of how well the generated behavior accomplishes the instructed task.
  - Physical Error (Phys-Err): An aggregate measure of physical violations like ground penetration, floating, and foot sliding. Lower is better.
Baselines:
- For standard motion generation on short sequences, they compare against state-of-the-art diffusion-based models: MotionCLR, MDM, MotionLCM.
- For long-sequence generation, they adapt two models capable of longer generation: MoMask and T2M-GPT. They also perform extensive ablation studies by removing or replacing components of their own PHYLOMAN framework.

6. Results & Analysis

Core Results:

1. Dataset Quality Validation (Table 2) The authors first validate their GBC-100K dataset by training existing SOTA models on it. The key finding is that models trained on GBC-100K achieve significantly higher MultiModality scores. This proves that the fine-grained hierarchical annotations in GBC-100K enable models to generate more diverse motions for the same instruction, confirming the dataset's richness and utility.

2. Long-Sequence Behavior Generation (Table 3) This is the main experiment, demonstrating the superiority of the full PHYLOMAN framework.
- The Optimal configuration of PHYLOMAN achieves a Success Rate of 0.821, a massive improvement over adapted baselines like MoMask (0.328) and T2M-GPT (0.179).
- It also drastically reduces Physical Error to 0.093, compared to the baselines (0.224) and, most notably, the version without an LLM planner (1.031).
- These results confirm that the proposed TAMP-based pipeline, combining high-level planning with low-level physical control, is essential for generating long, successful, and physically plausible behaviors.
Ablations / Component Analysis (Table 3):
- LLM Planner: Discarding the LLM planner (Discard) or replacing it with a simple Heuristic rule-based system causes the Success Rate to plummet (from 0.821 to 0.067 and 0.118, respectively). This highlights that the common-sense reasoning of the LLM is critical for decomposing complex tasks correctly.
- Text-to-Pose Model: Using a strong text-to-pose model is crucial. The proposed method significantly outperforms a Heuristic approach in Success Rate (0.821 vs. 0.452), showing that accurately generating key poses is fundamental to the overall behavior.
- Physics Controller: Discarding the physics controller (Discard) causes the Physical Error to more than double (from 0.093 to 0.235). This unequivocally demonstrates the controller's role in ensuring the final motion is physically grounded and realistic.
Effect of Hierarchical Annotations (Table 4): This experiment shows the power of the GBC-100K dataset for training planners. When open-source LLMs (like Llama3.1 and Qwen-V2.5) are fine-tuned on the hierarchical scripts from GBC-100K, their performance as planners improves dramatically. The fine-tuned Qwen-V2.5 model even outperforms powerful, closed-source models like GPT-4o used in a zero-shot setting, achieving the highest Success Rate (0.821 vs. 0.807). This indicates that learning from the structured, real-world behavior patterns in the dataset is highly effective.
Qualitative Analysis:

Image 4 provides a stark visual comparison. For the prompt "A mechanic changes a tire on a bike," PHYLOMAN generates a long, 2100-frame sequence that shows the entire process coherently. In contrast, baseline methods (MotionCLR, MDM, MotionLCM) produce much shorter sequences (~200 frames) that capture only a small, isolated part of the action.

Image 5 showcases successful long-horizon generations like martial arts, gym training, and dancing. Image 6 shows a failure case for "swimmer practicing lap in a pool," where the model struggles with the highly dynamic and physically complex movements, leading to a loss of balance and temporal inconsistency. This highlights a current limitation in handling extremely dynamic scenarios.

7. Conclusion & Reflections

Conclusion Summary: This paper makes a compelling case for shifting the focus of human motion synthesis from generating short motions to modeling long-term, goal-oriented behaviors. The authors successfully demonstrate that by integrating LLMs for high-level planning with a robotics-inspired TAMP framework and physics-based control, it is possible to generate complex human behaviors that are 10 times longer and significantly more coherent and physically plausible than what was previously achievable. The introduction of the GBC-100K dataset with its unique hierarchical annotations provides a valuable resource that will likely spur further research in this direction.
Limitations & Future Work:
- Authors' Stated Future Work: The authors plan to extend the framework for applications in embodied intelligence (e.g., controlling humanoid robots) and digital avatars.
- Observed Limitations: The failure case (swimming) indicates that the model still struggles with highly dynamic or unusual physical interactions that are likely underrepresented in the training data. The reliance on a complex, multi-stage pipeline with several pre-trained components could make the system difficult to train end-to-end and sensitive to errors in any single component. The quality of the generated motion is also dependent on the quality of the initial plan from the LLM.
Personal Insights & Critique:
- Strengths: This work represents a significant conceptual and practical leap forward. The unification of LLMs, TAMP, and physics is a powerful paradigm that feels like a natural evolution for the field. The creation of GBC-100K is a substantial contribution on its own and addresses a critical need for more structured, long-horizon motion data.
- Critique: The framework is currently an "inference-only" pipeline where components are trained separately. This modularity is practical but may be less optimal than a fully end-to-end trained system. The dataset, while large, is sourced from internet videos, which can contain biases, noise, and occlusions that affect the quality of the extracted SMPL data. Finally, while the generated behaviors are long, they are still limited to a single agent and do not yet incorporate complex interactions with objects or other agents, which is a key aspect of real-world human behavior. Nonetheless, this paper lays a very strong foundation for future exploration in these areas.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.