DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

Siyu Tang

Paper status: completed

DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

Published:10/08/2024

Diffusion-based Autoregressive Motion Generation (1)Text-Driven Human Motion Generation (1)Real-Time Motion Control (1)Spatially-Constrained Motion Generation (1)Reinforcement Learning Motion Decision Making (1)

Original Link PDF

Price: 0.10

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DartControl (DART) is a diffusion-based autoregressive motion model enabling real-time text-driven motion control. It overcomes limitations of existing methods by generating complex, long-duration movements that effectively integrate motion history and text inputs.

Abstract

Text-conditioned human motion generation, which allows for user interaction through natural language, has become increasingly popular. Existing methods typically generate short, isolated motions based on a single input sentence. However, human motions are continuous and can extend over long periods, carrying rich semantics. Creating long, complex motions that precisely respond to streams of text descriptions, particularly in an online and real-time setting, remains a significant challenge. Furthermore, incorporating spatial constraints into text-conditioned motion generation presents additional challenges, as it requires aligning the motion semantics specified by text descriptions with geometric information, such as goal locations and 3D scene geometry. To address these limitations, we propose DartControl, in short DART, a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control. Our model effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs using latent diffusion models. By autoregressively generating motion primitives based on the preceding history and current text input, DART enables real-time, sequential motion generation driven by natural language descriptions. Additionally, the learned motion primitive space allows for precise spatial motion control, which we formulate either as a latent noise optimization problem or as a Markov decision process addressed through reinforcement learning. We present effective algorithms for both approaches, demonstrating our model's versatility and superior performance in various motion synthesis tasks. Experiments show our method outperforms existing baselines in motion realism, efficiency, and controllability. Video results are available on the project page: https://zkf1997.github.io/DART/.

Mind Map

In-depth Reading

English Analysis~38 min read · 53,681 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control."

1.2. Authors

The authors are Kaifeng Zhao, Gen Li, and Siyu Tang, all affiliated with ETH Zürich. ETH Zürich is a prestigious public research university in Zürich, Switzerland, known for its leading research in science and technology, including computer science and robotics.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server, on 2024-10-07T17:58:22.000Z. While arXiv hosts preprints before formal peer review, many high-quality research papers are first shared there. The associated video results and code indicate a comprehensive research effort.

1.4. Publication Year

2024

1.5. Abstract

The paper introduces DartControl (DART), a novel Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion Control. It addresses the limitations of existing text-conditioned human motion generation methods, which typically produce short, isolated motions and struggle with continuous, long-duration, complex motions under real-time and online conditions, especially when incorporating spatial constraints. DART learns a compact motion primitive space, jointly conditioned on motion history and text inputs, using latent diffusion models. By autoregressively generating these motion primitives, DART enables real-time, sequential motion generation guided by natural language. Furthermore, its learned motion primitive space facilitates precise spatial motion control, which is framed either as a latent noise optimization problem or a Markov decision process solved via reinforcement learning (RL). The paper presents effective algorithms for both approaches, demonstrating superior performance in motion realism, efficiency, and controllability compared to existing baselines across various motion synthesis tasks.

1.6. Original Source Link

https://arxiv.org/abs/2410.05260 This is the official arXiv link for the preprint.

1.7. PDF Link

https://arxiv.org/pdf/2410.05260v3.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the generation of long, continuous, and complex human motions that precisely respond to streams of natural language descriptions and spatial constraints, particularly in an online and real-time setting.

This problem is highly important in fields like computer graphics, virtual reality, robotics, and interactive character animation, where intuitive control over digital humans is crucial for user experience and content creation.

Specific challenges and gaps in prior research include:

Short, Isolated Motions: Existing text-conditioned models primarily generate short, standalone motions from a single sentence, failing to compose long, continuous sequences with varying semantics.
Offline Generation: State-of-the-art temporal motion composition methods, like FlowMDM, are offline, meaning they require prior knowledge of the entire action timeline and are too slow for real-time applications.
Difficulty with Spatial Constraints: Integrating spatial constraints (e.g., reaching a goal location, interacting with objects) with text-conditioned generation is complex, requiring careful alignment of semantic and geometric information, often leading to compromises in motion quality or semantic fidelity.
Lack of Text-Conditioned Control in Interactive Systems: While interactive character control has focused on realism and responsiveness, most systems lack natural language interfaces, relying on tedious spatial input.

The paper's entry point is to leverage the power of diffusion models for high-quality generative tasks, combined with an autoregressive motion primitive representation, to tackle the challenge of continuous, real-time, and semantically/spatially controlled motion generation. The innovative idea is to learn a compact, text-conditioned motion primitive space that can be rapidly rolled out and manipulated.

2.2. Main Contributions / Findings

The primary contributions of DartControl (DART) are:

A Diffusion-Based Autoregressive Motion Primitive Model: DART effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs using latent diffusion models. This enables the synthesis of motions of arbitrary length in an autoregressive, online, and real-time manner.
Real-time, Sequential Motion Generation: By autoregressively generating motion primitives based on preceding history and current text input, DART achieves a generation speed of over 300 frames per second with low latency, making it suitable for real-time applications, a significant improvement over offline methods.
Precise Spatial Motion Control: The learned motion primitive space provides a robust foundation for integrating precise spatial control. The paper formulates this control either as a latent noise optimization problem (e.g., for in-betweening, human-scene interaction) or as a Markov decision process addressed through reinforcement learning (e.g., for goal-reaching locomotion). Effective algorithms are presented for both approaches.
Versatile and Unified Model: DART is presented as a simple, unified, and highly effective model capable of addressing various complex motion synthesis tasks, including long, continuous sequence generation from sequential text prompts, in-between motion, scene-conditioned motion, and goal-reaching synthesis.

Key conclusions and findings include:
DART significantly outperforms existing baselines in motion realism, efficiency, and controllability, as evidenced by quantitative metrics (FID, skate, jerk, speed, latency) and human preference studies.
The latent motion primitive representation and the use of a variational autoencoder (VAE) effectively mitigate artifacts (like jittering) commonly found in raw motion data, leading to higher-quality generations.
Scheduled training is crucial for improving the stability of long sequence generation and enhancing text prompt controllability, especially for unseen poses or novel combinations.
The proposed latent noise optimization framework effectively harmonizes spatial control with text semantic alignment, a challenge for previous methods.
Reinforcement learning can efficiently learn control policies using DART's latent space, enabling tasks like text-conditioned goal-reaching with different locomotion styles while maintaining high quality and speed.

3.1. Foundational Concepts

To understand the DART paper, a reader should be familiar with several fundamental concepts in machine learning, computer graphics, and motion synthesis:

Human Motion Representation:
- SMPL-X Model (Skinned Multi-Person Linear model with eXpressions): A widely used 3D parametric model for human bodies that can represent a wide range of human shapes and poses, including hands and facial expressions. It provides a statistical body model that, given shape parameters ( $\pmb{\beta}$ ) and pose parameters ( $\pmb{\theta}$ ), outputs a mesh representing the human body. The pose parameters typically describe joint rotations relative to a rest pose, and a global translation ( $\mathbf{t}$ ) and global orientation ( $\mathbf{R}$ ) determine the body's position and orientation in world space.
- 6D Rotation Representation: A method to represent 3D rotations without the issues of gimbal lock or singularities encountered with Euler angles. It represents a rotation matrix using two orthogonal column vectors (e.g., the first two columns of the rotation matrix), from which the third can be derived. The paper mentions Zhou et al. (2019) for this.
- Motion Primitives: Short, reusable, atomic motion segments that can be chained together to form longer, more complex motions. In DART, these are overlapping segments, facilitating smooth transitions and autoregressive generation. They offer a more manageable target for generative models than entire, long motion sequences.
Generative Models:
- Variational Autoencoders (VAEs): A type of generative neural network that learns a compressed, continuous latent space representation of data. It consists of two main parts: an encoder that maps input data to a probability distribution (mean and variance) in the latent space, and a decoder that reconstructs the data from samples drawn from this latent distribution. VAEs are trained to minimize a reconstruction loss (how well the decoder reconstructs the input) and a Kullback-Leibler (KL) divergence loss (to ensure the latent distribution is close to a prior, often a standard Gaussian).
- Denoising Diffusion Probabilistic Models (DDPMs) / Diffusion Models: A class of generative models that learn to reverse a diffusion process. The forward diffusion process gradually adds Gaussian noise to a data sample over several steps until it becomes pure noise. The reverse diffusion process (which the model learns) iteratively removes noise, starting from a pure noise sample, to generate a clean data sample.
  - Latent Diffusion Models (LDMs): Extend diffusion models by performing the diffusion process not on the raw data (e.g., high-resolution images or motion data) but on a compressed, lower-dimensional latent representation learned by an autoencoder (like a VAE). This makes the diffusion process more efficient and allows for higher-quality generation.
  - DDIM (Denoising Diffusion Implicit Models): A deterministic variant of diffusion models that allows for faster sampling (fewer steps) and more coherent latent space manipulation compared to DDPMs, which are stochastic. DART uses DDIM for its deterministic mapping from latent noise to motion during spatial control.
Conditional Generation Techniques:
- Autoregressive Generation: A process where a model generates a sequence of data one element (or segment) at a time, conditioning each new element on all previously generated elements in the sequence. This is crucial for generating continuous, long motions in an online fashion.
- Classifier-Free Guidance: A technique used in conditional diffusion models to enhance the influence of conditioning information (e.g., text prompts). During training, the conditioning input (e.g., text embedding) is sometimes dropped out (masked). At inference, both a conditional prediction (with the text) and an unconditional prediction (without the text) are made. The unconditional prediction is then subtracted from the conditional one and scaled by a guidance weight ( $w$ ) to push the generation further in the direction of the text condition. The formula is: $ \mathcal{G}_w(\mathbf{z}_t, t, \mathbf{H}, c) = \mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, \emptyset) + w \cdot (\mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, c) - \mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, \emptyset)) $ Here, $\mathcal{G}$ is the denoiser, $\mathbf{z}_t$ is the noisy latent sample, $t$ is the diffusion step, $\mathbf{H}$ is the history, $c$ is the text condition, and $\emptyset$ represents the unconditional (masked) condition.
- CLIP (Contrastive Language–Image Pre-training): A neural network trained on a vast dataset of images and their text captions. It learns to embed both images and text into a shared, high-dimensional space such that semantically similar image-text pairs are close together. This allows CLIP to be used as a powerful text encoder for text-conditioned generation tasks.
Optimization & Control:
- Gradient Descent Methods (e.g., Adam): Iterative optimization algorithms used to minimize an objective function by moving in the direction opposite to the gradient of the function. Adam (Kingma & Ba, 2015) is an adaptive learning rate optimization algorithm.
- Reinforcement Learning (RL): A paradigm where an agent learns to make sequential decisions in an environment to maximize a cumulative reward.
  - Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It consists of states, actions, transition probabilities, and rewards.
  - Actor-Critic Methods: A class of RL algorithms that combine two components: an actor (which learns a policy to select actions) and a critic (which learns to estimate the value of states or state-action pairs).
  - PPO (Proximal Policy Optimization): A popular on-policy RL algorithm that aims to improve the stability and performance of policy gradient methods by clipping the objective function, preventing large policy updates that could lead to instability.
- Signed Distance Field (SDF): A representation of a 3D shape where each point in space stores the shortest distance to the surface of the shape, with the sign indicating whether the point is inside (negative) or outside (positive). Useful for collision detection and surface interaction.

3.2. Previous Works

The paper contextualizes DART by discussing several categories of prior research:

Conditional Motion Generation:
- Early Motion Synthesis: Focused on generating realistic and diverse human motions without explicit conditions (Kovar et al., 2008; Holden et al., 2020; Clavet et al., 2016; Zinno, 2019). These methods laid the groundwork for motion realism but lacked flexible user control.
- Text-Conditioned Motion Generation: Became popular for intuitive user interaction. Tevet et al. (2023), Zhang et al. (2022a), Petrovich et al. (2022), $Guo et al. (2022)$ , Jiang et al. (2024a), Zhang et al. (2023a) are mentioned. These works primarily focus on generating standalone, short motions from a single descriptive sentence, which is a key limitation DART aims to overcome. T2M-GPT (Zhang et al., 2023a) is a baseline in DART's experiments, modified for history conditioning.
- Audio/Speech-Driven Motion Synthesis: Methods like Alexanderson et al. (2023), Tseng et al. (2023), Siyao et al. (2022), $Ao et al. (2022; 2023)$ show progress in generating motion from auditory inputs.
- Spatially Aware Motion Generation: Addresses applications requiring motion within spatial constraints or specific goals.
  - Interactive Character Control (Ling et al., 2020; Peng et al., 2022; Starke et al., 2022; Luo et al., 2024; Starke et al., 2024): Focuses on realism and responsiveness to interactive signals, but often lacks text-conditioned semantic control and is limited to small, curated datasets. GAMMA (Zhang & Tang, 2022) is an RL-based controller using a learned motion action space but lacks text conditioning and is limited to walking, serving as a baseline for DART's RL experiments.
  - Human-Scene Interactions (Hassan et al., 2021; Starke et al., 2019; Zhao et al., 2022; 2023; Li et al., 2024a; Xu et al., 2023; Jiang et al., 2024b; Wang et al., 2024; Liu et al., 2024; Li et al., 2024b; Zhang et al., 2022b; 2025): Aims to synthesize motions that interact realistically with 3D environments.
  - Human-Human(noid) Interactions (Liang et al., 2024; Zhang et al., 2023c; Christen et al., 2023; Cheng et al., 2024; Shan et al., 2024).
Diffusion Generative Models:
- Foundational Diffusion Models: $Ho et al. (2020)$ introduced Denoising Diffusion Probabilistic Models (DDPMs), followed by improvements like Song et al. (2021a; b). Rombach et al. (2022) proposed Latent Diffusion Models (LDMs) for high-resolution synthesis by operating in a latent space, which DART adopts.
- Diffusion for Motion Generation: Many works apply diffusion models to 3D human motions (Tevet et al., 2023; Barquero et al., 2024; Chen et al., 2023b; Karunratanakul et al., 2024b; Cohan et al., 2024; Dai et al., 2025; Chen et al., 2023a). Most focus on offline generation of short, isolated motion sequences, which DART explicitly addresses.
- Temporal Motion Composition: FlowMDM (Barquero et al., 2024) is identified as the state-of-the-art temporal motion composition method. It can generate complex, continuous motions by composing actions with precise durations. However, it is an offline method requiring full timeline knowledge and has slow generation speed, making it unsuitable for real-time. FlowMDM is a key baseline for DART.
- Diffusion with History/Control: Some works incorporate history conditions (Xu et al., 2023; Jiang et al., 2024b; Van Wouwe et al., 2024; Han et al., 2024) or adapt diffusion for real-time character control (Shi et al., 2024; Chen et al., 2024). $Shi et al. (2024)$ learns control policies using diffusion noises as action space but focuses on single-frame autoregressive generation and lacks text conditions.
- Optimization-Based Control with Diffusion: DNO (Karunratanakul etal., 2024b) is closely related to DART's optimization-based control. Both use diffusion noises as a latent space for editing and control. However, DNO's diffusion model is trained with full motion sequences in explicit representations, whereas DART uses latent motion primitives, which the authors claim gives superior performance in harmonizing spatial control with text semantic alignment. DNO is a baseline for DART.
- Concurrent Work: CloSD (Tevet et al., 2025) trains autoregressive motion diffusion models conditioned on target joint locations. This approach relies on paired training data of control signals and human motions. DART, in contrast, learns a latent motion space from motion-only data and uses latent space control methods that do not require paired training data.

3.3. Technological Evolution

The field of human motion generation has evolved significantly:

Early Data-Driven Approaches (Motion Graphs): Focused on stitching together pre-recorded motion clips to create new sequences, emphasizing realism but with limited flexibility and control (Kovar et al., 2008).
Learning Motion Manifolds (VAEs/Autoencoders): Methods like Holden et al. (2015) and Ling et al. (2020) introduced VAEs to learn continuous, compact latent representations of motion, allowing for interpolation and generation of novel motions, overcoming the discrete nature of motion graphs. This improved flexibility but often lacked high-level semantic control.
Conditional Generation:
- Physics-Based Control: Focused on physically plausible motions, often using RL to train characters to perform skills (Holden et al., 2017; Peng et al., 2022). These offered high fidelity but were typically skill-specific and hard to generalize.
- Semantic Control (Text-to-Motion): The rise of large language models and vision-language models led to text-conditioned motion generation, enabling intuitive control via natural language (Tevet et al., 2023; Guo et al., 2022). Initial methods focused on short, isolated motions.
Diffusion Models for High-Quality Synthesis: Diffusion models revolutionized generative AI, demonstrating unprecedented quality in images and videos. Their application to human motion (Tevet et al., 2023) brought similar quality improvements, though still largely for offline, short sequences.
Temporal Composition and Real-time Requirements: The need for longer, continuous, and complex motions led to methods for temporal composition (FlowMDM), but these were often offline and slow. The demand for interactive applications pushed for real-time and online generation capabilities.
Integration of Spatial Control: Simultaneously, there was a drive to integrate precise spatial control (e.g., goal-reaching, scene interaction) with semantic control, which proved challenging to balance effectively (Shafir et al., 2024; Karunratanakul et al., 2024b; Xie et al., 2024).

DART's work fits into the most recent stage of this evolution, pushing the boundaries by combining latent diffusion models with an autoregressive motion primitive representation to enable real-time, continuous, text-driven motion generation with precise spatial control. It builds upon the strengths of VAEs (compact latent space), diffusion models (high-quality generation), and autoregressive modeling (long-term continuity) while explicitly addressing the limitations of prior work regarding real-time performance, long sequence generation, and the harmonious integration of text and spatial controls.

3.4. Differentiation Analysis

Compared to the main methods in related work, DART introduces several core differences and innovations:

Autoregressive & Real-time Generation vs. Offline/Isolated:
- Prior Text-to-Motion (e.g., T2M-GPT, TEACH): Primarily generates short, isolated motions from a single text prompt.
- FlowMDM: State-of-the-art for temporal motion composition, generating continuous long motions from a sequence of text prompts. However, FlowMDM is an offline method, requiring the entire action timeline in advance and suffering from slow generation speed.
- DART's Innovation: DART is designed for online and real-time generation. It uses an autoregressive motion primitive approach, meaning it generates one short segment at a time, conditioned on previous generated motion and the current text prompt. This enables efficient, continuous generation of arbitrary length at over 300 frames/second with low latency, making it suitable for interactive applications.
Motion Primitive Space vs. Raw Motion/Full Sequence Diffusion:
- Most Diffusion Models for Motion: Often diffuse directly in the raw, high-dimensional motion space or encode full motion sequences into a latent space (e.g., DNO).
- DART's Innovation: DART learns a compact latent motion primitive space using a Variational Autoencoder (VAE). This primitive-based representation decomposes complex long sequences into simpler, short, overlapping segments. This simplifies the data distribution for the diffusion model, mitigates artifacts (like jittering) from raw data, and makes the latent space more amenable to fast generation and control.
Harmonious Integration of Text and Spatial Control:
- Prior Spatial Control Methods (e.g., OmniControl, DNO): While attempting to integrate spatial control, they often struggle to effectively balance spatial accuracy, motion quality, and semantic alignment with text, sometimes ignoring text prompts to reach a goal. They are also typically restricted to isolated short motions.
- DART's Innovation: DART's latent motion primitive space, combined with its text-conditioned latent diffusion model, provides a powerful foundation. It offers two flexible mechanisms for precise spatial control:
  1. Latent Diffusion Noise Optimization: Directly optimizes the latent noise variables using gradient descent to meet spatial constraints, while the underlying diffusion model ensures motion realism and text alignment.
  2. Reinforcement Learning Policies: Models the control task as an MDP and trains RL policies that output latent noises as actions, leveraging the pretrained generative model for high-quality motion.
- This dual approach allows DART to maintain high motion realism and strong text-semantic alignment even while satisfying precise spatial goals, demonstrating superior harmonization compared to baselines.
Robustness and Controllability via Scheduled Training:
- General Autoregressive Challenges: Autoregressive models can suffer from distribution drift during long rollouts, leading to unstable or unrealistic generations when the model encounters out-of-distribution history.
- DART's Innovation: Employs scheduled training to progressively introduce rollout history (i.e., previously generated frames instead of ground truth frames) during training. This exposes the model to test-time distributions, improving the stability of long-sequence generation and enhancing controllability for unseen poses or novel text-motion combinations.
  
  In essence, DART moves beyond generating static, short motions or slow, offline compositions to a dynamic, real-time paradigm, offering versatile control over long, semantically rich, and spatially constrained human movements by intelligently combining motion primitives, latent diffusion, and advanced control mechanisms.

4. Methodology

4.1. Principles

The core idea behind DART is to achieve real-time, text-driven, and spatially controllable continuous human motion generation by breaking down the complex problem into manageable, learnable components. The theoretical basis and intuition are:

Decomposition into Motion Primitives: Long, continuous human motions are inherently complex and high-dimensional, making them difficult to model directly. By representing them as sequences of short, overlapping motion primitives, the problem of modeling a global sequence is decomposed into modeling local, simpler segments. These primitives are easier for generative models to learn and also align better with atomic action semantics. The overlap ensures smooth transitions between generated primitives.
Learning a Compact Latent Space: Raw motion data often contains noise and artifacts (e.g., glitches, jitters). Training generative models directly on this raw data can lead to inheriting these imperfections. A Variational Autoencoder (VAE) is used to compress these motion primitives into a lower-dimensional, compact, and clean latent space. This latent space is more robust to data noise, computationally efficient, and provides a continuous manifold of plausible motions, which is ideal for subsequent generation and control.
Text-Conditioned Autoregressive Generation with Latent Diffusion: A latent diffusion model is chosen for its proven ability to generate high-quality and diverse samples. By training this diffusion model in the compact latent primitive space, conditioned on motion history and text inputs, DART learns to predict the next motion primitive autoregressively. This autoregressive nature, combined with the efficiency of operating in the latent space and using a small number of diffusion steps, enables real-time, online generation of motions of arbitrary length.
Flexible Spatial Control through Latent Space Manipulation: The learned latent primitive space is not just for generation but also for precise control. Since this space represents plausible motions, manipulating variables within it (specifically, the latent diffusion noises) can guide motion towards spatial goals while maintaining realism. This control can be achieved in two ways:
- Optimization: Directly optimizing the latent noises using gradient descent to minimize a criterion function that quantifies the distance to spatial goals and penalizes physical/scene constraint violations.
- Reinforcement Learning: Framing the control as a Markov Decision Process where latent noises are actions, and training an RL policy to predict these actions based on the current state (history, goal, text, scene) to maximize rewards (e.g., goal-reaching, physical plausibility). This provides an efficient, learned controller for dynamic goals.
  
  These principles combine to create a versatile and efficient framework that can understand natural language instructions, generate long and realistic human motions in real-time, and precisely control spatial aspects of those motions.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Preliminaries

4.2.1.1. Problem Definition

The paper defines the task as text-conditioned online motion generation with spatial control. Given:

An $H$ frame seed motion $\mathbf{H}_{seed} = [\mathbf{h}^1, ..., \mathbf{h}^H]$ . This is the initial motion context.
A sequence of $N$ text prompts $\mathbf{C} = [c^1, ..., c^N]$ . These prompts dictate the semantic actions.
Spatial goals $g$ . These define geometric constraints (e.g., target locations, poses).

The objective is to autoregressively generate continuous and realistic human motion sequences $\mathbf{M} = [\mathbf{\bar{H}}_{seed}, \mathbf{X}^1, ..., \mathbf{X}^N]$ . Here, each motion segment $\mathbf{X}^i$ must align with the corresponding text prompt $c^i$ and satisfy the spatial goal $g$ . This task presents challenges in high-level action semantic control, precise spatial control, and smooth temporal transitions.

4.2.1.2. Autoregressive Motion Primitive Representation

DART models long-term human motions as a sequential composition of motion primitives with overlaps, suitable for efficient generative learning and online inference.

Motion Primitive Structure: Each motion primitive $\mathbf{P}^i = [\mathbf{H}^i, \mathbf{X}^i]$ is a short motion clip.
- $\mathbf{H}^i = [\mathbf{h}^{i,1}, ..., \mathbf{h}^{i,H}]$ : Consists of $H$ frames of history motion.
- $\mathbf{X}^i = [\mathbf{x}^{i,1}, ..., \mathbf{x}^{i,F}]$ : Consists of $F$ frames of future motion.
- Overlap: The history motion $\mathbf{H}^i$ for the $i$ -th primitive is explicitly defined as the last $H$ frames of the previous future motion segment $\mathbf{X}^{i-1}$ , specifically $\mathbf{X}^{i-1,F-H+1:F}$ . This overlap ensures temporal continuity between successive primitives.
Long Motion Construction: Infinitely long motions can be represented as the rollout of such overlapping motion primitives: $\mathbf{M} = [\mathbf{\bar{H}}_{seed}, \mathbf{X}^1, ..., \mathbf{X}^N]$ .
Human Body Representation (per frame): Each motion frame is represented as a $D = 276$ dimensional vector, using an overparameterized representation based on the SMPL-X parametric human body model. The components are:
- $\mathbf{t} \in \mathbb{R}^3$ : Global body translation.
- $\mathbf{R} \in \mathbb{R}^6$ : 6D rotation representation (Zhou et al., 2019) of the global body orientation.
- $\pmb{\theta} \in \mathbb{R}^{21 \times 6}$ : 6D representation of 21 joint rotations (for the SMPL-X body joints).
- $\mathbf{J} \in \mathbb{R}^{22 \times 3}$ : Locations of the 22 SMPL-X joints.
- $\mathbf{dt} \in \mathbb{R}^3$ : Temporal difference features of the global translation with the previous frame.
- $\mathbf{dR} \in \mathbb{R}^{\tilde{6}}$ (should be $\mathbb{R}^6$ based on typical 6D rotation difference): Temporal difference features of the global rotation.
- $\mathbf{dJ} \in \mathbb{R}^{22 \times 3}$ : Temporal difference features of the joint locations. The paper specifies $H=2$ (history length) and $F=8$ (future length) for their experiments.

Canonicalization: Each motion primitive is canonicalized into a local coordinate frame. This means the origin is centered at the pelvis of the first frame of the primitive. This human-centric local coordinate system helps normalize primitive features and facilitates model learning. The details for this are in Appendix A:

The origin is at the pelvis joint of the first frame.
The X-axis is the horizontal projection of the vector from the left hip to the right hip.
The Z-axis points in the inverse gravity direction (upwards).

The Y-axis is derived from the cross product of the Z-axis and X-axis. The canonicalization algorithm (Algorithm 3) is as follows:

Algorithm 3 Motion primitive rotation canonicalization
Input: right_hip R3, left_hip R3
x_axis = right_hip - left_hip x_axis[2] = 0	Project to the xy plane
normalize(x_axis) z_axis = [0, 0, 1]
y_axis = cross_porudct(z_axis, x_axis)	Inverse gravity direction
normalize(y_axis)
return [x_axis, y_axis, z_axis]

The inverse transformation is applied to recover global motions.

Advantages of Primitive Representation:
- Tractability: Decomposes globally complex sequences into short, simple primitives, making the data distribution easier to learn for generative models.
- Efficiency: Autoregressive and simple nature is inherently suitable for fast online generation.
- Interpretability: Primitives convey more interpretable semantics than individual frames, aiding in learning a text-conditioned motion space.

4.2.2. DART: A Diffusion-Based Autoregressive Motion Primitive Model

DART uses a latent diffusion model (Rombach et al., 2022; Chen et al., 2023b) for autoregressive motion generation, conditioned on text and motion history. It comprises a Variational Autoencoder (VAE) and a latent denoising diffusion model.

4.2.2.1. Learning the Latent Motion Primitive Space (VAE)

The paper introduces a motion primitive VAE to compress motion primitives into a compact latent space. This is crucial because raw motion data from AMASS (Mahmood et al., 2019) can contain artifacts (glitches, jitters), and training diffusion models directly on raw data often propagates these. The VAE compression mitigates these outlier artifacts. The resulting latent representation is compact, computationally efficient, and enhances control capabilities.

Architecture: The motion primitive VAE consists of an encoder $\mathcal{E}$ and a decoder $\mathcal{D}$ , based on a transformer-based architecture (similar to MLD, Chen et al., 2023b), as shown in Figure 1.

$Figure 1: Architecture illustration of DART. The encoder network compresses the future frames $\\mathbf { X } \\bar { \\mathbf { \\eta } } = \[ \\mathbf { x } ^ { 1 } , . . . , \\mathbf { x } ^ { F } \]$ int…$ 该图像是DART架构的示意图。图中展示了包括编码器、解码器和去噪网络的三部分结构。编码器将未来帧 $\mathbf{X} \bar{\mathbf{\eta}} = [\mathbf{x}^1, \ldots, \mathbf{x}^F]$ 压缩为潜变量，基于历史帧 $\mathbf{H} = [\mathbf{h}^1, \ldots, \mathbf{h}^H]$ 。解码器则根据历史帧和潜样本重建未来帧。去噪网络预测干净的潜样本 $\hat{\mathbf{z}}_0$ ，条件是噪声步骤、文本提示、历史帧和有噪声的潜样本 $\mathbf{z}_t$ 。

Figure 1: Architecture illustration of DART. The encoder network compresses the future frames $\\mathbf { X } \\bar { \\mathbf { \\eta } } = \[ \\mathbf { x } ^ { 1 } , . . . , \\mathbf { x } ^ { F } \]$ into a latent variable, conditioned on the history frames $\\mathbf { H } = \[ \\mathbf { h } ^ { 1 } , . . . , \\mathbf { h } ^ { H } \]$ . The decoder network reconstructs the future frames conditioned on the history frames and the latent sample. The denoiser network predicts the clean latent sample $\\hat { \\mathbf { z } } _ { 0 }$ conditioned on the noising step, text prompt, history frames, and noised latent sample $\\mathbf { z } _ { t }$ . During the denoiser training, the encoder and decoder network weights remain fixed.
- Encoder ( $\mathcal{E}$ ): Takes history motion frames $\mathbf{H}$ and future motion frames $\mathbf{X}$ as input. It predicts the distribution parameters (mean $\pmb{\mu}$ and variance $\pmb{\sigma}$ ) for the latent space.
- Latent Sample ( $\mathbf{z}$ ): Obtained from the predicted distribution via the reparameterization trick (Kingma & Welling, 2014).
- Decoder ( $\mathcal{D}$ ): Predicts the future frames $\hat{\mathbf{X}}$ from zero tokens, conditioned on the latent sample $\mathbf{z}$ and the history frames $\mathbf{H}$ .
Training Objectives (from Appendix C): The motion primitive VAE is trained with a combination of losses: $ \mathcal{L}{VAE} = \mathcal{L}{rec} + w_{KL} \times \mathcal{L}{KL} + w{aux} \times \mathcal{L}{aux} + w{SMPL} \times \mathcal{L}_{SMPL} $ where:
- Reconstruction Loss ( $\mathcal{L}_{rec}$ ): Minimizes the distance between the reconstructed future frames $\hat{\mathbf{X}}$ and the ground truth future frames $\mathbf{X}$ . $ L_{rec} = \mathcal{F}(\hat{\mathbf{X}}, \mathbf{X}) $ Here, $\mathcal{F}(\cdot, \cdot)$ denotes a distance function, implemented as smoothed L1 loss (Girshick, 2015).
- Kullback-Leibler (KL) Regularization Term ( $\mathcal{L}_{KL}$ ): Penalizes the distribution difference between the predicted latent distribution and a standard Gaussian prior. $ \mathcal{L}_{KL} = KL(q(\mathbf{z}|\mathbf{H}) \parallel \mathcal{N}(0, \mathbf{I})) $ $KL(\cdot, \cdot)$ is the Kullback-Leibler divergence. $q(\mathbf{z}|\mathbf{H})$ is the distribution predicted by the encoder $\mathcal{E}$ . A small weight $w_{KL} = 1e^{-6}$ is used to maintain expressiveness.
- Auxiliary Losses ( $\mathcal{L}_{aux}$ ): Regularizes the predicted temporal difference features to be consistent with the actual differences calculated from the predicted motion features. $ L_{aux} = \mathcal{F}(\bar{\mathbf{dt}}, \hat{\mathbf{dt}}) + \mathcal{F}(\bar{\mathbf{dJ}}, \hat{\mathbf{dJ}}) + \mathcal{F}(\bar{\mathbf{dR}}, \hat{\mathbf{dR}}) $ For instance, $\bar{\mathbf{dt}}^0 := \hat{\mathbf{t}}^1 - \hat{\mathbf{t}}^0$ (difference of first two frames of predicted root translation) should be consistent with $\hat{\mathbf{dt}}^0$ (predicted first frame root translation difference feature). The weight $w_{aux} = 100$ is used.
- SMPL Losses ( $\mathcal{L}_{SMPL}$ ): Ensures consistency with the SMPL body model. $ L_{SMPL} = L_{joint_rec} + L_{consistency} $ where:
  - SMPL-based Joint Reconstruction Loss ( $\mathcal{L}_{joint\_rec}$ ): Penalizes discrepancy between joints regressed from predicted body parameters and those from ground truth body parameters. $ L_{joint_rec} = \mathcal{F}(\mathcal{I}(\hat{\mathbf{t}}, \hat{\mathbf{R}}, \hat{\pmb{\theta}}), \mathcal{I}(\mathbf{t}, \mathbf{R}, \pmb{\theta})) $ $\mathcal{I}$ is the SMPL body joint regressor, which predicts joint locations given body parameters ( $\mathbf{t}, \mathbf{R}, \pmb{\theta}$ ). Body shape parameters ( $\pmb{\beta}$ ) are assumed to be zero.
  - Joint Consistency Loss ( $\mathcal{L}_{consistency}$ ): Regularizes predicted joint locations $\hat{\mathbf{J}}$ and predicted SMPL body parameters $(\hat{\mathbf{t}}, \hat{\mathbf{R}}, \hat{\pmb{\theta}})$ to consistently represent the same body joints. $ L_{consistency} = \mathcal{F}(\hat{\mathbf{J}}, \mathcal{I}(\hat{\mathbf{t}}, \hat{\mathbf{R}}, \hat{\pmb{\theta}})) $ The weight $w_{SMPL} = 100$ is used.
Training Details: The VAE is trained with AdamW (Kingma & Ba, 2015) optimizer, learning rate $1e^{-4}$ with linear annealing.

4.2.2.2. Latent Motion Primitive Diffusion Model

After learning the compact latent motion primitive space with the VAE, DART formulates text-conditioned autoregressive motion generation as approximating a probabilistic distribution $q(\mathbf{z}|\mathbf{H}, c)$ using a latent diffusion model $\mathcal{G}$ .

Forward Diffusion Process: Given a latent representation $\mathbf{z}_0$ (obtained from the VAE encoder), the forward process adds Gaussian noise iteratively to produce increasingly noisy samples $\mathbf{z}_1, ..., \mathbf{z}_T$ : $ q(\mathbf{z}t | \mathbf{z}{t-1}) = \mathcal{N}(\sqrt{1 - \beta_t} \mathbf{z}_{t-1}, \beta_t \mathbf{I}) $ Here, $\beta_t$ are noise schedule hyperparameters, and $\mathcal{N}$ denotes a Gaussian distribution.
Reverse Diffusion Process (Denoising): The denoiser model $\mathcal{G}$ learns to reverse this process, predicting a clean latent variable from noise. The reverse process is modeled as: $ p_\theta(\mathbf{z}_{t-1} | \mathbf{z}_t, t, \mathbf{H}, c) = \mathcal{N}(\pmb{\mu}_t, \Sigma_t) $ This process aims to generate motion primitives conditioned on motion history $\mathbf{H}$ and text label $c$ . The variance $\Sigma_t$ is scheduled using hyperparameters, while the mean $\pmb{\mu}_t$ is modeled using the denoiser neural network. The denoiser $\mathcal{G}$ is designed to predict the clean latent variable $\hat{\mathbf{z}}_0$ : $ \hat{\mathbf{z}}_0 = \mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, c) $ From this predicted $\hat{\mathbf{z}}_0$ , the mean $\pmb{\mu}_t$ of the reverse process can be derived as: $ \pmb{\mu}t = \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1-\bar{\alpha}_t} \mathcal{G}(\mathbf{z}t, t, \mathbf{H}, c) + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}{t-1})}{1-\bar{\alpha}_t}\mathbf{z}_t $ where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ . During inference, a pure noise sample $\mathbf{z}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is progressively denoised by $\mathcal{G}$ until a clean sample $\mathbf{z}_0$ is obtained.
Architecture (Figure 1):
- The diffusion step $t$ is embedded using a small MLP (Multi-Layer Perceptron).
- The text prompt $c$ is encoded using the CLIP (Radford et al., 2021) text encoder.
- The text prompt is randomly masked out with a probability of 0.1 during training to enable classifier-free guidance during generation.
Decoding Latent to Motion: The cleaned latent variable $\hat{\mathbf{z}}_0$ can be converted back to future frames using the frozen decoder $\mathcal{D}$ of the VAE: $ \hat{\mathbf{X}} = \mathcal{D}(\mathbf{H}, \hat{\mathbf{z}}_0) $
Training Objectives (from Appendix D.1): The latent denoiser $\mathcal{G}$ $G$ is trained using the following losses: $ L_{denoiser} = L_{simple} + w_{rec} \times L_{rec} + w_{aux} \times L_{aux} $ where:
- Simple Objective ( $L_{simple}$ ): The primary objective for diffusion models (Ho et al., 2020), which trains the denoiser to predict the clean latent variable $\mathbf{z}_0$ . $ L_{simple} = \mathbb{E}_{(\mathbf{z}_0, c) \sim q(\mathbf{z}_0, c), t \sim [1, T], \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \mathcal{F}(\mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, c), \mathbf{z}_0) $ Here, $\mathcal{F}(\cdot, \cdot)$ is again the smooth L1 loss.
- Reconstruction Loss ( $L_{rec}$ ): Applied on the decoded future frames $\hat{\mathbf{X}}$ to ensure their validity. $w_{rec} = 1$ .
- Auxiliary Losses ( $L_{aux}$ ): Also applied on the decoded future frames to ensure naturalness. $w_{aux} = 10000$ . The SMPL losses ( $L_{SMPL}$ ) are not used in denoiser training to avoid slowing it down.
Diffusion Steps: Notably, DART uses only 10 diffusion steps for both training and inference. This small number is sufficient due to the simplicity of the motion primitive representation, enabling highly efficient online generation.

Scheduled Training (from Appendix D.2): This technique improves the stability of long sequence generation and text prompt controllability. Instead of always using ground truth history motion

\mathbf{H}

during training, the model progressively learns to use rollout history (i.e., history extracted from its own previous predictions). This exposes the model to test-time distribution shifts, preventing out-of-distribution problems during long autoregressive rollouts.

The scheduled training has three stages:
1. Fully Supervised Stage: Only ground truth history is used.
2. Scheduled Learning Stage: Randomly replaces ground truth history with rollout history with a probability linearly increasing from 0 to 1.
3. Rollout Training Stage: Always uses rollout history.

The algorithm for calculating rollout probability (Algorithm 4) is:

Algorithm 4 Calculate rollout probability

1Input: current iteration number iter, number of train iterations in the first supervised stage `I _ { 1 }` ,

number of train iterations in the second scheduled stage I _ { 2 }
2: Output: rollout probability $p$
3: function ROLLOUT_PROBABILITY(iter, I _ { 1 } , I _ { 2 } )
4: if iter $\leq I _ { 1 }$ then $\triangleright$ no rollout in the first supervised stage
5: $p \gets 0$
6: else if iter $\cdot > I _ { 1 } + I _ { 2 }$ then $\triangleright$ the third rollout stage always use rollout
7: $p \gets 1$
8: else linearly scheduled rollout probability in the second scheduled stage
9: $\begin{array} { r } { p \gets \frac { i t e r - I _ { 1 } } { I _ { 2 } } } \end{array}$
10: end if
11: return $p$
12:end function

The scheduled training for the latent denoising model (Algorithm 5) is provided in the Appendix, showing how rollout_probability is used to decide whether to use ground truth or rollout history. The latent denoiser is trained with AdamW, learning rate

1e^{-4}

, for 100K iterations per stage.

Autoregressive Rollout Generation (Algorithm 1): With the trained motion primitive decoder $\mathcal{D}$ , latent denoiser $\mathcal{G}$ , and a diffusion sampler $S$ (e.g., DDPM, DDIM), motion sequences are generated autoregressively.

Algorithm 1 Autoregressive rollout generation using latent motion primitive model

Input: primitive decoder D, latent variable denoiser G, history motion seed Hseed, text prompts C = [c1, ., cN ], total diffusion steps T, classifier-free guidance scale w, diffusion sampler S. Optional Input: Latent noises ZT = [z1 , .., ZN ] Output: motion sequence M

H ← Hseed M ← Hseed

for i ← 1 to N do number of rollouts sample noise zT from N (0, 1) if not inputted

z0 ← S(G, ZiT , T, H, ci, ω) diffusion sample loop with classifier-free guidance

X ← D(H, z0)

M ← CONCAT(M, X) concatenate future frames to generated sequence

H ← CANONICALIZE(XF −H+1:F) update the rollout history with last H generated frames

end for return M

Explanation of Algorithm 1:

Initialization: The history $\mathbf{H}$ is set to the seed motion $\mathbf{H}_{seed}$ . The motion sequence $\mathbf{M}$ starts with $\mathbf{H}_{seed}$ .
Loop for each text prompt: For each of the $N$ $N$ text prompts:
- Latent Noise Sampling: A noise vector $\mathbf{z}_T$ is sampled from a standard Gaussian distribution $\mathcal{N}(0, 1)$ . This is the starting point for the diffusion process for the current primitive. If latent noises $\mathbf{Z}_T$ are provided as input (for optimization), the $i$ -th noise $\mathbf{z}_T^i$ is used.
- Denoising (Classifier-Free Guidance): The diffusion sampler $S$ (e.g., DDIM) uses the denoiser $\mathcal{G}$ to iteratively remove noise from $\mathbf{z}_T$ , conditioned on the current history $\mathbf{H}$ , text prompt $c^i$ , and diffusion steps $T$ . Classifier-free guidance (with scale $w$ ) is applied during this denoising to steer the generation towards the text semantics. The output is the clean latent variable $\mathbf{z}_0$ .
- Decoding Motion Primitive: The motion primitive decoder $\mathcal{D}$ transforms $\mathbf{z}_0$ (and $\mathbf{H}$ ) into the future motion frames $\mathbf{X}$ .
- Concatenation: The newly generated future frames $\mathbf{X}$ are appended to the overall motion sequence $\mathbf{M}$ .
- History Update: The history $\mathbf{H}$ for the next iteration is updated by taking the last $H$ frames of the just-generated $\mathbf{X}$ and canonicalizing them. This ensures the autoregressive flow.
Return: The complete generated motion sequence $\mathbf{M}$ is returned.

Real-time Performance: DART generates over 300 frames per second on a single RTX 4090 GPU, enabling real-time applications.

4.2.3. Spatially Controllable Motion Synthesis via DART

DART provides a framework for integrating precise spatial control, addressing the limitation of relying solely on text for complex spatial goals.

Motion Control Task Formulation: The task is defined as generating a motion sequence $\mathbf{M}$ $M$ that minimizes its distance to a given spatial goal $g$ $g$ , under a criterion function $\mathcal{F}(\cdot, \cdot)$ $F (\cdot, \cdot)$ and regularization from scene and physical constraints $cons(\cdot)$ $co n s (\cdot)$ : $ \mathbf{M}^* = \mathrm{argmin}_\mathbf{M} \mathcal{F}(\Pi(\mathbf{M}), g) + cons(\mathbf{M}) $ where:
- $g$ : Task-dependent spatial goal (e.g., a keyframe body pose, target location).
- $\Pi(\cdot)$ : A projection function that extracts goal-relevant features from $\mathbf{M}$ and maps them into the task-aligned observation space.
- $cons(\cdot)$ : Physical and scene constraints (e.g., collision avoidance, foot-floor contact).
Latent Space Control Formulation: Directly solving in raw motion space often yields unrealistic motions. DART reformulates this in its latent motion space. Using the deterministic DDIM sampler (Song et al., 2021a), the ROLLOUT function (Algorithm 1) becomes a deterministic mapping from a list of motion primitive latent noises $\mathbf{Z}_T = [\mathbf{z}_T^1, ..., \mathbf{z}_T^N]$ to a motion sequence $\mathbf{M}$ , conditioned on $\mathbf{H}_{seed}$ and $C$ : $ \mathbf{M} = \mathrm{ROLLOUT}(\mathbf{Z}T, \mathbf{H}{seed}, C) $ The minimization objective is then converted to optimizing the latent noises: $ {\mathbf{Z}T}^* = \operatorname*{argmin}{\mathbf{Z}_T} \mathcal{F}(\Pi(\mathrm{ROLLOUT}(\mathbf{Z}T, \mathbf{H}{seed}, C)), g) + cons(\mathrm{ROLLOUT}(\mathbf{Z}T, \mathbf{H}{seed}, C)) $ The paper notes that they do not use DDIM for skipping diffusion steps, as this can cause artifacts.

4.2.3.1. Motion Control via Latent Diffusion Noise Optimization

One solution is to directly optimize the latent noises $\mathbf{Z}_T$ using gradient descent methods (Kingma & Ba, 2015).

Optimization Algorithm (Algorithm 2):

	Input: Latent noises ZT = [z1, ., ZN ], Optimizer O, learning rate η, and goal g. (For brevity, we do not reiterate the inputs of the rollout function defined in Alg. 1 and criterion terms in Eq. 2)
Output: a motion sequence M.
for i ← 1 to optimization steps do
M ← ROLLOUT(ZT, Hseed, C)
← ZT (F(Π(M), g) + cons(M))
ZT ← O(ZT, /, η)		update using normalized gradient

end for return M ← ROLLOUT(ZT, Hseed, C)

Explanation of Algorithm 2:

Input: Initial latent noises $\mathbf{Z}_T$ , an optimizer $O$ , learning rate $\eta$ , and spatial goal $g$ .
Loop for Optimization Steps:
- Generate Motion: The ROLLOUT function (Algorithm 1) is called with the current $\mathbf{Z}_T$ , history seed $\mathbf{H}_{seed}$ , and text prompts $C$ to generate a motion sequence $\mathbf{M}$ .
- Calculate Loss: The total loss is calculated using the criterion function $\mathcal{F}(\Pi(\mathbf{M}), g)$ (distance to goal) and constraint function $cons(\mathbf{M})$ . The gradient of this loss with respect to $\mathbf{Z}_T$ is computed.
- Update Latent Noises: The optimizer $O$ updates $\mathbf{Z}_T$ using the computed gradient and learning rate $\eta$ .
Return: After optimization, the final motion sequence $\mathbf{M}$ is generated using the optimized $\mathbf{Z}_T$ .

Example Control Scenarios:
- Text-conditioned motion in-between: The goal is to generate a transition between a history and a goal keyframe ( $f$ frames away), conditioned on a text prompt $c$ . The optimization objective is the distance between the $f$ -th frame of the generated motion and the goal keyframe.
- Human-scene interaction generation: Incorporates physical and scene constraints $cons(\cdot)$ $co n s (\cdot)$ . Given 3D scenes, text prompts, and goal joint locations (e.g., pelvis for sitting), motions are generated that perform the desired interaction and achieve goal positions while adhering to scene constraints. 3D scenes are represented as signed distance fields (SDF) to compute body-scene distances for foot-floor contact and collision avoidance.
  - Scene Collision Constraint ( $cons\_coll(\cdot)$ from Appendix F): Penalizes joints that collide with the scene (i.e., have negative SDF values) in each frame. $ cons_coll(\mathbf{J}) = - \sum_{k=1}^{22} (\Psi(\mathbf{J}^k) - \tau^k)_- $ Here, $\mathbf{J}^k$ is the $k$ -th joint's location in scene coordinates, $\Psi(\cdot)$ is the signed distance function (returning distance to scene), $\tau^k$ are joint-dependent contact threshold values (based on joint-skin distance in rest pose), and $(\cdot)_-$ clips positive values to zero (so only negative distances, indicating penetration, contribute to the penalty).
  - Foot-Floating Constraint ( $cons\_cont(\cdot)$ from Appendix F): Encourages feet to be in contact with the scene. $ cons_cont(\mathbf{J}) = (\operatorname*{min}{k \in Foot} \Psi(\mathbf{J}^k) - \tau)+ $ Here, Foot is the set of foot joint indices, $\tau$ is the foot-floor contact threshold distance, and $(\cdot)_+$ clips negative values to zero (so only positive distances, indicating floating, contribute to the penalty).

4.2.3.2. Motion Control via Reinforcement Learning

To address the computational expense of optimization, DART leverages its autoregressive primitive-based motion representation for reinforcement learning (RL). The latent motion control is modeled as a Markov Decision Process (MDP) with a latent action space.

MDP Formulation: Digital humans are agents interacting with an environment to maximize expected discounted return.
- Time step $i$ : Agent observes state $\mathbf{s}^i$ , samples action $\mathbf{a}^i$ from policy, transitions to next state $\tilde{\mathbf{s}}^{i+1}$ , and receives reward $r^i$ .
- Latent Action Space: The latent noises $\mathbf{z}_T$ serve as the latent action space.
Policy Architecture (Figure 2): An actor-critic (Sutton & Barto, 1998) architecture trained with the PPO (Schulman et al., 2017) algorithm.

该图像是DartControl模型的控制政策示意图，展示了如何通过演员-评论家结构和线性层处理潜在动作以生成未来运动。模型接受文本提示、历史运动、目标和场景信息，最终输出规范化的新历史运动。

Figure 2: Architecture of the reinforcement learning-based control policy. The pretrained DART diffusion denoiser and decoder models transform the latent actions into motion frames. The last predicted frames are canonicalized and provided to the policy model as the next step history condition.
- State Observation ( $\mathbf{s}^i$ ): Includes history motion observation $\mathbf{H}^i$ , goal observation $g^i$ , scene observation $s^i$ , and the CLIP embedding of the text prompt $c^i$ .
- Policy Input: The policy model takes $[\mathbf{H}^i, g^i, s^i, c^i]$ to predict the latent noise $\mathbf{z}_T^i$ as the action.
- Action Execution: The predicted $\mathbf{z}_T^i$ is mapped to future motion frames $\mathbf{X}^i$ using the frozen latent denoiser $\mathcal{G}$ and motion primitive decoder $\mathcal{D}$ .
- History Update: The new history motion is extracted from the last $H$ predicted frames of $\mathbf{X}^i$ and fed to the policy network for the next step.
Reward Maximization: The minimization problem in Equation 3 is reformulated as reward maximization for policy training.
Example Task: Text-Conditioned Goal-Reaching:
- Goal: Control human to reach a 2D goal location $g$ using an action specified by text $c$ .
- Goal Observation: Transformed into a local observation including distance to pelvis (at last history frame) and local direction (within human-centric frame).
- Scene Observation: For a flat scene, it's the relative floor height to the body pelvis (at first history frame).
- Rewards (from Appendix G): Policies are trained with multiple rewards:
  - Distance Reward ( $r_{dist}$ ): Encourages the human pelvis to get closer to the goal location. $ r_{dist} = d^{i-1} - d^i $ where $d^i$ is the 2D distance between human pelvis and goal location at step $i$ .
  - Success Reward ( $r_{succ}$ ): Provides a sparse but strong reward when the human reaches the goal. $ r_{succ} = {\left{ \begin{array}{ll} 1 & \mathrm{if ~} d^i < 0.3 \ 0 & \mathrm{otherwise} \end{array} \right.} $ where $0.3 \mathrm{~m}$ is the success threshold.
  - Moving Orientation Reward ( $r_{ori}$ ): Encourages the human's moving direction to align with the goal direction. $ r_{ori} = \frac{\langle \mathbf{p}^i - \mathbf{p}^{i-1}, g - \mathbf{p}^{i-1} \rangle + 1}{2} $ where $\mathbf{p}^i$ is the human pelvis location at step $i$ , and $g$ is the goal location. The dot product $\langle \cdot, \cdot \rangle$ gives a measure of alignment, normalized to [0, 1].
  - Skate Reward ( $r_{skate}$ ): Penalizes foot displacements when in contact with the floor. $ r_{skate} = - disp \cdot (2 - 2^{h/0.03}) $ Here, disp is the foot displacement in two consecutive frames, $h$ is the higher foot height in consecutive frames, and $0.03 \mathrm{~m}$ is the threshold for contact.
  - Floor Contact Reward ( $r_{floor}$ ): Penalizes when the lower foot distance to the floor is above a threshold. $ r_{floor} = - (|lf| - 0.03)_+ $ Here, lf denotes the height of the lower foot, $|\cdot|$ is the absolute value, and $(\cdot)_+$ is the clipping operator (minimum of 0).
Performance: The text-conditioned goal-reaching policy achieves 240 frames per second.

4.2.4. Physics-Based Motion Tracking (from Appendix H)

DART, being a kinematic-based approach, can produce physically inaccurate motions (e.g., skating, floating, penetration). To address this, DART can be integrated with physically simulated motion tracking methods. The paper demonstrates this with PHC (Luo et al., 2023). Raw DART output (e.g., crawling with hand-floor penetration) can be refined by physics-based tracking to produce more plausible results, improving joint-floor contact and eliminating penetration artifacts (as shown in Figure 5). The real-time capabilities of both DART and PHC allow for on-the-fly correction.

Figure 5: We demonstrate an example of integrating DART with the physics-based motion tracking method PHC (Luo et al., 2023) to achieve more physically plausible motions. The left image illustrates a… 该图像是图5，展示了DART与基于物理的运动跟踪方法PHC的整合示例。左侧是通过DART生成的爬行序列，存在手部与地面穿透等伪影；右侧展示了应用于原始生成序列的物理跟踪结果，增强了关节与地面的接触，解决了手部穿透问题。

Figure 5: We demonstrate an example of integrating DART with the physics-based motion tracking method PHC (Luo et al., 2023) to achieve more physically plausible motions. The left image illustrates a crawling sequence generated by DART, exhibiting artifacts such as hand-floor penetration. The right image displays the physics-based motion tracking outcome applied to the raw generated sequence, which enhances joint-floor contact and resolves the hand-floor penetration issue.

5. Experimental Setup

5.1. Datasets

DART models are trained on two motion-text datasets, both derived from the AMASS (Mahmood et al., 2019) dataset of motion capture.

BABEL (Bodies, Action and Behavior with English Labels) (Punnakkal et al., 2021):
- Characteristics: Contains motion capture sequences (AMASS source) with frame-aligned text labels that annotate fine-grained action semantics. This allows models to learn precise human action controls and natural transitions.
- Scale: Large-scale, providing diverse actions.
- Sampling: Motion primitives are randomly sampled. Text labels are randomly sampled from all action segments overlapping with the primitive.
- Action Imbalance: To alleviate imbalance, importance sampling is used based on provided action labels, giving each action category roughly equal sampling chances.
- Framerate: 30 frames per second (same as prior works).
- Compatibility: Human gender fixed to male, body shape parameter set to zero.
- Usage: Primary dataset for text-conditioned temporal motion composition experiments.
HML3D (Human Motion Latent 3D) (Guo et al., 2022):
- Characteristics: Contains short motions with coarse sequence-level sentence descriptions.
- Scale: Also uses AMASS motion sources.
- Subset Usage: Only a subset where SMPL body motion sequences are available is used, as the HumanAct12 subset and mirrored sequences only provide joint locations (not SMPL bodies required by DART's representation). Original HML3D joint rotations calculated by naive inverse kinematics are not directly usable for animation.
- Sampling: Primitives are randomly sampled with uniform probability. Text labels are randomly chosen from multiple sentence captions of the overlapping sequence.
- Framerate: 20 frames per second.
- Compatibility: Human gender fixed to male, body shape parameter set to zero.
- Usage: Used for optimization-based motion in-between experiments for fair comparison with baselines trained on it.

5.2. Evaluation Metrics

The paper uses a comprehensive set of metrics to evaluate different aspects of motion generation:

5.2.1. Motion Realism & Quality

Fréchet Inception Distance (FID) (Heusel et al., 2017 for images, adapted for motion):
- Conceptual Definition: Measures the similarity between the distribution of generated motions and the distribution of real motions from the dataset. It calculates the Fréchet distance between two multivariate Gaussian distributions (one for real data features, one for generated data features) in a feature space (e.g., embeddings from a pre-trained motion encoder). Lower FID indicates higher similarity and typically better realism.
- Mathematical Formula: $ \text{FID} = ||\mu_x - \mu_g||^2 + \text{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}) $
- Symbol Explanation:
  - $\mu_x$ : Mean of feature vectors for real motions.
  - $\mu_g$ : Mean of feature vectors for generated motions.
  - $\Sigma_x$ : Covariance matrix of feature vectors for real motions.
  - $\Sigma_g$ : Covariance matrix of feature vectors for generated motions.
  - $\text{Tr}(\cdot)$ : Trace of a matrix.
  - $||\cdot||^2$ : Squared L2 norm.
Jerk:
- Conceptual Definition: Jerk is the rate of change of acceleration, or the third derivative of position with respect to time. It quantifies the smoothness of motion. High jerk values indicate abrupt, unnatural, or "jerky" movements.
- Mathematical Formula: For a discrete sequence of positions $\mathbf{p}_t$ , velocity $\mathbf{v}_t = \mathbf{p}_t - \mathbf{p}_{t-1}$ , acceleration $\mathbf{a}_t = \mathbf{v}_t - \mathbf{v}_{t-1}$ , jerk $\mathbf{j}_t = \mathbf{a}_t - \mathbf{a}_{t-1}$ . The magnitude of jerk is $||\mathbf{j}_t||$ .
- Symbol Explanation:
  - $\mathbf{p}_t$ : Position at time step $t$ .
  - $\mathbf{v}_t$ : Velocity at time step $t$ .
  - $\mathbf{a}_t$ : Acceleration at time step $t$ .
  - $\mathbf{j}_t$ : Jerk at time step $t$ .
  - $||\cdot||$ : L2 norm (magnitude of the vector).
- Metrics used in paper:
  - Peak Jerk (PJ): The maximum jerk value observed over a motion segment, indicating the most abrupt movement. Lower is better.
  - Area Under the Jerk (AUJ): The integral (sum in discrete case) of jerk magnitudes over a motion segment, giving an overall measure of motion jerkiness. Lower is better.
Skate Metric (Ling et al., 2020; Zhang et al., 2018):
- Conceptual Definition: Quantifies unwanted foot sliding or skating when a foot is supposed to be in contact with the ground. It measures foot displacement when the foot's height is below a contact threshold.
- Mathematical Formula: $\bar{s} = disp \cdot (2 - 2^{\bar{h}/0.03})$
- Symbol Explanation:
  - $\bar{s}$ : Scaled foot skating.
  - disp: Foot displacement in two consecutive frames.
  - $\bar{h}$ : The higher foot height in consecutive frames.
  - $0.03 \mathrm{~m}$ : A threshold value for considering the foot in contact with the floor. Lower values indicate less skating.

5.2.2. Motion-Text Semantic Alignment

R-prec (Recall Precision):
- Conceptual Definition: Measures the recall of relevant text descriptions for a given motion, based on a motion-text retrieval task. A higher R-prec indicates better alignment between the generated motion and its corresponding text prompt.
MM-Dist (Motion-to-Text Distance):
- Conceptual Definition: Measures the distance between motion features and text embeddings, quantifying the semantic alignment. Typically, this involves encoding motions and texts into a shared embedding space (e.g., using CLIP or a motion encoder) and then calculating a distance (e.g., cosine distance) between them. Lower values indicate better alignment.

5.2.3. Generation Efficiency & Controllability

DIV (Diversity):
- Conceptual Definition: Measures the variety or diversity of generated motions for the same input condition (e.g., text prompt). A higher DIV indicates the model can produce a wider range of plausible motions, which is desirable for creative applications.
Speed (frame/s): Frames generated per second, indicating real-time capability. Higher is better.
Latency (s): Time to get the first generated frames, crucial for online interaction. Lower is better.
Memory (MiB): Memory usage during generation. Lower is better.
History Error (cm) & Goal Error (cm): For motion in-betweening, these measure the L2 norm error (Euclidean distance) between the generated motion's start/end frame and the given history/goal keyframes. Lower is better.
Reach Time (s) & Success Rate: For goal-reaching tasks, reach time measures how long it takes to reach the goal, and success rate is the percentage of successful attempts. Lower reach time and higher success rate are better.
Floor Distance (cm): For RL control, measures the average distance of the feet from the floor, indicating floating artifacts. Lower is better.

5.2.4. Human Preference Studies

Realism (%): Participants choose which motion looks more realistic between two options. Higher percentage for DART indicates better realism.
Semantic Alignment (%): Participants choose which motion better aligns with the given action text stream. Higher percentage for DART indicates better semantic alignment.

5.3. Baselines

The paper compares DART against several representative baseline models for different tasks:

5.3.1. Text-Conditioned Temporal Motion Composition

TEACH (Temporal Action Composition for 3D Humans) (Athanasiou et al., 2022): An early method for composing 3D human actions temporally.
DoubleTake (Shafir et al., 2024): A more recent approach that might involve blending or re-targeting to compose motions. The paper mentions adjusting its handshake size and blending length.
T2M-GPT* (Text-to-Motion GPT) (Zhang et al., 2023a): A history-conditioned modification of the original T2M-GPT. The original T2M-GPT uses discrete representations and is designed for generating motions from text. The * indicates a modification to include history conditioning for sequential generation, making it a more relevant baseline for temporal composition.
FlowMDM (Seamless Human Motion Composition with Blended Positional Encodings) (Barquero et al., 2024): The state-of-the-art offline temporal motion composition method. It can generate complex, continuous motions by composing desired actions with precise adherence to specified durations. However, it's an offline method requiring full timeline knowledge and is slow. This is a crucial baseline for performance and efficiency comparison.

5.3.2. Latent Diffusion Noise Optimization-Based Control (Motion In-betweening)

DNO (Optimizing Diffusion Noise Can Serve as Universal Motion Priors) (Karunratanakul et al., 2024b): A method that uses diffusion noises as a latent space for editing and control, similar to DART's optimization approach. However, its diffusion model is trained with full motion sequences in explicit motion representations.
OmniControl (Control Any Joint at Any Time for Human Motion Generation) (Xie et al., 2024): An approach that aims to provide flexible spatial control over human motion generation, potentially through inverse kinematics or optimization.

5.3.3. Reinforcement Learning-Based Control (Goal-Reaching)

GAMMA (The Wanderings of Odysseus in 3D Scenes) (Zhang & Tang, 2022): A baseline that also trains goal-reaching policies with a learned motion action space. Unlike DART, GAMMA lacks text conditioning and is limited to generating walking motions, making it a good comparison for the RL framework without text control.

6. Results & Analysis

6.1. Core Results Analysis

The experiments extensively validate DART's effectiveness across text-conditioned temporal motion composition, latent diffusion noise optimization-based spatial control (motion in-betweening, human-scene interaction), and reinforcement learning-based control (goal-reaching).

6.1.1. Text-Conditioned Temporal Motion Composition

This task evaluates DART's ability to generate realistic, continuous motion sequences that align with a stream of text prompts and specified durations. The evaluations were conducted on the BABEL validation set.

The following are the results from Table 1 of the original paper:

	FID↓	R-prec↑	DIV →	MM-Dist↓	FID↓	DIV →	PJ→	AUJ↓	Speed(frame/s)↑	Latency(s)↓	Mem.(MiB)↓
	Segment				Transition				Profiling
Dataset	0.00±0.00	0.72±0.00	8.42±0.15	3.36±0.00	0.00±0.00	6.20±0.06	0.02±0.00	0.00±0.00
TEACH	17.58±0.04	0.66±0.00	10.02±0.06	5.86±0.00	3.89±0.05	5.44±0.07	1.39±0.01	5.86±0.02	3880±144	0.05±0.00	2251
DoubleTake	7.92±0.13	0.60±0.01	8.29±0.16	5.59±0.01	3.56±0.05	6.08±0.06	0.32±0.00	1.23±0.01	85±1	0.76±0.11	1474
T2M-GPT*	7.71±0.55	0.49±0.01	8.89±0.21	6.69±0.08	2.53±0.04	6.61±0.02	1.44±0.03	4.10±0.09	885±12	0.23 ±0.00	2172
FlowMDM	5.81±0.10	0.67±0.00	8.90±0.06	5.08±0.02	2.39±0.01	6.63±0.08	0.04±0.00	0.11±0.00	31±0	161.29±0.24	11892
Ours	3.79±0.06	0.62±0.01	8.05±0.10	5.27±0.01	1.86±0.05	6.70±0.03	0.06±0.00	0.21±0.00	334 ±2	0.02±0.00	2394

Analysis of Table 1:

Motion Realism (FID): DART achieves the best FID for both segment (3.79) and transition (1.86) evaluations. This indicates DART's generated motions are the most realistic and closest to the dataset's distribution, outperforming all baselines, including FlowMDM.
Motion Smoothness (PJ, AUJ): DART shows second-best jerk metrics (PJ 0.06, AUJ 0.21), with FlowMDM being slightly better. This suggests DART produces smooth action transitions, close to the best offline method. The small values are close to the dataset reference, indicating high motion quality.
Motion-Text Semantic Alignment (R-prec, MM-Dist): FlowMDM achieves the best R-prec (0.67) and MM-Dist (5.08). DART has slightly lower R-prec (0.62) and MM-Dist (5.27). The authors explain this as an inherent characteristic of online generation: natural action transitions require a brief delay to adapt to new semantic prompts, which shifts motion embeddings and impacts R-prec. Offline methods like FlowMDM, having full timeline knowledge, can perfectly "plan" transitions.
Diversity (DIV): DART has a DIV of 8.05 for segments and 6.70 for transitions, which is comparable to FlowMDM and DoubleTake, indicating good diversity in generated motions.

Efficiency (Speed, Latency, Memory): This is where DART shines significantly in practical terms:

Speed: DART is remarkably fast at 334 frames/s, approximately 10x faster than FlowMDM (31 frames/s). TEACH is faster, but its motion quality (FID) is substantially worse.
Latency: DART has an extremely low latency of 0.02s, enabling true real-time responsiveness. This is orders of magnitude better than FlowMDM (161.29s).

Memory: DART uses 2394 MiB, which is significantly less than FlowMDM (11892 MiB), making it more resource-friendly.

The following are the results from Table 2 of the original paper:

	Realism (%)	Semantic (%)
Ours vs. TEACH	66.7 vs. 33.3	66.0 vs. 34.0
Ours vs. DoubleTake	66.4 vs. 33.6	66.1 vs. 33.9
Ours vs. T2M-GPT*	61.3 vs. 38.7	66.7 vs. 33.3
Ours vs. FlowMDM	53.3 vs. 46.7	51.3 vs. 48.7

Analysis of Table 2 (Human Preference Study):

Despite FlowMDM showing slightly better R-prec and MM-Dist in quantitative metrics, human evaluation reveals a different story. DART is preferred over FlowMDM for both motion realism (53.3% vs. 46.7%) and motion-text semantic alignment (51.3% vs. 48.7%). This suggests that the slight "delay" in semantic emergence due to natural transitions in DART is perceived as more natural by humans than the potentially abrupt, oracle-informed transitions of FlowMDM.
DART is clearly preferred over TEACH, DoubleTake, and T2M-GPT* in both realism and semantic alignment by a significant margin (e.g., ~66% preference over TEACH), highlighting its overall superior quality.

6.1.2. Latent Diffusion Noise Optimization-Based Control

6.1.2.1. Text-Conditioned Motion In-betweening

This task evaluates generating motions that smoothly transition between a history and a goal keyframe, conditioned on a text prompt specifying the action semantics. DART was trained on the HML3D dataset for fair comparison.

The following are the results from Table 3 of the original paper:

	History error (cm)↓	Goal error (cm)↓	Skate (cm/s)↓	Jerk↓
Dataset	0.00 ± 0.00	0.00 ± 0.00	2.27 ± 0.00	0.74 ± 0.00
OmniControl	21.22 ± 2.86	7.79 ± 1.91	4.97 ± 1.31	1.41 ± 0.08
DNO	1.20 ± 0.20	4.24 ± 1.34	5.38 ± 0.70	0.65 ± 0.06
Ours	0.00 ± 0.00	0.59 ± 0.01	2.98 ± 0.32	0.61 ± 0.01

Analysis of Table 3:

Keyframe Adherence (History Error, Goal Error): DART achieves perfect history error (0.00 cm) and the lowest goal error (0.59 cm). This demonstrates its superior capability in generating motions that precisely connect specified keyframes, significantly outperforming OmniControl and DNO.
Motion Quality (Skate, Jerk): DART exhibits the lowest skate (2.98 cm/s) and jerk (0.61) metrics among baselines, indicating fewer foot sliding artifacts and smoother motions. These values are also much closer to the dataset's natural motion quality.
Semantic Alignment with Spatial Control: The paper highlights that DART effectively preserves the semantics specified by the text prompts, unlike DNO which occasionally ignores text to reach the goal. This is a crucial advantage, showcasing DART's superior harmony between spatial control and text semantic alignment due to its latent motion primitive-based diffusion model.

6.1.2.2. Human-Scene Interaction

Qualitative results show that the latent noise optimization control can be applied to human-scene interaction synthesis. By using 3D scenes represented as signed distance fields (SDF) and incorporating scene contact/collision constraints, DART can generate motions like climbing stairs or sitting on a chair, adhering to both text prompts and goal joint locations. Figure 3 illustrates this.

Figure 3: Illustrations of human-scene interaction generation given text prompts and goal pelvis joint location (visualized as a red sphere). Best viewed in the supplementary video. 该图像是示意图，展示了人类与场景交互的生成效果，分别描述了三种行为：（a）走路，左转，坐在椅子上；（b）走楼梯；（c）下楼。图中目标的盆骨关节位置用红色球体可视化，最佳视图请参见补充视频。

Figure 3: Illustrations of human-scene interaction generation given text prompts and goal pelvis joint location (visualized as a red sphere). Best viewed in the supplementary video.

6.1.3. Reinforcement Learning-Based Control

DART's integration with reinforcement learning enables training text-conditioned goal-reaching policies for various locomotion styles (walk, run, hop on the left leg).

The following are the results from Table 4 of the original paper:

	Time (s)↓	Success rate↑	Skate (cm/s) ↓	Floor distance (cm)↓
GAMMA walk	31.44 ± 2.58	0.95 ± 0.03	5.14 ± 1.58	5.55 ± 0.84
Ours 'walk'	17.08 ± 0.05	1.0 ± 0.0	2.67 ± 0.12	2.24 ± 0.02
Ours 'run'	10.55 ± 0.06	1.0 ± 0.0	3.23 ± 0.24	3.86 ± 0.05
Ours 'hop on left leg'	20.50 ± 0.24	1.0 ± 0.0	2.22 ± 0.12	4.11 ± 0.07

Analysis of Table 4:

Goal-Reaching Performance (Time, Success Rate): DART's policies consistently achieve a 100% success rate for all locomotion styles, significantly outperforming GAMMA walk (95%). DART also demonstrates much faster reach times (17.08s for walk vs. GAMMA's 31.44s).
Motion Quality (Skate, Floor Distance): DART's policies show lower skate ( $2.67 cm/s$ for walk vs. GAMMA's 5.14 cm/s) and lower floor distance (2.24 cm for walk vs. GAMMA's 5.55 cm), indicating more physically plausible motions with less sliding and floating.
Versatility: DART supports multiple text-conditioned locomotion styles, unlike GAMMA which is limited to walking.
Efficiency: The text-conditioned goal-reaching policy achieves a generation speed of 240 frames per second, demonstrating its real-time applicability for learned controllers.

6.2. Ablation Studies / Parameter Analysis

Ablation studies were conducted on architecture designs, diffusion steps, primitive representation, and scheduled training, specifically for the text-conditioned motion composition task (Table 5).

The following are the results from Table 5 of the original paper:

	FID↓	R-prec↑	DIV →	MM-Dist↓	FID↓	DIV →	PJ→	AUJ↓
	Segment				Transition
Dataset	0.00±0.00	0.72±0.00	8.42±0.15	3.36±0.00	0.00±0.00	6.20±0.06	0.02±0.00	0.00±0.00
Ours	3.79±0.06	0.62±0.01	8.05±0.10	5.27±0.01	1.86±0.05	6.70±0.03	0.06±0.00	0.21±0.00
DART-VAE	4.23±0.02	0.62±0.01	8.33±0.12	5.29±0.01	1.79±0.02	6.73±0.23	0.20±0.00	0.96±0.00
DART-schedule	8.08±0.09	0.39±0.01	8.05±0.12	6.96±0.03	7.41±0.10	6.58±0.06	0.03±0.00	0.18±0.00
per frame(H=1,F=1)	10.31±0.09	0.29±0.01	6.82±0.13	7.41±0.01	7.82±0.09	6.03±0.08	0.02±0.00	0.08±0.00
H=2,F=16	4.04±0.10	0.66±0.00	8.20±0.06	4.96±0.01	2.22±0.10	6.60±0.20	0.06±0.00	0.18±0.00
steps 2	4.44±0.04	0.60±0.00	8.20±0.15	5.38±0.01	2.24±0.02	6.77±0.07	0.05±0.00	0.20±0.00
steps 5	3.49±0.09	0.63±0.00	8.25±0.15	5.18±0.01	2.11±0.07	6.74±0.11	0.05±0.00	0.20±0.00
steps 8	3.70±0.03	0.62±0.01	8.04±0.13	5.25±0.03	2.15±0.08	6.72±0.15	0.06±0.00	0.20±0.00
steps 10 (Ours)	3.79±0.06	0.62±0.01	8.05±0.10	5.27±0.01	1.86±0.05	6.70±0.03	0.06±0.00	0.21±0.00
steps 50	3.82±0.05	0.60±0.00	7.74±0.07	5.30±0.01	2.11±0.10	6.58±0.10	0.06±0.00	0.22±0.00
steps 100	4.16±0.06	0.61±0.00	7.82±0.15	5.32±0.02	2.20±0.05	6.43±0.10	0.06±0.00	0.21±0.00

Analysis of Table 5:

Impact of VAE (DART-VAE):
- Removing the VAE (DART-VAE row) results in significantly higher jerk metrics (PJ 0.20 vs. Ours 0.06, AUJ 0.96 vs. Ours 0.21). This indicates considerably more jittering and less smooth motions when the diffusion model is trained directly in the raw motion space. This validates the importance of the VAE for compressing high-frequency noise and improving motion quality.
Impact of Scheduled Training (DART-schedule):
- Without scheduled training (DART-schedule row), the model's R-prec (0.39 vs. Ours 0.62) and FID (8.08 vs. Ours 3.79) metrics are significantly worse. This confirms that scheduled training is critical for enabling the model to effectively respond to text control and handle out-of-distribution history-text combinations encountered during long autoregressive generations.
Impact of Primitive Length ( $per frame(H=1,F=1)$ and $H=2,F=16$ ):
- Using a per-frame prediction ( $H=1, F=1$ ) leads to significantly worse R-prec (0.29), FID (10.31), and MM-Dist (7.41). This demonstrates that very short primitives (single-frame) are less effective for learning text-conditioned motion space and responding to text prompts, compared to primitives with a reasonable horizon (like $H=2, F=8$ in Ours).
- Increasing the future length to $F=16$ ( $H=2, F=16$ ) from $F=8$ (Ours) results in slightly worse FID for segments (4.04 vs. 3.79) and transitions (2.22 vs. 1.86), while improving R-prec slightly (0.66 vs. 0.62). This suggests that $F=8$ provides a good balance between learning complex semantics and maintaining motion quality.
Impact of Diffusion Steps:
- The ablation shows that DART can achieve high-quality generation with a surprisingly small number of diffusion steps. Steps 5 and Steps 8 perform very similarly to Steps 10 (Ours).
- An extremely low diffusion step number of 2 results in a higher FID (4.44), indicating poorer motion quality. However, reducing steps from 100 to 10 or even 5 does not significantly degrade performance, validating the efficiency of DART's design (due to the simplicity of the primitive representation). This is a key finding for real-time applications, as fewer steps directly translate to faster inference.

6.3. Experimental Details and Human Preference Study Interface

Experiment Details: DART uses Algorithm 1 for online generation with a default rest standing seed motion and a classifier-free guidance weight of 5. Baselines like FlowMDM, TEACH, DoubleTake used their released checkpoints. T2M-GPT* was re-trained with history conditioning. Action segments for evaluation were extracted from BABEL validation set, clamping minimum duration to 15 frames and skipping 'transition' labels.
Human Preference Studies: Conducted on Amazon Mechanical Turk (AMT) to evaluate generation realism and text-motion semantic alignment. Participants viewed pairs of generated motions (from DART vs. a baseline) and selected the better one. Each comparison was voted by 3 independent participants, with random shuffling to prevent bias. This setup is illustrated in Figure 4.

该图像是人类偏好研究界面示意图，分为上下两部分。上半部分评估动作与文本描述的语义对齐，参与者需选择哪个动画与指令表述更相关；下半部分评估哪个动画更自然且真实。

Figure 4: Illustration of the human preference study interface for evaluating motion-text semantic alignment (top) and perceptual realism(bottom). Participants are requested to select the generation that is perceptually more realistic or better aligns with the action descriptions in subtitles (only visible in semantic preference study).

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces DartControl (DART), a pioneering diffusion-based autoregressive motion primitive model for real-time text-driven motion control. DART successfully addresses critical limitations in existing human motion generation methods by offering a unified framework for creating long, continuous, and complex motions that are precisely guided by natural language and spatial constraints. Its core contributions include learning a compact, artifact-mitigating latent motion primitive space via a VAE, enabling efficient and high-quality autoregressive generation through a latent diffusion model (with just 10 steps and scheduled training), and providing versatile spatial control capabilities through both latent noise optimization and reinforcement learning policies. Experimental results consistently demonstrate DART's superior performance in motion realism, efficiency (over 300 frames/s, low latency), and controllability across diverse tasks, outperforming state-of-the-art baselines.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Reliance on Frame-Level Annotations: DART heavily relies on motion sequences with fine-grained, frame-level aligned text annotations (like BABEL) to achieve precise text-motion alignment and natural transitions.
Degeneration with Coarse Labels: When trained on coarse sentence-level motion labels (like HML3D), the text-motion alignment degenerates for texts describing multiple actions. This leads to motions randomly switching between described actions, as each short motion primitive inherently matches only a portion of the sequence's semantics, causing semantic misalignment and ambiguity with coarse labels.
Future Work - Hierarchical Latent Spaces: The authors propose exploring hierarchical latent spaces (Stofl et al., 2024) to effectively tackle both fine-grained (primitive-level) and global (sequence-level) semantics.
Future Work - Open-Vocabulary Generation (General Limitation): Acknowledged as a critical challenge for all existing text-conditioned motion generation methods. The scarcity of 3D human motion data with text annotations (compared to image/video data) limits generalization to open-vocabulary text prompts.
Future Work - Dataset Expansion: Promising directions include extracting motion data from in-the-wild internet videos and generative image/video models (Kapon et al., 2024; Goel et al., 2023; Lin et al., 2023; Shan et al., 2024).
Future Work - Automated Labeling: Leveraging advancements in vision-language models (VLMs) (Shan et al., 2024; Dai et al., 2023) to automatically provide detailed, frame-aligned motion text labels to facilitate text-to-motion generation.

7.3. Personal Insights & Critique

DART represents a significant step forward in human motion generation, particularly in its emphasis on real-time and online capabilities, which are crucial for interactive applications.

Strength - Versatility of Latent Space: A key strength is the versatility of DART's learned latent motion primitive space. It's not just an efficient representation for generation, but also a powerful foundation for two distinct control paradigms (optimization and reinforcement learning), demonstrating its robustness and broad applicability. The fact that the same latent space can be manipulated for precise spatial control via gradient-based optimization or by a learned RL policy is highly compelling.
Strength - Efficiency through Primitives and Few Steps: The use of motion primitives combined with a surprisingly small number of diffusion steps (10 steps) for high-quality generation is a critical innovation for real-time performance. This challenges the common assumption that diffusion models require many steps, suggesting that simpler data distributions (like those of primitives) can be effectively modeled with fewer iterations.
Critical Reflection - Semantic Alignment Trade-off: The slightly lower R-prec and MM-Dist compared to FlowMDM in quantitative metrics, despite better human preference, highlights an interesting trade-off. DART prioritizes natural, online transitions, which might introduce a brief delay in fully manifesting new action semantics. While humans perceive this as natural, it can penalize metrics that expect immediate, perfect semantic shifts. This suggests that metrics for online, autoregressive generation might need to account for natural transition delays.
Applicability & Transferability: The methods presented, especially the concept of learning a robust, controllable latent primitive space and combining it with various control techniques, could be highly transferable. For instance, similar frameworks could be applied to generating other continuous, complex sequences, such as robotic arm movements for manufacturing, character behaviors in games, or even long-term event sequences in other domains, provided a suitable primitive definition exists.
Unverified Assumption / Area for Improvement - Primitive Length Optimization: While the ablation study shows $F=8$ is effective, a deeper theoretical analysis or adaptive mechanism for determining optimal primitive length (H, F) could be beneficial. Different actions might naturally lend themselves to different primitive durations.
Impact on Physics-Based Simulation: The successful integration with PHC for physics-based refinement is very promising. This hybrid approach (kinematic generation + physics correction) could become a standard for achieving both creative flexibility and physical plausibility in real-time. The real-time nature of DART could enable online physics-based correction, which is a powerful paradigm.

Overall, DART offers an elegant and practical solution to a complex problem, pushing the boundaries of real-time, controllable human motion generation with a thoughtfully designed architecture and comprehensive experimental validation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~38 min read · 53,681 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Preliminaries

4.2.1.1. Problem Definition

4.2.1.2. Autoregressive Motion Primitive Representation

4.2.2. DART: A Diffusion-Based Autoregressive Motion Primitive Model

4.2.2.1. Learning the Latent Motion Primitive Space (VAE)

4.2.2.2. Latent Motion Primitive Diffusion Model

4.2.3. Spatially Controllable Motion Synthesis via DART

4.2.3.1. Motion Control via Latent Diffusion Noise Optimization

4.2.3.2. Motion Control via Reinforcement Learning

4.2.4. Physics-Based Motion Tracking (from Appendix H)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Motion Realism & Quality

5.2.2. Motion-Text Semantic Alignment

5.2.3. Generation Efficiency & Controllability

5.2.4. Human Preference Studies

5.3. Baselines

5.3.1. Text-Conditioned Temporal Motion Composition

5.3.2. Latent Diffusion Noise Optimization-Based Control (Motion In-betweening)

5.3.3. Reinforcement Learning-Based Control (Goal-Reaching)

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Text-Conditioned Temporal Motion Composition

6.1.2. Latent Diffusion Noise Optimization-Based Control

6.1.2.1. Text-Conditioned Motion In-betweening

6.1.2.2. Human-Scene Interaction

6.1.3. Reinforcement Learning-Based Control

6.2. Ablation Studies / Parameter Analysis

6.3. Experimental Details and Human Preference Study Interface

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers