DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control
TL;DR Summary
DartControl (DART) is a diffusion-based autoregressive motion model enabling real-time text-driven motion control. It overcomes limitations of existing methods by generating complex, long-duration movements that effectively integrate motion history and text inputs.
Abstract
Text-conditioned human motion generation, which allows for user interaction through natural language, has become increasingly popular. Existing methods typically generate short, isolated motions based on a single input sentence. However, human motions are continuous and can extend over long periods, carrying rich semantics. Creating long, complex motions that precisely respond to streams of text descriptions, particularly in an online and real-time setting, remains a significant challenge. Furthermore, incorporating spatial constraints into text-conditioned motion generation presents additional challenges, as it requires aligning the motion semantics specified by text descriptions with geometric information, such as goal locations and 3D scene geometry. To address these limitations, we propose DartControl, in short DART, a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control. Our model effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs using latent diffusion models. By autoregressively generating motion primitives based on the preceding history and current text input, DART enables real-time, sequential motion generation driven by natural language descriptions. Additionally, the learned motion primitive space allows for precise spatial motion control, which we formulate either as a latent noise optimization problem or as a Markov decision process addressed through reinforcement learning. We present effective algorithms for both approaches, demonstrating our model's versatility and superior performance in various motion synthesis tasks. Experiments show our method outperforms existing baselines in motion realism, efficiency, and controllability. Video results are available on the project page: https://zkf1997.github.io/DART/.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control."
1.2. Authors
The authors are Kaifeng Zhao, Gen Li, and Siyu Tang, all affiliated with ETH Zürich. ETH Zürich is a prestigious public research university in Zürich, Switzerland, known for its leading research in science and technology, including computer science and robotics.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server, on 2024-10-07T17:58:22.000Z. While arXiv hosts preprints before formal peer review, many high-quality research papers are first shared there. The associated video results and code indicate a comprehensive research effort.
1.4. Publication Year
2024
1.5. Abstract
The paper introduces DartControl (DART), a novel Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion Control. It addresses the limitations of existing text-conditioned human motion generation methods, which typically produce short, isolated motions and struggle with continuous, long-duration, complex motions under real-time and online conditions, especially when incorporating spatial constraints. DART learns a compact motion primitive space, jointly conditioned on motion history and text inputs, using latent diffusion models. By autoregressively generating these motion primitives, DART enables real-time, sequential motion generation guided by natural language. Furthermore, its learned motion primitive space facilitates precise spatial motion control, which is framed either as a latent noise optimization problem or a Markov decision process solved via reinforcement learning (RL). The paper presents effective algorithms for both approaches, demonstrating superior performance in motion realism, efficiency, and controllability compared to existing baselines across various motion synthesis tasks.
1.6. Original Source Link
https://arxiv.org/abs/2410.05260 This is the official arXiv link for the preprint.
1.7. PDF Link
https://arxiv.org/pdf/2410.05260v3.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the generation of long, continuous, and complex human motions that precisely respond to streams of natural language descriptions and spatial constraints, particularly in an online and real-time setting.
This problem is highly important in fields like computer graphics, virtual reality, robotics, and interactive character animation, where intuitive control over digital humans is crucial for user experience and content creation.
Specific challenges and gaps in prior research include:
-
Short, Isolated Motions: Existing text-conditioned models primarily generate short, standalone motions from a single sentence, failing to compose long, continuous sequences with varying semantics.
-
Offline Generation: State-of-the-art temporal motion composition methods, like
FlowMDM, are offline, meaning they require prior knowledge of the entire action timeline and are too slow for real-time applications. -
Difficulty with Spatial Constraints: Integrating spatial constraints (e.g., reaching a goal location, interacting with objects) with text-conditioned generation is complex, requiring careful alignment of semantic and geometric information, often leading to compromises in motion quality or semantic fidelity.
-
Lack of Text-Conditioned Control in Interactive Systems: While interactive character control has focused on realism and responsiveness, most systems lack natural language interfaces, relying on tedious spatial input.
The paper's entry point is to leverage the power of
diffusion modelsfor high-quality generative tasks, combined with anautoregressive motion primitive representation, to tackle the challenge of continuous, real-time, and semantically/spatially controlled motion generation. The innovative idea is to learn a compact, text-conditionedmotion primitivespace that can be rapidly rolled out and manipulated.
2.2. Main Contributions / Findings
The primary contributions of DartControl (DART) are:
-
A Diffusion-Based Autoregressive Motion Primitive Model: DART effectively learns a compact
motion primitivespace jointly conditioned onmotion historyandtext inputsusinglatent diffusion models. This enables the synthesis of motions of arbitrary length in an autoregressive, online, and real-time manner. -
Real-time, Sequential Motion Generation: By autoregressively generating
motion primitivesbased on preceding history and current text input, DART achieves a generation speed of over 300 frames per second with low latency, making it suitable for real-time applications, a significant improvement over offline methods. -
Precise Spatial Motion Control: The learned
motion primitivespace provides a robust foundation for integrating precise spatial control. The paper formulates this control either as alatent noise optimizationproblem (e.g., for in-betweening, human-scene interaction) or as aMarkov decision processaddressed throughreinforcement learning(e.g., for goal-reaching locomotion). Effective algorithms are presented for both approaches. -
Versatile and Unified Model: DART is presented as a simple, unified, and highly effective model capable of addressing various complex motion synthesis tasks, including long, continuous sequence generation from sequential text prompts, in-between motion, scene-conditioned motion, and goal-reaching synthesis.
Key conclusions and findings include:
-
DART significantly outperforms existing baselines in
motion realism,efficiency, andcontrollability, as evidenced by quantitative metrics (FID,skate,jerk, speed, latency) and human preference studies. -
The
latent motion primitiverepresentation and the use of avariational autoencoder (VAE)effectively mitigate artifacts (like jittering) commonly found in raw motion data, leading to higher-quality generations. -
Scheduled trainingis crucial for improving the stability of long sequence generation and enhancing text prompt controllability, especially for unseen poses or novel combinations. -
The proposed
latent noise optimizationframework effectively harmonizesspatial controlwithtext semantic alignment, a challenge for previous methods. -
Reinforcement learningcan efficiently learn control policies using DART's latent space, enabling tasks like text-conditioned goal-reaching with different locomotion styles while maintaining high quality and speed.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the DART paper, a reader should be familiar with several fundamental concepts in machine learning, computer graphics, and motion synthesis:
-
Human Motion Representation:
- SMPL-X Model (Skinned Multi-Person Linear model with eXpressions): A widely used 3D parametric model for human bodies that can represent a wide range of human shapes and poses, including hands and facial expressions. It provides a statistical body model that, given
shape parameters() andpose parameters(), outputs a mesh representing the human body. The pose parameters typically describe joint rotations relative to a rest pose, and aglobal translation() andglobal orientation() determine the body's position and orientation in world space. - 6D Rotation Representation: A method to represent 3D rotations without the issues of gimbal lock or singularities encountered with Euler angles. It represents a rotation matrix using two orthogonal column vectors (e.g., the first two columns of the rotation matrix), from which the third can be derived. The paper mentions
Zhou et al. (2019)for this. - Motion Primitives: Short, reusable, atomic motion segments that can be chained together to form longer, more complex motions. In DART, these are overlapping segments, facilitating smooth transitions and autoregressive generation. They offer a more manageable target for generative models than entire, long motion sequences.
- SMPL-X Model (Skinned Multi-Person Linear model with eXpressions): A widely used 3D parametric model for human bodies that can represent a wide range of human shapes and poses, including hands and facial expressions. It provides a statistical body model that, given
-
Generative Models:
- Variational Autoencoders (VAEs): A type of generative neural network that learns a compressed, continuous
latent spacerepresentation of data. It consists of two main parts: anencoderthat maps input data to a probability distribution (mean and variance) in the latent space, and adecoderthat reconstructs the data from samples drawn from this latent distribution. VAEs are trained to minimize areconstruction loss(how well the decoder reconstructs the input) and aKullback-Leibler (KL) divergenceloss (to ensure the latent distribution is close to a prior, often a standard Gaussian). - Denoising Diffusion Probabilistic Models (DDPMs) / Diffusion Models: A class of generative models that learn to reverse a
diffusion process. Theforward diffusion processgradually adds Gaussian noise to a data sample over several steps until it becomes pure noise. Thereverse diffusion process(which the model learns) iteratively removes noise, starting from a pure noise sample, to generate a clean data sample.- Latent Diffusion Models (LDMs): Extend diffusion models by performing the diffusion process not on the raw data (e.g., high-resolution images or motion data) but on a compressed, lower-dimensional
latent representationlearned by an autoencoder (like a VAE). This makes the diffusion process more efficient and allows for higher-quality generation. - DDIM (Denoising Diffusion Implicit Models): A deterministic variant of diffusion models that allows for faster sampling (fewer steps) and more coherent latent space manipulation compared to DDPMs, which are stochastic. DART uses DDIM for its deterministic mapping from latent noise to motion during spatial control.
- Latent Diffusion Models (LDMs): Extend diffusion models by performing the diffusion process not on the raw data (e.g., high-resolution images or motion data) but on a compressed, lower-dimensional
- Variational Autoencoders (VAEs): A type of generative neural network that learns a compressed, continuous
-
Conditional Generation Techniques:
- Autoregressive Generation: A process where a model generates a sequence of data one element (or segment) at a time, conditioning each new element on all previously generated elements in the sequence. This is crucial for generating continuous, long motions in an online fashion.
- Classifier-Free Guidance: A technique used in conditional diffusion models to enhance the influence of conditioning information (e.g., text prompts). During training, the conditioning input (e.g., text embedding) is sometimes dropped out (masked). At inference, both a conditional prediction (with the text) and an unconditional prediction (without the text) are made. The unconditional prediction is then subtracted from the conditional one and scaled by a
guidance weight() to push the generation further in the direction of the text condition. The formula is: $ \mathcal{G}_w(\mathbf{z}_t, t, \mathbf{H}, c) = \mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, \emptyset) + w \cdot (\mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, c) - \mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, \emptyset)) $ Here, is the denoiser, is the noisy latent sample, is the diffusion step, is the history, is the text condition, and represents the unconditional (masked) condition. - CLIP (Contrastive Language–Image Pre-training): A neural network trained on a vast dataset of images and their text captions. It learns to embed both images and text into a shared, high-dimensional space such that semantically similar image-text pairs are close together. This allows CLIP to be used as a powerful text encoder for text-conditioned generation tasks.
-
Optimization & Control:
- Gradient Descent Methods (e.g., Adam): Iterative optimization algorithms used to minimize an objective function by moving in the direction opposite to the gradient of the function.
Adam(Kingma & Ba, 2015) is an adaptive learning rate optimization algorithm. - Reinforcement Learning (RL): A paradigm where an
agentlearns to make sequential decisions in anenvironmentto maximize a cumulativereward.- Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It consists of
states,actions,transition probabilities, andrewards. - Actor-Critic Methods: A class of RL algorithms that combine two components: an
actor(which learns a policy to select actions) and acritic(which learns to estimate the value of states or state-action pairs). - PPO (Proximal Policy Optimization): A popular on-policy RL algorithm that aims to improve the stability and performance of policy gradient methods by clipping the objective function, preventing large policy updates that could lead to instability.
- Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It consists of
- Signed Distance Field (SDF): A representation of a 3D shape where each point in space stores the shortest distance to the surface of the shape, with the sign indicating whether the point is inside (negative) or outside (positive). Useful for collision detection and surface interaction.
- Gradient Descent Methods (e.g., Adam): Iterative optimization algorithms used to minimize an objective function by moving in the direction opposite to the gradient of the function.
3.2. Previous Works
The paper contextualizes DART by discussing several categories of prior research:
-
Conditional Motion Generation:
- Early Motion Synthesis: Focused on generating realistic and diverse human motions without explicit conditions (
Kovar et al., 2008; Holden et al., 2020; Clavet et al., 2016; Zinno, 2019). These methods laid the groundwork for motion realism but lacked flexible user control. - Text-Conditioned Motion Generation: Became popular for intuitive user interaction.
Tevet et al. (2023),Zhang et al. (2022a),Petrovich et al. (2022), ,Jiang et al. (2024a),Zhang et al. (2023a)are mentioned. These works primarily focus on generating standalone, short motions from a single descriptive sentence, which is a key limitation DART aims to overcome.T2M-GPT(Zhang et al., 2023a) is a baseline in DART's experiments, modified for history conditioning. - Audio/Speech-Driven Motion Synthesis: Methods like
Alexanderson et al. (2023),Tseng et al. (2023),Siyao et al. (2022), show progress in generating motion from auditory inputs. - Spatially Aware Motion Generation: Addresses applications requiring motion within spatial constraints or specific goals.
Interactive Character Control(Ling et al., 2020; Peng et al., 2022; Starke et al., 2022; Luo et al., 2024; Starke et al., 2024): Focuses on realism and responsiveness to interactive signals, but often lacks text-conditioned semantic control and is limited to small, curated datasets.GAMMA(Zhang & Tang, 2022) is an RL-based controller using a learned motion action space but lacks text conditioning and is limited to walking, serving as a baseline for DART's RL experiments.Human-Scene Interactions(Hassan et al., 2021; Starke et al., 2019; Zhao et al., 2022; 2023; Li et al., 2024a; Xu et al., 2023; Jiang et al., 2024b; Wang et al., 2024; Liu et al., 2024; Li et al., 2024b; Zhang et al., 2022b; 2025): Aims to synthesize motions that interact realistically with 3D environments.Human-Human(noid) Interactions(Liang et al., 2024; Zhang et al., 2023c; Christen et al., 2023; Cheng et al., 2024; Shan et al., 2024).
- Early Motion Synthesis: Focused on generating realistic and diverse human motions without explicit conditions (
-
Diffusion Generative Models:
- Foundational Diffusion Models: introduced
Denoising Diffusion Probabilistic Models (DDPMs), followed by improvements likeSong et al. (2021a; b).Rombach et al. (2022)proposedLatent Diffusion Models(LDMs) for high-resolution synthesis by operating in a latent space, which DART adopts. - Diffusion for Motion Generation: Many works apply diffusion models to 3D human motions (
Tevet et al., 2023; Barquero et al., 2024; Chen et al., 2023b; Karunratanakul et al., 2024b; Cohan et al., 2024; Dai et al., 2025; Chen et al., 2023a). Most focus on offline generation of short, isolated motion sequences, which DART explicitly addresses. - Temporal Motion Composition:
FlowMDM(Barquero et al., 2024) is identified as thestate-of-the-art temporal motion composition method. It can generate complex, continuous motions by composing actions with precise durations. However, it is an offline method requiring full timeline knowledge and has slow generation speed, making it unsuitable for real-time.FlowMDMis a key baseline for DART. - Diffusion with History/Control: Some works incorporate history conditions (
Xu et al., 2023; Jiang et al., 2024b; Van Wouwe et al., 2024; Han et al., 2024) or adapt diffusion for real-time character control (Shi et al., 2024; Chen et al., 2024). learns control policies using diffusion noises as action space but focuses on single-frame autoregressive generation and lacks text conditions. - Optimization-Based Control with Diffusion:
DNO(Karunratanakul etal., 2024b) is closely related to DART's optimization-based control. Both use diffusion noises as a latent space for editing and control. However, DNO's diffusion model is trained with full motion sequences in explicit representations, whereas DART useslatent motion primitives, which the authors claim gives superior performance in harmonizing spatial control with text semantic alignment.DNOis a baseline for DART. - Concurrent Work:
CloSD(Tevet et al., 2025) trains autoregressive motion diffusion models conditioned on target joint locations. This approach relies on paired training data of control signals and human motions. DART, in contrast, learns a latent motion space from motion-only data and uses latent space control methods that do not require paired training data.
- Foundational Diffusion Models: introduced
3.3. Technological Evolution
The field of human motion generation has evolved significantly:
-
Early Data-Driven Approaches (Motion Graphs): Focused on stitching together pre-recorded motion clips to create new sequences, emphasizing realism but with limited flexibility and control (
Kovar et al., 2008). -
Learning Motion Manifolds (VAEs/Autoencoders): Methods like
Holden et al. (2015)andLing et al. (2020)introduced VAEs to learn continuous, compact latent representations of motion, allowing for interpolation and generation of novel motions, overcoming the discrete nature of motion graphs. This improved flexibility but often lacked high-level semantic control. -
Conditional Generation:
- Physics-Based Control: Focused on physically plausible motions, often using RL to train characters to perform skills (
Holden et al., 2017; Peng et al., 2022). These offered high fidelity but were typically skill-specific and hard to generalize. - Semantic Control (Text-to-Motion): The rise of large language models and vision-language models led to
text-conditioned motion generation, enabling intuitive control via natural language (Tevet et al., 2023; Guo et al., 2022). Initial methods focused on short, isolated motions.
- Physics-Based Control: Focused on physically plausible motions, often using RL to train characters to perform skills (
-
Diffusion Models for High-Quality Synthesis: Diffusion models revolutionized generative AI, demonstrating unprecedented quality in images and videos. Their application to human motion (
Tevet et al., 2023) brought similar quality improvements, though still largely for offline, short sequences. -
Temporal Composition and Real-time Requirements: The need for longer, continuous, and complex motions led to methods for
temporal composition(FlowMDM), but these were often offline and slow. The demand for interactive applications pushed forreal-timeandonlinegeneration capabilities. -
Integration of Spatial Control: Simultaneously, there was a drive to integrate precise
spatial control(e.g., goal-reaching, scene interaction) with semantic control, which proved challenging to balance effectively (Shafir et al., 2024; Karunratanakul et al., 2024b; Xie et al., 2024).DART's work fits into the most recent stage of this evolution, pushing the boundaries by combining
latent diffusion modelswith anautoregressive motion primitiverepresentation to enable real-time, continuous, text-driven motion generation with precise spatial control. It builds upon the strengths of VAEs (compact latent space), diffusion models (high-quality generation), and autoregressive modeling (long-term continuity) while explicitly addressing the limitations of prior work regarding real-time performance, long sequence generation, and the harmonious integration of text and spatial controls.
3.4. Differentiation Analysis
Compared to the main methods in related work, DART introduces several core differences and innovations:
-
Autoregressive & Real-time Generation vs. Offline/Isolated:
- Prior Text-to-Motion (e.g., T2M-GPT, TEACH): Primarily generates short, isolated motions from a single text prompt.
- FlowMDM: State-of-the-art for
temporal motion composition, generating continuous long motions from a sequence of text prompts. However,FlowMDMis an offline method, requiring the entire action timeline in advance and suffering from slow generation speed. - DART's Innovation: DART is designed for online and real-time generation. It uses an
autoregressive motion primitiveapproach, meaning it generates one short segment at a time, conditioned on previous generated motion and the current text prompt. This enables efficient, continuous generation of arbitrary length at over 300 frames/second with low latency, making it suitable for interactive applications.
-
Motion Primitive Space vs. Raw Motion/Full Sequence Diffusion:
- Most Diffusion Models for Motion: Often diffuse directly in the raw, high-dimensional motion space or encode full motion sequences into a latent space (e.g.,
DNO). - DART's Innovation: DART learns a compact
latent motion primitive spaceusing aVariational Autoencoder (VAE). This primitive-based representation decomposes complex long sequences into simpler, short, overlapping segments. This simplifies the data distribution for the diffusion model, mitigates artifacts (like jittering) from raw data, and makes the latent space more amenable to fast generation and control.
- Most Diffusion Models for Motion: Often diffuse directly in the raw, high-dimensional motion space or encode full motion sequences into a latent space (e.g.,
-
Harmonious Integration of Text and Spatial Control:
- Prior Spatial Control Methods (e.g., OmniControl, DNO): While attempting to integrate spatial control, they often struggle to effectively balance spatial accuracy, motion quality, and semantic alignment with text, sometimes ignoring text prompts to reach a goal. They are also typically restricted to isolated short motions.
- DART's Innovation: DART's
latent motion primitivespace, combined with itstext-conditioned latent diffusion model, provides a powerful foundation. It offers two flexible mechanisms for precise spatial control:- Latent Diffusion Noise Optimization: Directly optimizes the latent noise variables using gradient descent to meet spatial constraints, while the underlying diffusion model ensures motion realism and text alignment.
- Reinforcement Learning Policies: Models the control task as an
MDPand trainsRL policiesthat outputlatent noisesas actions, leveraging the pretrained generative model for high-quality motion.
- This dual approach allows DART to maintain high motion realism and strong text-semantic alignment even while satisfying precise spatial goals, demonstrating superior harmonization compared to baselines.
-
Robustness and Controllability via Scheduled Training:
-
General Autoregressive Challenges: Autoregressive models can suffer from
distribution driftduring long rollouts, leading to unstable or unrealistic generations when the model encountersout-of-distributionhistory. -
DART's Innovation: Employs
scheduled trainingto progressively introducerollout history(i.e., previously generated frames instead of ground truth frames) during training. This exposes the model to test-time distributions, improving the stability of long-sequence generation and enhancing controllability for unseen poses or novel text-motion combinations.In essence, DART moves beyond generating static, short motions or slow, offline compositions to a dynamic, real-time paradigm, offering versatile control over long, semantically rich, and spatially constrained human movements by intelligently combining
motion primitives,latent diffusion, and advanced control mechanisms.
-
4. Methodology
4.1. Principles
The core idea behind DART is to achieve real-time, text-driven, and spatially controllable continuous human motion generation by breaking down the complex problem into manageable, learnable components. The theoretical basis and intuition are:
- Decomposition into Motion Primitives: Long, continuous human motions are inherently complex and high-dimensional, making them difficult to model directly. By representing them as sequences of short, overlapping
motion primitives, the problem of modeling a global sequence is decomposed into modeling local, simpler segments. Theseprimitivesare easier for generative models to learn and also align better with atomic action semantics. The overlap ensures smooth transitions between generated primitives. - Learning a Compact Latent Space: Raw motion data often contains noise and artifacts (e.g., glitches, jitters). Training generative models directly on this raw data can lead to inheriting these imperfections. A
Variational Autoencoder (VAE)is used to compress thesemotion primitivesinto a lower-dimensional, compact, andclean latent space. This latent space is more robust to data noise, computationally efficient, and provides a continuous manifold of plausible motions, which is ideal for subsequent generation and control. - Text-Conditioned Autoregressive Generation with Latent Diffusion: A
latent diffusion modelis chosen for its proven ability to generate high-quality and diverse samples. By training this diffusion model in the compactlatent primitive space, conditioned onmotion historyandtext inputs, DART learns to predict the nextmotion primitiveautoregressively. Thisautoregressivenature, combined with the efficiency of operating in the latent space and using a small number of diffusion steps, enables real-time, online generation of motions of arbitrary length. - Flexible Spatial Control through Latent Space Manipulation: The learned
latent primitive spaceis not just for generation but also for precise control. Since this space represents plausible motions, manipulating variables within it (specifically, thelatent diffusion noises) can guide motion towards spatial goals while maintaining realism. This control can be achieved in two ways:-
Optimization: Directly optimizing the
latent noisesusing gradient descent to minimize acriterion functionthat quantifies the distance to spatial goals and penalizes physical/scene constraint violations. -
Reinforcement Learning: Framing the control as a
Markov Decision Processwherelatent noisesareactions, and training anRL policyto predict these actions based on the current state (history, goal, text, scene) to maximize rewards (e.g., goal-reaching, physical plausibility). This provides an efficient, learned controller for dynamic goals.These principles combine to create a versatile and efficient framework that can understand natural language instructions, generate long and realistic human motions in real-time, and precisely control spatial aspects of those motions.
-
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Preliminaries
4.2.1.1. Problem Definition
The paper defines the task as text-conditioned online motion generation with spatial control.
Given:
-
An frame
seed motion. This is the initial motion context. -
A sequence of
text prompts. These prompts dictate the semantic actions. -
Spatial goals. These define geometric constraints (e.g., target locations, poses).The objective is to
autoregressively generate continuous and realistic human motion sequences. Here, eachmotion segmentmust align with the corresponding text prompt and satisfy the spatial goal . This task presents challenges in high-level action semantic control, precise spatial control, and smooth temporal transitions.
4.2.1.2. Autoregressive Motion Primitive Representation
DART models long-term human motions as a sequential composition of motion primitives with overlaps, suitable for efficient generative learning and online inference.
-
Motion Primitive Structure: Each
motion primitiveis a short motion clip.- : Consists of frames of
history motion. - : Consists of frames of
future motion. - Overlap: The
history motionfor the -th primitive is explicitly defined as the last frames of the previousfuture motionsegment , specifically . This overlap ensures temporal continuity between successive primitives.
- : Consists of frames of
-
Long Motion Construction: Infinitely long motions can be represented as the rollout of such overlapping
motion primitives: . -
Human Body Representation (per frame): Each motion frame is represented as a dimensional vector, using an overparameterized representation based on the
SMPL-Xparametric human body model. The components are:- : Global body translation.
- : 6D rotation representation (Zhou et al., 2019) of the global body orientation.
- : 6D representation of 21 joint rotations (for the
SMPL-Xbody joints). - : Locations of the 22
SMPL-Xjoints. - : Temporal difference features of the global translation with the previous frame.
- (should be based on typical 6D rotation difference): Temporal difference features of the global rotation.
- : Temporal difference features of the joint locations. The paper specifies (history length) and (future length) for their experiments.
-
Canonicalization: Each
motion primitiveis canonicalized into a local coordinate frame. This means the origin is centered at the pelvis of the first frame of the primitive. This human-centric local coordinate system helps normalize primitive features and facilitates model learning. The details for this are in Appendix A:-
The origin is at the pelvis joint of the first frame.
-
The X-axis is the horizontal projection of the vector from the left hip to the right hip.
-
The Z-axis points in the inverse gravity direction (upwards).
-
The Y-axis is derived from the cross product of the Z-axis and X-axis. The canonicalization algorithm (Algorithm 3) is as follows:
Algorithm 3 Motion primitive rotation canonicalization Input: right_hip R3, left_hip R3 x_axis = right_hip - left_hip x_axis[2] = 0 Project to the xy plane normalize(x_axis) z_axis = [0, 0, 1] y_axis = cross_porudct(z_axis, x_axis) Inverse gravity direction normalize(y_axis) return [x_axis, y_axis, z_axis]
The inverse transformation is applied to recover global motions.
-
-
Advantages of Primitive Representation:
- Tractability: Decomposes globally complex sequences into short, simple primitives, making the data distribution easier to learn for generative models.
- Efficiency: Autoregressive and simple nature is inherently suitable for fast online generation.
- Interpretability: Primitives convey more interpretable semantics than individual frames, aiding in learning a text-conditioned motion space.
4.2.2. DART: A Diffusion-Based Autoregressive Motion Primitive Model
DART uses a latent diffusion model (Rombach et al., 2022; Chen et al., 2023b) for autoregressive motion generation, conditioned on text and motion history. It comprises a Variational Autoencoder (VAE) and a latent denoising diffusion model.
4.2.2.1. Learning the Latent Motion Primitive Space (VAE)
The paper introduces a motion primitive VAE to compress motion primitives into a compact latent space. This is crucial because raw motion data from AMASS (Mahmood et al., 2019) can contain artifacts (glitches, jitters), and training diffusion models directly on raw data often propagates these. The VAE compression mitigates these outlier artifacts. The resulting latent representation is compact, computationally efficient, and enhances control capabilities.
-
Architecture: The
motion primitive VAEconsists of anencoderand adecoder, based on a transformer-based architecture (similar toMLD, Chen et al., 2023b), as shown in Figure 1.
该图像是DART架构的示意图。图中展示了包括编码器、解码器和去噪网络的三部分结构。编码器将未来帧 压缩为潜变量,基于历史帧 。解码器则根据历史帧和潜样本重建未来帧。去噪网络预测干净的潜样本 ,条件是噪声步骤、文本提示、历史帧和有噪声的潜样本 。Figure 1: Architecture illustration of DART. The encoder network compresses the future frames \\mathbf { X } \\bar { \\mathbf { \\eta } } = \[ \\mathbf { x } ^ { 1 } , . . . , \\mathbf { x } ^ { F } \] into a latent variable, conditioned on the history frames \\mathbf { H } = \[ \\mathbf { h } ^ { 1 } , . . . , \\mathbf { h } ^ { H } \] . The decoder network reconstructs the future frames conditioned on the history frames and the latent sample. The denoiser network predicts the clean latent sample conditioned on the noising step, text prompt, history frames, and noised latent sample . During the denoiser training, the encoder and decoder network weights remain fixed.
- Encoder (): Takes
history motion framesandfuture motion framesas input. It predicts the distribution parameters (meanandvariance) for the latent space. - Latent Sample (): Obtained from the predicted distribution via the
reparameterization trick(Kingma & Welling, 2014). - Decoder (): Predicts the
future framesfrom zero tokens, conditioned on thelatent sampleand thehistory frames.
- Encoder (): Takes
-
Training Objectives (from Appendix C): The
motion primitive VAEis trained with a combination of losses: $ \mathcal{L}{VAE} = \mathcal{L}{rec} + w_{KL} \times \mathcal{L}{KL} + w{aux} \times \mathcal{L}{aux} + w{SMPL} \times \mathcal{L}_{SMPL} $ where:- Reconstruction Loss (): Minimizes the distance between the reconstructed future frames and the ground truth future frames .
$
L_{rec} = \mathcal{F}(\hat{\mathbf{X}}, \mathbf{X})
$
Here, denotes a distance function, implemented as
smoothed L1 loss(Girshick, 2015). - Kullback-Leibler (KL) Regularization Term (): Penalizes the distribution difference between the predicted latent distribution and a
standard Gaussianprior. $ \mathcal{L}_{KL} = KL(q(\mathbf{z}|\mathbf{H}) \parallel \mathcal{N}(0, \mathbf{I})) $ is theKullback-Leibler divergence. is the distribution predicted by the encoder . A small weight is used to maintain expressiveness. - Auxiliary Losses (): Regularizes the predicted temporal difference features to be consistent with the actual differences calculated from the predicted motion features. $ L_{aux} = \mathcal{F}(\bar{\mathbf{dt}}, \hat{\mathbf{dt}}) + \mathcal{F}(\bar{\mathbf{dJ}}, \hat{\mathbf{dJ}}) + \mathcal{F}(\bar{\mathbf{dR}}, \hat{\mathbf{dR}}) $ For instance, (difference of first two frames of predicted root translation) should be consistent with (predicted first frame root translation difference feature). The weight is used.
- SMPL Losses (): Ensures consistency with the
SMPLbody model. $ L_{SMPL} = L_{joint_rec} + L_{consistency} $ where:- SMPL-based Joint Reconstruction Loss (): Penalizes discrepancy between joints regressed from predicted body parameters and those from ground truth body parameters.
$
L_{joint_rec} = \mathcal{F}(\mathcal{I}(\hat{\mathbf{t}}, \hat{\mathbf{R}}, \hat{\pmb{\theta}}), \mathcal{I}(\mathbf{t}, \mathbf{R}, \pmb{\theta}))
$
is the
SMPL body joint regressor, which predicts joint locations given body parameters (). Body shape parameters () are assumed to be zero. - Joint Consistency Loss (): Regularizes predicted joint locations and predicted
SMPL body parametersto consistently represent the same body joints. $ L_{consistency} = \mathcal{F}(\hat{\mathbf{J}}, \mathcal{I}(\hat{\mathbf{t}}, \hat{\mathbf{R}}, \hat{\pmb{\theta}})) $ The weight is used.
- SMPL-based Joint Reconstruction Loss (): Penalizes discrepancy between joints regressed from predicted body parameters and those from ground truth body parameters.
$
L_{joint_rec} = \mathcal{F}(\mathcal{I}(\hat{\mathbf{t}}, \hat{\mathbf{R}}, \hat{\pmb{\theta}}), \mathcal{I}(\mathbf{t}, \mathbf{R}, \pmb{\theta}))
$
is the
- Reconstruction Loss (): Minimizes the distance between the reconstructed future frames and the ground truth future frames .
$
L_{rec} = \mathcal{F}(\hat{\mathbf{X}}, \mathbf{X})
$
Here, denotes a distance function, implemented as
-
Training Details: The VAE is trained with
AdamW(Kingma & Ba, 2015) optimizer, learning rate with linear annealing.
4.2.2.2. Latent Motion Primitive Diffusion Model
After learning the compact latent motion primitive space with the VAE, DART formulates text-conditioned autoregressive motion generation as approximating a probabilistic distribution using a latent diffusion model .
- Forward Diffusion Process: Given a latent representation (obtained from the VAE encoder), the forward process adds
Gaussian noiseiteratively to produce increasingly noisy samples : $ q(\mathbf{z}t | \mathbf{z}{t-1}) = \mathcal{N}(\sqrt{1 - \beta_t} \mathbf{z}_{t-1}, \beta_t \mathbf{I}) $ Here, arenoise schedule hyperparameters, and denotes aGaussian distribution. - Reverse Diffusion Process (Denoising): The denoiser model learns to reverse this process, predicting a clean latent variable from noise. The reverse process is modeled as:
$
p_\theta(\mathbf{z}_{t-1} | \mathbf{z}_t, t, \mathbf{H}, c) = \mathcal{N}(\pmb{\mu}_t, \Sigma_t)
$
This process aims to generate
motion primitivesconditioned onmotion historyandtext label. The variance is scheduled using hyperparameters, while the mean is modeled using the denoiser neural network. The denoiser is designed to predict the clean latent variable : $ \hat{\mathbf{z}}_0 = \mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, c) $ From this predicted , the mean of the reverse process can be derived as: $ \pmb{\mu}t = \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1-\bar{\alpha}_t} \mathcal{G}(\mathbf{z}t, t, \mathbf{H}, c) + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}{t-1})}{1-\bar{\alpha}_t}\mathbf{z}_t $ where and . During inference, a pure noise sample is progressively denoised by until a clean sample is obtained. - Architecture (Figure 1):
- The
diffusion stepis embedded using a smallMLP(Multi-Layer Perceptron). - The
text promptis encoded using theCLIP(Radford et al., 2021) text encoder. - The text prompt is randomly masked out with a probability of 0.1 during training to enable
classifier-free guidanceduring generation.
- The
- Decoding Latent to Motion: The cleaned latent variable can be converted back to
future framesusing thefrozen decoderof the VAE: $ \hat{\mathbf{X}} = \mathcal{D}(\mathbf{H}, \hat{\mathbf{z}}_0) $ - Training Objectives (from Appendix D.1): The
latent denoiseris trained using the following losses: $ L_{denoiser} = L_{simple} + w_{rec} \times L_{rec} + w_{aux} \times L_{aux} $ where:- Simple Objective (): The primary objective for diffusion models (Ho et al., 2020), which trains the denoiser to predict the clean latent variable .
$
L_{simple} = \mathbb{E}_{(\mathbf{z}_0, c) \sim q(\mathbf{z}_0, c), t \sim [1, T], \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \mathcal{F}(\mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, c), \mathbf{z}_0)
$
Here, is again the
smooth L1 loss. - Reconstruction Loss (): Applied on the
decoded future framesto ensure their validity. . - Auxiliary Losses (): Also applied on the decoded future frames to ensure naturalness. .
The
SMPL losses() are not used in denoiser training to avoid slowing it down.
- Simple Objective (): The primary objective for diffusion models (Ho et al., 2020), which trains the denoiser to predict the clean latent variable .
$
L_{simple} = \mathbb{E}_{(\mathbf{z}_0, c) \sim q(\mathbf{z}_0, c), t \sim [1, T], \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \mathcal{F}(\mathcal{G}(\mathbf{z}_t, t, \mathbf{H}, c), \mathbf{z}_0)
$
Here, is again the
- Diffusion Steps: Notably, DART uses only 10 diffusion steps for both training and inference. This small number is sufficient due to the simplicity of the
motion primitive representation, enablinghighly efficient online generation. - Scheduled Training (from Appendix D.2): This technique improves the stability of long sequence generation and text prompt controllability. Instead of always using ground truth history motion during training, the model progressively learns to use
rollout history(i.e., history extracted from its own previous predictions). This exposes the model totest-time distributionshifts, preventingout-of-distributionproblems during long autoregressive rollouts.-
The scheduled training has three stages:
- Fully Supervised Stage: Only ground truth history is used.
- Scheduled Learning Stage: Randomly replaces ground truth history with rollout history with a probability linearly increasing from 0 to 1.
- Rollout Training Stage: Always uses rollout history.
-
The algorithm for calculating rollout probability (Algorithm 4) is:
The scheduled training for the latent denoising model (Algorithm 5) is provided in the Appendix, showing howAlgorithm 4 Calculate rollout probability 1Input: current iteration number iter, number of train iterations in the first supervised stage `I _ { 1 }` , number of train iterations in the second scheduled stage
I _ { 2 }
2: Output: rollout probability
3: function ROLLOUT_PROBABILITY(iter,I _ { 1 } , I _ { 2 } )
4: if iter then no rollout in the first supervised stage
5:
6: else if iter then the third rollout stage always use rollout
7:
8: else linearly scheduled rollout probability in the second scheduled stage
9:
10: end if
11: return
12:end functionrollout_probabilityis used to decide whether to use ground truth orrollout history. Thelatent denoiseris trained withAdamW, learning rate , for 100K iterations per stage.-
Autoregressive Rollout Generation (Algorithm 1): With the trained
motion primitive decoder,latent denoiser, and adiffusion sampler(e.g., DDPM, DDIM), motion sequences are generated autoregressively.Algorithm 1 Autoregressive rollout generation using latent motion primitive model Input: primitive decoder D, latent variable denoiser G, history motion seed Hseed, text prompts C = [c1, ., cN ], total diffusion steps T, classifier-free guidance scale w, diffusion sampler S. Optional Input: Latent noises ZT = [z1 , .., ZN ] Output: motion sequence M H ← Hseed M ← Hseed for i ← 1 to N do number of rollouts sample noise zT from N (0, 1) if not inputted z0 ← S(G, ZiT , T, H, ci, ω) diffusion sample loop with classifier-free guidance X ← D(H, z0) M ← CONCAT(M, X) concatenate future frames to generated sequence H ← CANONICALIZE(XF −H+1:F) update the rollout history with last H generated frames end for return M Explanation of Algorithm 1:
- Initialization: The
historyis set to theseed motion. Themotion sequencestarts with . - Loop for each text prompt: For each of the text prompts:
- Latent Noise Sampling: A
noise vectoris sampled from a standard Gaussian distribution . This is the starting point for the diffusion process for the current primitive. Iflatent noisesare provided as input (for optimization), the -th noise is used. - Denoising (Classifier-Free Guidance): The
diffusion sampler(e.g., DDIM) uses thedenoiserto iteratively remove noise from , conditioned on the currenthistory,text prompt, anddiffusion steps.Classifier-free guidance(with scale ) is applied during this denoising to steer the generation towards the text semantics. The output is the clean latent variable . - Decoding Motion Primitive: The
motion primitive decodertransforms (and ) into thefuture motion frames. - Concatenation: The newly generated
future framesare appended to the overallmotion sequence. - History Update: The
historyfor the next iteration is updated by taking the last frames of the just-generated and canonicalizing them. This ensures the autoregressive flow.
- Latent Noise Sampling: A
- Return: The complete generated
motion sequenceis returned.
- Initialization: The
-
Real-time Performance: DART generates over 300 frames per second on a single
RTX 4090 GPU, enabling real-time applications.
4.2.3. Spatially Controllable Motion Synthesis via DART
DART provides a framework for integrating precise spatial control, addressing the limitation of relying solely on text for complex spatial goals.
- Motion Control Task Formulation: The task is defined as generating a
motion sequencethat minimizes its distance to a givenspatial goal, under acriterion functionandregularizationfromscene and physical constraints: $ \mathbf{M}^* = \mathrm{argmin}_\mathbf{M} \mathcal{F}(\Pi(\mathbf{M}), g) + cons(\mathbf{M}) $ where:- : Task-dependent
spatial goal(e.g., a keyframe body pose, target location). - : A
projection functionthat extracts goal-relevant features from and maps them into thetask-aligned observation space. - :
Physical and scene constraints(e.g., collision avoidance, foot-floor contact).
- : Task-dependent
- Latent Space Control Formulation: Directly solving in raw motion space often yields unrealistic motions. DART reformulates this in its
latent motion space. Using the deterministicDDIM sampler(Song et al., 2021a), theROLLOUTfunction (Algorithm 1) becomes a deterministic mapping from a list ofmotion primitive latent noisesto a motion sequence , conditioned on and : $ \mathbf{M} = \mathrm{ROLLOUT}(\mathbf{Z}T, \mathbf{H}{seed}, C) $ The minimization objective is then converted to optimizing thelatent noises: $ {\mathbf{Z}T}^* = \operatorname*{argmin}{\mathbf{Z}_T} \mathcal{F}(\Pi(\mathrm{ROLLOUT}(\mathbf{Z}T, \mathbf{H}{seed}, C)), g) + cons(\mathrm{ROLLOUT}(\mathbf{Z}T, \mathbf{H}{seed}, C)) $ The paper notes that they do not use DDIM for skipping diffusion steps, as this can cause artifacts.
4.2.3.1. Motion Control via Latent Diffusion Noise Optimization
One solution is to directly optimize the
latent noisesusinggradient descentmethods (Kingma & Ba, 2015).-
Optimization Algorithm (Algorithm 2):
Input: Latent noises ZT = [z1, ., ZN ], Optimizer O, learning rate η, and goal g. (For brevity, we do not reiterate the inputs of the rollout function defined in Alg. 1 and criterion terms in Eq. 2) Output: a motion sequence M. for i ← 1 to optimization steps do M ← ROLLOUT(ZT, Hseed, C) ← ZT (F(Π(M), g) + cons(M)) ZT ← O(ZT, /, η) update using normalized gradient end for return M ← ROLLOUT(ZT, Hseed, C) Explanation of Algorithm 2:
- Input: Initial
latent noises, anoptimizer,learning rate, andspatial goal. - Loop for Optimization Steps:
- Generate Motion: The
ROLLOUTfunction (Algorithm 1) is called with the current ,history seed, andtext promptsto generate amotion sequence. - Calculate Loss: The total loss is calculated using the
criterion function(distance to goal) andconstraint function. The gradient of this loss with respect to is computed. - Update Latent Noises: The
optimizerupdates using the computed gradient and learning rate .
- Generate Motion: The
- Return: After optimization, the final
motion sequenceis generated using the optimized .
- Input: Initial
-
Example Control Scenarios:
- Text-conditioned motion in-between: The goal is to generate a transition between a
historyand agoal keyframe( frames away), conditioned on a text prompt . The optimization objective is the distance between the -th frame of the generated motion and thegoal keyframe. - Human-scene interaction generation: Incorporates
physical and scene constraints. Given 3D scenes, text prompts, andgoal joint locations(e.g., pelvis for sitting), motions are generated that perform the desired interaction and achieve goal positions while adhering to scene constraints.3D scenesare represented assigned distance fields (SDF)to computebody-scene distancesforfoot-floor contactandcollision avoidance.- Scene Collision Constraint ( from Appendix F): Penalizes joints that collide with the scene (i.e., have negative
SDFvalues) in each frame. $ cons_coll(\mathbf{J}) = - \sum_{k=1}^{22} (\Psi(\mathbf{J}^k) - \tau^k)_- $ Here, is the -th joint's location in scene coordinates, is thesigned distance function(returning distance to scene), arejoint-dependent contact threshold values(based on joint-skin distance in rest pose), and clips positive values to zero (so only negative distances, indicating penetration, contribute to the penalty). - Foot-Floating Constraint ( from Appendix F): Encourages feet to be in contact with the scene.
$
cons_cont(\mathbf{J}) = (\operatorname*{min}{k \in Foot} \Psi(\mathbf{J}^k) - \tau)+
$
Here,
Footis the set of foot joint indices, is thefoot-floor contact threshold distance, and clips negative values to zero (so only positive distances, indicating floating, contribute to the penalty).
- Scene Collision Constraint ( from Appendix F): Penalizes joints that collide with the scene (i.e., have negative
- Text-conditioned motion in-between: The goal is to generate a transition between a
4.2.3.2. Motion Control via Reinforcement Learning
To address the computational expense of optimization, DART leverages its
autoregressive primitive-based motion representationforreinforcement learning (RL). Thelatent motion controlis modeled as aMarkov Decision Process (MDP)with alatent action space.-
MDP Formulation: Digital humans are agents interacting with an environment to maximize expected discounted return.
- Time step : Agent observes
state, samplesactionfrompolicy, transitions tonext state, and receivesreward. - Latent Action Space: The
latent noisesserve as thelatent action space.
- Time step : Agent observes
-
Policy Architecture (Figure 2): An
actor-critic(Sutton & Barto, 1998) architecture trained with thePPO(Schulman et al., 2017) algorithm.
该图像是DartControl模型的控制政策示意图,展示了如何通过演员-评论家结构和线性层处理潜在动作以生成未来运动。模型接受文本提示、历史运动、目标和场景信息,最终输出规范化的新历史运动。Figure 2: Architecture of the reinforcement learning-based control policy. The pretrained DART diffusion denoiser and decoder models transform the latent actions into motion frames. The last predicted frames are canonicalized and provided to the policy model as the next step history condition.
- State Observation (): Includes
history motion observation,goal observation,scene observation, and theCLIP embedding of the text prompt. - Policy Input: The policy model takes to predict the
latent noiseas the action. - Action Execution: The predicted is mapped to
future motion framesusing thefrozen latent denoiserandmotion primitive decoder. - History Update: The new
history motionis extracted from the last predicted frames of and fed to the policy network for the next step.
- State Observation (): Includes
-
Reward Maximization: The minimization problem in Equation 3 is reformulated as reward maximization for policy training.
-
Example Task: Text-Conditioned Goal-Reaching:
- Goal: Control human to reach a 2D goal location using an action specified by text .
- Goal Observation: Transformed into a local observation including distance to pelvis (at last history frame) and local direction (within human-centric frame).
- Scene Observation: For a flat scene, it's the relative floor height to the body pelvis (at first history frame).
- Rewards (from Appendix G): Policies are trained with multiple rewards:
- Distance Reward (): Encourages the human pelvis to get closer to the goal location. $ r_{dist} = d^{i-1} - d^i $ where is the 2D distance between human pelvis and goal location at step .
- Success Reward (): Provides a sparse but strong reward when the human reaches the goal. $ r_{succ} = {\left{ \begin{array}{ll} 1 & \mathrm{if ~} d^i < 0.3 \ 0 & \mathrm{otherwise} \end{array} \right.} $ where is the success threshold.
- Moving Orientation Reward (): Encourages the human's moving direction to align with the goal direction.
$
r_{ori} = \frac{\langle \mathbf{p}^i - \mathbf{p}^{i-1}, g - \mathbf{p}^{i-1} \rangle + 1}{2}
$
where is the human pelvis location at step , and is the goal location. The dot product gives a measure of alignment, normalized to
[0, 1]. - Skate Reward (): Penalizes foot displacements when in contact with the floor.
$
r_{skate} = - disp \cdot (2 - 2^{h/0.03})
$
Here,
dispis the foot displacement in two consecutive frames, is the higher foot height in consecutive frames, and is the threshold for contact. - Floor Contact Reward (): Penalizes when the lower foot distance to the floor is above a threshold.
$
r_{floor} = - (|lf| - 0.03)_+
$
Here,
lfdenotes the height of the lower foot, is the absolute value, and is the clipping operator (minimum of 0).
-
Performance: The text-conditioned goal-reaching policy achieves 240 frames per second.
4.2.4. Physics-Based Motion Tracking (from Appendix H)
DART, being a kinematic-based approach, can produce physically inaccurate motions (e.g., skating, floating, penetration). To address this, DART can be integrated with
physically simulated motion tracking methods. The paper demonstrates this withPHC(Luo et al., 2023). Raw DART output (e.g., crawling with hand-floor penetration) can be refined byphysics-based trackingto produce more plausible results, improving joint-floor contact and eliminating penetration artifacts (as shown in Figure 5). The real-time capabilities of both DART andPHCallow for on-the-fly correction.
该图像是图5,展示了DART与基于物理的运动跟踪方法PHC的整合示例。左侧是通过DART生成的爬行序列,存在手部与地面穿透等伪影;右侧展示了应用于原始生成序列的物理跟踪结果,增强了关节与地面的接触,解决了手部穿透问题。Figure 5: We demonstrate an example of integrating DART with the physics-based motion tracking method PHC (Luo et al., 2023) to achieve more physically plausible motions. The left image illustrates a crawling sequence generated by DART, exhibiting artifacts such as hand-floor penetration. The right image displays the physics-based motion tracking outcome applied to the raw generated sequence, which enhances joint-floor contact and resolves the hand-floor penetration issue.
5. Experimental Setup
5.1. Datasets
DART models are trained on two motion-text datasets, both derived from the
AMASS(Mahmood et al., 2019) dataset of motion capture.-
BABEL (Bodies, Action and Behavior with English Labels) (Punnakkal et al., 2021):
- Characteristics: Contains motion capture sequences (
AMASSsource) with frame-aligned text labels that annotate fine-grained action semantics. This allows models to learn precise human action controls and natural transitions. - Scale: Large-scale, providing diverse actions.
- Sampling: Motion primitives are randomly sampled. Text labels are randomly sampled from all action segments overlapping with the primitive.
- Action Imbalance: To alleviate imbalance, importance sampling is used based on provided action labels, giving each action category roughly equal sampling chances.
- Framerate: 30 frames per second (same as prior works).
- Compatibility: Human gender fixed to male, body shape parameter set to zero.
- Usage: Primary dataset for
text-conditioned temporal motion compositionexperiments.
- Characteristics: Contains motion capture sequences (
-
HML3D (Human Motion Latent 3D) (Guo et al., 2022):
- Characteristics: Contains short motions with coarse sequence-level sentence descriptions.
- Scale: Also uses
AMASSmotion sources. - Subset Usage: Only a subset where
SMPLbody motion sequences are available is used, as theHumanAct12subset and mirrored sequences only provide joint locations (not SMPL bodies required by DART's representation). OriginalHML3Djoint rotations calculated by naiveinverse kinematicsare not directly usable for animation. - Sampling: Primitives are randomly sampled with uniform probability. Text labels are randomly chosen from multiple sentence captions of the overlapping sequence.
- Framerate: 20 frames per second.
- Compatibility: Human gender fixed to male, body shape parameter set to zero.
- Usage: Used for
optimization-based motion in-betweenexperiments for fair comparison with baselines trained on it.
5.2. Evaluation Metrics
The paper uses a comprehensive set of metrics to evaluate different aspects of motion generation:
5.2.1. Motion Realism & Quality
- Fréchet Inception Distance (FID) (
Heusel et al., 2017for images, adapted for motion):- Conceptual Definition: Measures the
similaritybetween the distribution of generated motions and the distribution of real motions from the dataset. It calculates the Fréchet distance between two multivariate Gaussian distributions (one for real data features, one for generated data features) in a feature space (e.g., embeddings from a pre-trained motion encoder). Lower FID indicates higher similarity and typically better realism. - Mathematical Formula: $ \text{FID} = ||\mu_x - \mu_g||^2 + \text{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}) $
- Symbol Explanation:
- : Mean of feature vectors for real motions.
- : Mean of feature vectors for generated motions.
- : Covariance matrix of feature vectors for real motions.
- : Covariance matrix of feature vectors for generated motions.
- : Trace of a matrix.
- : Squared L2 norm.
- Conceptual Definition: Measures the
- Jerk:
- Conceptual Definition: Jerk is the rate of change of acceleration, or the third derivative of position with respect to time. It quantifies the
smoothnessof motion. High jerk values indicate abrupt, unnatural, or "jerky" movements. - Mathematical Formula: For a discrete sequence of positions , velocity , acceleration , jerk . The magnitude of jerk is .
- Symbol Explanation:
- : Position at time step .
- : Velocity at time step .
- : Acceleration at time step .
- : Jerk at time step .
- : L2 norm (magnitude of the vector).
- Metrics used in paper:
- Peak Jerk (PJ): The maximum jerk value observed over a motion segment, indicating the most abrupt movement. Lower is better.
- Area Under the Jerk (AUJ): The integral (sum in discrete case) of jerk magnitudes over a motion segment, giving an overall measure of motion jerkiness. Lower is better.
- Conceptual Definition: Jerk is the rate of change of acceleration, or the third derivative of position with respect to time. It quantifies the
- Skate Metric (Ling et al., 2020; Zhang et al., 2018):
- Conceptual Definition: Quantifies unwanted
foot slidingorskatingwhen a foot is supposed to be in contact with the ground. It measures foot displacement when the foot's height is below a contact threshold. - Mathematical Formula:
- Symbol Explanation:
- : Scaled foot skating.
disp: Foot displacement in two consecutive frames.- : The higher foot height in consecutive frames.
- : A threshold value for considering the foot in contact with the floor. Lower values indicate less skating.
- Conceptual Definition: Quantifies unwanted
5.2.2. Motion-Text Semantic Alignment
- R-prec (Recall Precision):
- Conceptual Definition: Measures the
recallof relevant text descriptions for a given motion, based on a motion-text retrieval task. A higher R-prec indicates better alignment between the generated motion and its corresponding text prompt.
- Conceptual Definition: Measures the
- MM-Dist (Motion-to-Text Distance):
- Conceptual Definition: Measures the
distancebetween motion features and text embeddings, quantifying thesemantic alignment. Typically, this involves encoding motions and texts into a shared embedding space (e.g., using CLIP or a motion encoder) and then calculating a distance (e.g., cosine distance) between them. Lower values indicate better alignment.
- Conceptual Definition: Measures the
5.2.3. Generation Efficiency & Controllability
- DIV (Diversity):
- Conceptual Definition: Measures the
varietyordiversityof generated motions for the same input condition (e.g., text prompt). A higher DIV indicates the model can produce a wider range of plausible motions, which is desirable for creative applications.
- Conceptual Definition: Measures the
- Speed (frame/s): Frames generated per second, indicating real-time capability. Higher is better.
- Latency (s): Time to get the first generated frames, crucial for online interaction. Lower is better.
- Memory (MiB): Memory usage during generation. Lower is better.
- History Error (cm) & Goal Error (cm): For motion in-betweening, these measure the
L2 normerror (Euclidean distance) between the generated motion's start/end frame and the given history/goal keyframes. Lower is better. - Reach Time (s) & Success Rate: For goal-reaching tasks,
reach timemeasures how long it takes to reach the goal, andsuccess rateis the percentage of successful attempts. Lower reach time and higher success rate are better. - Floor Distance (cm): For RL control, measures the average distance of the feet from the floor, indicating floating artifacts. Lower is better.
5.2.4. Human Preference Studies
- Realism (%): Participants choose which motion looks
more realisticbetween two options. Higher percentage for DART indicates better realism. - Semantic Alignment (%): Participants choose which motion
better alignswith the given action text stream. Higher percentage for DART indicates better semantic alignment.
5.3. Baselines
The paper compares DART against several representative baseline models for different tasks:
5.3.1. Text-Conditioned Temporal Motion Composition
- TEACH (Temporal Action Composition for 3D Humans) (Athanasiou et al., 2022): An early method for composing 3D human actions temporally.
- DoubleTake (Shafir et al., 2024): A more recent approach that might involve blending or re-targeting to compose motions. The paper mentions adjusting its handshake size and blending length.
- T2M-GPT* (Text-to-Motion GPT) (Zhang et al., 2023a): A
history-conditioned modificationof the originalT2M-GPT. The originalT2M-GPTuses discrete representations and is designed for generating motions from text. The*indicates a modification to include history conditioning for sequential generation, making it a more relevant baseline for temporal composition. - FlowMDM (Seamless Human Motion Composition with Blended Positional Encodings) (Barquero et al., 2024): The
state-of-the-art offline temporal motion composition method. It can generate complex, continuous motions by composing desired actions with precise adherence to specified durations. However, it's an offline method requiring full timeline knowledge and is slow. This is a crucial baseline for performance and efficiency comparison.
5.3.2. Latent Diffusion Noise Optimization-Based Control (Motion In-betweening)
- DNO (Optimizing Diffusion Noise Can Serve as Universal Motion Priors) (Karunratanakul et al., 2024b): A method that uses diffusion noises as a latent space for editing and control, similar to DART's optimization approach. However, its diffusion model is trained with full motion sequences in explicit motion representations.
- OmniControl (Control Any Joint at Any Time for Human Motion Generation) (Xie et al., 2024): An approach that aims to provide flexible spatial control over human motion generation, potentially through inverse kinematics or optimization.
5.3.3. Reinforcement Learning-Based Control (Goal-Reaching)
- GAMMA (The Wanderings of Odysseus in 3D Scenes) (Zhang & Tang, 2022): A baseline that also trains goal-reaching policies with a learned motion action space. Unlike DART, GAMMA lacks text conditioning and is limited to generating walking motions, making it a good comparison for the RL framework without text control.
6. Results & Analysis
6.1. Core Results Analysis
The experiments extensively validate DART's effectiveness across
text-conditioned temporal motion composition,latent diffusion noise optimization-based spatial control(motion in-betweening, human-scene interaction), andreinforcement learning-based control(goal-reaching).6.1.1. Text-Conditioned Temporal Motion Composition
This task evaluates DART's ability to generate realistic, continuous motion sequences that align with a stream of text prompts and specified durations. The evaluations were conducted on the
BABELvalidation set.The following are the results from Table 1 of the original paper:
Segment Transition Profiling FID↓ R-prec↑ DIV → MM-Dist↓ FID↓ DIV → PJ→ AUJ↓ Speed(frame/s)↑ Latency(s)↓ Mem.(MiB)↓ Dataset 0.00±0.00 0.72±0.00 8.42±0.15 3.36±0.00 0.00±0.00 6.20±0.06 0.02±0.00 0.00±0.00 TEACH 17.58±0.04 0.66±0.00 10.02±0.06 5.86±0.00 3.89±0.05 5.44±0.07 1.39±0.01 5.86±0.02 3880±144 0.05±0.00 2251 DoubleTake 7.92±0.13 0.60±0.01 8.29±0.16 5.59±0.01 3.56±0.05 6.08±0.06 0.32±0.00 1.23±0.01 85±1 0.76±0.11 1474 T2M-GPT* 7.71±0.55 0.49±0.01 8.89±0.21 6.69±0.08 2.53±0.04 6.61±0.02 1.44±0.03 4.10±0.09 885±12 0.23 ±0.00 2172 FlowMDM 5.81±0.10 0.67±0.00 8.90±0.06 5.08±0.02 2.39±0.01 6.63±0.08 0.04±0.00 0.11±0.00 31±0 161.29±0.24 11892 Ours 3.79±0.06 0.62±0.01 8.05±0.10 5.27±0.01 1.86±0.05 6.70±0.03 0.06±0.00 0.21±0.00 334 ±2 0.02±0.00 2394 Analysis of Table 1:
- Motion Realism (FID): DART achieves the best FID for both
segment(3.79) andtransition(1.86) evaluations. This indicates DART's generated motions are the most realistic and closest to the dataset's distribution, outperforming all baselines, includingFlowMDM. - Motion Smoothness (PJ, AUJ): DART shows
second-best jerk metrics(PJ0.06,AUJ0.21), withFlowMDMbeing slightly better. This suggests DART produces smooth action transitions, close to the best offline method. The small values are close to the dataset reference, indicating high motion quality. - Motion-Text Semantic Alignment (R-prec, MM-Dist):
FlowMDMachieves the bestR-prec(0.67) andMM-Dist(5.08). DART has slightly lowerR-prec(0.62) andMM-Dist(5.27). The authors explain this as an inherent characteristic of online generation: natural action transitions require a brief delay to adapt to new semantic prompts, which shifts motion embeddings and impactsR-prec. Offline methods likeFlowMDM, having full timeline knowledge, can perfectly "plan" transitions. - Diversity (DIV): DART has a
DIVof 8.05 for segments and 6.70 for transitions, which is comparable toFlowMDMandDoubleTake, indicating good diversity in generated motions. - Efficiency (Speed, Latency, Memory): This is where DART shines significantly in practical terms:
-
Speed: DART is remarkably fast at 334 frames/s, approximately
10x fasterthanFlowMDM(31 frames/s).TEACHis faster, but its motion quality (FID) is substantially worse. -
Latency: DART has an extremely low
latencyof 0.02s, enabling truereal-time responsiveness. This is orders of magnitude better thanFlowMDM(161.29s). -
Memory: DART uses 2394 MiB, which is significantly less than
FlowMDM(11892 MiB), making it more resource-friendly.The following are the results from Table 2 of the original paper:
Realism (%) Semantic (%) Ours vs. TEACH 66.7 vs. 33.3 66.0 vs. 34.0 Ours vs. DoubleTake 66.4 vs. 33.6 66.1 vs. 33.9 Ours vs. T2M-GPT* 61.3 vs. 38.7 66.7 vs. 33.3 Ours vs. FlowMDM 53.3 vs. 46.7 51.3 vs. 48.7
-
Analysis of Table 2 (Human Preference Study):
- Despite
FlowMDMshowing slightly betterR-precandMM-Distin quantitative metrics, human evaluation reveals a different story. DART is preferred overFlowMDMfor bothmotion realism(53.3% vs. 46.7%) andmotion-text semantic alignment(51.3% vs. 48.7%). This suggests that the slight "delay" in semantic emergence due to natural transitions in DART is perceived as more natural by humans than the potentially abrupt, oracle-informed transitions ofFlowMDM. - DART is clearly preferred over
TEACH,DoubleTake, andT2M-GPT*in both realism and semantic alignment by a significant margin (e.g., ~66% preference overTEACH), highlighting its overall superior quality.
6.1.2. Latent Diffusion Noise Optimization-Based Control
6.1.2.1. Text-Conditioned Motion In-betweening
This task evaluates generating motions that smoothly transition between a
historyand agoal keyframe, conditioned on a text prompt specifying the action semantics. DART was trained on theHML3Ddataset for fair comparison.The following are the results from Table 3 of the original paper:
History error (cm)↓ Goal error (cm)↓ Skate (cm/s)↓ Jerk↓ Dataset 0.00 ± 0.00 0.00 ± 0.00 2.27 ± 0.00 0.74 ± 0.00 OmniControl 21.22 ± 2.86 7.79 ± 1.91 4.97 ± 1.31 1.41 ± 0.08 DNO 1.20 ± 0.20 4.24 ± 1.34 5.38 ± 0.70 0.65 ± 0.06 Ours 0.00 ± 0.00 0.59 ± 0.01 2.98 ± 0.32 0.61 ± 0.01 Analysis of Table 3:
- Keyframe Adherence (History Error, Goal Error): DART achieves perfect history error (0.00 cm) and the lowest goal error (0.59 cm). This demonstrates its superior capability in generating motions that precisely connect specified keyframes, significantly outperforming
OmniControlandDNO. - Motion Quality (Skate, Jerk): DART exhibits the lowest skate (2.98 cm/s) and
jerk(0.61) metrics among baselines, indicating fewer foot sliding artifacts and smoother motions. These values are also much closer to the dataset's natural motion quality. - Semantic Alignment with Spatial Control: The paper highlights that DART
effectively preserves the semantics specified by the text prompts, unlikeDNOwhich occasionally ignores text to reach the goal. This is a crucial advantage, showcasing DART's superior harmony between spatial control and text semantic alignment due to itslatent motion primitive-based diffusion model.
6.1.2.2. Human-Scene Interaction
Qualitative results show that the
latent noise optimization controlcan be applied tohuman-scene interaction synthesis. By using 3D scenes represented assigned distance fields (SDF)and incorporating scene contact/collision constraints, DART can generate motions like climbing stairs or sitting on a chair, adhering to both text prompts and goal joint locations. Figure 3 illustrates this.
该图像是示意图,展示了人类与场景交互的生成效果,分别描述了三种行为:(a)走路,左转,坐在椅子上;(b)走楼梯;(c)下楼。图中目标的盆骨关节位置用红色球体可视化,最佳视图请参见补充视频。Figure 3: Illustrations of human-scene interaction generation given text prompts and goal pelvis joint location (visualized as a red sphere). Best viewed in the supplementary video.
6.1.3. Reinforcement Learning-Based Control
DART's integration with
reinforcement learningenables training text-conditioned goal-reaching policies for various locomotion styles (walk,run,hop on the left leg).The following are the results from Table 4 of the original paper:
Time (s)↓ Success rate↑ Skate (cm/s) ↓ Floor distance (cm)↓ GAMMA walk 31.44 ± 2.58 0.95 ± 0.03 5.14 ± 1.58 5.55 ± 0.84 Ours 'walk' 17.08 ± 0.05 1.0 ± 0.0 2.67 ± 0.12 2.24 ± 0.02 Ours 'run' 10.55 ± 0.06 1.0 ± 0.0 3.23 ± 0.24 3.86 ± 0.05 Ours 'hop on left leg' 20.50 ± 0.24 1.0 ± 0.0 2.22 ± 0.12 4.11 ± 0.07 Analysis of Table 4:
- Goal-Reaching Performance (Time, Success Rate): DART's policies consistently achieve a 100% success rate for all locomotion styles, significantly outperforming
GAMMA walk(95%). DART also demonstrates much fasterreach times(17.08sfor walk vs.GAMMA's 31.44s). - Motion Quality (Skate, Floor Distance): DART's policies show lower skate ( for walk vs.
GAMMA's 5.14 cm/s) and lower floor distance (2.24 cmfor walk vs.GAMMA's 5.55 cm), indicating more physically plausible motions with less sliding and floating. - Versatility: DART supports multiple text-conditioned locomotion styles, unlike
GAMMAwhich is limited to walking. - Efficiency: The text-conditioned goal-reaching policy achieves a generation speed of 240 frames per second, demonstrating its real-time applicability for learned controllers.
6.2. Ablation Studies / Parameter Analysis
Ablation studies were conducted on architecture designs, diffusion steps, primitive representation, and scheduled training, specifically for the
text-conditioned motion compositiontask (Table 5).The following are the results from Table 5 of the original paper:
Segment Transition FID↓ R-prec↑ DIV → MM-Dist↓ FID↓ DIV → PJ→ AUJ↓ Dataset 0.00±0.00 0.72±0.00 8.42±0.15 3.36±0.00 0.00±0.00 6.20±0.06 0.02±0.00 0.00±0.00 Ours 3.79±0.06 0.62±0.01 8.05±0.10 5.27±0.01 1.86±0.05 6.70±0.03 0.06±0.00 0.21±0.00 DART-VAE 4.23±0.02 0.62±0.01 8.33±0.12 5.29±0.01 1.79±0.02 6.73±0.23 0.20±0.00 0.96±0.00 DART-schedule 8.08±0.09 0.39±0.01 8.05±0.12 6.96±0.03 7.41±0.10 6.58±0.06 0.03±0.00 0.18±0.00 per frame(H=1,F=1) 10.31±0.09 0.29±0.01 6.82±0.13 7.41±0.01 7.82±0.09 6.03±0.08 0.02±0.00 0.08±0.00 H=2,F=16 4.04±0.10 0.66±0.00 8.20±0.06 4.96±0.01 2.22±0.10 6.60±0.20 0.06±0.00 0.18±0.00 steps 2 4.44±0.04 0.60±0.00 8.20±0.15 5.38±0.01 2.24±0.02 6.77±0.07 0.05±0.00 0.20±0.00 steps 5 3.49±0.09 0.63±0.00 8.25±0.15 5.18±0.01 2.11±0.07 6.74±0.11 0.05±0.00 0.20±0.00 steps 8 3.70±0.03 0.62±0.01 8.04±0.13 5.25±0.03 2.15±0.08 6.72±0.15 0.06±0.00 0.20±0.00 steps 10 (Ours) 3.79±0.06 0.62±0.01 8.05±0.10 5.27±0.01 1.86±0.05 6.70±0.03 0.06±0.00 0.21±0.00 steps 50 3.82±0.05 0.60±0.00 7.74±0.07 5.30±0.01 2.11±0.10 6.58±0.10 0.06±0.00 0.22±0.00 steps 100 4.16±0.06 0.61±0.00 7.82±0.15 5.32±0.02 2.20±0.05 6.43±0.10 0.06±0.00 0.21±0.00 Analysis of Table 5:
- Impact of VAE (
DART-VAE):- Removing the
VAE(DART-VAErow) results in significantly higherjerkmetrics (PJ0.20 vs.Ours0.06,AUJ0.96 vs.Ours0.21). This indicates considerably more jittering and less smooth motions when the diffusion model is trained directly in the raw motion space. This validates the importance of theVAEfor compressing high-frequency noise and improving motion quality.
- Removing the
- Impact of Scheduled Training (
DART-schedule):- Without
scheduled training(DART-schedulerow), the model'sR-prec(0.39 vs.Ours0.62) andFID(8.08 vs.Ours3.79) metrics are significantly worse. This confirms thatscheduled trainingis critical for enabling the model to effectively respond to text control and handleout-of-distributionhistory-text combinations encountered during long autoregressive generations.
- Without
- Impact of Primitive Length ( and ):
- Using a
per-frame prediction() leads to significantly worseR-prec(0.29),FID(10.31), andMM-Dist(7.41). This demonstrates that very short primitives (single-frame) are less effective for learning text-conditioned motion space and responding to text prompts, compared to primitives with a reasonablehorizon(like inOurs). - Increasing the future length to () from (
Ours) results in slightly worseFIDfor segments (4.04 vs. 3.79) and transitions (2.22 vs. 1.86), while improvingR-precslightly (0.66 vs. 0.62). This suggests that provides a good balance between learning complex semantics and maintaining motion quality.
- Using a
- Impact of Diffusion Steps:
- The ablation shows that DART can achieve high-quality generation with a surprisingly small number of diffusion steps.
Steps 5andSteps 8perform very similarly toSteps 10 (Ours). - An
extremely low diffusion step number of 2results in a higherFID(4.44), indicating poorer motion quality. However, reducing steps from 100 to 10 or even 5 does not significantly degrade performance, validating the efficiency of DART's design (due to the simplicity of the primitive representation). This is a key finding for real-time applications, as fewer steps directly translate to faster inference.
- The ablation shows that DART can achieve high-quality generation with a surprisingly small number of diffusion steps.
6.3. Experimental Details and Human Preference Study Interface
-
Experiment Details:
DARTusesAlgorithm 1for online generation with a defaultrest standing seed motionand aclassifier-free guidance weightof 5. Baselines likeFlowMDM,TEACH,DoubleTakeused their released checkpoints.T2M-GPT*was re-trained with history conditioning. Action segments for evaluation were extracted fromBABELvalidation set, clamping minimum duration to 15 frames and skipping 'transition' labels. -
Human Preference Studies: Conducted on
Amazon Mechanical Turk(AMT) to evaluategeneration realismandtext-motion semantic alignment. Participants viewed pairs of generated motions (from DART vs. a baseline) and selected the better one. Each comparison was voted by 3 independent participants, with random shuffling to prevent bias. This setup is illustrated in Figure 4.
该图像是人类偏好研究界面示意图,分为上下两部分。上半部分评估动作与文本描述的语义对齐,参与者需选择哪个动画与指令表述更相关;下半部分评估哪个动画更自然且真实。
Figure 4: Illustration of the human preference study interface for evaluating motion-text semantic alignment (top) and perceptual realism(bottom). Participants are requested to select the generation that is perceptually more realistic or better aligns with the action descriptions in subtitles (only visible in semantic preference study).
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces
DartControl(DART), a pioneeringdiffusion-based autoregressive motion primitive modelforreal-time text-driven motion control. DART successfully addresses critical limitations in existing human motion generation methods by offering a unified framework for creating long, continuous, and complex motions that are precisely guided by natural language and spatial constraints. Its core contributions include learning a compact, artifact-mitigatinglatent motion primitive spacevia aVAE, enabling efficient and high-quality autoregressive generation through alatent diffusion model(with just 10 steps andscheduled training), and providing versatilespatial controlcapabilities through bothlatent noise optimizationandreinforcement learningpolicies. Experimental results consistently demonstrate DART's superior performance inmotion realism,efficiency(over 300 frames/s, low latency), andcontrollabilityacross diverse tasks, outperforming state-of-the-art baselines.7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Reliance on Frame-Level Annotations: DART heavily relies on motion sequences with fine-grained, frame-level aligned text annotations (like
BABEL) to achieve precisetext-motion alignmentand natural transitions. - Degeneration with Coarse Labels: When trained on coarse sentence-level motion labels (like
HML3D), thetext-motion alignmentdegenerates for texts describing multiple actions. This leads to motions randomly switching between described actions, as each shortmotion primitiveinherently matches only a portion of the sequence's semantics, causing semantic misalignment and ambiguity with coarse labels. - Future Work - Hierarchical Latent Spaces: The authors propose exploring
hierarchical latent spaces(Stofl et al., 2024) to effectively tackle both fine-grained (primitive-level) and global (sequence-level) semantics. - Future Work - Open-Vocabulary Generation (General Limitation): Acknowledged as a critical challenge for all existing text-conditioned motion generation methods. The scarcity of 3D human motion data with text annotations (compared to image/video data) limits generalization to
open-vocabulary text prompts. - Future Work - Dataset Expansion: Promising directions include extracting motion data from in-the-wild internet videos and generative image/video models (
Kapon et al., 2024; Goel et al., 2023; Lin et al., 2023; Shan et al., 2024). - Future Work - Automated Labeling: Leveraging advancements in
vision-language models (VLMs)(Shan et al., 2024; Dai et al., 2023) to automatically provide detailed, frame-aligned motion text labels to facilitatetext-to-motion generation.
7.3. Personal Insights & Critique
DART represents a significant step forward in human motion generation, particularly in its emphasis on
real-timeandonlinecapabilities, which are crucial for interactive applications.-
Strength - Versatility of Latent Space: A key strength is the versatility of DART's learned
latent motion primitive space. It's not just an efficient representation for generation, but also a powerful foundation for two distinct control paradigms (optimizationandreinforcement learning), demonstrating its robustness and broad applicability. The fact that the same latent space can be manipulated for precise spatial control via gradient-based optimization or by a learned RL policy is highly compelling. -
Strength - Efficiency through Primitives and Few Steps: The use of
motion primitivescombined with a surprisingly small number ofdiffusion steps(10 steps) for high-quality generation is a critical innovation for real-time performance. This challenges the common assumption that diffusion models require many steps, suggesting that simpler data distributions (like those of primitives) can be effectively modeled with fewer iterations. -
Critical Reflection - Semantic Alignment Trade-off: The slightly lower
R-precandMM-Distcompared toFlowMDMin quantitative metrics, despite better human preference, highlights an interesting trade-off. DART prioritizes natural, online transitions, which might introduce a brief delay in fully manifesting new action semantics. While humans perceive this as natural, it can penalize metrics that expect immediate, perfect semantic shifts. This suggests that metrics for online, autoregressive generation might need to account for natural transition delays. -
Applicability & Transferability: The methods presented, especially the concept of learning a robust, controllable
latent primitive spaceand combining it with various control techniques, could be highly transferable. For instance, similar frameworks could be applied to generating other continuous, complex sequences, such as robotic arm movements for manufacturing, character behaviors in games, or even long-term event sequences in other domains, provided a suitable primitive definition exists. -
Unverified Assumption / Area for Improvement - Primitive Length Optimization: While the ablation study shows is effective, a deeper theoretical analysis or adaptive mechanism for determining optimal
primitive length(H, F) could be beneficial. Different actions might naturally lend themselves to different primitive durations. -
Impact on Physics-Based Simulation: The successful integration with
PHCfor physics-based refinement is very promising. This hybrid approach (kinematic generation + physics correction) could become a standard for achieving both creative flexibility and physical plausibility in real-time. The real-time nature of DART could enableonline physics-based correction, which is a powerful paradigm.Overall, DART offers an elegant and practical solution to a complex problem, pushing the boundaries of real-time, controllable human motion generation with a thoughtfully designed architecture and comprehensive experimental validation.
-
-
Similar papers
Recommended via semantic vector search.