Paper status: completed

Motion Prompting: Controlling Video Generation with Motion Trajectories

Published:12/04/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces motion prompting to control video generation via motion trajectories, addressing limitations of text-based prompts. It demonstrates converting high-level requests into detailed motion prompts, showcasing versatility in motion control and image editing with i

Abstract

Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as motion prompts. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term motion prompt expansion. We demonstrate the versatility of our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance. Video results are available on our webpage: https://motion-prompting.github.io/

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Motion Prompting: Controlling Video Generation with Motion Trajectories

1.2. Authors

Daniel Geng1,2^{1,2*}, Charles Herrmann1^1, Junhwa Hur1^1, Forrester Cole1^1, Serena Zhang1^1, Tobias Pfaff1^1, Tatiana Lopez-Guevara1^1, Carl Doersch1^1, Yusuf Aytar1^1, Michael Rubinstein1^1, Chen Sun1,3^{1,3}, Oliver Wang1^1, Andrew Owens2^2, Deqing Sun1^1

  • 1^{1}Google DeepMind
  • 2^{2}University of Michigan
  • 3^{3}Brown University
  • ^*Denotes equal contribution.
  • ^\daggerDenotes lead.

1.3. Journal/Conference

The paper is available as a preprint on arXiv. While the specific journal or conference for its final publication is not explicitly stated, Google DeepMind, University of Michigan, and Brown University are highly reputable institutions in AI/ML research. Given the authors' affiliations and the publication venue (arXiv), it is likely intended for a top-tier computer vision or machine learning conference (e.g., CVPR, ICCV, NeurIPS, ICML) or a highly respected journal.

1.4. Publication Year

2024 (Published at 2024-12-03T18:59:56.000Z)

1.5. Abstract

This paper introduces a novel approach to control video generation using motion trajectories, referred to as motion prompts. The core problem addressed is that existing video generation models primarily rely on text prompts, which are insufficient for capturing the nuanced dynamics and temporal compositions required for expressive video content. To overcome this, the authors train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. This flexible representation allows for encoding any number of trajectories, object-specific or global scene motion, and temporally sparse motion. A key innovation is motion prompt expansion, a process that translates high-level user requests into detailed, semi-dense motion prompts. The versatility of this approach is demonstrated through various applications, including camera and object motion control, "interacting" with images, motion transfer, and image editing. The results exhibit emergent behaviors such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. The method is quantitatively evaluated, a human study is conducted, and strong performance is demonstrated against baselines.

2. Executive Summary

2.1. Background & Motivation

Core Problem

The core problem the paper addresses is the limitation of current video generation models, which predominantly rely on text prompts for control. While text is effective for describing static scenes or high-level actions, it struggles to convey the subtleties of motion—such as exact trajectories, acceleration profiles, ease-in-ease-out timings, or synchronized movements. This makes it difficult to generate truly expressive and compelling video content with granular control over dynamic actions and temporal compositions. For instance, a text prompt like "a bear quickly turns its head" leaves too much room for interpretation regarding the precise motion.

Importance and Challenges

Motion control is paramount in video generation. It can elevate a video's realism, guide viewer attention, enhance storytelling, and define visual style. Achieving realistic and expressive motion with precise control is essential for creating professional-quality video. The challenge lies in finding a control signal that is both expressive enough to capture motion nuances and practical for users to specify. Prior research has explored various forms of motion conditioning, but often these approaches are either too sparse (lacking fine-grained control), too dense (impractical to specify manually), or require complex engineering and specialized training for different motion types.

Paper's Entry Point or Innovative Idea

The paper's innovative idea is to use spatio-temporally sparse or dense motion trajectories as the primary control signal, termed motion prompts, for video generation. The authors argue that motion trajectories (also known as particle video or point tracks) are an ideal candidate because they offer a highly expressive encoding of motion. This representation can:

  1. Capture the trajectories of any number of points.

  2. Represent object-specific or global scene motion.

  3. Handle temporally sparse motion constraints (i.e., motion specified only at certain time points).

  4. Account for occlusions through visibility flags.

    Furthermore, the paper introduces motion prompt expansion, a method to bridge the gap between simple user inputs (e.g., mouse drags) and the detailed motion trajectories required for fine-grained control, often by leveraging computer vision signals. This approach aims to provide both flexibility and ease of use, making motion a powerful, complementary control scheme to text.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of video generation:

  • Flexible Motion Representation and Conditioning: The authors identify spatio-temporally sparse or dense motion tracks as a highly flexible and comprehensive motion representation. They train a ControlNet on top of a pre-trained video diffusion model (Lumiere) to accept these motion prompts as conditioning, enabling granular control over various aspects of video motion within a unified model. This representation is versatile enough to control cameras, objects, or entire scenes.

  • Motion Prompt Expansion: They propose motion prompt expansion, a novel process that translates simple user inputs (like mouse drags or high-level requests) into more complex and detailed motion tracks. This addresses the practical challenge of generating precise motion prompts and allows for fine-grained control without requiring users to manually design dense trajectories.

  • Versatile Applications: The approach is demonstrated across a wide range of applications, showcasing its versatility:

    • Object control: Manipulating specific objects.
    • Camera control: Guiding camera movements (e.g., orbits, arcs) without explicit camera pose training.
    • Motion transfer: Applying motion from a source video to a different first frame.
    • Drag-based image editing: Interactively modifying images based on user drags.
    • "Interacting" with an image: Manipulating elements like hair or sand.
    • Composing motions: Combining object and camera control simultaneously.
    • Motion magnification: Enhancing subtle motions in videos.
  • Emergent Behaviors and Probing Capabilities: The model exhibits emergent behaviors, such as generating videos with realistic physics (e.g., hair tossing, sand sweeping) in response to motion prompts. This suggests that motion prompts can serve as a powerful tool to probe the learned video prior of generative models, helping to understand their implicit knowledge of 3D structures and physical laws, and potentially interacting with future world models.

  • Strong Performance and Evaluation: The method is quantitatively evaluated against baselines (Image Conductor, DragAnything) using metrics like PSNR, SSIM, LPIPS, FVD, and EPE. It demonstrates superior performance in most cases. A human study further validates the approach, showing a strong preference for the proposed method in terms of motion adherence, motion quality, and visual quality.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following core concepts:

  • Video Generation: The process of creating video content, typically from various inputs like text, images, or other conditional signals. Early methods might involve simple interpolations, while modern approaches leverage deep learning.

  • Diffusion Models: A class of generative models that have shown remarkable success in generating high-quality images and videos. They work by iteratively denoising a random input (noise) to produce a coherent output. The process involves two main stages:

    1. Forward Diffusion: Gradually adding Gaussian noise to data until it becomes pure noise.
    2. Reverse Diffusion: Learning to reverse this process, starting from pure noise and gradually removing noise to recover the original data distribution.
    • Latent Diffusion Models: Many modern diffusion models operate in a compressed latent space rather than directly on pixel space, which makes training and inference more efficient. The Lumiere model, which this paper builds upon, is a latent video diffusion model.
  • ControlNet: A neural network architecture designed to add conditional control to pre-trained large diffusion models. It works by creating a trainable copy of the encoder layers of a pre-trained diffusion model. The original diffusion model's weights are kept frozen, while the ControlNet learns to incorporate new conditional inputs (like edge maps, pose estimates, or in this paper, motion trajectories). The outputs of the ControlNet are then added to the corresponding layers of the original diffusion model's encoder. This allows for fine-tuning without disturbing the pre-trained model's general knowledge. A key feature is the use of zero convolutions at the beginning of each ControlNet block, which initialize the ControlNet to a "zero state" that does not affect the original model, allowing training to start smoothly without disrupting its existing capabilities.

  • Motion Trajectories (Point Tracks / Particle Video): This is a fundamental concept in the paper. A motion trajectory is a sequence of 2D coordinates that traces the path of a specific point (e.g., a pixel, a feature point, or a conceptual particle) across multiple frames in a video.

    • Spatio-temporally sparse: Means that trajectories might be available for only a few points, or only at certain time intervals.
    • Spatio-temporally dense: Means trajectories are available for a large number of points across all frames.
    • Visibility: An important aspect of motion trajectories is whether a tracked point is visible in a given frame (i.e., not occluded or out of frame). This is often represented by a binary flag (1 for visible, 0 for not visible).
    • Particle Video: A concept where a video is represented not by pixels, but by a collection of moving particles (points), each with a trajectory.
  • Optical Flow: A traditional computer vision concept representing the apparent motion of objects, surfaces, and edges in a visual scene between two consecutive frames. It's a vector field where each vector indicates the displacement of a pixel from one frame to the next. While related to motion trajectories, optical flow typically describes dense, frame-to-frame motion, and errors can accumulate over long sequences. It also struggles with occlusions. The paper contrasts its chosen motion trajectory representation with optical flow, noting the latter's limitations.

  • Monocular Depth Estimation: The process of predicting the depth (distance from the camera) of each pixel in an image using only a single 2D image as input. This is typically done using deep learning models trained on large datasets with ground truth depth information. The paper uses an off-the-shelf monocular depth estimator to create 3D point clouds from a 2D image, which is then used for camera control.

3.2. Previous Works

The paper contextualizes its work within several areas of video generation and motion control.

Video Diffusion Models

Recent years have seen diffusion models achieve state-of-the-art results in video generation.

  • Text-to-Video Generation: Models like Imagen Video [26], Video Diffusion Models [27], Make-A-Video [64], Emu Video [19], and Lumiere [3] generate videos from text prompts. Lumiere is the base model that this paper builds upon, known for generating 5-second, 16fps videos from text and a first frame.
  • Image-to-Video Generation: Other models like Stable Video Diffusion [6] and DynamiCrafter [81] animate static images into videos.
  • World Simulators: Diffusion models are also seen as a path towards creating world simulators [7], with preliminary success in visual planning for embodied agents [15, 16, 83]. The debate around whether video priors capture sufficient understanding of the physical world [35] is mentioned, with some arguing for explicit physics rule integration [41, 85, 86].

Motion-conditioned Video Generation

This line of work aims to adapt pre-trained text-to-video models to follow new motion patterns or additional motion signals.

  • Fine-tuning/Adapters:
    • LoRA (Low-rank adaptation) [29] for few-shot motion customization [55, 90].
    • DreamBooth [57] for personalized video generation with motion control [78].
  • Sparse Motion Control: Early work proposed control through sparse motions [2, 21]. More recent works explore similar ideas with more powerful models:
    • Tora [89], MotionCtrl [75], DragNUWA [84], Image Conductor [39], MCDiff [9] adopt complex two-stage training, specialized losses [39, 45], or multi-stage fine-tuning.
    • MOFA-Video [46] requires separate adapters for different motion types.
    • TrackGo [92] uses custom losses and layers.
    • Many of these works [39, 46, 75, 84, 89] engineer specific data filtering pipelines.
  • Entity-Centric Control: Uses signals tied to specific entities in the video:
    • Bounding boxes [72, 78]
    • Segmentation masks [10, 79]
    • Human pose [30, 82]
    • Camera pose [23, 76]
  • Zero-shot Motion Adaptation: Approaches like SG-I2V [45], Trailblazer [43], FreeTraj [54], and Peekaboo [32] guide video generation based on entity-centric mask motion without training or fine-tuning the video models (i.e., at test time). These often explicitly control diffusion feature maps.

Motion Representations

The choice of motion representation is crucial.

  • Optical Flow [8, 14, 28, 42, 67, 68]: Represents pixel-level motion between frames.
    • Limitation: Errors can accumulate when chained over time, and it lacks robust occlusion handling, which the authors find unsuitable for good camera control.
  • Long-range feature matching [5, 31, 33, 63] or Point Trajectories [12, 13, 22, 36, 37, 91]: Tracks specific points across longer temporal durations.
    • Advantage: Can handle occlusions and allows for both sparse and dense tracking over arbitrary durations. The paper specifically cites BootsTAP [13] as an efficient algorithm used for extracting tracks.

Example of Crucial Prior Formula (Not directly from this paper, but foundational for Diffusion Models)

Since this paper builds on diffusion models, understanding the core diffusion loss is helpful. A common objective in Denoising Diffusion Probabilistic Models (DDPMs) [25] is to train a neural network ϵθ \epsilon_\theta to predict the noise added to a noisy input. The simplified objective (or diffusion loss) for training ϵθ \epsilon_\theta is:

Lt=Ex0,ϵN(0,I)[ϵϵθ(xt,t)2] L_t = \mathbb{E}_{\mathbf{x}_0, \epsilon \sim \mathcal{N}(0, \mathbf{I})} \left[ \| \epsilon - \epsilon_\theta(\mathbf{x}_t, t) \|^2 \right]

Where:

  • LtL_t: The loss at a specific timestep tt.

  • E\mathbb{E}: Expectation over the data distribution of x0\mathbf{x}_0 and the noise ϵ\epsilon.

  • x0\mathbf{x}_0: An original data sample (e.g., an image or video frame).

  • ϵN(0,I)\epsilon \sim \mathcal{N}(0, \mathbf{I}): Pure Gaussian noise sampled from a standard normal distribution.

  • xt\mathbf{x}_t: A noisy version of x0\mathbf{x}_0 at timestep tt, obtained by adding noise according to a diffusion process: xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon.

  • ϵθ(xt,t)\epsilon_\theta(\mathbf{x}_t, t): The neural network (often a U-Net) parameterized by θ\theta, which predicts the noise ϵ\epsilon given the noisy input xt\mathbf{x}_t and the current timestep tt.

  • 2\| \cdot \|^2: The squared L2 norm, measuring the difference between the true noise ϵ\epsilon and the predicted noise ϵθ\epsilon_\theta.

  • αˉt\bar{\alpha}_t: A scalar that determines the noise schedule at timestep tt.

    This loss encourages the model to accurately predict the noise component, which is then used in the reverse diffusion process to iteratively reconstruct clean data from noise. The paper states that its ControlNet is trained using the standard diffusion loss, implying a variant of this objective.

3.3. Technological Evolution

The evolution of video generation has moved from early, often rule-based or simple interpolation methods, to sophisticated deep learning models. Initially, Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) were explored, but diffusion models have recently taken the lead due to their stability, sample quality, and ability to capture complex data distributions.

  • Text-only control: The first wave of advanced models primarily relied on text prompts, effectively generating content but lacking fine-grained motion control.
  • Image-conditioning: The next step incorporated image conditioning (e.g., first frame) to provide visual consistency, helping anchor the generation.
  • Explicit motion control (early forms): Researchers started adding various forms of motion conditioning, ranging from sparse keypoints, segmentation masks, bounding boxes, to human poses or camera parameters. However, these often required specific training for each type of control or lacked the flexibility to combine different granularities of motion.
  • Flexible motion primitives: This paper represents a significant step by proposing a unified and flexible motion representation (motion trajectories) that can encode various types and granularities of motion (sparse, dense, object, global, temporal) and by introducing motion prompt expansion to make this expressive control accessible to users. This moves the field towards more intuitive and powerful interfaces for directing video generation, potentially paving the way for sophisticated interaction with generative world models.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

  • Unified and Flexible Motion Representation:

    • Prior Work: Many prior methods for motion conditioning are often constrained to specific types of motion (e.g., bounding boxes, segmentation masks, human poses, or simple sparse trajectories). MOFA-Video [46] even requires separate adapters for different motion types. Optical flow, while dense, has limitations with occlusions and long-range coherence.
    • This Paper: Uses spatio-temporally sparse or dense motion trajectories as a unified representation. This single representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion (including occlusions via visibility flags). This flexibility is a key differentiator, allowing a single model to handle a broad spectrum of motion control tasks.
  • Simpler Training Recipe:

    • Prior Work: Many competitive motion-conditioned video generation methods like Tora [89], MotionCtrl [75], DragNUWA [84], Image Conductor [39], and MCDiff [9] often require complex engineering, such as two-stage training, specialized losses, custom architectures, multi-stage fine-tuning, or engineered data filtering pipelines.
    • This Paper: Achieves high-quality results with a simpler training recipe. The model is trained in a single stage, using uniformly sampled dense trajectories extracted by BootsTAP [13], and without any specialized engineering efforts or filtering of training videos. Despite this simplicity, it generalizes well to both sparse and dense trajectories during inference.
  • Motion Prompt Expansion:

    • Prior Work: While some works allow users to specify sparse trajectories (e.g., mouse drags), the leap to dense, fine-grained control is often manual or limited.
    • This Paper: Introduces motion prompt expansion, a method to translate high-level user requests (e.g., mouse drags, conceptual object manipulations, camera paths) into the detailed, semi-dense motion prompts required for robust control. This bridges the gap between user intent and the model's precise input requirements, making the powerful motion trajectory representation more accessible.
  • Generalization Capabilities:

    • Prior Work: Models often specialize in the specific motions or controls they are trained on (e.g., camera pose models [23, 76] are trained explicitly on camera data).
    • This Paper: Shows strong generalization to spatially localized track conditioning (despite training on uniform distribution), to varying numbers of tracks (more or fewer than trained on), and to temporally sparse tracks (not necessarily starting from the first frame). Crucially, it achieves compelling camera control and motion composition without being explicitly trained on camera poses or specific compositions. This emergent capability from a general motion training objective is a notable innovation.
  • Probing and World Models:

    • Prior Work: Less emphasis on using motion conditioning as a tool for understanding the underlying generative model's physical priors.
    • This Paper: Highlights the potential of motion prompts for probing video models to understand their 3D or physics understanding and suggests their role in interacting with future generative world models. The observation of emergent behaviors like realistic physics is a finding that differentiates this work beyond mere generation quality.

4. Methodology

4.1. Principles

The core idea behind Motion Prompting is to leverage explicit motion trajectories as a primary conditioning signal for video generation, moving beyond the limitations of text-only prompts. This is achieved by building on a pre-trained video diffusion model (Lumiere) and augmenting it with a ControlNet architecture specifically designed to accept motion prompts in the form of point tracks. The underlying intuition is that motion is inherently spatiotemporal and continuous, making trajectories a natural and expressive way to encode it. The ControlNet allows the model to learn to adhere to these specified motions while retaining the generative capabilities and consistency of the base diffusion model. To make this expressive control practical, the concept of motion prompt expansion is introduced, enabling users to translate high-level intentions into detailed trajectories.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology can be broken down into several key components: the choice of motion prompt representation, the ControlNet architecture for conditioning, the data preparation, and the training process, followed by various motion prompt expansion strategies for different applications.

4.2.1. Motion Prompt Representation

The paper identifies spatio-temporally sparse or dense motion trajectories as an ideal representation for motion prompts. This representation is flexible enough to capture various types of motion.

Formally, a set of NN point trajectories of length TT is denoted by pRN×T×2\mathbf{p} \in \mathbb{R}^{N \times T \times 2}.

  • p[n,t]=(xtn,ytn)\mathbf{p}[n, t] = (x_t^n, y_t^n) represents the 2D coordinate of the nthn^{\text{th}} track at the ttht^{\text{th}} timestep.

  • The visibility of tracks is denoted by vRN×T\mathbf{v} \in \mathbb{R}^{N \times T}, an array of 1s and 0s, where 0 indicates an off-screen or occluded track, and 1 indicates a visible track.

    This representation allows for:

  • Any number of tracks (NN).

  • Object-specific motion (tracking points on a single object).

  • Global scene motion (tracking points across the entire scene).

  • Temporally sparse motion (tracks specified only for certain frames).

  • Handling occlusions through the visibility flag.

4.2.2. Architecture

The model is built on Lumiere [3], a pre-trained video diffusion model that generates 5-second videos (80 frames at 16 fps) conditioned on text and a first frame. To incorporate track conditioning, a ControlNet [87] is used.

ControlNet Integration:

  1. Base Model: The Lumiere model's weights are kept frozen.
  2. ControlNet Copy: A trainable copy of Lumiere's encoder stack is created.
  3. Zero Convolutions: Zero convolutions are added as in ControlNet [87] at the beginning of each block of the ControlNet, ensuring that the ControlNet initially produces zero output, thus not disturbing the pre-trained Lumiere model at the start of training.
  4. Input Layer Modification: The first convolutional layer of the ControlNet is replaced with a new layer designed to accept the motion prompt conditioning signal.

Encoding Motion Tracks into a Spatio-temporal Volume: The ControlNet requires the motion prompt to be encoded into a spatio-temporal volume, cRT×H×W×C\mathbf{c} \in \mathbb{R}^{T \times H \times W \times C}.

  • TT: Number of frames (80 for Lumiere).

  • H, W: Height and width of the generated video (128x128 for Lumiere's base model).

  • CC: Channel dimension of the conditioning signal. The paper sets C=64C = 64.

    The encoding process is as follows:

  1. Unique Embeddings: For each track p[n,:]\mathbf{p}[n, :] (i.e., for each nn from 1 to NN), a unique and random embedding vector ϕnRC\phi^n \in \mathbb{R}^C is assigned. These embeddings act as simple identifiers for each track. The embeddings ϕn\phi^n are randomly drawn from a fixed pool and assigned a sinusoidal positional encoding [70] of 64 dimensions.

  2. Rasterization: The spatio-temporal volume c\mathbf{c} is zero-initialized. Then, for each space-time location (t,xtn,ytn)(t, x_t^n, y_t^n) that a track visits and is visible at, the corresponding embedding ϕn\phi^n is placed in c\mathbf{c}.

  3. Formal Equation for Conditioning Signal Generation: c[t,xtn,ytn]=v[n,t]ϕn \mathbf{c}[t, x_t^n, y_t^n] = \mathbf{v}[n, t] \phi_n Where:

    • c[t,xtn,ytn]\mathbf{c}[t, x_t^n, y_t^n]: The value at a specific spatio-temporal location (x, y) at timestep tt in the conditioning volume.
    • tt: Timestep index, ranging from 0 to T-1.
    • xtn,ytnx_t^n, y_t^n: The (quantized to nearest integer) 2D coordinates of the nthn^{\text{th}} track at timestep tt.
    • v[n,t]\mathbf{v}[n, t]: The visibility flag for the nthn^{\text{th}} track at timestep tt. If v[n,t]=0\mathbf{v}[n, t] = 0, the embedding is zeroed out, meaning the track does not contribute to the conditioning at that location/time. If v[n,t]=1\mathbf{v}[n, t] = 1, the embedding ϕn\phi_n is placed.
    • ϕn\phi_n: The unique CC-dimensional random embedding vector assigned to the nthn^{\text{th}} track.

    Important Notes on Encoding:

    • The coordinates xtnx_t^n and ytny_t^n are quantized to the nearest integer for simplicity, matching the discrete grid of the volume.

    • If multiple tracks pass through the same space-time location, their respective embeddings are added together.

    • For completely dense tracks, this representation is analogous to starting with a dense grid of embeddings and forward warping them, similar to [59].

      Figure 2 from the original paper illustrates this process: The following figure (Figure 2 from the original paper) illustrates the conditioning tracks encoding process:

Figure 2. Conditioning Tracks. During training, we take estimated tracks from a video (left) and encode them into a \(T \\times H \\times\) \(W { \\times } C\) dimensional space-time volume (middle). Each track has a unique embedding (right), written to every location the track visits and is visible at. All other locations are set to zeros. This strategy can encode any number and configuration of tracks.
该图像是示意图,展示了视频生成模型中的轨迹编码过程。左侧展示了从视频中估算的轨迹,右侧显示了将这些轨迹栅格化后的条件信号。每个轨迹在空间-时间体积中都有独特的嵌入,所有未访问位置的值设为零。

Figure 2. Conditioning Tracks. During training, we take estimated tracks from a video (left) and encode them into a T×H×T \times H \times W×CW { \times } C dimensional space-time volume (middle). Each track has a unique embedding (right), written to every location the track visits and is visible at. All other locations are set to zeros. This strategy can encode any number and configuration of tracks.

4.2.3. Data

To train the model, a video dataset paired with tracks is required.

  • Dataset: An internal dataset consisting of 2.2 million videos is used. These videos are resized to 128×128128 \times 128, the output size of the base Lumiere model.
  • Track Extraction: BootsTAP [13], an off-the-shelf point tracking method, is run on this dataset.
    • Videos are center cropped to a square and resized to 256×256256 \times 256 before running BootsTAP.
    • Tracks are extracted densely, resulting in 16,384 tracks per video.
    • Predicted occlusions are also extracted by BootsTAP, which are crucial for the visibility flag v\mathbf{v}.
  • Data Philosophy: The authors explicitly state they do not filter the videos in any way, hypothesizing that training on diverse motions will lead to a more powerful and flexible model.

4.2.4. Training

The training procedure largely follows the ControlNet paradigm [87].

  • Loss Function: The standard diffusion loss (as introduced in Section 3.2) is optimized.
  • Conditioning Signal Sampling: For every video during training, a random number of tracks is sampled from a uniform distribution (specifically, from 1000 to 2000 inclusive, as per Appendix A.1). These sampled tracks are then used to construct the conditioning signal c\mathbf{c} as described above.
  • Optimizer and Learning Rate: The model is trained for 70,000 steps using Adafactor [60] with a learning rate of 1×1041 \times 10^{-4}, without any learning rate decay.
  • Spatial Super Resolution (SSR): After the base 128×128128 \times 128 video is generated, it is passed through Lumiere's spatial super resolution (SSR) model to produce a 1024×10241024 \times 1024 video. The SSR model is used as-is, without finetuning it for motion conditioning, indicating that the motion conditioned dynamics are primarily learned by the base 128×128128 \times 128 model.

Training Observations (Generalization and Convergence Phenomena): The authors observe several interesting phenomena during training:

  1. Loss-Performance Discrepancy: The loss function is found not to be correlated with the model's actual performance in following tracks. This suggests the loss might not directly reflect the nuanced objective of motion adherence.
  2. Sudden Convergence: Similar to ControlNet [87], the model exhibits a "sudden convergence phenomena," transitioning from completely ignoring the conditioning signal to being fully trained in a short number of training steps. This is hypothesized to be related to the zero initialization of ControlNet and is also observed in ControlNext [50].
  3. Strong Generalization: The model generalizes surprisingly well in multiple directions:
    • Spatially localized tracks: Despite training on randomly sampled tracks (leading to spatially uniform distribution), the model can generalize to spatially localized track conditioning.
    • Number of tracks: It generalizes well to both more and fewer tracks than the range it was trained on (1000-2000).
    • Temporally sparse tracks: The model generalizes to tracks that don't necessarily start from the first frame, even though training data primarily consists of tracks starting from the first frame. This implies an ability to perform motion prediction or completion. The authors hypothesize that this generalization stems from a combination of inductive biases from convolutions in the network and the large variety of trajectories in the training data.

4.2.5. Motion Prompt Expansion Strategies and Applications

The paper highlights how users can generate motion prompts through motion prompt expansion, translating high-level requests into detailed trajectories.

"Interacting" with an Image (Section 4.1)

This application allows users to manipulate parts of a static image to animate them.

  • User Input: A GUI (Graphical User Interface) records mouse drags on a displayed still image.

  • Expansion: Mouse drags are translated into a grid of point tracks centered on the mouse when it's dragged.

    • Users can choose the density and size of this grid, similar to Gaussian blurring for spatial extent in prior work [39, 75, 79, 84], but this is done at inference time.
    • Users can specify static tracks to keep background areas still or have tracks persist after a drag.
  • Emergent Phenomena: This often results in complex, realistic dynamics, like tossing hair or sweeping sand, demonstrating the model's learned physics and world understanding.

  • Prediction: Because the model supports temporally sparse track conditioning, users can specify motion for a short duration and let the model predict the future behavior.

    The following figure (Figure 3 from the original paper) shows examples of "interacting" with an image:

    该图像是示意图,展示了基于运动轨迹的视频生成过程。图中分为四个部分,分别展示了不同场景中的对象在运动指引下的变换,以强调运动提示的灵活性和多样性。图 a) 和图 c) 显示鸟类和牛的运动控制效果;图 b) 和图 d) 展示了人物和沙地的动态交互表现。 该图像是示意图,展示了基于运动轨迹的视频生成过程。图中分为四个部分,分别展示了不同场景中的对象在运动指引下的变换,以强调运动提示的灵活性和多样性。图 a) 和图 c) 显示鸟类和牛的运动控制效果;图 b) 和图 d) 展示了人物和沙地的动态交互表现。

The image is an illustration showing the video generation process based on motion trajectories. It is divided into four sections, highlighting the transformation of objects in different scenes under motion guidance to emphasize the flexibility and diversity of motion prompts. Figure a) and figure c) demonstrate the motion control effects on birds and cattle; figure b) and figure d) showcase dynamic interactions with a person and sand.

Drag-Based Image Editing (Section 4.1): A natural extension of "interacting" with an image, where user drags are used to edit an image by making objects follow these drags. This is similar to prior work [1, 18, 44, 48, 56, 62].

The following figure (Figure 4 from the original paper) shows examples of drag-based image editing:

Figure 4. Drag-Based Image Editing. We show the input images in the first row, and resulting drag-based edits in the bottom row, with the drag visualized in both rows. In addition, in the final example we show how we can keep areas of the images static.
该图像是一幅插图,展示了拖动基础的图像编辑效果。上排为输入图像,下排为经过拖动编辑后的结果,两排中均可看到拖动效果的可视化展示。在最后一个例子中,展示了如何保持图像某些区域静态。

Figure 4. Drag-Based Image Editing. We show the input images in the first row, and resulting drag-based edits in the bottom row, with the drag visualized in both rows. In addition, in the final example we show how we can keep areas of the images static.

Object Control with Primitives (Section 4.2)

This strategy allows for more fine-grained control over objects.

  • User Input: Mouse motions are reinterpreted as manipulating a proxy geometric primitive (e.g., a sphere).

  • Expansion:

    1. The user places a sphere of a given radius and location over an object.
    2. Mouse motions are converted into sphere spins, where the sphere rotates around a single axis such that the initial mouse location matches the current mouse location at each frame, uniquely defining a rotation.
    3. Points on the surface of the sphere are tracked as it spins, generating 3D point trajectories.
    4. These 3D points are then orthographically projected to obtain 2D tracks.
  • Benefit: Enables control of complex motions like rotations, which are hard to express with a single mouse drag.

    The following figure (Figure 6 from the original paper) shows examples of object control with primitives:

    Figure 6. Object Control with Primitives. By defining geometric primitives (e.g., a sphere) manipulated by a user with a mouse, we can obtain tracks exerting more fine-grain control over objects (e.g., rotations), which cannot be specified with a single track. 该图像是示意图,展示了通过几何原语(如球体)对对象进行细粒度控制的能力。上半部分显示了通过鼠标操作定义的轨迹,而下半部分则展示了不同动物(猫和青蛙)的相应变化,突出了运动提示技术的应用潜力。

Figure 6. Object Control with Primitives. By defining geometric primitives (e.g., a sphere) manipulated by a user with a mouse, we can obtain tracks exerting more fine-grain control over objects (e.g., rotations), which cannot be specified with a single track.

Camera Control with Depth (Section 4.3)

This technique enables camera movement around a scene.

  • Expansion:

    1. An off-the-shelf monocular depth estimator [51] is run on the input frame to obtain a point cloud of the scene (i.e., 3D points in space).
    2. Given a desired trajectory of camera poses, the point cloud is re-projected onto each camera pose in the sequence. This results in 2D tracks for input.
    3. Z-buffering is optionally applied to determine occlusion flags, improving quality by only showing the closest points to the camera.
  • Combined with Mouse Control: A camera trajectory can be constructed such that a single point in the point cloud follows a mouse trajectory, with the camera constrained to a vertical plane for ease of use.

  • Key Insight: The model achieves compelling camera control without being explicitly trained on or conditioned by camera poses, demonstrating its ability to learn general motion dynamics.

    The following figure (Figure 5 from the original paper) shows examples of camera control:

    该图像是一个示意图,展示了通过深度估计器生成点云的过程,及其在视频生成中的应用。图中包含多个运动轨迹和相机运动示例,展示不同的动态效果。图a)展示了运动轨迹的变化,图b)和图c)展示了通过不同轨迹生成的图像及其对应的光流特征。 该图像是一个示意图,展示了通过深度估计器生成点云的过程,及其在视频生成中的应用。图中包含多个运动轨迹和相机运动示例,展示不同的动态效果。图a)展示了运动轨迹的变化,图b)和图c)展示了通过不同轨迹生成的图像及其对应的光流特征。

The image is a schematic diagram illustrating the process of generating point clouds through a depth estimator, and its application in video generation. It contains multiple motion trajectories and camera motion examples, showcasing various dynamic effects. Figure a) shows changes in motion trajectories, while figures b) and c) display images generated from different trajectories and their corresponding optical flow features.

Composing Motions (Section 4.4)

Different types of motion prompts can be combined.

  • Strategy: For example, object control tracks are converted to displacements (deltas) and added to camera control tracks.

  • Effectiveness: While this 2D composition is an approximation that might fail for extreme camera motion, it works well for small to moderate movements.

  • Key Insight: This capability emerges without specific training for motion composition, further highlighting the model's generalized understanding of motion.

    The following figure (Figure 7 from the original paper) shows examples of composing motion prompts:

    Figure 7. Compositions of Motion Prompts. By composing motion prompts together, we can attain simultaneous object and camera control. For example, here we move the dog and horse's head while orbiting the camera from left to right. 该图像是示意图,展示了运动提示的组合效果。通过组合运动提示,我们可以同时控制物体和相机运动,图中展示了狗和马头部的移动,以及相机从左到右的环绕动作。

Figure 7. Compositions of Motion Prompts. By composing motion prompts together, we can attain simultaneous object and camera control. For example, here we move the dog and horse's head while orbiting the camera from left to right.

Motion Transfer (Section 4.5)

This application allows applying motion from an existing video to a new first frame.

  • Strategy: Motion tracks are extracted from a source video (e.g., a person turning their head) and then applied to a different first frame (e.g., an image of a macaque).

  • Robustness: The model is surprisingly robust, even applying motions to out-of-domain images (e.g., monkey chewing motion to a photo of trees), leading to interesting Gestalt common-fate effects [34] where the motion reveals the hidden object.

    The following figure (Figure 8 from the original paper) shows examples of motion transfer:

    Figure 8. Motion Transfer. By conditioning our model on extracted motion from a source video we can puppeteer a macaque, or even transfer the motion of a monkey chewing to a photo of trees. Best viewed as videos on our webpage. 该图像是插图,展示了运动转移的效果。图中包含多个例子,从源视频提取的运动被应用到不同图像上,包括猴子、地球影像、熊猫以及树木的动态效果,演示了模型的灵活应用与表现。

Figure 8. Motion Transfer. By conditioning our model on extracted motion from a source video we can puppeteer a macaque, or even transfer the motion of a monkey chewing to a photo of trees. Best viewed as videos on our webpage.

Motion Magnification (Appendix E)

An additional application is to magnify subtle motions in a video.

  • Strategy:

    1. A tracking algorithm (TAPIR [12]) is run on an input video to estimate tracks.
    2. Tracks are smoothed (e.g., with a Gaussian blur over space and time) to reduce noise.
    3. The smoothed tracks are magnified (scaled).
    4. The first frame of the input video and the magnified tracks are fed to the model.
  • Result: Generates a new video where the subtle motions are exaggerated, making them easier to perceive.

    The following figure (Figure A4 from the original paper) shows examples of motion magnification:

    Figure A4. Motion Magnification. We show the result of using our model to perform motion magnification. We show the first frame of two videos, and space-time slices through the blue line at different magnification factors. 该图像是运动放大结果的示意图。左侧展示了两个视频的第一帧及在不同放大因子下的时空切片。下方显示放大倍数为 1x8x16x32x 的切片,右侧同样展示了时空切片在不同放大倍数下的变化。

Figure A4. Motion Magnification. We show the result of using our model to perform motion magnification. We show the first frame of two videos, and space-time slices through the blue line at different magnification factors.

Human Pose Control (Appendix D)

The method can also be used to control human figures.

  • Strategy:
    1. An off-the-shelf pose estimation model is used to estimate human pose keypoints from an image.

    2. Desired motions are applied to these keypoints.

    3. These animated keypoints are then translated into motion tracks.

    4. The tracks are fed to the model.

      The following figure (Figure A3 from the original paper) shows examples of human pose control:

      Figure A3. Pose Conditioning. We estimate human pose, animate it, translate it to tracks, and then feed it to our model. In each row, we show frames from generated videos with input tracks overlaid on top. 该图像是生成视频中的姿态条件示意图。每一行展示了不同帧的生成视频,将输入的轨迹叠加在上面,包括手臂动作和物体交互的动画效果。

Figure A3. Pose Conditioning. We estimate human pose, animate it, translate it to tracks, and then feed it to our model. In each row, we show frames from generated videos with input tracks overlaid on top.

Failures, Limitations, and Probing Models (Section 4.6)

The paper categorizes failures into two types:

  1. Motion Conditioning/Prompting Failures: E.g., cow's head unnaturally stretched because horns were mistakenly "locked" to the background due to incorrect track generation.
  2. Underlying Video Model Failures: E.g., a new chess piece spontaneously forms when dragging an existing one. This suggests a limitation in the model's understanding of object permanence or physics.
  • Probing Potential: These failures highlight how motion prompts can be used to probe the limitations of the underlying model's learned representations, revealing gaps in its physics or world knowledge.

    The following figure (Figure 9 from the original paper) shows examples of failures used for probing:

    Figure 9. Probing by Failures. We can use motion prompts to probe limitations of the underlying model. For example, dragging the chess piece results in the creation of a new piece. 该图像是一个示意图,展示了通过运动提示对模型进行探测的过程。上部分显示了在不同情况下,动物图像的生成结果;下部分则展示了围绕棋盘运动的指示性拖动,表现出棋子创建的行为,这表明模型的局限性。

Figure 9. Probing by Failures. We can use motion prompts to probe limitations of the underlying model. For example, dragging the chess piece results in the creation of a new piece.

5. Experimental Setup

5.1. Datasets

Training Dataset

  • Source: An internal dataset of 2.2 million videos.
  • Characteristics: The videos are diverse and are not filtered.
  • Preprocessing: Each video is center-cropped to a square, resized to 256×256256 \times 256. BootsTAP [13] is run to extract 16,384 dense tracks and predicted occlusions per video. During Lumiere fine-tuning, videos are resized to 128×128128 \times 128 to match the base model's input/output size.

Evaluation Dataset

  • Dataset: DAVIS video dataset [53], specifically its validation split.
  • Characteristics: Contains 30 videos covering a wide range of scenes, including subjects like sports, humans, animals, and cars.
  • Preprocessing:
    • First frames and tracks are extracted from the dataset using BootsTAP.
    • First frame inputs are square crops of the original first frames.
    • Text prompts are automatically generated from the video titles (typically one or two words).
    • For each evaluation, a random number of tracks (varying from 1 to 2048) is sampled for conditioning.

Example of Data Sample

For the DAVIS dataset, an example first frame could be an image of a running dog, and the text prompt might be "dog". The motion prompt would consist of point tracks specifying the dog's movement or the camera's motion. For instance, Figure 7 from the paper shows a golden retriever and a horse, which could be part of such a video. The first frame of the golden retriever laying in the grass, combined with motion tracks causing it to move its head, would be an example.

The purpose of choosing DAVIS is that it is a standard benchmark for video object segmentation and tracking, providing diverse real-world video content suitable for evaluating video generation quality and motion adherence.

5.2. Evaluation Metrics

The evaluation uses a combination of metrics to assess both the visual quality (appearance) and the motion adherence of the generated videos.

PSNR (Peak Signal-to-Noise Ratio)

  • Conceptual Definition: PSNR measures the quality of reconstruction of a lossy compressed image compared to the original. It quantifies the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Higher PSNR generally indicates better image quality. It's often used as a proxy for image quality, though it doesn't always correlate well with human perception.
  • Mathematical Formula: PSNR=10log10((MAXI)2MSE) \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{(\mathrm{MAX}_I)^2}{\mathrm{MSE}} \right)
  • Symbol Explanation:
    • PSNR\mathrm{PSNR}: Peak Signal-to-Noise Ratio, measured in decibels (dB).
    • MAXI\mathrm{MAX}_I: The maximum possible pixel value of the image. For an 8-bit image, this is 255.
    • MSE\mathrm{MSE}: Mean Squared Error between the original and the generated image. MSE=1MNi=0M1j=0N1[I(i,j)K(i,j)]2 \mathrm{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 Where:
      • I(i,j): The pixel value of the original image at row ii and column jj.
      • K(i,j): The pixel value of the generated image at row ii and column jj.
      • M, N: The dimensions (height and width) of the image.

SSIM (Structural Similarity Index Measure)

  • Conceptual Definition: SSIM is a perceptual metric that quantifies the degradation of structural information in an image compared to a reference image. Unlike PSNR which focuses on absolute errors, SSIM attempts to model human visual perception by considering changes in structural information (luminance, contrast, and structure). A value closer to 1 indicates higher similarity.
  • Mathematical Formula: SSIM(x,y)=[l(x,y)]α[c(x,y)]β[s(x,y)]γ \mathrm{SSIM}(x, y) = [l(x, y)]^{\alpha} \cdot [c(x, y)]^{\beta} \cdot [s(x, y)]^{\gamma} Where:
    • l(x,y)=2μxμy+C1μx2+μy2+C1l(x, y) = \frac{2\mu_x\mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}: Luminance comparison function.
    • c(x,y)=2σxσy+C2σx2+σy2+C2c(x, y) = \frac{2\sigma_x\sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}: Contrast comparison function.
    • s(x,y)=σxy+C3σxσy+C3s(x, y) = \frac{\sigma_{xy} + C_3}{\sigma_x\sigma_y + C_3}: Structure comparison function.
    • In practice, α=β=γ=1\alpha = \beta = \gamma = 1 is often used for simplicity.
  • Symbol Explanation:
    • x, y: Two image patches being compared.
    • μx,μy\mu_x, \mu_y: The mean pixel values of xx and yy, respectively.
    • σx,σy\sigma_x, \sigma_y: The standard deviations of xx and yy, respectively (measure of contrast).
    • σxy\sigma_{xy}: The covariance of xx and yy (measure of structural correlation).
    • C1=(K1L)2,C2=(K2L)2,C3=C2/2C_1 = (K_1 L)^2, C_2 = (K_2 L)^2, C_3 = C_2/2: Small constants to prevent division by zero, where LL is the dynamic range of pixel values (e.g., 255 for 8-bit images), and K11,K21K_1 \ll 1, K_2 \ll 1 are small default constants.

LPIPS (Learned Perceptual Image Patch Similarity)

  • Conceptual Definition: LPIPS (also known as "perceptual distance") is a metric that measures the perceptual similarity between two images. Unlike traditional metrics like PSNR or SSIM, LPIPS is learned using a deep neural network (often a pre-trained VGG or AlexNet). It computes the distance between deep features extracted from two images, which tends to correlate much better with human judgments of image similarity. A lower LPIPS score indicates higher perceptual similarity.
  • Mathematical Formula: There isn't a single simple mathematical formula for LPIPS in the same way as PSNR or SSIM, as it relies on a learned model. Conceptually, it's defined as: LPIPS(x,y)=l1HlWlh,wwl(ϕl(x)h,wϕl(y)h,w)22 \mathrm{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} \| w_l \odot (\phi_l(x)_{h,w} - \phi_l(y)_{h,w}) \|_2^2
  • Symbol Explanation:
    • x, y: The two input images.
    • ll: Index over different layers of the pre-trained deep network (e.g., VGG, AlexNet).
    • ϕl()\phi_l(\cdot): The feature extractor (output of layer ll) of the pre-trained network.
    • wlw_l: A learned scaling vector for each channel in layer ll.
    • \odot: Element-wise multiplication.
    • Hl,WlH_l, W_l: Height and width of the feature maps at layer ll.
    • 22\| \cdot \|_2^2: Squared L2 norm, measuring the Euclidean distance between feature vectors. Essentially, it measures the weighted L2 distance between normalized feature stacks produced by a pre-trained network.

FVD (Frechet Video Distance)

  • Conceptual Definition: FVD is a metric used to evaluate the quality and realism of generated videos, analogous to FID (Frechet Inception Distance) for images. It measures the Frechet distance between feature representations of real videos and generated videos. The underlying assumption is that good generative models produce samples whose feature distributions are close to those of real data. Lower FVD scores indicate higher quality and realism in the generated videos.
  • Mathematical Formula: Like LPIPS, FVD does not have a simple single mathematical formula as it involves complex computations using a pre-trained feature extractor (typically a video classification network like Inflated 3D ConvNet (I3D)). Conceptually, it is the Frechet distance between two multivariate Gaussian distributions fitted to the feature embeddings of real and generated videos: FVD=μrμg22+Tr(Σr+Σg2(ΣrΣg)1/2) \mathrm{FVD} = \| \mu_r - \mu_g \|_2^2 + \mathrm{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})
  • Symbol Explanation:
    • μr\mu_r: Mean of the feature embeddings for real videos.
    • μg\mu_g: Mean of the feature embeddings for generated videos.
    • Σr\Sigma_r: Covariance matrix of the feature embeddings for real videos.
    • Σg\Sigma_g: Covariance matrix of the feature embeddings for generated videos.
    • 22\| \cdot \|_2^2: Squared L2 norm.
    • Tr()\mathrm{Tr}(\cdot): Trace of a matrix.
    • ()1/2(\cdot)^{1/2}: Matrix square root. The features are extracted from intermediate layers of a video classification model.

EPE (End-Point Error)

  • Conceptual Definition: EPE measures how well the generated video adheres to the specified motion conditioning. It quantifies the average L2 distance (Euclidean distance) between the ground truth (or specified) motion tracks and the motion tracks extracted from the generated videos. A lower EPE indicates that the generated video's motion more accurately matches the desired motion.
  • Mathematical Formula: For a single track nn at a single timestep tt: EPEn,t=(xtn,genxtn,cond)2+(ytn,genytn,cond)2 \mathrm{EPE}_{n,t} = \sqrt{(x_t^{n, \mathrm{gen}} - x_t^{n, \mathrm{cond}})^2 + (y_t^{n, \mathrm{gen}} - y_t^{n, \mathrm{cond}})^2} The overall EPE is typically the average over all tracks and all visible timesteps: EPE=1n,tv[n,t]n=1Nt=1Tv[n,t]EPEn,t \mathrm{EPE} = \frac{1}{\sum_{n,t} \mathbf{v}[n,t]} \sum_{n=1}^{N} \sum_{t=1}^{T} \mathbf{v}[n,t] \cdot \mathrm{EPE}_{n,t}
  • Symbol Explanation:
    • EPEn,t\mathrm{EPE}_{n,t}: End-Point Error for the nthn^{\text{th}} track at timestep tt.
    • (xtn,gen,ytn,gen)(x_t^{n, \mathrm{gen}}, y_t^{n, \mathrm{gen}}): The 2D coordinates of the nthn^{\text{th}} track extracted from the generated video at timestep tt.
    • (xtn,cond,ytn,cond)(x_t^{n, \mathrm{cond}}, y_t^{n, \mathrm{cond}}): The 2D coordinates of the nthn^{\text{th}} track from the conditioning signal (ground truth) at timestep tt.
    • v[n,t]\mathbf{v}[n,t]: Visibility flag for the nthn^{\text{th}} track at timestep tt, ensuring that only visible tracks are included in the average.
    • NN: Total number of tracks.
    • TT: Total number of timesteps (frames).

5.3. Baselines

The paper compares its method against two recent works in motion-conditioned video generation:

  • Image Conductor [39]: This model finetunes AnimateDiff [20] specifically for camera and object motion.
    • Accommodation for Fair Comparison: Image Conductor is trained on videos of resolution 384×256384 \times 256. For evaluation, the input frame to Image Conductor was reflection-padded to 384×256384 \times 256 and the output was cropped, as this yielded slightly better results than directly using 256×256256 \times 256 inputs.
  • DragAnything [79]: This model is designed to move "entities" along tracks by finetuning Stable Video Diffusion [6].
    • Accommodation for Fair Comparison: DragAnything requires segmentation masks for the objects being moved. For the DAVIS evaluation, these masks were obtained from the ground truth segmentations provided in the DAVIS dataset. For human studies, SAM [38] was used with initial track locations as query points to generate masks.

      These baselines are representative as they are recent, competitive methods in the domain of controllable video generation via motion.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents both quantitative and qualitative results, along with a human study and ablation studies, demonstrating the effectiveness and versatility of motion prompts.

Quantitative Evaluation

The quantitative evaluation focuses on comparing the proposed method against Image Conductor and DragAnything on the DAVIS validation dataset. The evaluation metrics assess both appearance (PSNR, SSIM, LPIPS, FVD) and motion adherence (EPE). The number of conditioning tracks (NN) is varied to evaluate performance across different densities.

The results, presented in Table 1, show that the proposed Motion Prompting model generally outperforms the baselines in almost all cases across various track densities.

  • Appearance Metrics (PSNR, SSIM, LPIPS, FVD): Our method consistently achieves higher PSNR and SSIM (better image quality/similarity), and lower LPIPS and FVD (better perceptual quality and video realism). This indicates that the generated videos are visually more appealing, more consistent with the first frame, and more realistic overall.
  • Motion Adherence (EPE): Our method generally shows lower EPE (better adherence to motion tracks), especially at higher track densities (N=512, N=2048).
    • At lower track densities (N=1, N=16), DragAnything sometimes performs better in terms of EPE. The authors explain this by noting that DragAnything includes a latent warping module that can force accurate motion adherence. However, this comes at a cost, as DragAnything consistently underperforms on all appearance metrics (PSNR, SSIM, LPIPS, FVD), suggesting that while its motion might be accurate, it introduces visual artifacts or lower overall video quality. Our method strikes a better balance, achieving strong motion adherence with superior visual fidelity.
  • Impact of Track Density: The performance gap for our method is particularly pronounced at higher track densities (N=512, N=2048), where it achieves significantly lower FVD (e.g., 688.7 vs 1379.8 for DragAnything at N=512) and EPE (e.g., 4.055 vs 10.948 at N=512), demonstrating its ability to leverage more detailed motion information effectively.

Human Study

A human study was conducted using a two alternative forced choice (2AFC) test. Participants were shown two videos (one from our method, one from a baseline) and asked three questions:

  1. Which video follows the motion conditioning better?

  2. Which video has more realistic motion?

  3. Which video has higher visual quality?

    The results, presented in Table 2, show strong preference for our method across all categories.

  • Motion Adherence: Our method was preferred 74.3% of the time against Image Conductor and 74.5% against DragAnything for following motion conditioning better.

  • Motion Quality: Our method was preferred 80.5% of the time against Image Conductor and 75.7% against DragAnything for more realistic motion.

  • Visual Quality: Our method was preferred 77.3% of the time against Image Conductor and 73.7% against DragAnything for higher visual quality.

    These human study results corroborate the quantitative findings, confirming that not only does our model perform well on objective metrics, but it also aligns better with human perception of motion adherence, realism, and visual quality.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

# Tracks Method PSNR ↑ SSIM ↑ LPIPS ↓ FVD ↓ EPE ↓
N = 1 Image Conductor 11.468 0.145 0.529 1919.8 19.224
DragAnything 14.589 0.241 0.420 1544.9 9.135
Ours 15.431 0.266 0.368 1445.2 14.619
N = 16 Image Conductor 12.184 0.175 0.502 1838.9 24.263
DragAnything 15.119 0.305 0.378 1282.8 9.800
Ours 16.618 0.405 0.319 1322.0 8.319
N = 512 Image Conductor 11.902 0.132 0.524 1966.3 30.734
DragAnything 15.055 0.289 0.381 1379.8 10.948
Ours 18.968 0.583 0.229 688.7 4.055
N = 2048 Image Conductor 11.609 0.120 0.538 1890.7 33.561
DragAnything 14.845 0.286 0.397 1468.4 12.485
Ours 19.327 0.608 0.227 655.9 3.887

The following are the results from Table 2 of the original paper:

Method Motion Adherence Motion Quality Visual Quality
Image Conductor 74.3(±1.1) 80.5 (±1.0) 77.3 (±1.0)
Drag Anything 74.5 (±1.1) 75.7(±1.1) 73.7 (±1.0)

6.3. Ablation Studies / Parameter Analysis

An ablation study was conducted to understand the impact of track density during training on model performance. Three training strategies were compared:

  1. Sparse: Training only with sparse point trajectories (1-8 tracks).

  2. Dense + Sparse: Training with the number of tracks sampled logarithmically from 202^0 to 2132^{13}.

  3. Dense: Training only with dense tracks (as described in the main methodology, 1000-2000 tracks uniformly).

    The ablation uses a subset of the DAVIS evaluation with fewer training steps. The results are shown in Table 3.

The following are the results from Table 3 of the original paper:

# Tracks Method PSNR ↑ SSIM ↑ LPIPS ↓ FVD ↓ EPE ↓
N = 4 Sparse 15.075 0.241 0.384 1209.2 30.712
Dense + Sparse 15.162 0.252 0.379 1230.6 29.466
Dense 15.638 0.296 0.349 1254.9 24.553
N = 2048 Sparse 15.697 0.284 0.355 1322.0 26.724
Dense + Sparse 15.294 0.246 0.375 1267.8 27.931
Dense 19.197 0.582 0.230 729.0 4.806

Analysis of Ablation Results:

  • Dense Training is Most Effective: For both low (N=4) and high (N=2048) inference track densities, training with Dense tracks (1000-2000 tracks) consistently yields the best performance across all metrics (higher PSNR, SSIM; lower LPIPS, FVD, EPE).

  • Surprising Generalization to Sparse Tracks: Counter-intuitively, Dense training also performs better for sparse tracks at inference (N=4) compared to Sparse or Dense + Sparse training. The authors hypothesize that the sparse training signal is insufficient for effective learning, making it more efficient to train on dense tracks which then generalizes to sparser ones. This might be influenced by the ControlNet architecture and zero convolutions which facilitate such generalization.

  • Impact on Motion Adherence (EPE): The difference in EPE is particularly striking at high track densities (N=2048). Dense training achieves an EPE of 4.806, which is significantly lower than Sparse (26.724) or Dense + Sparse (27.931) training. This highlights that dense training is crucial for the model to effectively utilize and adhere to a large number of conditioning tracks.

  • Visual Quality: Similar trends are observed in visual quality metrics. For N=2048, Dense training results in considerably better PSNR, SSIM, LPIPS, and FVD, indicating more realistic and perceptually pleasing videos.

    In summary, the ablation study strongly supports the design choice of training the model on dense point trajectories to achieve the best overall performance and generalization capabilities, even for generating videos conditioned on sparse motions.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Motion Prompting, a robust and flexible framework for motion-conditioned video generation. By employing spatio-temporally sparse or dense motion trajectories as motion prompts, the method overcomes the limitations of text-only control, offering granular and expressive control over video dynamics. The ControlNet-based architecture, built on Lumiere, effectively learns to adhere to these trajectories. A key innovation is motion prompt expansion, which translates high-level user intentions into detailed motion inputs. The versatility of the approach is demonstrated across a wide array of applications, including object and camera control, motion transfer, and interactive image editing. The emergence of realistic physics in generated videos suggests the model's intrinsic understanding of the world. Quantitative evaluations and human studies confirm the superior performance of Motion Prompting compared to existing baselines in terms of both motion adherence and visual quality.

7.2. Limitations & Future Work

The authors acknowledge several limitations and implicitly suggest avenues for future work:

  • Real-time and Causal Generation: The current method is not real-time and not causal (i.e., it cannot generate video frame-by-frame based on real-time input or predict future frames in a truly causal manner). Generating an output video takes approximately 12 minutes. This is a significant practical limitation for interactive applications or truly dynamic world models. Future work could focus on accelerating inference and achieving causal generation.
  • Approximations in Motion Composition: The 2D composition of object and camera control tracks by adding displacements is an approximation and can fail for extreme camera motion. More sophisticated, physically-grounded composition methods could be explored.
  • Track Quality and Smoothing: For applications like motion magnification, smoothing was necessary to reduce noise in estimated tracks from the tracking algorithm. The authors suggest that more accurate point tracking algorithms would remove this need, implying an ongoing reliance on the quality of external tracking tools. Improvements in tracking could directly enhance the model's capabilities.
  • Failures and Probing Models: The observed failures, such as objects stretching unnaturally or new objects spontaneously forming (e.g., the chess piece example), highlight areas where the underlying video model's learned physics or object permanence is limited. This suggests future research could use motion prompts as a systematic probing tool to identify and address these limitations, potentially leading to more robust generative world models. The paper itself encourages interacting with future generative world models.
  • Training Loss vs. Performance: The observation that training loss does not correlate with performance and the sudden convergence phenomena suggest that the training objective or architecture could potentially be refined for more stable and predictable learning, perhaps by adopting solutions like cross normalization as explored in ControlNext [50].

7.3. Personal Insights & Critique

This paper presents a highly intuitive and powerful approach to controlling video generation. The core idea of using motion trajectories as motion prompts is very elegant, as motion is inherently a sequence of positions over time.

Inspirations Drawn:

  • Unified Control Paradigm: The ability to use a single, flexible representation (point tracks) for such a diverse range of motion control tasks (object, camera, scene, transfer, editing) is truly inspiring. It suggests that a fundamental understanding of motion can lead to highly generalizable generative models.
  • Emergent Physics: The observation of emergent realistic physics (e.g., hair dynamics, sand movement) from general motion training is profound. It implies that these large generative models implicitly learn complex physical rules, and motion prompts provide a direct way to interrogate and harness this knowledge. This is a crucial step towards building generative world models that can simulate realistic interactions.
  • Accessibility through Expansion: Motion prompt expansion is a brilliant practical solution. Without it, the expressive power of dense trajectories would be inaccessible to most users. This concept could be applied to other complex conditional inputs in generative AI, translating simple user intent into rich, structured control signals.
  • Diagnostic Tool: Using failures as a probing mechanism is a valuable insight for research. It provides a systematic way to identify the limitations and biases within large generative models, guiding future development towards more physically accurate and coherent generations.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Computational Cost & Speed: The 12-minute generation time per video is a significant barrier for true interactivity and iteration. While impressive for quality, scaling this to real-time applications is crucial for widespread adoption and deeper interactive research into world models. Research into faster diffusion samplers or more efficient ControlNet integrations could address this.

  • User Interface for Motion Prompt Expansion: While motion prompt expansion is conceptually powerful, the actual user experience for defining complex motion primitives, combining motions, or performing precise depth-based camera control still requires some technical understanding. Developing more intuitive, possibly AI-assisted GUIs (e.g., natural language commands for motion, drag-and-drop motion libraries, or even sketching motion paths directly) would enhance accessibility further.

  • Fidelity of Extracted Tracks: The quality of the generated video is inherently tied to the quality of the input motion tracks, whether generated by BootsTAP or through motion prompt expansion. Errors or noise in these tracks can lead to artifacts or unnatural motions. Improving the robustness of point tracking algorithms or incorporating uncertainty estimation in the motion prompts could be beneficial.

  • Generalization of Physics: While the emergent physics is exciting, its consistency and accuracy across all scenarios need to be further investigated. Do the models learn true physics, or merely plausible visual correlations? Motion prompts can help probe this, but explicit integration of physics engines or constraints might be necessary for specific, high-fidelity physical simulations.

  • Causality and Long-Term Coherence: The non-causal nature implies the model generates the entire video at once. For world models that need to react to dynamic inputs or simulate extended sequences, true causality and long-term temporal coherence (beyond 5 seconds) are critical. This would require advancements in video diffusion model architectures themselves.

  • Ambiguity in Sparse Prompts: While the model generalizes well to sparse prompts, there might still be inherent ambiguity. How the model "fills in the blanks" for sparsely defined motions could be influenced by biases in the training data, leading to unexpected or undesirable results in certain contexts.

    Overall, Motion Prompting represents a significant advancement in controllable video generation, offering a highly flexible and interpretable interface for directing video content. Its insights into emergent physics and its potential as a probing tool for generative world models open exciting new avenues for research beyond just content creation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.