Paper status: completed

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

Published:12/02/2024

video diffusion models (10)Multimodal Large Language Model (24)Text-to-Video Generation (6)Customized Motion Transfer (1)Motion Modeling (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MoTrans introduces a customized motion transfer method using a multimodal large language model recaptioner and an appearance injection module, effectively transferring specific human-centric motions from reference videos to new contexts, outperforming existing techniques.

Abstract

Existing pretrained text-to-video (T2V) models have demonstrated impressive abilities in generating realistic videos with basic motion or camera movement. However, these models exhibit significant limitations when generating intricate, human-centric motions. Current efforts primarily focus on fine-tuning models on a small set of videos containing a specific motion. They often fail to effectively decouple motion and the appearance in the limited reference videos, thereby weakening the modeling capability of motion patterns. To this end, we propose MoTrans, a customized motion transfer method enabling video generation of similar motion in new context. Specifically, we introduce a multimodal large language model (MLLM)-based recaptioner to expand the initial prompt to focus more on appearance and an appearance injection module to adapt appearance prior from video frames to the motion modeling process. These complementary multimodal representations from recaptioned prompt and video frames promote the modeling of appearance and facilitate the decoupling of appearance and motion. In addition, we devise a motion-specific embedding for further enhancing the modeling of the specific motion. Experimental results demonstrate that our method effectively learns specific motion pattern from singular or multiple reference videos, performing favorably against existing methods in customized video generation.

Mind Map

In-depth Reading

English Analysis~23 min read · 29,202 chars

1. Bibliographic Information

1.1. Title

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

1.2. Authors

The authors of this paper are Xiaomin Li, Xu Jia, Qinghe Wang, Haiwen Diao, Mengmeng Ge, Pengxiang Li, You He, and Huchuan Lu. The authors are primarily affiliated with Dalian University of Technology and Tsinghua University, which are prominent research institutions in China with strong programs in computer science and artificial intelligence.

1.3. Journal/Conference

This paper was published in the Proceedings of the 32nd ACM International Conference on Multimedia (MM '24). ACM Multimedia is a premier international conference in the field of multimedia. It is highly competitive and well-regarded, covering a wide range of topics from multimedia content analysis to generation and interaction. Publication at this venue signifies a high level of research quality and impact.

1.4. Publication Year

2024

1.5. Abstract

The paper addresses a key limitation in existing text-to-video (T2V) diffusion models: their inability to generate complex, human-centric motions. Current fine-tuning approaches often fail to separate the motion from the appearance in the reference videos, a problem known as appearance-motion coupling. To solve this, the authors propose MoTrans, a method for customized motion transfer. MoTrans employs a two-pronged strategy to decouple appearance and motion. First, it uses a Multimodal Large Language Model (MLLM)-based recaptioner to generate detailed text prompts focusing on appearance. Second, it uses an appearance injection module to feed visual information from the reference frames directly into the model. These complementary inputs help the model focus on motion dynamics. Additionally, a motion-specific embedding is introduced to enhance the representation of the target motion. The paper's experiments show that MoTrans can effectively learn a specific motion from one or more reference videos and apply it to new subjects in new contexts, outperforming existing methods.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2412.01343
PDF Link: https://arxiv.org/pdf/2412.01343v1.pdf
Publication Status: This is a preprint version of a paper accepted at the MM '24 conference, available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper tackles is the difficulty of generating specific and complex human motions using general-purpose Text-to-Video (T2V) models. While foundational models like Sora or VideoCrafter can generate videos with basic movements, they often fall short when a user wants to specify a particular intricate action, such as a "golf swing" or "skateboarding trick," and apply it to a new character (e.g., a panda).

The primary challenge in achieving this "motion customization" is appearance-motion coupling. When a T2V model is fine-tuned on a few reference videos showing a specific motion, it tends to memorize not just the motion but also the appearance of the subject and background in those videos. As a result, when asked to generate a "panda skateboarding," it might produce a video of the original human skateboarder from the reference video, or a strange hybrid.

This paper's innovative entry point is to tackle this decoupling problem explicitly using complementary multimodal information. Instead of just trying to separate motion and appearance within the model architecture, MoTrans actively feeds the model rich descriptions of the appearance from two different sources:

Textual: Using an MLLM to generate highly detailed text captions about the appearance.
Visual: Directly injecting image features from the reference video frames. The intuition is that by providing the model with overwhelming information about appearance, the remaining parts of the model are forced to focus solely on learning the motion dynamics.

2.2. Main Contributions / Findings

The paper presents the following primary contributions and findings:

A Novel Motion Transfer Method (MoTrans): The authors propose MoTrans, a full framework for learning a specific motion pattern from a single or multiple reference videos and transferring it to new subjects and scenes.
Multimodal Decoupling Strategy: The core innovation is a two-part strategy to decouple appearance and motion:
- MLLM-based Recaptioner: It uses a powerful MLLM to automatically generate detailed captions that describe the appearance in the reference video. This enriches the textual conditioning to better model appearance.
- Appearance Prior Injector: It injects visual embeddings from the reference video frames into the model's temporal layers, providing a direct visual cue for appearance that encourages the temporal modules to focus on motion.
Motion-Specific Embedding: To further improve motion modeling, the paper introduces a learnable residual embedding that is added to the text embedding of the motion-related verb (e.g., "swinging"). This helps the model capture the nuances of the specific motion from the reference video, rather than just a generic version of the action.
State-of-the-Art Performance: The key finding is that MoTrans successfully generates videos that are faithful to the target motion while correctly rendering the new subject and context specified in the prompt. Experimental results show it outperforms previous methods by achieving a better balance between motion fidelity and avoiding appearance overfitting.

3.1. Foundational Concepts

3.1.1. Diffusion Models

Diffusion models are a class of generative models that have become state-of-the-art in generating high-quality images and videos. They work in two steps:

Forward (Noise) Process: This is a fixed process where a small amount of Gaussian noise is gradually added to a real data sample (e.g., an image) over a series of timesteps, $T$ . By the end, the image becomes pure, indistinguishable noise.
Reverse (Denoising) Process: This is the learned part. A neural network, typically a U-Net, is trained to reverse the noising process. At each timestep $t$ , the model takes the noisy image and predicts the noise that was added. By subtracting this predicted noise, it can gradually denoise the image, starting from pure noise and ending with a clean, generated image. When conditioned on text (as in T2V models), the text prompt guides this denoising process.

3.1.2. Text-to-Video (T2V) Generation

T2V generation extends the principles of text-to-image (T2I) diffusion models to create videos. The key challenge is modeling temporal consistency and motion across frames. This is typically achieved by modifying the U-Net architecture used in T2I models. In addition to the standard spatial attention layers (which operate within a single frame), temporal attention layers are added. These layers allow different pixels to "attend" to corresponding pixels in other frames, enabling the model to learn motion and maintain object identity over time.

3.1.3. LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning (PEFT) technique. Fine-tuning a large diffusion model (with billions of parameters) for a new task is computationally expensive and memory-intensive. LoRA addresses this by freezing the original pretrained model weights and injecting small, trainable "adapter" modules into certain layers (often the attention layers). These adapters consist of two low-rank matrices ( $A$ and $B$ ) whose product approximates the full weight update. Because the rank $r$ is small, the number of trainable parameters in $A$ and $B$ is a tiny fraction of the original model's size. This makes it feasible to train custom models for specific subjects or motions on consumer hardware.

3.1.4. Multimodal Large Language Models (MLLMs)

MLLMs are advanced AI models that can process and understand information from multiple modalities, most commonly text and images. Models like LLaVA and GPT-4 with Vision are examples. They can perform tasks that require a joint understanding of both, such as describing an image in detail, answering questions about it, or, as used in this paper, generating a new, more descriptive text caption for an image based on an initial prompt and instructions.

3.2. Previous Works

The paper positions itself relative to two main lines of research:

General T2V Models: These are foundational models trained on massive datasets.
- ModelScopeT2V and ZeroScope: These are open-source T2V diffusion models that build upon the Stable Diffusion architecture by adding temporal layers. MoTrans uses ZeroScope as its base model.
- VideoCrafter2: An improved T2V model that uses a mix of low-quality videos (for motion) and high-quality images (for appearance) to achieve better results.
- The paper notes that these models struggle with specific, complex motions as they are trained for general-purpose generation.
Customized Motion Generation: These methods focus on the same problem as MoTrans.
- Tune-a-Video: A one-shot method that fine-tunes a T2I model by treating a video as a sequence of images. It often suffers from poor temporal consistency and appearance overfitting.
- MotionDirector: A strong baseline that uses a dual-path framework to learn appearance and motion separately. It freezes the spatial layers (trained for appearance) during motion training to encourage decoupling. The authors of MoTrans argue this is not sufficient to prevent appearance leakage.
- DreamVideo: This method also aims to decouple appearance and motion by using separate "identity" and "motion" adapters. It injects appearance information into the motion adapter to force it to learn motion. MoTrans builds on this idea but uses a more comprehensive, multimodal approach for providing appearance priors.
- LAMP: A few-shot method that learns a motion pattern. The authors note it has poor temporal consistency and relies on a high-quality initial frame.

3.3. Technological Evolution

The field has evolved from generating images to generating general videos, and now to controlling specific aspects of video generation.

T2I Generation (e.g., Stable Diffusion): High-quality image generation from text.
General T2V Generation (e.g., ModelScopeT2V, Gen-2, Sora): Extending T2I to video by adding temporal modeling. The focus is on generating realistic and diverse videos.
Customized Generation: The focus shifts from general generation to user-specific control. This started with subject customization in images (e.g., DreamBooth) and has now moved to video.
Motion Customization: A sub-field of video customization focused specifically on learning and transferring motion. Early methods struggled with appearance-motion coupling. MoTrans represents the latest effort to solve this decoupling problem more effectively by using multimodal priors.

3.4. Differentiation Analysis

Compared to its predecessors, MoTrans's core innovations are:

Multimodal vs. Unimodal Decoupling: While methods like MotionDirector and DreamVideo attempt to decouple motion and appearance, they do so using primarily architectural constraints (freezing layers) or unimodal priors (injecting only visual features). MoTrans is distinct because it uses complementary multimodal priors:
- Textual Prior: The MLLM-based recaptioner provides a rich, explicit textual description of the appearance.
- Visual Prior: The appearance injector provides a direct visual feature representation of the appearance. This dual-pronged approach provides a much stronger signal to the model about what constitutes "appearance," thereby making it easier to isolate and learn the "motion."
Explicit Motion Enhancement: The motion-specific embedding is a targeted mechanism that is not present in the other methods. By identifying the motion verb in the prompt and augmenting its embedding with information from the reference video, it directly enhances the model's ability to represent the nuances of the specific desired motion.
Two-Stage Training: The explicit two-stage training process (first appearance, then motion) provides a clean separation of concerns, which is more structured than the joint or alternating training strategies used in some other methods.

4. Methodology

4.1. Principles

The core idea behind MoTrans is to force the model to decouple appearance and motion by tackling each component in a separate, dedicated stage. The overall strategy is a two-stage training process:

Appearance Learning Stage: Focus exclusively on learning the appearance of the subject/scene from the reference video. This is done by training only the spatial components of the model on single frames, guided by a highly detailed appearance-focused text prompt.
Motion Learning Stage: With the appearance components frozen, focus on learning the motion dynamics. To prevent any remaining appearance leakage, the model is explicitly fed appearance information (from the reference video frames) as a prior, forcing the temporal components to learn what's left—the motion. This motion signal is further amplified using a dedicated motion enhancement module.

The overall training pipeline is visualized in the paper's Figure 2.

该图像是插图，展示了 MoTrans 中的外观学习（Appearance Learning）与运动学习（Motion Learning）两个部分的框架。外观学习部分涉及随机帧和 MLLM 基于的重captioner，而运动学习部分包括图像编码器和 T2V 模型，两个部分通过共享权重和相应的损失函数进行优化。

4.2. Core Methodology In-depth

4.2.1. Stage 1: Appearance Learning

The goal of this stage is to train the spatial attention modules of the U-Net to capture the appearance of the reference video. This is achieved using single frames and an enhanced prompt.

1. MLLM-based Recaptioner: To ensure the model has rich information about the appearance, a standard, simple prompt is insufficient. MoTrans uses an MLLM (e.g., LLaVA) as a "recaptioner."

Input: A random frame $f^i$ from the reference video $\mathcal{V}$ and a base prompt $\mathbf{c}_b$ (e.g., "a person is skateboarding").
Process: The MLLM is given a task instruction to expand the base prompt with detailed descriptions of the appearance in the given frame. For example, it might turn "a person is skateboarding" into "a person with a red helmet, black t-shirt, and blue jeans is skateboarding on a grey concrete pavement."
Output: A recaptioned, appearance-rich prompt $\mathbf{c}_r$ . This process is illustrated in the paper's Figure 3.

2. Training Spatial LoRAs:
The model is trained on individual frames from the reference video.
Only the spatial LoRAs are trained. These LoRAs are injected into the self-attention and feed-forward network (FFN) layers of the spatial transformers in the U-Net (as shown in Figure 4a).
Crucially, the cross-attention layers, which are responsible for aligning the generation with the text prompt, are kept frozen to preserve the powerful text-to-image capabilities of the original pretrained model.
The weights of these trained spatial LoRAs are saved and will be used (but frozen) in the next stage.

3. Appearance Learning Loss Function: The training objective is the standard denoising loss from diffusion models, applied to single frames. The model $\epsilon_{\theta}$ tries to predict the noise $\epsilon$ added to a latent representation of a frame $\mathbf{z_0^i}$ .

$ \mathcal { L } _ { s } = \mathbb { E } _ { \mathbf { z _ { 0 } ^ { i } } , \mathbf { c } _ { r } , \epsilon \sim N ( 0 , I ) , t } [ | | \epsilon - \epsilon _ { \theta } ( \mathbf { z _ { t } ^ { i } } , \tau _ { \theta } ( \mathbf { c } _ { r } ) , t ) | | _ { 2 } ^ { 2 } ] . $

$\mathbf{z_0^i}$ : The latent code of a single frame $f^i$ , obtained by an encoder like VQ-VAE.
$\mathbf{c}_r$ : The detailed, recaptioned prompt from the MLLM.
$\epsilon$ : A random noise sample from a standard normal distribution.
$t$ : A random timestep from 1 to $T$ .
$\mathbf{z_t^i}$ : The noisy latent at timestep $t$ .
$\epsilon_{\theta}(\cdot)$ : The U-Net model being trained (specifically, its spatial LoRAs).
$\tau_{\theta}(\cdot)$ : The frozen text encoder (e.g., OpenCLIP) that converts the text prompt $\mathbf{c}_r$ into an embedding.

4.2.2. Stage 2: Motion Learning

In this stage, the goal is to learn the motion pattern from the full reference video, while actively preventing the model from learning the appearance. The spatial LoRAs trained in Stage 1 are loaded and frozen. Only the temporal LoRAs and a motion enhancer MLP are trained.

1. Appearance Injector: To force the temporal modules to focus on motion, they are preemptively given the appearance information.

Process: An image encoder $\psi$ (OpenCLIP ViT-H/14) extracts an embedding $\psi(\mathbf{f}^i)$ for a randomly selected frame $\mathbf{f}^i$ from the reference video. This single-frame embedding serves as the appearance prior.
This embedding is projected by a linear layer $W_{\mathcal{P}}$ and then added to the hidden states $h_s^l$ that come out of the spatial transformer for every frame in the video clip. This "injects" the appearance information before the data enters the temporal transformer.
The architecture is shown in Figure 4b.

The following figure (Figure 4 from the original paper) shows the details of the LoRA injection and the appearance injector.

该图像是示意图，展示了 MoTrans 方法中的可训练空间和时间 LoRA 模块及外观注入器的架构。图 (a) 说明了自注意力和交叉注意力的结构设计，左侧为空间 LoRA，右侧为时间 LoRA；图 (b) 则展示了外观注入器的卷积层、空间和时间变换器的关系。整体设计旨在提升视频生成中的运动模式学习能力。

This operation is defined by the formula: $ h _ { t } ^ { l } = h _ { s } ^ { l } \odot ( W _ { \mathcal { P } } \cdot \psi ( \mathbf { f } ^ { \mathbf { i } } ) ) , $

$h_s^l$ : The hidden state output from the spatial transformer in block $l$ of the U-Net. Its shape is (batch*height*width) x frames x channels.
$\psi(\mathbf{f}^i)$ : The image embedding of a single reference frame, representing the appearance prior.
$W_{\mathcal{P}}$ : A trainable linear projection layer.
$\odot$ : The broadcast operator. The paper text clarifies this is a broadcast summation. The projected appearance embedding is broadcast across all spatial positions and all frames and then added to $h_s^l$ .
$h_t^l$ : The resulting hidden state that is fed into the temporal transformer.

2. Motion Enhancement: This module enhances the representation of the specific motion.

Process: The intuition is that the motion is primarily linked to the verb in the prompt (e.g., "swinging," "kicking").
1. A Part-of-Speech tagger (Spacy) is used to find the verb token $s_i$ in the prompt.
2. The base text embedding $E_b$ for this verb is obtained from the text encoder $\tau_{\theta}$ .
3. A video-level embedding is created by taking all frames $\mathcal{V}$ , encoding them with the image encoder $\psi$ , and then taking the mean of all frame embeddings: $MeanPool(\psi(\mathcal{V}))$ .
4. The video embedding and the base verb embedding are concatenated and fed into a small MLP (two linear layers with a GELU activation) to produce a residual embedding $E_r$ .
5. This residual is added to the base verb embedding to create the final, enhanced motion-specific embedding $E_{cond}$ .
  
  The calculation for the residual embedding is: $ E _ { r } = W _ { 2 } \cdot ( \sigma _ { G E L U } ( W _ { 1 } \cdot ( [ M e a n P o o l ( \psi ( \mathcal { V } ) ) , \tau _ { \theta } ( \mathbf { s _ { i } } ) ] ) ) ) , $
$W_1, W_2$ : Weights of the two linear layers in the MLP.
$\sigma_{GELU}$ : The GELU activation function.
$[ \cdot ]$ : Concatenation operation.
$MeanPool(\psi(\mathcal{V}))$ : The average embedding of all frames in the video.
$\tau_{\theta}(\mathbf{s_i})$ : The base text embedding of the motion verb.

The final enhanced embedding is: $ E _ { c o n d } = E _ { b } + E _ { r } . $

3. Motion Learning Loss Function: The training now uses the full video clip. The loss function has two parts: the standard denoising loss and a regularization term for the motion enhancer.

The denoising loss $\mathcal{L}_t$ is similar to before, but now operates on the full sequence of video latents $\mathbf{z_t^{1:N}}$ and uses the original base prompt $\mathbf{c}_b$ (with the enhanced verb embedding swapped in). $ \mathcal { L } _ { t } = \mathbb { E } _ { \mathbf { z } _ { 0 } ^ { 1 : \mathrm { N } } , \mathbf { c } _ { b } , \epsilon \sim \mathcal { N } ( 0 , I ) , t } [ | | \epsilon - \epsilon _ { \theta } ( \mathbf { z _ { t } ^ { 1 : \mathrm { N } } } , \tau _ { \theta } ( \mathbf { c } _ { b } ) , t ) | | _ { 2 } ^ { 2 } ] . $

To prevent the residual embedding $E_r$ from becoming too large and overpowering the original text embedding, an L2 regularization term is added: $ \mathcal { L } _ { r e g } = | | E _ { r } | | _ { 2 } ^ { 2 } $

The final loss for the motion learning stage is a weighted sum of these two losses: $ \mathcal { L } _ { m o t i o n } = \mathcal { L } _ { t } + \lambda \mathcal { L } _ { r e g } , $

$\lambda$ : A hyperparameter that controls the strength of the regularization.

4.2.3. Inference Stage

During inference, a user provides a new prompt (e.g., "A panda is skateboarding in Times Square"). The pretrained T2V model (ZeroScope) is loaded, along with the trained temporal LoRAs and the trained motion enhancer MLP. The spatial LoRAs from Stage 1 are discarded. The model then generates a video guided by the new prompt, where the motion enhancer injects the learned specific motion pattern.

5. Experimental Setup

5.1. Datasets

Source: The authors curated a custom dataset of 12 distinct motion patterns. These were sourced from standard action recognition datasets like UCF101, UCF Sports Action, $NTU RGB+D 120$ , as well as from the internet.
Characteristics: The dataset includes a mix of sports motions (e.g., golf swing, weightlifting, skateboarding) and large-scale daily movements (e.g., waving hands, drinking water, wiping face). Each motion category contains 4-10 short training videos. The authors specifically chose motions with larger movement amplitudes to create a challenging benchmark.
Example Data (Prompt Template): For evaluation, the authors used six prompt templates to generate videos with varied subjects and contexts. An example template is: "A {cat} is {motion} {in the living room}". Here, {cat} could be replaced with "panda," "alien," etc., {motion} is the learned motion, and {in the living room} could be replaced with other scenes.

5.2. Evaluation Metrics

The paper uses a combination of standard and newly proposed metrics to evaluate performance.

5.2.1. CLIP Textual Alignment (CLIP-T)

Conceptual Definition: This metric measures how well the generated video content aligns with the given text prompt. It evaluates the overall semantic correspondence between text and video. A higher score means the video is a better representation of the entire prompt (subject, action, and scene).
Mathematical Formula: It is typically calculated as the average cosine similarity between the CLIP text embedding of the prompt and the CLIP image embeddings of the video frames. $ \text{CLIP-T} = \frac{1}{N} \sum_{i=1}^{N} \cos(\text{CLIP}{img}(f_i), \text{CLIP}{txt}(P)) $
Symbol Explanation:
- $N$ : Number of frames in the video.
- $f_i$ : The $i$ -th frame of the generated video.
- $P$ : The input text prompt.
- $\text{CLIP}_{img}(\cdot)$ : The CLIP image encoder.
- $\text{CLIP}_{txt}(\cdot)$ : The CLIP text encoder.
- $\cos(\cdot, \cdot)$ : The cosine similarity function.

5.2.2. Temporal Consistency (TempCons)

Conceptual Definition: This metric assesses the smoothness and visual consistency of a video from one frame to the next. It helps quantify issues like flickering or sudden object changes. A higher score indicates a smoother, more stable video.
Mathematical Formula: It is calculated as the average cosine similarity between the CLIP embeddings of adjacent frames. $ \text{TempCons} = \frac{1}{N-1} \sum_{i=1}^{N-1} \cos(\text{CLIP}{img}(f_i), \text{CLIP}{img}(f_{i+1})) $
Symbol Explanation:
- $N$ : Number of frames in the video.
- $f_i$ : The $i$ -th frame of the generated video.

5.2.3. CLIP Entity Alignment (CLIP-E)

Conceptual Definition: This is a metric introduced by the authors to specifically diagnose appearance overfitting. It measures whether the generated video correctly depicts the subject (entity) from the new prompt, ignoring the motion and scene. For example, if the prompt is "a panda skateboarding" and the model generates the original human skateboarder, CLIP-E will be low.
Mathematical Formula: It is calculated similarly to CLIP-T, but using a simplified prompt that contains only the entity name (e.g., "a panda"). $ \text{CLIP-E} = \frac{1}{N} \sum_{i=1}^{N} \cos(\text{CLIP}{img}(f_i), \text{CLIP}{txt}(P_{entity})) $
Symbol Explanation:
- $P_{entity}$ : The simplified prompt containing only the subject/entity (e.g., "a panda").

5.2.4. Motion Fidelity (MoFid)

Conceptual Definition: This novel metric, proposed by the authors, aims to quantify how closely the motion in the generated video matches the motion in the reference videos. It uses a powerful pretrained video understanding model (VideoMAE) to extract motion-aware features from both the reference and generated videos and compares them. A higher score indicates that the motion was transferred more faithfully.
Mathematical Formula: $ \mathcal { E } _ { m } = \frac { 1 } { \vert \mathcal { M } \vert \vert \bar { v } _ { m } \vert } \sum _ { m \in \mathcal { M } } \sum _ { k = 1 } ^ { \vert \bar { v } _ { m } \vert } \cos ( f ( v _ { m } ^ { i } ) , f(\bar { v } _ { k }) ) $
Symbol Explanation:
- $\mathcal{M}$ : The set of all motion types being evaluated.
- $\bar{v}_m$ : The set of generated videos for a specific motion $m$ .
- $v_m^i$ : A randomly selected reference (training) video for motion $m$ .
- $\bar{v}_k$ : The $k$ -th generated video for motion $m$ .
- $f(\cdot)$ : The VideoMAE model used as a feature extractor.
- $\cos(\cdot, \cdot)$ : The cosine similarity function.

5.3. Baselines

The paper compares MoTrans against several representative models:

Inference-only Baselines:
- ZeroScope: The base T2V model used without any fine-tuning.
- VideoCrafter2: Another strong, open-source T2V model.
Fine-tuning Baselines:
- ZeroScope (fine-tune): Directly fine-tuning the temporal LoRAs of ZeroScope on the reference videos without the proposed decoupling mechanisms.
One-shot Customization Baseline:
- Tune-a-Video: A well-known method for single-video customization.
Few-shot Customization Baselines:
- LAMP: A method designed for few-shot motion pattern learning.
- MotionDirector: A strong baseline and state-of-the-art method for motion customization. This is the most direct competitor.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Qualitative Results

The qualitative results in Figure 5 visually demonstrate the effectiveness of MoTrans.

The following figure (Figure 5 from the original paper) shows qualitative comparisons.

该图像是图表，展示了在不同方法下，基于参考视频生成的运动视频。左侧示例中包含从不同参考视频生成的“喝水”的动画，右侧则展示了“滑板”的动画。该图表比较了多种模型，包括ZeroScope、VideoCrafter、Tune-A-Video等，重点突出MoTrans在定制视频生成方面的表现。

Pretrained Models (ZeroScope, VideoCrafter2): As expected, these models fail to generate the specific, complex motions (like drinking or skateboarding), often producing static or minimally moving subjects.
Simple Fine-tuning (ZeroScope (fine-tune)) and Tune-A-Video: These methods exhibit severe appearance overfitting. For the "skateboarding" task, they generate the original human skateboarder from the reference video instead of the prompted "panda."
MotionDirector: This baseline shows some improvement but still suffers from appearance leakage. In the "skateboarding" example, it generates a panda, but the panda is wearing human-like clothes, an artifact inherited from the reference video. The motion magnitude is also diminished.
MoTrans (Ours): The videos generated by MoTrans successfully transfer the motion (drinking water, skateboarding) to the new subjects (a tiger, a panda) in the new context, without noticeable appearance artifacts from the reference video. The motion is clear and faithful to the original.

6.1.2. Quantitative Results

The following are the results from Table 1 of the original paper:

		CLIP-T (↑)	CLIP-E (↑)	TempCons (↑)	MoFid (↑)
Inference	ZeroScope [5]	0.2017	0.2116	0.9785	0.4419
Inference	VideoCrafter [7]	0.2090	0.2228	0.9691	0.4497
One-shot	Tune-a-video [46]	0.1911	0.2031	0.9401	0.5627
	ZeroScope (fine-tune)	0.2088	0.2092	0.9878	0.6011
	MotionDirector [55]	0.2178	0.2130	0.9889	0.5423
	MoTrans (ours)	0.2192	0.2173	0.9872	0.5679
Few-shot	LAMP [47]	0.1773	0.1934	0.9587	0.4522
	ZeroScope (fine-tune)	0.2191	0.2132	0.9789	0.5409
	MotionDirector	0.2079	0.2137	0.9801	0.5417
	MoTrans (ours)	0.2275	0.2192	0.9895	0.5695

Analysis of the Trade-off: There is a clear trade-off between motion fidelity (MoFid) and entity alignment (CLIP-E). Methods like Tune-a-Video and fine-tuned ZeroScope achieve high MoFid but very low CLIP-E. This quantitatively confirms the qualitative observation of appearance overfitting: they are good at replicating the motion because they are also replicating the original subject.
One-shot Setting: MoTrans achieves the highest CLIP-T and CLIP-E scores, demonstrating its superior ability to generate the correct new subject and context. Its MoFid score is strong and competitive, indicating it strikes the best balance between faithful motion transfer and avoiding overfitting.
Few-shot Setting: In the few-shot setting (using multiple reference videos), MoTrans outperforms all other methods on all four metrics. This shows that with more data, its ability to learn a common motion pattern while generalizing to new appearances becomes even more pronounced.

6.1.3. User Study

The user study (Figure 6) reinforces the quantitative results. Participants were asked to choose the best video based on text alignment, smoothness (temporal consistency), and motion similarity without appearance leakage.

The following figure (Figure 6 from the original paper) shows the user study results.

Figure 6: User study. For each metric, the percentages attributed to all methods sum to 1. MoTrans accounts for the largest proportion, indicating that the videos generated by our method exhibit superior text alignment, temporal consistency, and the closest resemblance to the reference video. 该图像是一个柱状图，展示了MoTrans与其他模型在文本对齐、时间一致性和运动保真度上的评估结果。MoTrans在这三个维度上均表现优异，特别是在文本对齐上达到0.65的得分，在时间一致性和运动保真度上得分分别为0.57和0.63。

MoTrans was preferred by a large majority of users across all three criteria, indicating that its generated videos align better with human perception of quality and correctness for this task.

6.2. Ablation Studies / Parameter Analysis

The ablation studies in Table 2 and Figure 7 systematically remove each key component of MoTrans to validate its contribution.

The following are the results from Table 2 of the original paper:

		CLIP-T (↑)	CLIP-E (↑)	TempCons (↑)	MoFid (↑)
One-shot	w/o MLLM-based recaptioner	0.2138	0.2101	0.9865	0.6129
	w/o appearance injector	0.2114	0.2034	0.9862	0.6150
	w/o motion enhancer	0.2164	0.2135	0.9871	0.5643
	MoTrans	0.2192	0.2173	0.9872	0.5679
Few-shot	w/o MLLM-based recaptioner	0.2179	0.2138	0.9792	0.5997
	w/o appearance injector	0.2143	0.2132	0.9807	0.6030
	w/o motion enhancer	0.2211	0.2171	0.9801	0.5541
	MoTrans	0.2275	0.2192	0.9895	0.5695

w/o MLLM-based recaptioner and w/o appearance injector: Removing either of the two decoupling components results in a drop in CLIP-T and CLIP-E, while MoFid increases. This is the classic sign of appearance overfitting. The model is less able to generate the prompted subject (lower CLIP-E), but because it's copying the appearance from the reference, the motion also appears more similar (higher MoFid). This confirms that both multimodal priors are essential for decoupling.
w/o motion enhancer: Removing the motion-specific embedding leads to a drop in MoFid. This shows that this module is effective in helping the model capture the specific motion pattern more accurately.
Conclusion: Each proposed module—the recaptioner, the injector, and the enhancer—provides a significant and distinct contribution to the final performance of MoTrans.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully presents MoTrans, a novel and effective method for customized motion transfer in text-to-video generation. The central problem of appearance-motion coupling, which plagues existing methods, is skillfully addressed through an innovative two-stage training strategy. The key contributions are the use of complementary multimodal priors—an MLLM-based recaptioner for rich textual appearance descriptions and an appearance injector for direct visual cues—to robustly decouple appearance from motion. Furthermore, the introduction of a motion-specific embedding effectively enhances the model's ability to learn the nuances of the target motion. The comprehensive experimental results, including both quantitative metrics and user studies, convincingly demonstrate that MoTrans surpasses prior art by generating high-quality videos that faithfully transfer motion to new subjects and contexts without suffering from appearance overfitting.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Semantic Constraints on Motion Transfer: The current method can transfer human-centric motions to non-human subjects (e.g., a cat playing a guitar), but it does not account for morphological constraints. For example, it might struggle to apply a "running" motion to an object without limbs, like a fish. Future work could explore incorporating semantic or physical constraints.
Limited Video Length: The method is currently optimized for short video clips (2-3 seconds). Generating longer, coherent videos with complex, customized motions remains a significant challenge for the entire field.
The authors plan to address these limitations to expand the applicability of MoTrans to more complex and practical scenarios.

7.3. Personal Insights & Critique

MoTrans is a well-designed and thoughtful piece of research that makes a tangible contribution to the field of controllable video generation.

Strengths and Inspirations:
- The core idea of using complementary multimodal information for decoupling is particularly insightful. It's a powerful principle that could be applied to other disentanglement problems in generative AI. Instead of relying on a single modality or purely architectural tricks, it leverages the strengths of both language and vision to guide the model.
- The motion-specific embedding targeting verbs is an elegant and intuitive solution. It demonstrates how linguistic priors can be effectively integrated into the diffusion model's conditioning space for fine-grained control.
- The introduction of the CLIP-E metric is a valuable contribution to the evaluation methodology for this specific task, as it provides a direct way to quantify appearance overfitting.
Potential Issues and Areas for Improvement:
- Brittleness of NLP Tools: The reliance on an external tool like Spacy for part-of-speech tagging could be a weak link. Its performance may vary with complex or creatively phrased prompts, potentially causing the motion enhancer to fail if it cannot correctly identify the verb. A more end-to-end approach for identifying the motion concept could be more robust.
- Dependency on Pretrained Models: The MoFid metric's reliability is contingent on the feature space of the VideoMAE model. If VideoMAE was not trained on a sufficiently diverse set of motions, its ability to distinguish between similar but distinct actions might be limited, potentially affecting the accuracy of the motion fidelity score.
- Complexity: The two-stage training process, while effective, adds a layer of complexity compared to end-to-end methods. Simplifying the pipeline while retaining the decoupling benefits would be a valuable future step.
- Trade-off in One-Shot Setting: While the paper argues that lower MoFid in MoTrans compared to overfitting methods is a good thing, it does highlight a trade-off. In the one-shot scenario, there might be a limit to how much motion fidelity can be achieved without some degree of appearance leakage. Exploring this trade-off more deeply could be an interesting research direction.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~23 min read · 29,202 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Diffusion Models

3.1.2. Text-to-Video (T2V) Generation

3.1.3. LoRA (Low-Rank Adaptation)

3.1.4. Multimodal Large Language Models (MLLMs)

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Stage 1: Appearance Learning

4.2.2. Stage 2: Motion Learning

4.2.3. Inference Stage

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. CLIP Textual Alignment (CLIP-T)

5.2.2. Temporal Consistency (TempCons)

5.2.3. CLIP Entity Alignment (CLIP-E)

5.2.4. Motion Fidelity (MoFid)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Qualitative Results

6.1.2. Quantitative Results

6.1.3. User Study

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers