Paper status: completed

Video-As-Prompt: Unified Semantic Control for Video Generation

Published:10/24/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
7 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The Video-As-Prompt (VAP) paradigm redefines semantic control in video generation using reference videos as prompts, combined with a Mixture-of-Transformers architecture. VAP builds the largest dataset, VAP-Data, achieving a 38.7% user preference rate, showcasing strong zero-shot

Abstract

Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is Video-As-Prompt: Unified Semantic Control for Video Generation. This title highlights a novel paradigm (Video-As-Prompt) for achieving consistent and generalizable semantic control in video generation.

1.2. Authors

The authors of this paper are:

  • Yuxuan Bian

  • Xin Chen

  • Zenan Li

  • Tiancheng Zhi

  • Shen Sang

  • Linjie Luo

  • Qiang Xu

    Their affiliations include Intelligent Creation Lab, ByteDance and The Chinese University of Hong Kong. Xin Chen is noted as the Project lead, and Xin Chen and Qiang Xu are Corresponding Authors.

1.3. Journal/Conference

The paper is published as an arXiv preprint with the identifier arXiv:2510.20888. While arXiv is a preprint server and not a peer-reviewed journal or conference, papers often appear there before or during the review process for major conferences (e.g., CVPR, ICCV, NeurIPS, ICLR) or journals. Given the technical depth and the mentions of a project page, it is likely intended for a top-tier computer vision or machine learning conference or journal, which are highly reputable and influential in the field of generative AI.

1.4. Publication Year

The paper was published on 2025-10-23T17:59:52.000Z (UTC).

1.5. Abstract

The paper addresses the critical challenge of achieving unified and generalizable semantic control in video generation. Existing methods suffer from limitations such as introducing artifacts due to inappropriate pixel-wise priors from structure-based controls or relying on costly, non-generalizable, condition-specific finetuning or task-specific architectures.

The authors introduce Video-As-Prompt (VAP), a new paradigm that reframes semantic control as an in-context generation problem. VAP uses a reference video as a direct semantic prompt to guide a frozen Video Diffusion Transformer (DiT). This guidance is facilitated by a plug-and-play Mixture-of-Transformers (MoT) expert architecture, which prevents catastrophic forgetting. A temporally biased position embedding is employed to eliminate spurious mapping priors and ensure robust context retrieval.

To support this research, the authors created VAP-Data, the largest dataset for semantic-controlled video generation to date, comprising over 100K paired videos across 100 semantic conditions.

VAP, as a single unified model, achieves state-of-the-art results among open-source methods, demonstrating a 38.7% user preference rate that competes with leading condition-specific commercial models. Its strong zero-shot generalization capabilities and support for various downstream applications mark a significant advancement toward general-purpose, controllable video generation.

Official Source Link: https://arxiv.org/abs/2510.20888 PDF Link: https://arxiv.org/pdf/2510.20888v1.pdf This paper is currently published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the lack of a unified and generalizable framework for semantic-controlled video generation. While structure-controlled video generation (where control signals like depth maps or poses are pixel-aligned with the target video) is well-studied and unified frameworks exist, semantic control presents a different challenge. Semantic control involves guiding video generation based on abstract concepts such as style, motion, or camera movement, where there is no direct pixel-wise correspondence between the control signal and the output video.

This problem is important because it limits applications in crucial areas like visual effects, video stylization, motion imitation, and camera control. Existing methods for semantic-controlled video generation face several challenges:

  • Artifacts from Structure-Based Methods: Directly migrating unified structure-controlled methods (which rely on pixel-aligned conditions) often leads to artifacts. These methods enforce inappropriate pixel-wise mapping priors (assumptions that each pixel in the input condition corresponds to a specific pixel in the output), which are not valid for semantic conditions.

  • Condition-Specific Overfitting: Many approaches finetune large backbones or LoRAs (Low-Rank Adaptations) for each specific semantic condition (e.g., "Ghibli style" or "Hitchcock camera zoom"). This approach is extremely costly in terms of computational resources and time, and it makes the models non-generalizable to new, unseen semantic conditions.

  • Task-Specific Design: Other methods involve crafting task-specific modules or inference strategies for each type of semantic control (e.g., style, motion, camera). While effective for their specific task, these designs hinder the development of a single, unified model that can handle diverse semantic controls and lack zero-shot generalizability to novel tasks or conditions.

    The paper's entry point and innovative idea revolve around reframing semantic control as an in-context generation problem. Instead of pixel-aligned conditions, per-condition training, or task-specific designs, Video-As-Prompt (VAP) treats a reference video containing the desired semantics as a direct "video prompt" to guide the generation process. This approach is inspired by the success of Diffusion Transformers (DiTs) in supporting strong in-context control abilities in image and structure-controlled video generation.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of controllable video generation:

  • VAP: A Unified Semantic-Controlled Video Generation Paradigm: The authors introduce Video-As-Prompt (VAP), the first unified framework for semantic-controlled video generation under non-pixel-aligned conditions. VAP innovatively treats a reference video with the desired semantics as a video prompt, enabling generalizable in-context control across diverse semantic conditions. This removes the inappropriate pixel-wise mapping prior found in structure-based controls and avoids the need for per-condition training or per-task model designs.

  • Plug-and-Play In-Context Video Generation Framework: VAP proposes a novel plug-and-play framework built on a Mixture-of-Transformers (MoT) architecture. This design augments any frozen Video Diffusion Transformer (DiT) with a trainable parallel expert transformer that interprets the video prompt and guides generation. This architecture is crucial for preventing catastrophic forgetting of the pre-trained DiT's generative abilities, supports various downstream tasks (e.g., concept, style, motion, camera guidance), and delivers strong zero-shot generalizability to unseen semantic conditions. A temporally biased position embedding is also introduced to eliminate spurious mapping priors for robust context retrieval.

  • VAP-Data: Largest Dataset for Semantic-Controlled Video Generation: The authors construct and release VAP-Data, a large-scale dataset specifically designed for semantic-controlled video generation. With over 100K curated paired samples across 100 semantic conditions, VAP-Data provides a robust foundation for training and evaluating unified semantic-controlled video generation models, addressing a significant data gap in the field.

    The key conclusions and findings are:

  • VAP, as a single unified model, achieves state-of-the-art performance among open-source methods, with a 38.7% user preference rate that is competitive with leading condition-specific commercial models. This demonstrates that a unified approach can rival highly specialized models.

  • The MoT architecture effectively preserves the base DiT's generative ability while enabling stable in-context control, overcoming issues like catastrophic forgetting.

  • The temporally biased Rotary Position Embedding (RoPE) is essential for removing unrealistic pixel-wise alignment priors, leading to improved performance for semantic control.

  • VAP exhibits strong zero-shot generalization to unseen semantic conditions and supports various downstream applications, indicating a significant step toward truly general-purpose controllable video generation.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand the Video-As-Prompt (VAP) paper, a foundational understanding of several key concepts in deep learning and generative AI is essential:

  • Diffusion Models: These are a class of generative models that learn to reverse a gradual noisy process. In essence, they start with random noise and gradually transform it into a meaningful data sample (e.g., an image or video) by learning to denoise at each step.

    • Forward Process (Diffusion): Data (e.g., a clear image) is progressively corrupted by adding Gaussian noise over several timesteps until it becomes pure noise.
    • Reverse Process (Denoising/Generation): A neural network is trained to predict and remove the noise added at each step, effectively learning to transform noise back into coherent data.
    • Latent Diffusion Models (LDMs): Instead of working directly in the pixel space, LDMs operate in a compressed latent space. A Variational Autoencoder (VAE) is typically used to encode high-resolution images/videos into lower-dimensional latents, and the diffusion process occurs in this latent space. This makes training more efficient and stable.
  • Transformers: Originally developed for natural language processing, Transformers are neural network architectures characterized by their self-attention mechanisms.

    • Self-Attention: This mechanism allows the model to weigh the importance of different parts of the input sequence (e.g., words in a sentence, patches in an image/video) when processing each part. It captures long-range dependencies effectively.
    • Query (Q), Key (K), Value (V): In self-attention, the input is linearly transformed into three matrices: Query, Key, and Value. The Query interacts with Keys to determine attention weights, which are then applied to Values to produce the output. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where QQ is the query matrix, KK is the key matrix, VV is the value matrix, dkd_k is the dimension of the key vectors, and softmax\mathrm{softmax} normalizes the attention scores.
    • Multi-Head Attention: Multiple attention heads are run in parallel, each learning different representations, and their outputs are concatenated and linearly transformed.
  • Diffusion Transformers (DiTs): These models combine the power of Diffusion Models with the Transformer architecture. Instead of U-Net architectures often used in diffusion models, DiTs use Transformers to process the latent representations of noisy images or videos.

    • Video DiTs extend this to the temporal dimension, processing video latents as sequences of spatial patches over time. They are known for their scalability and ability to handle large input contexts.
  • Rotary Position Embedding (RoPE): A type of position embedding used in Transformers to inject positional information into the input sequence. Unlike absolute or relative position embeddings that add values to token embeddings, RoPE rotates Query and Key vectors based on their absolute position. This allows self-attention to implicitly incorporate relative positional information, which is beneficial for in-context learning.

  • Variational Autoencoders (VAEs): As mentioned with LDMs, VAEs are neural networks used for dimensionality reduction and data generation. They consist of an encoder that maps input data (e.g., a video) to a lower-dimensional latent space and a decoder that reconstructs the data from the latent space. For DiTs, VAEs are crucial for converting high-resolution videos into computationally manageable latent representations.

  • Flow Matching: A generative modeling technique that trains a neural network to predict a vector field that transports samples from a simple base distribution (e.g., Gaussian noise) to a complex data distribution (e.g., real videos). Instead of learning to reverse a stochastic diffusion process (as in traditional diffusion models), Flow Matching learns a continuous deterministic flow that maps noise to data. This can lead to more stable training and efficient sampling.

  • In-Context Learning: A paradigm where a model learns a new task by observing examples provided directly in the input prompt, without requiring explicit finetuning of the model's weights. The model uses its learned patterns and internal representations to generalize from the provided examples to perform the new task. For video generation, this means providing a reference video that implicitly specifies the desired style, motion, or concept, and the model generates a new video following that context.

  • Mixture-of-Experts (MoE) / Mixture-of-Transformers (MoT): An architecture where a network comprises several "expert" sub-networks. A gating network learns to select or combine the outputs of these experts based on the input. MoT specifically applies this concept to Transformers, where different Transformer experts can specialize in different tasks or input modalities. In this paper, MoT is used as a plug-and-play mechanism where a parallel expert Transformer handles the reference prompt while the main DiT backbone handles the target generation, preventing catastrophic forgetting.

  • Catastrophic Forgetting: A common problem in machine learning where a model, when trained on a new task, tends to forget previously learned information or tasks. This is a significant challenge when trying to extend a pre-trained model to new functionalities without losing its original capabilities.

3.2. Previous Works

The paper contextualizes VAP by discussing existing approaches to controllable video generation, categorizing them broadly into Structure-Controlled and Semantic-Controlled.

3.2.1. Structure-Controlled Video Generation

  • Definition: These methods are driven by pixel-aligned conditions, meaning there's a direct geometric or structural correspondence between the input control signal and the desired output video.
  • Examples of Conditions: depth maps [14, 16], human pose skeletons [28, 36], segmentation masks [6, 75], optical flow (representing motion vectors) [29, 35].
  • Common Mechanism: Typically, an additional adapter or branch is added to a pre-trained DiT (or similar backbone) and its features are injected via residual addition. This mechanism effectively exploits the pixel-wise mapping prior (the assumption that each pixel in the condition corresponds to a specific pixel in the output).
  • Unified Frameworks: Recent works like VACE [34] and FullDiT [37] have enabled all-in-one structure-controlled generation by treating these diverse pixel-aligned inputs as a unified spatial condition.

3.2.2. Semantic-Controlled Video Generation

  • Definition: These methods deal with conditions that lack pixel-wise correspondence with the target videos. The control is based on abstract semantics.
  • Examples of Conditions: concept transformation (e.g., turning a person into a Ladudu doll) [26, 40, 47], stylization (e.g., Ghibli style) [31, 78], camera movement (e.g., Hitchcock dolly zoom) [2, 3], motion transfer (sharing motion but differing in layout or skeleton) [19, 71, 82].
  • Prior Approaches and Their Limitations (as shown in Figure 2):
    • Condition-Specific Overfit (Figure 2(b)): Methods like VX Creator [47] or using Civitai [11] LoRAs finetune the DiT backbones or LoRAs for each specific semantic condition.
      • Drawbacks: This is computationally expensive, requires a separate model for every new semantic, and struggles to generalize to unseen conditions.
    • Task-Specific Design (Figure 2(c)): Methods like RecamMaster [2] (camera control) or StyleMaster [78] (stylization) craft dedicated task-specific modules or inference strategies for a particular condition type. They often encode reference videos with the same semantics into a specially designed space to guide generation.
      • Drawbacks: While effective for their narrow task, these approaches are not unified, hinder the creation of general-purpose models, and lack zero-shot generalizability.
    • Concurrent Work [49]: A LoRA mixture-of-experts approach for unified generation, but it still learns each condition by overfitting subsets of parameters and fails to generalize to unseen ones.

3.2.3. Technological Evolution

Video generation has rapidly evolved:

  • Early Stages: Dominated by Generative Adversarial Networks (GANs) [63, 69], which produced impressive but often less stable or controllable results.
  • Diffusion Era: Diffusion models [8, 17] emerged as state-of-the-art, offering higher quality, stability, and control.
  • Transformer-Based Architectures: Transition from convolutional architectures [7, 10] to transformer-based ones (DiTs) [8, 17, 70] dramatically increased scalability and performance, particularly for long-range dependencies.
  • Controllable Generation: Initial DiTs primarily supported text prompts or first/last frame control. This led to the development of task-specific modules [6, 34] or inference strategies [71, 82] to enable finer, user-defined control.
  • In-Context Learning Influence: Recent successes in unified image generation [66] and structure-controlled video generation [37] using DiTs for in-context control motivated the authors to apply this paradigm to the more challenging semantic-controlled video generation.

3.3. Differentiation Analysis

VAP critically differentiates itself from prior methods in the following ways:

  • Unified vs. Fragmented: Unlike the fragmented landscape of semantic-controlled video generation (which relies on condition-specific overfitting or task-specific designs), VAP proposes a single, unified model that can handle diverse semantic conditions (concept, style, motion, camera) without requiring per-condition retraining or per-task model modifications.

  • In-Context Control via Video Prompts: VAP introduces the novel concept of using a reference video itself as a direct semantic prompt. This is a significant departure from:

    • Structure-controlled methods that use pixel-aligned conditions (e.g., depth, pose) and residual addition, which VAP shows cause artifacts when applied to semantic tasks due to inappropriate pixel-wise priors (Figure 5(a)).
    • Semantic-controlled methods that either finetune for each condition or design task-specific modules, both of which lack generality.
  • Plug-and-Play Architecture (MoT): VAP leverages a Mixture-of-Transformers (MoT) design to augment a frozen Video Diffusion Transformer (DiT). This plug-and-play approach is superior to finetuning the entire DiT (which can lead to catastrophic forgetting as shown in ablation studies) or using LoRAs (which have limited capacity for complex in-context generation). MoT enables the expert to learn from the reference prompt while preserving the backbone's pre-trained generative abilities.

  • Temporally Biased Position Embedding: VAP addresses a subtle but crucial issue with Rotary Position Embedding (RoPE) in in-context generation. It identifies that sharing identical position embeddings between the reference and target can impose false pixel-level spatiotemporal mapping priors. By introducing a temporal bias, VAP eliminates this spurious prior, leading to improved performance and robust context retrieval. This is a unique contribution to the in-context learning paradigm for video.

  • Zero-Shot Generalization: A key advantage of VAP's unified, in-context learning approach is its strong zero-shot generalizability to unseen semantic conditions, a capability largely absent in condition-specific or task-specific prior works.

  • Dedicated Dataset (VAP-Data): The creation of VAP-Data, the largest dataset for semantic-controlled video generation with paired reference/target videos, is a crucial enabler for VAP and future research, addressing a major data scarcity issue in this specific area.

    In essence, VAP moves beyond the limitations of pixel-alignment and specialized architectures by formulating semantic control as a flexible, in-context prompting problem, enabled by a robust MoT and a carefully designed position embedding, trained on a purpose-built large-scale dataset.

4. Methodology

4.1. Principles

The core idea behind Video-As-Prompt (VAP) is to unify semantic-controlled video generation by treating a reference video with the desired semantics as a direct, task-agnostic "video prompt." This approach aims to overcome the limitations of prior methods, which either rely on pixel-aligned conditions (leading to artifacts for semantic tasks) or employ condition-specific finetuning or task-specific designs (hindering generalizability and unification).

VAP reframes the problem as in-context generation, where the model learns to infer and apply abstract semantic patterns from the reference video to a new target generation. The theoretical basis and intuition are rooted in the power of Diffusion Transformers (DiTs) for in-context learning, where the model can adapt its behavior based on contextual examples provided within the input.

To achieve this, VAP incorporates several key principles:

  1. Reference Video as Unified Prompt: Instead of relying on explicit, task-specific control signals, a video that demonstrates the desired semantic (e.g., a video in Ghibli style, a video showing a specific motion) serves as the universal input prompt.
  2. Plug-and-Play Architecture: To integrate this video prompt into a pre-trained Video Diffusion Transformer (DiT) without catastrophic forgetting (i.e., losing its original generative capabilities), VAP uses a Mixture-of-Transformers (MoT) design. This allows a specialized "expert" module to process the reference prompt, while the frozen DiT backbone handles the main generation task, with controlled information flow between them.
  3. Temporal Context Preservation: Recognizing that position embeddings are crucial for transformers, VAP introduces a temporally biased Rotary Position Embedding (RoPE). This design ensures that the model correctly interprets the temporal relationship between the reference prompt and the target video, preventing the imposition of spurious pixel-wise mapping priors that are irrelevant for semantic control.
  4. Task-Agnostic Learning: By using a diverse dataset of reference video-target video pairs across various semantic conditions, VAP learns a general underlying principle for semantic control, allowing it to generalize zero-shot to unseen semantics.

4.2. Core Methodology In-depth (Layer by Layer)

The VAP framework operates by augmenting a frozen Video Diffusion Transformer (DiT) with a trainable Mixture-of-Transformers (MoT) expert, guided by a temporally biased position embedding.

4.2.1. Preliminary: Video Diffusion Models

Video diffusion models learn to reverse a noise process to generate videos. The paper illustrates this using Flow Matching [46]. In Flow Matching, a neural network is trained to predict a velocity field that transports samples from a simple noise distribution to a complex data distribution.

Formally, a noise sample x0N(0,1)\mathbf{x_0} \sim \mathcal{N}(0, 1) is denoised to a target video x1\mathbf{x_1} along a path xt\mathbf{x_t} defined as: $ \mathbf{x_t} = t \mathbf{x_1} + (1 - (1 - \sigma_{min})t) \mathbf{x_0} $ The velocity field VtV_t along this path is given by the derivative of xt\mathbf{x_t} with respect to tt: $ V_t = \frac{d\mathbf{x_t}}{dt} = \mathbf{x_1} - (1 - \sigma_{min}) \mathbf{x_0} $ Here, σmin=105\sigma_{min} = 10^{-5} is a small constant to prevent division by zero, and t[0,1]t \in [0, 1] represents the timestep along the path.

The model, parameterized by Θ\Theta and denoted as uΘu_{\boldsymbol{\Theta}}, is trained to predict this velocity field VtV_t given the noisy sample xt\mathbf{x_t}, the current timestep tt, and any additional conditions CC. The training objective L\mathcal{L} is typically an L2-loss: $ \mathcal{L} = \mathbb{E}{t, \mathbf{x_0}, \mathbf{x_1}, C} \left. \lVert u{\boldsymbol{\Theta}} (\mathbf{x_t}, t, C) - (\mathbf{x_1} - (1 - \sigma_{min}) \mathbf{x_0}) \rVert \right. $

  • uΘ()u_{\boldsymbol{\Theta}}(\cdot): The Video Diffusion Transformer model with parameters Θ\boldsymbol{\Theta} that predicts the velocity field.

  • xt\mathbf{x_t}: The noisy video sample at timestep tt.

  • tt: The current timestep, indicating the progression from noise to data.

  • CC: Additional conditions provided to guide the generation (e.g., text prompts, reference videos).

  • x0\mathbf{x_0}: An initial sample of Gaussian noise.

  • x1\mathbf{x_1}: The target (ground truth) video.

  • σmin\sigma_{min}: A small constant (10510^{-5}) used in the path definition.

  • E\mathbb{E}: Expected value over the distribution of tt, x0\mathbf{x_0}, x1\mathbf{x_1}, and CC.

  • \lVert \cdot \rVert: L2-norm, measuring the squared difference between the model's prediction and the true velocity.

    During inference, the model starts by sampling Gaussian noise x0N(0,1)\mathbf{x_0} \sim \mathcal{N}(0, 1). Then, an ODE solver (e.g., Euler, DDIM) uses the trained model uΘu_{\boldsymbol{\Theta}} to iteratively predict the velocity field and traverse the path from noise to a clean video x1\mathbf{x_1} over a discrete set of NN denoising timesteps.

4.2.2. Reference Videos as Task-Agnostic Prompts

VAP unifies heterogeneous semantic conditions by treating reference videos (which embody the desired semantics) as universal video prompts.

  • Problem: Structure-controlled methods enforce a pixel-wise alignment (e.g., depth map to video). When a semantic-same but pixel-misaligned video is injected via residual addition (as in ControlNet-like methods), it often leads to copy-and-paste artifacts (Figure 5(a)), where unwanted appearance or layout from the reference is copied. Prior semantic methods were condition-specific or task-specific.

  • VAP's Solution: A single unified model uΘu_{\boldsymbol{\Theta}} is trained to learn the conditional distribution p(xc)p(\mathbf{x} \mid c) for any semantic condition cc. The model uses the reference video as a prompt, which shares the same semantics as the target but lacks pixel-wise alignment.

  • Semantic Categories: VAP specifically evaluates four representative categories for semantic control:

    • Concept-Guided Generation: Involves entity transformation (e.g., person becomes a doll) or interaction (e.g., AI lover approaches).
    • Style-Guided Generation: Generating videos in a specific artistic style (e.g., Ghibli, Minecraft).
    • Motion-Guided Generation: Following a reference motion, either human (e.g., specific dance moves) or non-human (e.g., object expansion).
    • Camera-Guided Generation: Replicating reference camera movements (e.g., zoom, pan, Hitchcock dolly zoom).
  • Auxiliary Captions: To further aid in semantic control, VAP also takes captions for both the reference video (PrefP_{ref}) and the target video (PtarP_{tar}) as input. These captions help the model identify and transfer shared semantic control signals. Thus, the model learns p(xCco,Cs,Cm,Cca,Pref,Ptar)p(\mathbf{x} \mid C_{co}, C_s, C_m, C_{ca}, P_{ref}, P_{tar}).

    The architecture overview of Video-As-Prompt is shown in Figure 4.

    FigureOverviewof Video-As-Prompt.The reference video (with the wanted semantics), target video, and thei f t nei in-context token sequence \(\[ R e f _ { t e x t } , R e f _ { v i d e o } , T a r _ { t e x t } , T a r _ { v i d e o } \]\) (See middle. We omitted term "tokens" for simplicity.). First frame tokens are concatenated with video tokens. We add a temporal bias \(\\varDelta\) to RoPE to avoid nonexistent pixe-aligned priors from the original shared RoPE (Se bottom right). The reference video and captions act as the prts nd aiable pe Tormeeee),hi xhang y with the pre-trained DiT via full attention at each layer, enabling plug-and-play in-context generation. 该图像是示意图,展示了 Video-As-Prompt 方法中的关键流程和组件,包括参考视频、目标视频和与之相关的 token 序列。图中展示了如何通过 Mixture-of-Transformers 架构进行 tokenization 和含有时间偏置的 RoPE,以增强上下文的生成能力。公式 [ R e f _ { t e x t }, R e f _ { v i d e o }, T a r _ { t e x t }, T a r _ { v i d e o } ] 用于描述 in-context token 序列的组成。 Figure 4: Overview of Video-As-Prompt. The reference video (with the wanted semantics), target video, and their corresponding in-context token sequence [Reftext,Refvideo,Tartext,Tarvideo][ Ref_{text}, Ref_{video}, Tar_{text}, Tar_{video} ] (See middle. We omitted term "tokens" for simplicity.). First frame tokens are concatenated with video tokens. We add a temporal bias Δ\varDelta to RoPE to avoid nonexistent pixe-aligned priors from the original shared RoPE (See bottom right). The reference video and captions act as the prompts and guide the generation, which exchanges with the pre-trained DiT via full attention at each layer, enabling plug-and-play in-context generation.

4.2.3. Plug-and-Play In-Context Control

The VAP model takes four primary inputs to guide generation:

  1. Reference Video (semantic prompt): Provides the wanted semantics.

  2. Reference Image (initial appearance): The first frame of the generated video, providing initial appearance and subject.

  3. Captions: Reference caption (PrefP_{ref}) and target caption (PtarP_{tar}) to aid in semantic transfer.

  4. Noise/Noisy Target Video: Gaussian noise for inference, or a noisy version of the target video during training.

    The processing flow is as follows:

  5. VAE Encoding: The reference video cRn×h×w×cc \in \mathbb{R}^{n \times h \times w \times c} and the target video XRn×h×w×cX \in \mathbb{R}^{n \times h \times w \times c} are first encoded into latent representations c^Rn×h×w×d\hat{c} \in \mathbb{R}^{n' \times h' \times w' \times d} and xRn×h×w×d\mathbf{x} \in \mathbb{R}^{n' \times h' \times w' \times d} respectively, using a Variational Autoencoder (VAE). Here, n, h, w are original temporal/spatial sizes, and n', h', w' are latent sizes.

  6. Text Tokenization: The reference caption PrefP_{ref} and target caption PtarP_{tar} are processed into text tokens tc^,txRnt×dt_{\hat{c}}, t_x \in \mathbb{R}^{n_t \times d}, where ntn_t is the number of text tokens and dd is the embedding dimension.

  7. In-Context Sequence: A naive approach would be to finetune a DiT on a concatenated sequence of all inputs: [tc^,c^,tx,x][ t_{\hat{c}}, \hat{c}, t_x, \mathbf{x} ]. However, the authors found this leads to catastrophic forgetting and poor performance with limited data, especially because semantic control lacks pixel-aligned priors.

  8. Mixture-of-Transformers (MoT) Architecture: To overcome these issues, VAP adopts a Mixture-of-Transformers (MoT) design (Figure 4, middle):

    • Frozen Backbone: A pre-trained Video Diffusion Transformer (DiT) is kept frozen. This backbone is responsible for processing the target generation inputs: [tx,x][ t_x, \mathbf{x} ]. It maintains its original query (Q), key (K), value (V) projections, feed-forward layers, and layer norms.
    • Trainable Expert: A parallel expert Transformer is introduced, initialized from the backbone. This expert is trainable and specifically designed to consume the reference prompt inputs: [tc^,c^][ t_{\hat{c}}, \hat{c} ]. It also has its own Q/K/VQ/K/V projections, feed-forward layers, and norms.
    • Two-Way Information Fusion: At each layer of the MoT, the Q/K/VQ/K/V from both the expert (reference path) and the frozen backbone (target path) are concatenated. Full attention is then run over these combined Q/K/VQ/K/V sequences. This allows for synchronous layer-wise reference guidance, meaning the expert's interpretation of the reference prompt can dynamically influence the frozen DiT's generation of the target video, and vice-versa.
    • Benefits: This MoT design preserves the backbone's original generative ability (preventing catastrophic forgetting), boosts training stability, and enables plug-and-play in-context control that is independent of the specific DiT architecture.

4.2.4. Temporally Biased Rotary Position Embedding

  • Problem with Shared RoPE: Similar to observations in in-context image generation [66], the authors found that sharing identical position embeddings (e.g., Rotary Position Embedding - RoPE [65]) between the reference condition and the target video is suboptimal. It implicitly imposes a false pixel-level spatiotemporal mapping prior. This means the model assumes a direct, pixel-to-pixel correspondence in space and time between the reference and target, which does not exist for semantic control and leads to unsatisfactory performance (e.g., artifacts in Figure 5(c)).

  • VAP's Solution: To remove this spurious prior, VAP modifies the position embedding for the reference prompt (Figure 4, bottom right).

    • Temporal Bias (Δ\varDelta): The temporal indices of the reference prompt tokens are shifted by a fixed offset Δ\varDelta. This effectively places the reference tokens before all the noisy target video tokens along the temporal axis. This matches the temporal order expected by in-context generation (where the context is given first) and ensures the model doesn't try to align reference and target frames directly in time.
    • Spatial Consistency: Crucially, the spatial indices for the reference prompt tokens are kept unchanged. This allows the model to still exploit meaningful spatial semantic changes within the reference video prompt.
  • Benefits: This temporally biased RoPE effectively removes the nonexistent pixel-mapping prior, leading to improved performance (Table 2) by allowing the model to focus on abstract semantic patterns rather than spurious pixel-level correspondences.

    Figure 5 Motivation. Ablation visualizations (Semantic: Spin \(3 6 0 ^ { \\circ }\) )on structure designs of VAP. Figure 5: Motivation. Ablation visualizations (Semantic: Spin 360360^\circ) on structure designs of VAP.

5. Experimental Setup

5.1. Datasets

Semantic-controlled video generation specifically requires paired reference and target videos that share the same non-pixel-aligned semantic controls (e.g., concept, style, motion, camera). Such pairs are difficult to obtain or label automatically, unlike structure-controlled data where vision perception models (e.g., SAM [39] for masks, Depth-Anything [74] for depth) can directly provide conditions. Prior work often relied on small, manually collected datasets tailored for specific conditions [47], hindering unified model development.

To address this data scarcity, the authors introduce VAP-Data, the largest dataset for semantic-controlled video generation to date.

  • Creation Process:

    1. High-Quality Reference Images: They curated 2K high-quality real images from the internet, covering diverse subjects like men, women, children, animals, objects, landscapes, and multi-subject scenarios.
    2. Bootstrapping with Existing Models: They leveraged a "zoo of specialist models" to generate paired videos:
      • Commercial models' Image-to-Video (I2V) visual effects templates (e.g., VIDU [68], Kling [40]).
      • Community LoRAs [11] for various effects.
    3. Pairing: Each curated image was matched with all compatible templates (some templates have subject category restrictions) to create the paired data.
  • Scale and Characteristics: VAP-Data comprises over 100K curated paired samples across 100 semantic conditions.

    • Categories: These conditions are organized into 4 primary categories, as shown in Figure 3 and detailed in Table 6 (in the Appendix):
      • Concept: Entity Transformation (e.g., person becomes Captain America, hair color change) and Entity Interaction (e.g., aliens coming, covered in liquid metal).
      • Style: Stylization (e.g., Ghibli, Minecraft, Jojo).
      • Motion: Human Motion Transfer (e.g., shake it dance, hip twist) and Non-human Motion Transfer (e.g., balloon flyaway, spin 360360^\circ).
      • Camera: Camera Movement Control (e.g., dolly effect, Hitchcock zoom, move up/down/left/right).
  • Purpose: VAP-Data serves as a robust data foundation for training a single generalist model (VAP) to learn the unified underlying principle of semantic control from disparate specialist models.

  • Evaluation Subset: For evaluation, a test subset was created by evenly sampling 24 semantic conditions (6 from each of the 4 categories), with 2 samples each, totaling 48 test samples.

    Figure 3Overview of Our Proposed VAP-Data. (a) 100 semantic conditions across 4 categories:concept, syle, camera, and motion; () diverse referenceages, includin animals, humans, ojects, and scenes, wit multple variants; and (c) a word cloud of the semantic conditions. Figure 3: Overview of Our Proposed VAP-Data. (a) 100 semantic conditions across 4 categories: concept, style, camera, and motion; (b) diverse reference images, including animals, humans, objects, and scenes, with multiple variants; and (c) a word cloud of the semantic conditions.

The following are the results from Table 6 of the original paper:

Primary Category Subcategory Subset (alphabetical) Total Videos
Concept(n=56) Entity Transformation (n=24) captain america, cartoon doll, eat mushrooms, fairy me, fishermen, fuzzyfuzzy, gender swap, get thinner, hair color change, hair swap, ladudu me, mecha x, minecraft, monalisa, muscling, pet to human, sexy me, squid game, style me, super saiyan, toy me, venom, vip, zen 56k
Entity Interaction (n=21) aliens coming, child memory, christmas, cloning, couple arrival, couple drop, couple walk, covered liquid metal, drive yacht, emoji figure, gun shooting, jump to pool, love drop, nap me, punch hit, selfie with younger self, slice therapy, soul depart, watermelon hit, zongzi drop, zongzi wrap
Style(n=11) Stylization (n=11) american comic, bjd, bloom magic, bloombloom, clayshot, ghibli, irasutoya, jojo, painting, sakura season, simpsons comic 15k
Motion(n=41) Human Motion Transfer (n=16) break glass, crying, cute bangs, emotionlab, flying, hip twist, laughing, live memory, live photo, pet belly dance, pet finger, shake it dance, shake it down, split stance human, split stance pet, walk forward auto spin 10k
Non-human Motion Transfer (n=16) balloon flyaway, crush, decapitate, dizzydizzy, expansion, explode, grow wings, head to balloon, paper- man, paper fall, petal scattered, pinch, rotate, spin360, squish 19k
Camera(n=12) Camera Movement Control (n=12) dolly effect, earth zoom out, hitchcock zoom, move down, move left, move right, move up, orbit, orbit dolly, orbit zoom in/out, reverse dolly, track in/out 19k
Overall subsets across all primary categories: 100. Overall videos across all categories: >100k> 100k
*Table 6: Dataset Statistics. Primary Categories. We organize dataset into 4 primary categories: Concept (merging entity transformation and interaction), Style, Motion (covering human and non-human motion transfer), and Camera Movement. For each primary category, we report its subcategory (if any), the alphabetical semantic condition subset list (names come from commercial models API definition [40, 68], and community visual effects LoRA definition [11], see Sec. 4.3), and the total number of videos.*

5.2. Evaluation Metrics

The paper evaluates models using a comprehensive set of metrics across three key aspects: text alignment, video quality, and semantic alignment.

5.2.1. Text Alignment

  • Conceptual Definition: CLIP similarity (or CLIP Score) measures how well the generated video aligns with the input text prompt semantically. It leverages the CLIP (Contrastive Language-Image Pre-training) model, which learns a joint embedding space for text and images. A higher CLIP similarity indicates that the visual content of the video is more relevant and consistent with its accompanying text description.
  • Mathematical Formula: $ \text{CLIP Score} = \frac{1}{N} \sum_{i=1}^N \cos(\mathbf{e}{\text{text},i}, \mathbf{e}{\text{video},i}) $
  • Symbol Explanation:
    • NN: The total number of generated video-text pairs.
    • etext,i\mathbf{e}_{\text{text},i}: The CLIP embedding vector for the text prompt of the ii-th video.
    • evideo,i\mathbf{e}_{\text{video},i}: The CLIP embedding vector for the ii-th generated video.
    • cos(,)\cos(\cdot, \cdot): The cosine similarity function, which measures the cosine of the angle between two vectors. It ranges from -1 (completely dissimilar) to 1 (identical direction). A higher value indicates better alignment.

5.2.2. Overall Video Quality

The paper uses three metrics to assess video quality:

  • Motion Smoothness

    • Conceptual Definition: Motion smoothness quantifies the temporal consistency and fluidity of motion within a generated video. It assesses whether movements are natural and free from abrupt changes, jumps, or jitters between frames. Higher scores indicate smoother, more visually pleasing motion.
    • Mathematical Formula: The paper refers to a common formulation [59] for motion smoothness, which typically involves calculating the average CLIP embedding similarity between consecutive frames or using a metric derived from optical flow or feature consistency. A common approach involves measuring the cosine similarity of CLIP embeddings between adjacent frames and averaging them. $ \text{Motion Smoothness} = \frac{1}{T-1} \sum_{t=1}^{T-1} \cos(\mathbf{e}{\text{frame},t}, \mathbf{e}{\text{frame},t+1}) $
    • Symbol Explanation:
      • TT: The total number of frames in the video.
      • eframe,t\mathbf{e}_{\text{frame},t}: The CLIP embedding vector for frame tt.
      • cos(,)\cos(\cdot, \cdot): Cosine similarity, measuring similarity between consecutive frames' visual representations.
  • Dynamic Degree

    • Conceptual Definition: Dynamic degree measures the amount and intensity of motion or change occurring within a video. A video with high dynamic degree exhibits significant movement or transformation, while a low dynamic degree suggests a relatively static scene. This is important for ensuring that generated videos are not stagnant. It often relies on optical flow analysis.
    • Mathematical Formula: The paper refers to Dynamic Degree [67], which often computes the magnitude of optical flow vectors between frames. A common formula involves averaging the magnitude of optical flow vectors. $ \text{Dynamic Degree} = \frac{1}{T-1} \sum_{t=1}^{T-1} \left( \frac{1}{H \times W} \sum_{x=1}^H \sum_{y=1}^W \lVert \mathbf{f}_{t}(x,y) \rVert_2 \right) $
    • Symbol Explanation:
      • TT: The total number of frames in the video.
      • H, W: Height and width of the video frames.
      • ft(x,y)\mathbf{f}_{t}(x,y): The optical flow vector at pixel (x,y) between frame tt and frame t+1t+1.
      • 2\lVert \cdot \rVert_2: The Euclidean (L2) norm of the vector, representing the magnitude of the motion at that pixel.
  • Aesthetic Quality

    • Conceptual Definition: Aesthetic quality assesses the overall visual appeal, beauty, and artistic merit of a generated video. This metric often relies on pre-trained aesthetic predictor models that have been trained on human-annotated datasets of image aesthetics. A higher score indicates a more aesthetically pleasing video.
    • Mathematical Formula: The paper refers to aesthetic quality [61]. While a universal formula isn't provided in the paper, it typically involves an aesthetic predictor model A()A(\cdot) that outputs a score for each frame, which is then averaged over the video. $ \text{Aesthetic Quality} = \frac{1}{T} \sum_{t=1}^T A(\text{frame}_t) $
    • Symbol Explanation:
      • TT: The total number of frames in the video.
      • A(framet)A(\text{frame}_t): The aesthetic score predicted by a pre-trained model for frame tt.

5.2.3. Semantic Alignment Score

  • Conceptual Definition: This custom metric measures the consistency between the reference semantic condition (provided by the reference video and captions) and the generated video. Standard video quality metrics don't reliably capture this adherence to specific semantic conditions. To achieve this, the authors leverage advanced Vision Language Models (VLMs) like Gemini-2.5-pro [12] and GPT-5 [53] for automatic scoring. The VLM acts as an expert judge, comparing the reference video-generation video pair against detailed evaluation rules.

  • Methodology:

    1. Input to VLM: For each (reference, generation) video pair, the VLM receives:
      • A general template describing the judging role and overall evaluation regime (ID-TRANSFORM vs. NON-ID-TRANSFORM).
      • Specific criteria tailored to the current semantic condition (e.g., "Ghibli style stylization," "dolly zoom").
      • The reference video condition.
      • The generated video.
    2. Scoring by VLM: The VLM then scores the generated video based on how well it satisfies these rules, outputting a single integer score (0-100).
  • Validation: To ensure stability and validity, the same evaluation was performed with two different state-of-the-art VLMs (Gemini-2.5-Pro and GPT-5). The scores from both VLMs showed close agreement and followed the trend of the user preference rate, confirming the metric's effectiveness.

    The following are the results from Table 4 of the original paper:

    Category Content
    General Template You are an expert judge for reference-based semantic video generation. INPUTS REFERENCE video: the target semantic to imitate. TEST video: a new output conditioned on a NEW reference image. Human criteria (treat as ground truth success checklist; overrides defaults if conflict): {criteria}
    REGIME DECISION Classify the semantics into one of: A) ID-TRANSFORM (identity-changing): the main subject/object changes semantic class or material/state. Layout and identity may legitimately change as a consequence of the transformation. B) NON-ID-TRANSFORM (identity-preserving): stylization, camera motion (pan/zoom), mild geometry exaggeration, lighting changes, human motion, etc. The main subject class/identity should remain the same. If the REFERENCE clearly shows a class/state change, choose A. Otherwise, choose B. When uncertain, choose B. EVALUATION 1) SEMANTIC MATCH (0-60) Regime A (ID-TRANSFORM): How strongly and accurately does TEST reproduce the REFERENCE's target state/look/behavior on the correct regions? Is the source→target mapping consistent (same parts transform to corresponding target parts)? Does the transformed state resemble the REFERENCE target, not a generic filter? Regime B (NON-ID-TRANSFORM): Does TEST replicate the specific semantic (style, camera motion, geometric exaggeration) while keeping the subject recognizable and aligned to the intended scope? 2) IDENTITY / LAYOUT CORRESPONDENCE (0-20) Regime A: Reward semantic correspondence rather than identical identity; coarse scene continuity is preserved unless the REFERENCE implies re-layout. Regime B: Main subject identity stays intact (face/body/clothes/features), and coarse spatial layout remains consistent (no unintended subject swaps/teleports). 3) TEMPORAL QUALITY and TRANSFORMATION CONTINUITY (0-20)
    Failures: - Incomplete semantic transformation (e.g., Ghibli style applied to <70% of frames), or failure to fully transition to target class/material, or completes <70% of the transformation timeline. - Severe identity loss in Regime B (unrecognizable face/body, unintended person/object swap). Gross broken anatomy (detached/missing limbs, implausible face mash) is not required by the semantics. - Extreme temporal instability or unreadable corruption (heavy strobe, tearing, tiling). - Hallucinated intrusive objects that block the subject or derail the semantics. OUTPUT (exactly ONE line of JSON; integer only) {"score": 0-100}
    Semantic Criteria Regime: NON-ID-TRANSFORM (identity-preserving stylization). Semantic: Ghibli-style stylization — the overall look gradually transitions to a hand-drawn, soft, film-like Ghibli aesthetic across the whole frame. Identity preservation: The main subject remains recognizable; appearance/proportions/base colors are largely maintained (stylistic simplification and brush-like textures allowed).

Table 4: Example Prompt components for the semantic-alignment metric. We provide the general template and the specific criteria of "Ghibli Style" as an example.

5.2.4. User Study

  • Conceptual Definition: A user study is a qualitative evaluation where human participants assess the quality of generated content. It directly captures human preferences regarding video quality and semantic alignment, which can sometimes be hard to quantify with automated metrics alone.
  • Methodology:
    1. Participants: 20 randomly selected video-generation researchers.
    2. Task: In each trial, raters compared outputs from different methods, shown alongside a semantic-control reference video.
    3. Criteria: Raters chose the better result based on two aspects: (i) semantic alignment and (ii) overall quality.
    4. Reporting: The preference rate is reported, which is the normalized share of selections across all comparisons, summing to 100%.

5.3. Baselines

The VAP method is evaluated against a comprehensive set of baselines, including state-of-the-art (SOTA) structure-controlled methods, condition-specific semantic methods, and leading closed-source commercial models.

  • State-of-the-Art Structure-Controlled Video Generation Method:

    • VACE [34]: This model conditions on a video and a same-size mask, indicating editable (1) vs. fixed (0) regions. For comparison with VAP, VACE was provided with:
      • Reference video itself.
      • Depth map of the reference video.
      • Optical flow of the reference video. In all cases, the mask was set to 1s, meaning the model was instructed to follow (rather than copy) the condition. This highlights how structure-controlled methods perform when pixel-wise mapping priors are inappropriate for semantic control.
  • DiT Backbone and Condition-Specific Methods:

    • CogVideoX-I2V [76]: The base Video Diffusion Transformer (DiT) model used as the backbone. This represents a strong generative model without any specific semantic control mechanism beyond standard text prompts.
    • CogVideoX-I2V (LoRA) [27]: This represents a common community practice for condition-specific semantic control. For each unique semantic condition, a LoRA (Low-Rank Adaptation) is finetuned on the CogVideoX-I2V backbone. This approach is often reported to match or surpass task-specific models [2, 78]. The performance reported is an average across all finetuned LoRAs.
  • State-of-the-Art Closed-Source Commercial Models:

    • Kling [40]: A commercial image-to-video special effects platform.
    • Vidu [68]: Another commercial AI video templates platform. These models are generally highly optimized and are typically condition-specific or task-specific, representing the highest practical performance benchmarks in the field.

5.4. Implementation Details

The authors trained VAP on two different DiT architectures to demonstrate its effectiveness and transferability:

  • CogVideoX-I2V-5B [76]

  • Wan2.1-I2V-14B [70]

    Key hyperparameters and training specifics:

  • Parameter Counts: For fairness, the in-context DiT expert in VAP was designed to match parameter counts.

    • On CogVideoX-I2V-5B: The expert was a full copy of the original DiT (approx. 5B parameters).
    • On Wan2.1-I2V-14B: The expert was a distributed copy spanning 1/41/4 of the layers (also approx. 5B parameters).
  • Video Processing:

    • Videos were resized to 480p (480 x 720 or 480 x 832).
    • 49 frames were sampled at 16 frames per second (fps).
  • Optimizer: AdamW was used.

  • Learning Rate: 1×1051 \times 10^{-5}.

  • Learning Rate Schedule: Constant with warmup (1000 warmup steps).

  • Training Steps: Approximately 20,000 steps.

  • Hardware: Trained on 48 NVIDIA A100 GPUs.

  • Inference:

    • 50 denoising steps.
    • Classifier-free guidance scale of 6 (for CogVideoX-based) or 5 (for Wan2.1-based).
  • Prediction Type: Velocity (for CogVideoX) or Flow Matching (for Wan2.1).

  • MoT Layers: For the CogVideoX-based VAP, the MoT expert was applied across all 42 layers. For the Wan2.1-based VAP, it was applied to layers [0, 4, 8, 12, 16, 20, 24, 28, 32, 36] (10 layers, approximately 1/4 of total layers).

    The following are the results from Table 3 of the original paper:

    Hyperparameter Model
    CogVideoX-I2V-based Wan2.1-I2V-based
    Batch Size / GPU 1/1 1/2
    Accumulate Step 1 1
    Optimizer AdamW AdamW
    Weight Decay 0.0001 0.0001
    Learning Rate 0.00001 0.00001
    Learning Rate Schedule constant with warmup constant with warmup
    WarmUp Steps 1000 1000
    Training Steps 20,000 20,000
    Resolution 480p 480p
    Prediction Type Velocity Flow Matching
    Num Layers 42 40
    MoT Layers [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41] [0, 4, 8, 12, 16, 20, 24, 28, 32, 36]
    Pre-trained Model CogVideoX-I2V-5B Wan2.1-I2V-14B
    Sampler CogVideoX DDIM Flow Euler
    Sample Steps 50 30
    Guide Scale 6.0 5.0
    Generation Speed (1 A100) ~540s ~420s
    Device A100×48 A100×48
    Training Strategy FSDP /DDP /BFloat16 FSDP DDP Parallel /BFloat16

Table 3: Hyperparameter selection for CogVideoX-I2V-5B-based and Wan2.1-I2V-14B-based VAP.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that VAP significantly advances semantic-controlled video generation, outperforming open-source baselines and achieving competitive performance with commercial models.

The following are the results from Table 1 of the original paper:

Metrics Text CLIP Score ↑ Overall Quality Semantic Alignment Score ↑ User Study Preference Rate (%)*
Model Motion Smoothness ↑ Dynamic Degree ↑ Aesthetic Quality ↑
Structure-Controlled Methods
VACE (Original) 5.88 97.60 68.75 53.90 35.38 0.6%
VACE (Depth) 22.64 97.65 75.00 56.03 43.35 0.7%
VACE (Optical Flow) 22.65 97.56 79.17 57.34 46.71 1.8%
DiT Backbone and Condition-Specific Methods
CogVideoX-I2V 22.82 98.48 72.92 56.75 26.04 6.9%
CogVideoX-I2V (LoRA)+ 23.59 98.34 70.83 54.23 68.60 13.1%
Kling / Vidu‡ 24.05 98.12 79.17 59.16 74.02 38.2%
Ours
Video-As-Prompt (VAP) 24.13 98.59 77.08 57.71 70.44 38.7%

Table 1: Quantitative Comparison. We compare against the SOTA structure-controlled generation method VACE [34], the base video DiT model CogVideoX-I2V [76], the condition-specific variant CogVideoX-I2V (LoRA) [27], and the closed-source commercial models Kling/Vidu [40, 68]. Overall, VAP delivers performance comparable to the closed-source models and outperforms other open-source baselines. Red stands for the best, Blue stands for the second best.

6.1.1. Quantitative Comparison (Table 1)

  • Structure-Controlled Methods (VACE): As expected, VACE (Original, Depth, Optical Flow) performs the worst across all metrics for semantic-controlled generation. This is because VACE assumes a pixel-wise mapping between condition and output, which is not suitable for abstract semantic controls. When using the raw reference video as input, VACE shows very low Text CLIP Score (5.88) and Semantic Alignment Score (35.38), along with a minimal User Preference Rate (0.6%). As the control shifts from raw video to depth and then optical flow, the pixel-wise appearance details decrease, and metrics slightly improve (e.g., Semantic Alignment Score up to 46.71 for Optical Flow). This confirms that the strong pixel-wise prior of these methods is ill-suited for semantic control.
  • DiT Backbone (CogVideoX-I2V): Driving the pre-trained DiT backbone with only captions yields decent video quality metrics (e.g., Motion Smoothness 98.48, Aesthetic Quality 56.75), but a very weak Semantic Alignment Score (26.04) and User Preference Rate (6.9%). This indicates that many semantic controls are difficult to express effectively with coarse text alone.
  • Condition-Specific Methods (CogVideoX-I2V (LoRA)): LoRA fine-tuning achieves a strong Semantic Alignment Score (68.60), significantly better than the base DiT. However, it slightly harms base quality (e.g., Motion Smoothness drops to 98.34 from 98.48 for CogVideoX-I2V, Aesthetic Quality to 54.23 from 56.75), suggesting overfitting to specific conditions. Moreover, it still lags in User Preference Rate (13.1%), and its inherent limitation is the need for a separate model per condition and lack of zero-shot generalization.
  • Commercial Models (Kling / Vidu): These closed-source commercial models achieve excellent performance, with a Semantic Alignment Score of 74.02 and a User Preference Rate of 38.2%. They represent the state-of-the-art for condition-specific or task-specific approaches.
  • Video-As-Prompt (VAP): VAP significantly outperforms all open-source baselines on most metrics. It achieves the highest Text CLIP Score (24.13), best Motion Smoothness (98.59), and a Semantic Alignment Score (70.44) that is substantially higher than LoRA and close to commercial models. Critically, VAP achieves the highest User Preference Rate (38.7%), surpassing even the commercial models, establishing itself as the new state-of-the-art for open-source methods. This is particularly impressive given that VAP is a single unified model, unlike the condition-specific commercial models.

6.1.2. User Study (Table 1)

The user study corroborates the quantitative results. VAP achieved an overall 38.7% preference rate, which is the highest among all compared methods, narrowly beating the commercial Kling / Vidu models (38.2%). This strong human preference underscores VAP's ability to produce coherent, semantically aligned videos with high visual quality in a unified manner.

6.1.3. Qualitative Comparison (Figure 6)

Figure 6 Qualitative comparison with VACE \[34\], CogVideoX (I2V) \[76\], CogVideoX-LoRA (I2V) and commercia models \[40, 68\]; VACE \(( ^ { * } )\) uss a \\*form condition (op left). More visualizations are in the project page. 该图像是图表,展示了VAP与多个视频生成模型(如VACE、CogVideoX、CogVideoX-LoRA等)的定性比较。图中包含多组参考视频和生成结果,强调了各模型在语义控制方面的不同表现。 *Figure 6: Qualitative comparison with VACE [34], CogVideoX (I2V) [76], CogVideoX-LoRA (I2V) and commercial models [40, 68]; VACE ()(^*) uses a form condition (top left). More visualizations are in the project page.

Figure 6 provides qualitative examples that visually confirm the quantitative findings:

  • VACE: Exhibits pixel-mapping bias. For instance, when given a reference video of a Statue of Liberty imitating a sheep, VACE copies unwanted appearance details or layout directly from the reference. This leads to artifacts or unintended visual elements being "copied" into the target, demonstrating its ill-suitability for semantic control. This artifact weakens when depth or optical flow are used as conditions, as they progressively remove appearance details.
  • CogVideoX-I2V: While capable of generating plausible videos, it often fails to capture or align with the abstract semantics, leading to visually good but semantically inaccurate outputs.
  • CogVideoX-I2V (LoRA): Shows improved semantic alignment compared to the base DiT, but it can still suffer from overfitting and lacks the fine-grained control or generalization of VAP. It requires a separate model for each condition.
  • VAP: Produces videos with superior temporal coherence, visual quality, and semantic consistency. It effectively transfers the desired semantic (e.g., specific concept, style, motion) to the target without copy artifacts, while maintaining the target's identity. VAP's outputs are visually competitive with condition-specific commercial models like Kling and Vidu.

6.1.4. Zero-Shot Generalization (Figure 7)

Figure7 Zero-Shot Performance. Givesanic conditins unseen in VAP-Data (lt column), VAP stiltraner the abstract semantic pattern to the reference image in a zero-shot manner. Figure 7: Zero-Shot Performance. Given semantic conditions unseen in VAP-Data (left column), VAP still transfers the abstract semantic pattern to the reference image in a zero-shot manner.

A significant finding is VAP's strong zero-shot generalization capability. By treating all semantic conditions as unified video prompts, VAP can handle diverse tasks. Furthermore, when presented with an entirely unseen semantic reference that was not part of the VAP-Data training set (e.g., "crumble," "dissolve," "levitate," "melt"), VAP successfully transfers the abstract semantic pattern to a new reference image. This demonstrates the power of its in-context learning ability and its potential for truly general-purpose controllable video generation.

6.2. Ablation Studies / Parameter Analysis

The paper conducts several ablation studies to validate the effectiveness of VAP's key architectural components and design choices.

The following are the results from Table 7 of the original paper:

Metrics Variant Text CLIP Score ↑ Overall Quality Reference Alignment Score ↑
Motion Smoothness ↑ Dynamic Degree ↑ Aesthetic Quality ↑
In-Context Generation Structure
u (Single-Branch Finetuning) 23.03 97.97 70.83 56.93 68.74
u (Single-Branch LoRA) 23.12 98.25 72.92 57.19 69.28
u (Unidir-Cross-Attn) 22.96 97.94 66.67 56.88 67.16
u (Unidir-Addition) 22.37 97.63 62.50 56.91 55.99
Position Embedding Design
u (Identical PE) 23.17 98.49 70.83 57.09 68.98
uΘn_{\Theta}^{n} (Neg. shift in T, W) 23.45 98.53 72.92 57.31 69.05
Scalability
u(1K) 22.84 92.12 60.42 56.77 63.91
u(10K) 22.87 94.89 64.58 56.79 66.28
u(50K) 23.29 96.72 70.83 56.82 68.23
u(100K) 24.13 98.59 77.08 57.71 70.44
DiT Structure
uWan (Wan2.1-I2V-14B) 23.93 97.87 79.17 58.09 70.23
In-Context Expert Transformer Layer Distribution‡
u(Lodd) 24.05 98.52 75.00 57.58 70.22
u(Lodd, ≤ [0.5N]) 23.72 98.19 70.83 56.71 69.61
u(Lfirst-half) 23.90 98.41 75.00 57.18 69.94
u(Lfirst-last) 23.96 98.33 72.92 57.06 70.02
Video Prompt Representation
u (noisy reference) 23.98 98.41 75.00 57.42 70.18
Ours
u (VAP) 24.13 98.59 77.08 57.71 70.44

Table 7: Ablation Study. We verify the effectiveness of our MoT structure, temporal-biased RoPE, the full dataset, and the transferability in different DiTs. The bottom row reports our full model. Notation: uΘu_{\Theta} (our VAP parameterized by Θ\Theta), uΘsu_{\Theta}^{s} (single-branch finetuning), uΘslu_{\Theta}^{sl} (single-branch LoRA), uΘucu_{\Theta}^{uc} (unidirectional cross-attention), uΘuau_{\Theta}^{ua} (unidirectional addition), uΘiu_{\Theta}^{i} (identical position embedding), uΘnu_{\Theta}^{n} (negative temporal/width shifts of position embedding), uWanu_{Wan} (Wan2.1 as DiT backbone). Scale: uΘ(M)u_{\Theta}(M) with M{1K,10K,50K,100K}M \in \{1\text{K}, 10\text{K}, 50\text{K}, 100\text{K}\}. MoT layers: uΘ(L)\boldsymbol{u}_{\Theta}(\mathcal{L}) for selected layers L[Nl]={1,,Nl}\mathcal{L} \subseteq [N_l] = \{1, \dots, N_l\} of the backbone with NlN_l Transformer layers. We define Lfirsthalf={1,2,,0.5Nl}\mathcal{L}_{\mathrm{first-half}} = \{1, 2, \dotsc, \lfloor 0.5N_l \rfloor\}, Lfirstlast={1,Nl}\mathcal{L}_{\mathrm{first-last}} = \{1, N_l\}, Lodd,0.5Nl={1,3,,0.5Nl}\mathcal{L}_{\mathrm{odd}, \le \lfloor 0.5N_l \rfloor} = \{1, 3, \dots, \lfloor 0.5N_l \rfloor\} and Lodd={1,3,,Nl}\mathcal{L}_{\mathrm{odd}} = \{1, 3, \dots, N_l\}. Our full model uses all 100K data and all MoT layers (for CogVideoX-I2V-5B based VAP).

6.2.1. In-Context Generation Structure

The study compares four variants of in-context generation against the full VAP (MoT with Frozen DiT + Trainable Expert + Full Attention):

  • A1. uΘsu_{\Theta}^{s} (Single-Branch Finetuning): Expanding the pre-trained DiT input sequence to include reference text, reference video, target text, and target video, then finetuning the full model. This approach yielded decent Semantic Alignment (68.74) but VAP (70.44) still outperforms it. This indicates that while finetuning can learn in-context patterns, MoT is superior at preserving the base DiT's generative ability while enabling plug-and-play control, addressing catastrophic forgetting.

  • A2. uΘslu_{\Theta}^{sl} (Single-Branch LoRA Finetuning): Similar to A1 but freezing the backbone and training only LoRA layers. LoRA helps retain the backbone's ability and achieves a slightly better Semantic Alignment (69.28) than full finetuning. However, its limited capacity struggles with the complexity of in-context semantic generation, resulting in suboptimal overall performance compared to VAP (70.44).

  • A3. uΘucu_{\Theta}^{uc} (Unidirectional Cross-Attention): Freezing the pre-trained DiT, adding a new branch (with same weights as backbone), and injecting its features via layer-wise cross-attention unidirectionally (reference to target). This leads to worse Semantic Alignment (67.16) than VAP. This highlights that the bidirectional information exchange through full attention in MoT (where reference and target representations adapt synchronously) is crucial for improving semantic alignment.

  • A4. uΘuau_{\Theta}^{ua} (Unidirectional Addition): Similar to A3 but injecting features via residual addition. This performs the worst in terms of Semantic Alignment (55.99). Even with retraining, residual-addition methods rely on rigid pixel-to-pixel mapping, which mismatches semantic-controlled generation and severely degrades performance, as shown in Figure 5(a) and quantitatively in Table 1 (VACE results).

    Conclusion: The MoT architecture used in VAP (with a frozen DiT and a trainable expert communicating via full attention) is the most effective structure for in-context semantic-controlled video generation, balancing generative ability preservation with robust contextual guidance.

6.2.2. Position Embedding Designs

  • uΘiu_{\Theta}^{i} (Identical PE): Applying identical RoPE to both the reference and target videos enforces an unrealistic pixel-wise alignment prior. This results in a lower Semantic Alignment Score (68.98) compared to VAP (70.44). This validates the motivation for a specialized position embedding for semantic control.

  • uΘnu_{\Theta}^{n} (Neg. shift in T, W): In this variant, in addition to a temporal bias Δ\varDelta, a width bias is also introduced by placing the reference tokens before the target along the width axis (following in-context image generation [66]). This yielded slightly lower Semantic Alignment (69.05) than VAP. This suggests that while temporal bias is crucial, a width bias (negative shift in W) may not be beneficial for video contexts, possibly interfering with spatial consistency or being less necessary than for static images.

    Conclusion: The specific temporally biased RoPE (temporal shift only) used in VAP is crucial for eliminating spurious pixel-mapping priors and achieving optimal semantic alignment.

6.2.3. Scalability

The ablation study on dataset size (1K, 10K, 50K, 100K samples from VAP-Data) shows a clear trend: VAP's performance improves across all metrics as the training data grows.

  • From 1K samples, Semantic Alignment Score is 63.91, Motion Smoothness is 92.12.

  • With the full 100K samples, Semantic Alignment Score reaches 70.44, and Motion Smoothness reaches 98.59.

    Conclusion: This demonstrates the strong scalability of VAP's unified design, which treats reference videos as prompts without task-specific modifications, combined with the MoT framework that preserves the backbone's generative capacity.

6.2.4. DiT Structure

To test transferability, VAP was implemented on Wan2.1-I2V-14B, a stronger DiT backbone, with an expert of comparable parameter count (~5B) to the CogVideoX-I2V-5B version. The Wan2.1-based VAP (uWan) shows:

  • Improved Dynamic Degree (79.17 vs. 77.08 for CogVideoX-based VAP) and Aesthetic Quality (58.09 vs. 57.71). This benefit likely comes from Wan2.1's stronger base generative capabilities.

  • Slightly worse Reference Alignment Score (70.23 vs. 70.44 for CogVideoX-based VAP). This could be attributed to the Wan2.1 variant having MoT expert layers inserted across only 1/41/4 of the backbone's layers, potentially leading to less comprehensive in-context interaction compared to the CogVideoX variant where MoT was applied to all layers.

    Conclusion: VAP is transferable across different DiT architectures, with performance generally benefiting from a stronger backbone. The distribution of MoT layers (how many and which layers) can influence the balance between generation quality and semantic alignment.

6.2.5. In-Context Expert Transformer Layer Distribution

The study analyzes how different layer distributions for the MoT expert affect in-context control:

  • u(Lodd) (all odd layers): Achieves a Semantic Alignment Score of 70.22, close to the full model (70.44), with good Overall Quality.

  • u(Lfirst-half) (first half layers): Semantic Alignment of 69.94.

  • u(Lfirst-last) (first and last layers only): Semantic Alignment of 70.02. This outperforms u(Lfirst-half), suggesting that balanced feature interaction across both early and late layers (for capturing low-level and high-level semantics) is important.

  • u(Lodd,[0.5N])u(Lodd, ≤ [0.5N]) (odd layers in first half): Semantic Alignment of 69.61. This is lower than u(Lodd), further supporting that a wider distribution of expert layers generally improves performance.

    Conclusion: A more balanced distribution of MoT expert layers throughout the DiT backbone generally leads to better generation quality and semantic alignment. While reducing layers can improve efficiency, it may compromise performance.

6.2.6. Video Prompt Representation

Inspired by Diffusion Forcing [9, 23, 64], the authors experimented with injecting noise into the video prompt representation.

  • Result: This approach often led to severe artifacts and degraded generation quality.

  • Reasoning: Unlike long-video generation in Diffusion Forcing (where the goal is often to avoid copy-paste or overly static results, and noise can help introduce diversity), VAP's reference videos already differ significantly in appearance and layout from the target. Adding noise to the video prompt in this context corrupts the crucial contextual information needed for semantic transfer, thus hindering rather than helping.

    Conclusion: For VAP's semantic control paradigm, a clean and informative video prompt representation is essential; adding noise to it is detrimental.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces Video-As-Prompt (VAP), a novel and unified framework that addresses the longstanding challenge of generalizable semantic control in video generation. VAP reframes this problem as in-context generation, using a reference video as a direct semantic prompt. Its core technical contributions include a plug-and-play Mixture-of-Transformers (MoT) expert that guides a frozen Video Diffusion Transformer (DiT) while preventing catastrophic forgetting, and a temporally biased position embedding that eliminates spurious pixel-wise mapping priors for robust context retrieval.

To support this innovation, the authors built VAP-Data, the largest dataset for semantic-controlled video generation, comprising over 100K paired videos across 100 semantic conditions. Extensive experiments demonstrate that VAP, as a single unified model, achieves state-of-the-art performance among open-source methods, rivaling leading condition-specific commercial models with a 38.7% user preference rate. Its strong zero-shot generalization and versatility across various downstream applications mark a significant advance toward general-purpose, controllable video generation.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

  • VAP-Data Limitations:

    • Synthetic Nature: VAP-Data is generated using visual effects templates from commercial models and community LoRAs. This means the dataset is synthetic and may inherit specific stylistic biases, artifacts, and conceptual limitations of its source templates. For instance, if the source models struggle with generating realistic hands, VAP trained on this data might also perform poorly in that aspect.
    • Limited Real-World Diversity: While large, the semantic conditions in VAP-Data are still relatively limited in terms of real-world complexity and diversity.
    • Future Work: The construction of a truly large-scale, real-world, semantic-controlled video dataset is a crucial next step, albeit beyond the scope of this paper.
  • Influence of Caption and Reference Quality:

    • Caption Accuracy: VAP relies on standard video-description captions for both reference and target videos to aid semantic transfer. Inaccurate semantic descriptions in captions can degrade generation quality.
    • Subject Mismatch: Large structural mismatches between the main subjects of the reference and target images/videos can also negatively impact performance. VAP performs best when captions align and subjects are structurally similar (Figure 14).
    • Future Work: Exploring instruction-style captions (e.g., "Please follow the Ghibli style of the reference video") may more effectively capture intended semantics and improve control.
  • Multiple Reference Videos:

    • The paper experimented with supplying multiple semantically matched reference videos but found similar results to using a single reference.
    • A significant issue observed with multiple references is the potential for the model to blend unwanted visual details across videos (Figure 15). This is hypothesized to stem from the general-purpose captions lacking explicit semantic referents. When multiple references introduce conflicting or divergent visual elements (e.g., different types of falling objects), the model might create a mixed, undesirable output.
    • Future Work: Developing a more effective multi-reference control mechanism (e.g., a tailored RoPE for multi-reference conditions) or using instruction-style captions that explicitly specify the intended referent is needed to mitigate this issue. A full study of model and caption design for multi-reference training is left for future research.
  • Efficiency:

    • Increased Inference Cost: The plug-and-play MoT approach, while avoiding backbone retraining, introduces additional parameters and computation. This leads to higher memory usage and longer runtime (inference time roughly doubles).
    • Future Work: Performance optimizations like sparse attention [13, 80] or pruning [15, 73] are orthogonal to the core contributions and could be explored in future work to improve efficiency.

7.3. Personal Insights & Critique

VAP represents a truly compelling and elegant paradigm shift for controllable video generation. The core innovation of treating reference videos as in-context prompts is brilliant, moving beyond the limitations of pixel-aligned conditions and task-specific architectures. This approach feels much more aligned with how humans learn and abstract concepts – by example rather than by explicit, granular instructions.

Inspirations and Applications:

  • Unified Control: The success of VAP as a single, unified model for diverse semantic controls is highly inspiring. It suggests a path towards foundational generative video models that can adapt to a vast range of user intentions without constant retraining. This could revolutionize creative industries, making complex visual effects and stylization accessible to a broader audience.
  • Plug-and-Play Extensibility: The MoT architecture is a key enabler. Its plug-and-play nature means that as new, more powerful Video Diffusion Transformers are developed, VAP's control capabilities can be readily integrated, leveraging the advancements of the backbone without catastrophic forgetting. This modularity is a robust design principle.
  • Data Bootstrapping: The VAP-Data creation strategy, leveraging existing commercial and community models to bootstrap a large dataset for a novel task, is a clever workaround for data scarcity. This methodology could be transferable to other nascent fields of generative AI where large, high-quality paired datasets are difficult to acquire.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • "Synthetic Trap" of VAP-Data: While necessary, the synthetic nature of VAP-Data is a double-edged sword. As the authors admit, VAP might inherit stylistic biases or artifacts from the generative models used for data creation. This raises questions about the ultimate realism and diversity of outputs for semantics not perfectly captured by the source models. Future work will need to carefully consider how to augment this with real-world data or domain adaptation techniques.

  • Interpretation of "Semantic Alignment": The Semantic Alignment Score relies on VLMs (Gemini-2.5-Pro, GPT-5). While validated by user studies, the interpretability and potential biases of these VLMs themselves could influence the score. A deeper dive into how these VLMs make their judgments for complex video semantics would be valuable.

  • General-Purpose Captions vs. Instruction-Style: The paper notes that standard video descriptions were used to stay close to the DiT's pre-training data. However, the limitation analysis suggests that instruction-style captions might be more effective. This implies an unexplored design space for prompt engineering (textual) in conjunction with video prompts. Integrating instruction tuning principles could unlock even finer control.

  • Multi-Reference Challenges: The issue of blending unwanted visual details with multiple references is significant. It highlights that while VAP is great at transferring a semantic, controlling which part of which reference's semantic to apply, especially when references diverge, remains an open problem. This likely requires more explicit attention mechanisms or gating networks to disentangle and select specific semantic components from multiple contexts.

  • Efficiency for Production: While acceptable for research, the doubled inference time might be a hurdle for real-time applications or large-scale commercial deployments without further optimization. This is a common trade-off with increased model complexity for control.

    Overall, VAP makes a profound contribution by offering a unified and highly generalizable approach to semantic-controlled video generation. It sets a new benchmark for open-source models and provides a robust framework that can evolve with stronger DiT backbones and more sophisticated in-context learning strategies.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.