Video-As-Prompt: Unified Semantic Control for Video Generation
TL;DR Summary
The Video-As-Prompt (VAP) paradigm redefines semantic control in video generation using reference videos as prompts, combined with a Mixture-of-Transformers architecture. VAP builds the largest dataset, VAP-Data, achieving a 38.7% user preference rate, showcasing strong zero-shot
Abstract
Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is Video-As-Prompt: Unified Semantic Control for Video Generation. This title highlights a novel paradigm (Video-As-Prompt) for achieving consistent and generalizable semantic control in video generation.
1.2. Authors
The authors of this paper are:
-
Yuxuan Bian
-
Xin Chen
-
Zenan Li
-
Tiancheng Zhi
-
Shen Sang
-
Linjie Luo
-
Qiang Xu
Their affiliations include
Intelligent Creation Lab, ByteDanceandThe Chinese University of Hong Kong. Xin Chen is noted as the Project lead, and Xin Chen and Qiang Xu are Corresponding Authors.
1.3. Journal/Conference
The paper is published as an arXiv preprint with the identifier arXiv:2510.20888. While arXiv is a preprint server and not a peer-reviewed journal or conference, papers often appear there before or during the review process for major conferences (e.g., CVPR, ICCV, NeurIPS, ICLR) or journals. Given the technical depth and the mentions of a project page, it is likely intended for a top-tier computer vision or machine learning conference or journal, which are highly reputable and influential in the field of generative AI.
1.4. Publication Year
The paper was published on 2025-10-23T17:59:52.000Z (UTC).
1.5. Abstract
The paper addresses the critical challenge of achieving unified and generalizable semantic control in video generation. Existing methods suffer from limitations such as introducing artifacts due to inappropriate pixel-wise priors from structure-based controls or relying on costly, non-generalizable, condition-specific finetuning or task-specific architectures.
The authors introduce Video-As-Prompt (VAP), a new paradigm that reframes semantic control as an in-context generation problem. VAP uses a reference video as a direct semantic prompt to guide a frozen Video Diffusion Transformer (DiT). This guidance is facilitated by a plug-and-play Mixture-of-Transformers (MoT) expert architecture, which prevents catastrophic forgetting. A temporally biased position embedding is employed to eliminate spurious mapping priors and ensure robust context retrieval.
To support this research, the authors created VAP-Data, the largest dataset for semantic-controlled video generation to date, comprising over 100K paired videos across 100 semantic conditions.
VAP, as a single unified model, achieves state-of-the-art results among open-source methods, demonstrating a 38.7% user preference rate that competes with leading condition-specific commercial models. Its strong zero-shot generalization capabilities and support for various downstream applications mark a significant advancement toward general-purpose, controllable video generation.
1.6. Original Source Link
Official Source Link: https://arxiv.org/abs/2510.20888
PDF Link: https://arxiv.org/pdf/2510.20888v1.pdf
This paper is currently published as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the lack of a unified and generalizable framework for semantic-controlled video generation. While structure-controlled video generation (where control signals like depth maps or poses are pixel-aligned with the target video) is well-studied and unified frameworks exist, semantic control presents a different challenge. Semantic control involves guiding video generation based on abstract concepts such as style, motion, or camera movement, where there is no direct pixel-wise correspondence between the control signal and the output video.
This problem is important because it limits applications in crucial areas like visual effects, video stylization, motion imitation, and camera control. Existing methods for semantic-controlled video generation face several challenges:
-
Artifacts from Structure-Based Methods: Directly migrating
unified structure-controlled methods(which rely onpixel-aligned conditions) often leads to artifacts. These methods enforce inappropriatepixel-wise mapping priors(assumptions that each pixel in the input condition corresponds to a specific pixel in the output), which are not valid for semantic conditions. -
Condition-Specific Overfitting: Many approaches
finetunelargebackbonesorLoRAs(Low-Rank Adaptations) for each specific semantic condition (e.g., "Ghibli style" or "Hitchcock camera zoom"). This approach is extremely costly in terms of computational resources and time, and it makes the models non-generalizable to new, unseen semantic conditions. -
Task-Specific Design: Other methods involve crafting
task-specific modulesorinference strategiesfor each type of semantic control (e.g., style, motion, camera). While effective for their specific task, these designs hinder the development of a single, unified model that can handle diverse semantic controls and lackzero-shot generalizabilityto novel tasks or conditions.The paper's entry point and innovative idea revolve around reframing
semantic controlas anin-context generationproblem. Instead of pixel-aligned conditions, per-condition training, or task-specific designs,Video-As-Prompt (VAP)treats areference videocontaining the desired semantics as a direct "video prompt" to guide the generation process. This approach is inspired by the success ofDiffusion Transformers (DiTs)in supporting strongin-context controlabilities in image and structure-controlled video generation.
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field of controllable video generation:
-
VAP: A Unified Semantic-Controlled Video Generation Paradigm: The authors introduce
Video-As-Prompt (VAP), the first unified framework forsemantic-controlled video generationunder non-pixel-aligned conditions. VAP innovatively treats areference videowith the desired semantics as avideo prompt, enabling generalizablein-context controlacross diverse semantic conditions. This removes theinappropriate pixel-wise mapping priorfound instructure-based controlsand avoids the need forper-condition trainingorper-task model designs. -
Plug-and-Play In-Context Video Generation Framework: VAP proposes a novel
plug-and-playframework built on aMixture-of-Transformers (MoT)architecture. This design augments any frozenVideo Diffusion Transformer (DiT)with a trainable parallelexpert transformerthat interprets the video prompt and guides generation. This architecture is crucial for preventingcatastrophic forgettingof the pre-trainedDiT's generative abilities, supports various downstream tasks (e.g., concept, style, motion, camera guidance), and delivers strongzero-shot generalizabilityto unseen semantic conditions. Atemporally biased position embeddingis also introduced to eliminatespurious mapping priorsfor robust context retrieval. -
VAP-Data: Largest Dataset for Semantic-Controlled Video Generation: The authors construct and release
VAP-Data, a large-scale dataset specifically designed forsemantic-controlled video generation. With over 100K curated paired samples across 100 semantic conditions,VAP-Dataprovides a robust foundation for training and evaluating unified semantic-controlled video generation models, addressing a significant data gap in the field.The key conclusions and findings are:
-
VAP, as a single unified model, achieves state-of-the-art performance among open-source methods, with a 38.7% user preference rate that is competitive with leading
condition-specific commercial models. This demonstrates that a unified approach can rival highly specialized models. -
The
MoTarchitecture effectively preserves the baseDiT's generative ability while enabling stablein-context control, overcoming issues likecatastrophic forgetting. -
The
temporally biased Rotary Position Embedding (RoPE)is essential for removing unrealisticpixel-wise alignment priors, leading to improved performance for semantic control. -
VAP exhibits strong
zero-shot generalizationto unseen semantic conditions and supports various downstream applications, indicating a significant step toward truly general-purposecontrollable video generation.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the Video-As-Prompt (VAP) paper, a foundational understanding of several key concepts in deep learning and generative AI is essential:
-
Diffusion Models: These are a class of generative models that learn to reverse a gradual noisy process. In essence, they start with random noise and gradually transform it into a meaningful data sample (e.g., an image or video) by learning to denoise at each step.
- Forward Process (Diffusion): Data (e.g., a clear image) is progressively corrupted by adding Gaussian noise over several timesteps until it becomes pure noise.
- Reverse Process (Denoising/Generation): A neural network is trained to predict and remove the noise added at each step, effectively learning to transform noise back into coherent data.
- Latent Diffusion Models (LDMs): Instead of working directly in the pixel space, LDMs operate in a compressed
latent space. AVariational Autoencoder (VAE)is typically used to encode high-resolution images/videos into lower-dimensional latents, and the diffusion process occurs in this latent space. This makes training more efficient and stable.
-
Transformers: Originally developed for natural language processing,
Transformersare neural network architectures characterized by theirself-attentionmechanisms.- Self-Attention: This mechanism allows the model to weigh the importance of different parts of the input sequence (e.g., words in a sentence, patches in an image/video) when processing each part. It captures long-range dependencies effectively.
- Query (Q), Key (K), Value (V): In
self-attention, the input is linearly transformed into three matrices:Query,Key, andValue. TheQueryinteracts withKeysto determine attention weights, which are then applied toValuesto produce the output. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where is the query matrix, is the key matrix, is the value matrix, is the dimension of the key vectors, and normalizes the attention scores. - Multi-Head Attention: Multiple
attention headsare run in parallel, each learning different representations, and their outputs are concatenated and linearly transformed.
-
Diffusion Transformers (DiTs): These models combine the power of
Diffusion Modelswith theTransformerarchitecture. Instead ofU-Netarchitectures often used in diffusion models,DiTsuseTransformersto process the latent representations of noisy images or videos.- Video
DiTsextend this to the temporal dimension, processing video latents as sequences of spatial patches over time. They are known for their scalability and ability to handle large input contexts.
- Video
-
Rotary Position Embedding (RoPE): A type of
position embeddingused inTransformersto inject positional information into the input sequence. Unlike absolute or relative position embeddings that add values to token embeddings,RoPErotatesQueryandKeyvectors based on their absolute position. This allowsself-attentionto implicitly incorporate relative positional information, which is beneficial forin-context learning. -
Variational Autoencoders (VAEs): As mentioned with
LDMs,VAEsare neural networks used for dimensionality reduction and data generation. They consist of anencoderthat maps input data (e.g., a video) to a lower-dimensionallatent spaceand adecoderthat reconstructs the data from thelatent space. ForDiTs,VAEsare crucial for converting high-resolution videos into computationally manageablelatent representations. -
Flow Matching: A generative modeling technique that trains a neural network to predict a
vector fieldthat transports samples from a simple base distribution (e.g., Gaussian noise) to a complex data distribution (e.g., real videos). Instead of learning to reverse a stochastic diffusion process (as in traditional diffusion models),Flow Matchinglearns a continuousdeterministic flowthat maps noise to data. This can lead to more stable training and efficient sampling. -
In-Context Learning: A paradigm where a model learns a new task by observing examples provided directly in the input prompt, without requiring explicit
finetuningof the model's weights. The model uses its learned patterns and internal representations to generalize from the provided examples to perform the new task. For video generation, this means providing a reference video that implicitly specifies the desired style, motion, or concept, and the model generates a new video following that context. -
Mixture-of-Experts (MoE) / Mixture-of-Transformers (MoT): An architecture where a network comprises several "expert" sub-networks. A
gating networklearns to select or combine the outputs of these experts based on the input.MoTspecifically applies this concept toTransformers, where differentTransformer expertscan specialize in different tasks or input modalities. In this paper,MoTis used as aplug-and-playmechanism where a parallelexpert Transformerhandles the reference prompt while the mainDiT backbonehandles the target generation, preventingcatastrophic forgetting. -
Catastrophic Forgetting: A common problem in machine learning where a model, when trained on a new task, tends to forget previously learned information or tasks. This is a significant challenge when trying to extend a pre-trained model to new functionalities without losing its original capabilities.
3.2. Previous Works
The paper contextualizes VAP by discussing existing approaches to controllable video generation, categorizing them broadly into Structure-Controlled and Semantic-Controlled.
3.2.1. Structure-Controlled Video Generation
- Definition: These methods are driven by
pixel-aligned conditions, meaning there's a direct geometric or structural correspondence between the input control signal and the desired output video. - Examples of Conditions:
depth maps[14, 16],human pose skeletons[28, 36],segmentation masks[6, 75],optical flow(representing motion vectors) [29, 35]. - Common Mechanism: Typically, an additional
adapterorbranchis added to a pre-trainedDiT(or similar backbone) and its features areinjected via residual addition. This mechanism effectively exploits thepixel-wise mapping prior(the assumption that each pixel in the condition corresponds to a specific pixel in the output). - Unified Frameworks: Recent works like
VACE [34]andFullDiT [37]have enabled all-in-onestructure-controlled generationby treating these diversepixel-aligned inputsas a unified spatial condition.
3.2.2. Semantic-Controlled Video Generation
- Definition: These methods deal with conditions that lack
pixel-wise correspondencewith the target videos. The control is based on abstract semantics. - Examples of Conditions:
concept transformation(e.g., turning a person into a Ladudu doll) [26, 40, 47],stylization(e.g., Ghibli style) [31, 78],camera movement(e.g., Hitchcock dolly zoom) [2, 3],motion transfer(sharing motion but differing in layout or skeleton) [19, 71, 82]. - Prior Approaches and Their Limitations (as shown in Figure 2):
- Condition-Specific Overfit (Figure 2(b)): Methods like
VX Creator [47]or usingCivitai [11]LoRAsfinetunetheDiT backbonesorLoRAsfor each specific semantic condition.- Drawbacks: This is computationally expensive, requires a separate model for every new semantic, and struggles to generalize to unseen conditions.
- Task-Specific Design (Figure 2(c)): Methods like
RecamMaster [2](camera control) orStyleMaster [78](stylization) craft dedicatedtask-specific modulesorinference strategiesfor a particular condition type. They often encode reference videos with the same semantics into a specially designed space to guide generation.- Drawbacks: While effective for their narrow task, these approaches are not unified, hinder the creation of general-purpose models, and lack
zero-shot generalizability.
- Drawbacks: While effective for their narrow task, these approaches are not unified, hinder the creation of general-purpose models, and lack
- Concurrent Work [49]: A
LoRA mixture-of-expertsapproach for unified generation, but it still learns each condition byoverfitting subsets of parametersand fails to generalize to unseen ones.
- Condition-Specific Overfit (Figure 2(b)): Methods like
3.2.3. Technological Evolution
Video generation has rapidly evolved:
- Early Stages: Dominated by
Generative Adversarial Networks (GANs)[63, 69], which produced impressive but often less stable or controllable results. - Diffusion Era:
Diffusion models[8, 17] emerged as state-of-the-art, offering higher quality, stability, and control. - Transformer-Based Architectures: Transition from
convolutional architectures[7, 10] totransformer-based ones(DiTs) [8, 17, 70] dramatically increased scalability and performance, particularly for long-range dependencies. - Controllable Generation: Initial
DiTsprimarily supported text prompts or first/last frame control. This led to the development oftask-specific modules[6, 34] orinference strategies[71, 82] to enable finer, user-defined control. - In-Context Learning Influence: Recent successes in
unified image generation [66]andstructure-controlled video generation [37]usingDiTsforin-context controlmotivated the authors to apply this paradigm to the more challengingsemantic-controlled video generation.
3.3. Differentiation Analysis
VAP critically differentiates itself from prior methods in the following ways:
-
Unified vs. Fragmented: Unlike the fragmented landscape of
semantic-controlled video generation(which relies oncondition-specific overfittingortask-specific designs),VAPproposes a single, unified model that can handle diverse semantic conditions (concept, style, motion, camera) without requiringper-condition retrainingorper-task model modifications. -
In-Context Control via Video Prompts:
VAPintroduces the novel concept of using areference videoitself as a direct semantic prompt. This is a significant departure from:Structure-controlled methodsthat usepixel-aligned conditions(e.g., depth, pose) andresidual addition, whichVAPshows cause artifacts when applied to semantic tasks due toinappropriate pixel-wise priors(Figure 5(a)).Semantic-controlled methodsthat eitherfinetunefor each condition or designtask-specific modules, both of which lack generality.
-
Plug-and-Play Architecture (MoT):
VAPleverages aMixture-of-Transformers (MoT)design to augment afrozen Video Diffusion Transformer (DiT). Thisplug-and-playapproach is superior tofinetuningthe entireDiT(which can lead tocatastrophic forgettingas shown in ablation studies) or usingLoRAs(which have limited capacity for complexin-context generation).MoTenables the expert to learn from the reference prompt while preserving the backbone's pre-trained generative abilities. -
Temporally Biased Position Embedding:
VAPaddresses a subtle but crucial issue withRotary Position Embedding (RoPE)inin-context generation. It identifies that sharingidentical position embeddingsbetween the reference and target can imposefalse pixel-level spatiotemporal mapping priors. By introducing atemporal bias,VAPeliminates this spurious prior, leading to improved performance and robustcontext retrieval. This is a unique contribution to thein-context learningparadigm for video. -
Zero-Shot Generalization: A key advantage of
VAP's unified,in-context learningapproach is its strongzero-shot generalizabilityto unseen semantic conditions, a capability largely absent incondition-specificortask-specificprior works. -
Dedicated Dataset (VAP-Data): The creation of
VAP-Data, the largest dataset forsemantic-controlled video generationwith paired reference/target videos, is a crucial enabler forVAPand future research, addressing a major data scarcity issue in this specific area.In essence,
VAPmoves beyond the limitations ofpixel-alignmentandspecialized architecturesby formulatingsemantic controlas a flexible,in-context promptingproblem, enabled by a robustMoTand a carefully designedposition embedding, trained on a purpose-built large-scale dataset.
4. Methodology
4.1. Principles
The core idea behind Video-As-Prompt (VAP) is to unify semantic-controlled video generation by treating a reference video with the desired semantics as a direct, task-agnostic "video prompt." This approach aims to overcome the limitations of prior methods, which either rely on pixel-aligned conditions (leading to artifacts for semantic tasks) or employ condition-specific finetuning or task-specific designs (hindering generalizability and unification).
VAP reframes the problem as in-context generation, where the model learns to infer and apply abstract semantic patterns from the reference video to a new target generation. The theoretical basis and intuition are rooted in the power of Diffusion Transformers (DiTs) for in-context learning, where the model can adapt its behavior based on contextual examples provided within the input.
To achieve this, VAP incorporates several key principles:
- Reference Video as Unified Prompt: Instead of relying on explicit, task-specific control signals, a video that demonstrates the desired semantic (e.g., a video in Ghibli style, a video showing a specific motion) serves as the universal input prompt.
- Plug-and-Play Architecture: To integrate this video prompt into a pre-trained
Video Diffusion Transformer (DiT)withoutcatastrophic forgetting(i.e., losing its original generative capabilities),VAPuses aMixture-of-Transformers (MoT)design. This allows a specialized "expert" module to process the reference prompt, while the frozenDiTbackbone handles the main generation task, with controlled information flow between them. - Temporal Context Preservation: Recognizing that
position embeddingsare crucial for transformers,VAPintroduces atemporally biased Rotary Position Embedding (RoPE). This design ensures that the model correctly interprets the temporal relationship between the reference prompt and the target video, preventing the imposition ofspurious pixel-wise mapping priorsthat are irrelevant for semantic control. - Task-Agnostic Learning: By using a diverse dataset of
reference video-target video pairsacross various semantic conditions,VAPlearns a general underlying principle for semantic control, allowing it to generalizezero-shotto unseen semantics.
4.2. Core Methodology In-depth (Layer by Layer)
The VAP framework operates by augmenting a frozen Video Diffusion Transformer (DiT) with a trainable Mixture-of-Transformers (MoT) expert, guided by a temporally biased position embedding.
4.2.1. Preliminary: Video Diffusion Models
Video diffusion models learn to reverse a noise process to generate videos. The paper illustrates this using Flow Matching [46]. In Flow Matching, a neural network is trained to predict a velocity field that transports samples from a simple noise distribution to a complex data distribution.
Formally, a noise sample is denoised to a target video along a path defined as:
$
\mathbf{x_t} = t \mathbf{x_1} + (1 - (1 - \sigma_{min})t) \mathbf{x_0}
$
The velocity field along this path is given by the derivative of with respect to :
$
V_t = \frac{d\mathbf{x_t}}{dt} = \mathbf{x_1} - (1 - \sigma_{min}) \mathbf{x_0}
$
Here, is a small constant to prevent division by zero, and represents the timestep along the path.
The model, parameterized by and denoted as , is trained to predict this velocity field given the noisy sample , the current timestep , and any additional conditions . The training objective is typically an L2-loss:
$
\mathcal{L} = \mathbb{E}{t, \mathbf{x_0}, \mathbf{x_1}, C} \left. \lVert u{\boldsymbol{\Theta}} (\mathbf{x_t}, t, C) - (\mathbf{x_1} - (1 - \sigma_{min}) \mathbf{x_0}) \rVert \right.
$
-
: The
Video Diffusion Transformermodel with parameters that predicts thevelocity field. -
: The noisy video sample at timestep .
-
: The current timestep, indicating the progression from noise to data.
-
: Additional conditions provided to guide the generation (e.g., text prompts, reference videos).
-
: An initial sample of Gaussian noise.
-
: The target (ground truth) video.
-
: A small constant () used in the path definition.
-
: Expected value over the distribution of , , , and .
-
: L2-norm, measuring the squared difference between the model's prediction and the true velocity.
During inference, the model starts by sampling Gaussian noise . Then, an
ODE solver(e.g., Euler, DDIM) uses the trained model to iteratively predict thevelocity fieldand traverse the path from noise to a clean video over a discrete set of denoising timesteps.
4.2.2. Reference Videos as Task-Agnostic Prompts
VAP unifies heterogeneous semantic conditions by treating reference videos (which embody the desired semantics) as universal video prompts.
-
Problem:
Structure-controlled methodsenforce apixel-wise alignment(e.g., depth map to video). When asemantic-same but pixel-misalignedvideo is injected viaresidual addition(as inControlNet-like methods), it often leads tocopy-and-paste artifacts(Figure 5(a)), where unwanted appearance or layout from the reference is copied. Prior semantic methods werecondition-specificortask-specific. -
VAP's Solution: A single unified model is trained to learn the conditional distribution for any semantic condition . The model uses the reference video as a prompt, which shares the same semantics as the target but lacks
pixel-wise alignment. -
Semantic Categories:
VAPspecifically evaluates four representative categories for semantic control:Concept-Guided Generation: Involves entity transformation (e.g., person becomes a doll) or interaction (e.g., AI lover approaches).Style-Guided Generation: Generating videos in a specific artistic style (e.g., Ghibli, Minecraft).Motion-Guided Generation: Following a reference motion, either human (e.g., specific dance moves) or non-human (e.g., object expansion).Camera-Guided Generation: Replicating reference camera movements (e.g., zoom, pan, Hitchcock dolly zoom).
-
Auxiliary Captions: To further aid in
semantic control,VAPalso takescaptionsfor both thereference video() and thetarget video() as input. These captions help the model identify and transfer shared semantic control signals. Thus, the model learns .The architecture overview of
Video-As-Promptis shown in Figure 4.
该图像是示意图,展示了 Video-As-Prompt 方法中的关键流程和组件,包括参考视频、目标视频和与之相关的 token 序列。图中展示了如何通过 Mixture-of-Transformers 架构进行 tokenization 和含有时间偏置的 RoPE,以增强上下文的生成能力。公式 [ R e f _ { t e x t }, R e f _ { v i d e o }, T a r _ { t e x t }, T a r _ { v i d e o } ]用于描述 in-context token 序列的组成。 Figure 4: Overview of Video-As-Prompt. The reference video (with the wanted semantics), target video, and their corresponding in-context token sequence (See middle. We omitted term "tokens" for simplicity.). First frame tokens are concatenated with video tokens. We add a temporal bias to RoPE to avoid nonexistent pixe-aligned priors from the original shared RoPE (See bottom right). The reference video and captions act as the prompts and guide the generation, which exchanges with the pre-trained DiT via full attention at each layer, enabling plug-and-play in-context generation.
4.2.3. Plug-and-Play In-Context Control
The VAP model takes four primary inputs to guide generation:
-
Reference Video (semantic prompt): Provides the wanted semantics.
-
Reference Image (initial appearance): The first frame of the generated video, providing initial appearance and subject.
-
Captions:
Reference caption() andtarget caption() to aid in semantic transfer. -
Noise/Noisy Target Video: Gaussian noise for inference, or a noisy version of the target video during training.
The processing flow is as follows:
-
VAE Encoding: The
reference videoand thetarget videoare first encoded intolatent representationsand respectively, using aVariational Autoencoder (VAE). Here,n, h, ware original temporal/spatial sizes, andn', h', w'are latent sizes. -
Text Tokenization: The
reference captionandtarget captionare processed into text tokens , where is the number of text tokens and is the embedding dimension. -
In-Context Sequence: A naive approach would be to finetune a
DiTon a concatenated sequence of all inputs: . However, the authors found this leads tocatastrophic forgettingand poor performance with limited data, especially because semantic control lackspixel-aligned priors. -
Mixture-of-Transformers (MoT) Architecture: To overcome these issues,
VAPadopts aMixture-of-Transformers (MoT)design (Figure 4, middle):- Frozen Backbone: A pre-trained
Video Diffusion Transformer (DiT)is kept frozen. Thisbackboneis responsible for processing thetarget generationinputs: . It maintains its originalquery (Q),key (K),value (V)projections,feed-forward layers, andlayer norms. - Trainable Expert: A parallel
expert Transformeris introduced, initialized from thebackbone. Thisexpertis trainable and specifically designed to consume thereference promptinputs: . It also has its own projections,feed-forward layers, andnorms. - Two-Way Information Fusion: At each layer of the
MoT, the from both theexpert(reference path) and thefrozen backbone(target path) are concatenated.Full attentionis then run over these combined sequences. This allows for synchronouslayer-wise reference guidance, meaning theexpert's interpretation of the reference prompt can dynamically influence thefrozen DiT's generation of the target video, and vice-versa. - Benefits: This
MoTdesign preserves thebackbone's original generative ability (preventingcatastrophic forgetting), boosts training stability, and enablesplug-and-play in-context controlthat is independent of the specificDiTarchitecture.
- Frozen Backbone: A pre-trained
4.2.4. Temporally Biased Rotary Position Embedding
-
Problem with Shared RoPE: Similar to observations in
in-context image generation [66], the authors found that sharingidentical position embeddings(e.g.,Rotary Position Embedding - RoPE [65]) between thereference conditionand thetarget videois suboptimal. It implicitly imposes afalse pixel-level spatiotemporal mapping prior. This means the model assumes a direct, pixel-to-pixel correspondence in space and time between the reference and target, which does not exist for semantic control and leads tounsatisfactory performance(e.g., artifacts in Figure 5(c)). -
VAP's Solution: To remove this
spurious prior,VAPmodifies theposition embeddingfor thereference prompt(Figure 4, bottom right).- Temporal Bias (): The
temporal indicesof thereference prompttokens are shifted by a fixed offset . This effectively places the reference tokensbeforeall thenoisy target video tokensalong the temporal axis. This matches thetemporal order expected by in-context generation(where the context is given first) and ensures the model doesn't try to align reference and target frames directly in time. - Spatial Consistency: Crucially, the
spatial indicesfor the reference prompt tokens are keptunchanged. This allows the model to still exploit meaningfulspatial semantic changeswithin the reference video prompt.
- Temporal Bias (): The
-
Benefits: This
temporally biased RoPEeffectively removes the nonexistent pixel-mapping prior, leading to improved performance (Table 2) by allowing the model to focus on abstract semantic patterns rather than spurious pixel-level correspondences.
Figure 5: Motivation. Ablation visualizations (Semantic: Spin ) on structure designs of VAP.
5. Experimental Setup
5.1. Datasets
Semantic-controlled video generation specifically requires paired reference and target videos that share the same non-pixel-aligned semantic controls (e.g., concept, style, motion, camera). Such pairs are difficult to obtain or label automatically, unlike structure-controlled data where vision perception models (e.g., SAM [39] for masks, Depth-Anything [74] for depth) can directly provide conditions. Prior work often relied on small, manually collected datasets tailored for specific conditions [47], hindering unified model development.
To address this data scarcity, the authors introduce VAP-Data, the largest dataset for semantic-controlled video generation to date.
-
Creation Process:
- High-Quality Reference Images: They curated 2K high-quality real images from the internet, covering diverse subjects like men, women, children, animals, objects, landscapes, and multi-subject scenarios.
- Bootstrapping with Existing Models: They leveraged a "zoo of specialist models" to generate paired videos:
- Commercial models'
Image-to-Video (I2V)visual effects templates (e.g.,VIDU [68],Kling [40]). - Community
LoRAs[11] for various effects.
- Commercial models'
- Pairing: Each curated image was matched with all compatible templates (some templates have subject category restrictions) to create the paired data.
-
Scale and Characteristics:
VAP-Datacomprises over 100K curated paired samples across 100 semantic conditions.- Categories: These conditions are organized into 4 primary categories, as shown in Figure 3 and detailed in Table 6 (in the Appendix):
Concept: Entity Transformation (e.g., person becomes Captain America, hair color change) and Entity Interaction (e.g., aliens coming, covered in liquid metal).Style: Stylization (e.g., Ghibli, Minecraft, Jojo).Motion: Human Motion Transfer (e.g., shake it dance, hip twist) and Non-human Motion Transfer (e.g., balloon flyaway, spin ).Camera: Camera Movement Control (e.g., dolly effect, Hitchcock zoom, move up/down/left/right).
- Categories: These conditions are organized into 4 primary categories, as shown in Figure 3 and detailed in Table 6 (in the Appendix):
-
Purpose:
VAP-Dataserves as a robust data foundation for training a single generalist model (VAP) to learn the unified underlying principle ofsemantic controlfrom disparate specialist models. -
Evaluation Subset: For evaluation, a test subset was created by evenly sampling 24 semantic conditions (6 from each of the 4 categories), with 2 samples each, totaling 48 test samples.
Figure 3: Overview of Our Proposed VAP-Data. (a) 100 semantic conditions across 4 categories: concept, style, camera, and motion; (b) diverse reference images, including animals, humans, objects, and scenes, with multiple variants; and (c) a word cloud of the semantic conditions.
The following are the results from Table 6 of the original paper:
| Primary Category | Subcategory | Subset (alphabetical) | Total Videos |
|---|---|---|---|
| Concept(n=56) | Entity Transformation (n=24) | captain america, cartoon doll, eat mushrooms, fairy me, fishermen, fuzzyfuzzy, gender swap, get thinner, hair color change, hair swap, ladudu me, mecha x, minecraft, monalisa, muscling, pet to human, sexy me, squid game, style me, super saiyan, toy me, venom, vip, zen | 56k |
| Entity Interaction (n=21) | aliens coming, child memory, christmas, cloning, couple arrival, couple drop, couple walk, covered liquid metal, drive yacht, emoji figure, gun shooting, jump to pool, love drop, nap me, punch hit, selfie with younger self, slice therapy, soul depart, watermelon hit, zongzi drop, zongzi wrap | ||
| Style(n=11) | Stylization (n=11) | american comic, bjd, bloom magic, bloombloom, clayshot, ghibli, irasutoya, jojo, painting, sakura season, simpsons comic | 15k |
| Motion(n=41) | Human Motion Transfer (n=16) | break glass, crying, cute bangs, emotionlab, flying, hip twist, laughing, live memory, live photo, pet belly dance, pet finger, shake it dance, shake it down, split stance human, split stance pet, walk forward auto spin | 10k |
| Non-human Motion Transfer (n=16) | balloon flyaway, crush, decapitate, dizzydizzy, expansion, explode, grow wings, head to balloon, paper- man, paper fall, petal scattered, pinch, rotate, spin360, squish | 19k | |
| Camera(n=12) | Camera Movement Control (n=12) | dolly effect, earth zoom out, hitchcock zoom, move down, move left, move right, move up, orbit, orbit dolly, orbit zoom in/out, reverse dolly, track in/out | 19k |
| Overall subsets across all primary categories: 100. Overall videos across all categories: | |||
5.2. Evaluation Metrics
The paper evaluates models using a comprehensive set of metrics across three key aspects: text alignment, video quality, and semantic alignment.
5.2.1. Text Alignment
- Conceptual Definition:
CLIP similarity(orCLIP Score) measures how well the generated video aligns with the input text prompt semantically. It leverages theCLIP (Contrastive Language-Image Pre-training)model, which learns a joint embedding space for text and images. A higherCLIP similarityindicates that the visual content of the video is more relevant and consistent with its accompanying text description. - Mathematical Formula: $ \text{CLIP Score} = \frac{1}{N} \sum_{i=1}^N \cos(\mathbf{e}{\text{text},i}, \mathbf{e}{\text{video},i}) $
- Symbol Explanation:
- : The total number of generated video-text pairs.
- : The
CLIP embeddingvector for the text prompt of the -th video. - : The
CLIP embeddingvector for the -th generated video. - : The cosine similarity function, which measures the cosine of the angle between two vectors. It ranges from -1 (completely dissimilar) to 1 (identical direction). A higher value indicates better alignment.
5.2.2. Overall Video Quality
The paper uses three metrics to assess video quality:
-
Motion Smoothness
- Conceptual Definition:
Motion smoothnessquantifies the temporal consistency and fluidity of motion within a generated video. It assesses whether movements are natural and free from abrupt changes, jumps, or jitters between frames. Higher scores indicate smoother, more visually pleasing motion. - Mathematical Formula: The paper refers to a common formulation [59] for
motion smoothness, which typically involves calculating the averageCLIP embeddingsimilarity between consecutive frames or using a metric derived fromoptical floworfeature consistency. A common approach involves measuring the cosine similarity ofCLIP embeddingsbetween adjacent frames and averaging them. $ \text{Motion Smoothness} = \frac{1}{T-1} \sum_{t=1}^{T-1} \cos(\mathbf{e}{\text{frame},t}, \mathbf{e}{\text{frame},t+1}) $ - Symbol Explanation:
- : The total number of frames in the video.
- : The
CLIP embeddingvector for frame . - : Cosine similarity, measuring similarity between consecutive frames' visual representations.
- Conceptual Definition:
-
Dynamic Degree
- Conceptual Definition:
Dynamic degreemeasures the amount and intensity of motion or change occurring within a video. A video with highdynamic degreeexhibits significant movement or transformation, while a lowdynamic degreesuggests a relatively static scene. This is important for ensuring that generated videos are not stagnant. It often relies onoptical flowanalysis. - Mathematical Formula: The paper refers to
Dynamic Degree [67], which often computes the magnitude ofoptical flowvectors between frames. A common formula involves averaging the magnitude ofoptical flowvectors. $ \text{Dynamic Degree} = \frac{1}{T-1} \sum_{t=1}^{T-1} \left( \frac{1}{H \times W} \sum_{x=1}^H \sum_{y=1}^W \lVert \mathbf{f}_{t}(x,y) \rVert_2 \right) $ - Symbol Explanation:
- : The total number of frames in the video.
H, W: Height and width of the video frames.- : The
optical flow vectorat pixel(x,y)between frame and frame . - : The Euclidean (L2) norm of the vector, representing the magnitude of the motion at that pixel.
- Conceptual Definition:
-
Aesthetic Quality
- Conceptual Definition:
Aesthetic qualityassesses the overall visual appeal, beauty, and artistic merit of a generated video. This metric often relies on pre-trained aesthetic predictor models that have been trained on human-annotated datasets of image aesthetics. A higher score indicates a more aesthetically pleasing video. - Mathematical Formula: The paper refers to
aesthetic quality [61]. While a universal formula isn't provided in the paper, it typically involves anaesthetic predictormodel that outputs a score for each frame, which is then averaged over the video. $ \text{Aesthetic Quality} = \frac{1}{T} \sum_{t=1}^T A(\text{frame}_t) $ - Symbol Explanation:
- : The total number of frames in the video.
- : The
aesthetic scorepredicted by a pre-trained model for frame .
- Conceptual Definition:
5.2.3. Semantic Alignment Score
-
Conceptual Definition: This custom metric measures the consistency between the
reference semantic condition(provided by the reference video and captions) and the generated video. Standard video quality metrics don't reliably capture this adherence to specific semantic conditions. To achieve this, the authors leverage advancedVision Language Models (VLMs)likeGemini-2.5-pro [12]andGPT-5 [53]for automatic scoring. TheVLMacts as an expert judge, comparing thereference video-generation video pairagainst detailed evaluation rules. -
Methodology:
- Input to VLM: For each
(reference, generation)video pair, theVLMreceives:- A
general templatedescribing the judging role and overall evaluation regime (ID-TRANSFORMvs.NON-ID-TRANSFORM). Specific criteriatailored to the current semantic condition (e.g., "Ghibli style stylization," "dolly zoom").- The
reference video condition. - The
generated video.
- A
- Scoring by VLM: The
VLMthen scores the generated video based on how well it satisfies these rules, outputting a single integer score (0-100).
- Input to VLM: For each
-
Validation: To ensure stability and validity, the same evaluation was performed with two different state-of-the-art
VLMs(Gemini-2.5-ProandGPT-5). The scores from bothVLMsshowed close agreement and followed the trend of the user preference rate, confirming the metric's effectiveness.The following are the results from Table 4 of the original paper:
Category Content General Template You are an expert judge for reference-based semantic video generation. INPUTS REFERENCE video: the target semantic to imitate. TEST video: a new output conditioned on a NEW reference image. Human criteria (treat as ground truth success checklist; overrides defaults if conflict): {criteria} REGIME DECISION Classify the semantics into one of: A) ID-TRANSFORM (identity-changing): the main subject/object changes semantic class or material/state. Layout and identity may legitimately change as a consequence of the transformation. B) NON-ID-TRANSFORM (identity-preserving): stylization, camera motion (pan/zoom), mild geometry exaggeration, lighting changes, human motion, etc. The main subject class/identity should remain the same. If the REFERENCE clearly shows a class/state change, choose A. Otherwise, choose B. When uncertain, choose B. EVALUATION 1) SEMANTIC MATCH (0-60) Regime A (ID-TRANSFORM): How strongly and accurately does TEST reproduce the REFERENCE's target state/look/behavior on the correct regions? Is the source→target mapping consistent (same parts transform to corresponding target parts)? Does the transformed state resemble the REFERENCE target, not a generic filter? Regime B (NON-ID-TRANSFORM): Does TEST replicate the specific semantic (style, camera motion, geometric exaggeration) while keeping the subject recognizable and aligned to the intended scope? 2) IDENTITY / LAYOUT CORRESPONDENCE (0-20) Regime A: Reward semantic correspondence rather than identical identity; coarse scene continuity is preserved unless the REFERENCE implies re-layout. Regime B: Main subject identity stays intact (face/body/clothes/features), and coarse spatial layout remains consistent (no unintended subject swaps/teleports). 3) TEMPORAL QUALITY and TRANSFORMATION CONTINUITY (0-20) Failures: - Incomplete semantic transformation (e.g., Ghibli style applied to <70% of frames), or failure to fully transition to target class/material, or completes <70% of the transformation timeline. - Severe identity loss in Regime B (unrecognizable face/body, unintended person/object swap). Gross broken anatomy (detached/missing limbs, implausible face mash) is not required by the semantics. - Extreme temporal instability or unreadable corruption (heavy strobe, tearing, tiling). - Hallucinated intrusive objects that block the subject or derail the semantics. OUTPUT (exactly ONE line of JSON; integer only) {"score": 0-100} Semantic Criteria Regime: NON-ID-TRANSFORM (identity-preserving stylization). Semantic: Ghibli-style stylization — the overall look gradually transitions to a hand-drawn, soft, film-like Ghibli aesthetic across the whole frame. Identity preservation: The main subject remains recognizable; appearance/proportions/base colors are largely maintained (stylistic simplification and brush-like textures allowed).
Table 4: Example Prompt components for the semantic-alignment metric. We provide the general template and the specific criteria of "Ghibli Style" as an example.
5.2.4. User Study
- Conceptual Definition: A
user studyis a qualitative evaluation where human participants assess the quality of generated content. It directly captures human preferences regarding video quality andsemantic alignment, which can sometimes be hard to quantify with automated metrics alone. - Methodology:
- Participants: 20 randomly selected video-generation researchers.
- Task: In each trial, raters compared outputs from different methods, shown alongside a
semantic-control reference video. - Criteria: Raters chose the better result based on two aspects: (i)
semantic alignmentand (ii)overall quality. - Reporting: The
preference rateis reported, which is the normalized share of selections across all comparisons, summing to 100%.
5.3. Baselines
The VAP method is evaluated against a comprehensive set of baselines, including state-of-the-art (SOTA) structure-controlled methods, condition-specific semantic methods, and leading closed-source commercial models.
-
State-of-the-Art Structure-Controlled Video Generation Method:
VACE [34]: This model conditions on a video and a same-size mask, indicating editable (1) vs. fixed (0) regions. For comparison withVAP,VACEwas provided with:Reference videoitself.Depth mapof the reference video.Optical flowof the reference video. In all cases, the mask was set to1s, meaning the model was instructed to follow (rather than copy) the condition. This highlights howstructure-controlled methodsperform whenpixel-wise mapping priorsare inappropriate forsemantic control.
-
DiT Backbone and Condition-Specific Methods:
CogVideoX-I2V [76]: The baseVideo Diffusion Transformer (DiT)model used as the backbone. This represents a strong generative model without any specificsemantic controlmechanism beyond standard text prompts.CogVideoX-I2V (LoRA) [27]: This represents a common community practice forcondition-specific semantic control. For each unique semantic condition, aLoRA(Low-Rank Adaptation) isfinetunedon theCogVideoX-I2Vbackbone. This approach is often reported to match or surpasstask-specific models[2, 78]. The performance reported is an average across allfinetuned LoRAs.
-
State-of-the-Art Closed-Source Commercial Models:
Kling [40]: A commercialimage-to-videospecial effects platform.Vidu [68]: Another commercialAI video templatesplatform. These models are generally highly optimized and are typicallycondition-specificortask-specific, representing the highest practical performance benchmarks in the field.
5.4. Implementation Details
The authors trained VAP on two different DiT architectures to demonstrate its effectiveness and transferability:
-
CogVideoX-I2V-5B [76] -
Wan2.1-I2V-14B [70]Key hyperparameters and training specifics:
-
Parameter Counts: For fairness, the
in-context DiT expertinVAPwas designed to match parameter counts.- On
CogVideoX-I2V-5B: The expert was a full copy of the originalDiT(approx. 5B parameters). - On
Wan2.1-I2V-14B: The expert was a distributed copy spanning of the layers (also approx. 5B parameters).
- On
-
Video Processing:
- Videos were resized to
480p(480 x 720or480 x 832). - 49 frames were sampled at
16 frames per second (fps).
- Videos were resized to
-
Optimizer:
AdamWwas used. -
Learning Rate: .
-
Learning Rate Schedule:
Constant with warmup(1000 warmup steps). -
Training Steps: Approximately
20,000 steps. -
Hardware: Trained on
48 NVIDIA A100 GPUs. -
Inference:
50 denoising steps.Classifier-free guidance scaleof6(for CogVideoX-based) or5(for Wan2.1-based).
-
Prediction Type:
Velocity(for CogVideoX) orFlow Matching(for Wan2.1). -
MoT Layers: For the CogVideoX-based VAP, the MoT expert was applied across all 42 layers. For the Wan2.1-based VAP, it was applied to layers
[0, 4, 8, 12, 16, 20, 24, 28, 32, 36](10 layers, approximately 1/4 of total layers).The following are the results from Table 3 of the original paper:
Hyperparameter Model CogVideoX-I2V-based Wan2.1-I2V-based Batch Size / GPU 1/1 1/2 Accumulate Step 1 1 Optimizer AdamW AdamW Weight Decay 0.0001 0.0001 Learning Rate 0.00001 0.00001 Learning Rate Schedule constant with warmup constant with warmup WarmUp Steps 1000 1000 Training Steps 20,000 20,000 Resolution 480p 480p Prediction Type Velocity Flow Matching Num Layers 42 40 MoT Layers [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41] [0, 4, 8, 12, 16, 20, 24, 28, 32, 36] Pre-trained Model CogVideoX-I2V-5B Wan2.1-I2V-14B Sampler CogVideoX DDIM Flow Euler Sample Steps 50 30 Guide Scale 6.0 5.0 Generation Speed (1 A100) ~540s ~420s Device A100×48 A100×48 Training Strategy FSDP /DDP /BFloat16 FSDP DDP Parallel /BFloat16
Table 3: Hyperparameter selection for CogVideoX-I2V-5B-based and Wan2.1-I2V-14B-based VAP.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that VAP significantly advances semantic-controlled video generation, outperforming open-source baselines and achieving competitive performance with commercial models.
The following are the results from Table 1 of the original paper:
| Metrics | Text CLIP Score ↑ | Overall Quality | Semantic Alignment Score ↑ | User Study Preference Rate (%)* | ||
|---|---|---|---|---|---|---|
| Model | Motion Smoothness ↑ | Dynamic Degree ↑ | Aesthetic Quality ↑ | |||
| Structure-Controlled Methods | ||||||
| VACE (Original) | 5.88 | 97.60 | 68.75 | 53.90 | 35.38 | 0.6% |
| VACE (Depth) | 22.64 | 97.65 | 75.00 | 56.03 | 43.35 | 0.7% |
| VACE (Optical Flow) | 22.65 | 97.56 | 79.17 | 57.34 | 46.71 | 1.8% |
| DiT Backbone and Condition-Specific Methods | ||||||
| CogVideoX-I2V | 22.82 | 98.48 | 72.92 | 56.75 | 26.04 | 6.9% |
| CogVideoX-I2V (LoRA)+ | 23.59 | 98.34 | 70.83 | 54.23 | 68.60 | 13.1% |
| Kling / Vidu‡ | 24.05 | 98.12 | 79.17 | 59.16 | 74.02 | 38.2% |
| Ours | ||||||
| Video-As-Prompt (VAP) | 24.13 | 98.59 | 77.08 | 57.71 | 70.44 | 38.7% |
Table 1: Quantitative Comparison. We compare against the SOTA structure-controlled generation method VACE [34], the base video DiT model CogVideoX-I2V [76], the condition-specific variant CogVideoX-I2V (LoRA) [27], and the closed-source commercial models Kling/Vidu [40, 68]. Overall, VAP delivers performance comparable to the closed-source models and outperforms other open-source baselines. Red stands for the best, Blue stands for the second best.
6.1.1. Quantitative Comparison (Table 1)
- Structure-Controlled Methods (
VACE): As expected,VACE(Original, Depth, Optical Flow) performs the worst across all metrics forsemantic-controlled generation. This is becauseVACEassumes apixel-wise mappingbetween condition and output, which is not suitable for abstract semantic controls. When using the rawreference videoas input,VACEshows very lowText CLIP Score(5.88) andSemantic Alignment Score(35.38), along with a minimalUser Preference Rate(0.6%). As the control shifts from raw video todepthand thenoptical flow, thepixel-wise appearance detailsdecrease, and metrics slightly improve (e.g.,Semantic Alignment Scoreup to 46.71 forOptical Flow). This confirms that the strongpixel-wise priorof these methods isill-suitedforsemantic control. DiTBackbone (CogVideoX-I2V): Driving the pre-trainedDiTbackbone with only captions yields decentvideo qualitymetrics (e.g.,Motion Smoothness98.48,Aesthetic Quality56.75), but a very weakSemantic Alignment Score(26.04) andUser Preference Rate(6.9%). This indicates that many semantic controls are difficult to express effectively with coarse text alone.- Condition-Specific Methods (
CogVideoX-I2V (LoRA)):LoRA fine-tuningachieves a strongSemantic Alignment Score(68.60), significantly better than the baseDiT. However, it slightly harmsbase quality(e.g.,Motion Smoothnessdrops to 98.34 from 98.48 forCogVideoX-I2V,Aesthetic Qualityto 54.23 from 56.75), suggestingoverfittingto specific conditions. Moreover, it still lags inUser Preference Rate(13.1%), and its inherent limitation is the need for a separate model per condition and lack ofzero-shot generalization. - Commercial Models (
Kling / Vidu): Theseclosed-source commercial modelsachieve excellent performance, with aSemantic Alignment Scoreof 74.02 and aUser Preference Rateof 38.2%. They represent the state-of-the-art forcondition-specificortask-specificapproaches. Video-As-Prompt (VAP):VAPsignificantly outperforms allopen-source baselineson most metrics. It achieves the highestText CLIP Score(24.13), bestMotion Smoothness(98.59), and aSemantic Alignment Score(70.44) that is substantially higher thanLoRAand close to commercial models. Critically,VAPachieves the highestUser Preference Rate(38.7%), surpassing even the commercial models, establishing itself as thenew state-of-the-art for open-source methods. This is particularly impressive given thatVAPis a single unified model, unlike thecondition-specific commercial models.
6.1.2. User Study (Table 1)
The user study corroborates the quantitative results. VAP achieved an overall 38.7% preference rate, which is the highest among all compared methods, narrowly beating the commercial Kling / Vidu models (38.2%). This strong human preference underscores VAP's ability to produce coherent, semantically aligned videos with high visual quality in a unified manner.
6.1.3. Qualitative Comparison (Figure 6)
该图像是图表,展示了VAP与多个视频生成模型(如VACE、CogVideoX、CogVideoX-LoRA等)的定性比较。图中包含多组参考视频和生成结果,强调了各模型在语义控制方面的不同表现。
*Figure 6: Qualitative comparison with VACE [34], CogVideoX (I2V) [76], CogVideoX-LoRA (I2V) and commercial models [40, 68]; VACE uses a form condition (top left). More visualizations are in the project page.
Figure 6 provides qualitative examples that visually confirm the quantitative findings:
VACE: Exhibitspixel-mapping bias. For instance, when given areference videoof aStatue of Liberty imitating a sheep,VACEcopies unwanted appearance details or layout directly from the reference. This leads to artifacts or unintended visual elements being "copied" into the target, demonstrating itsill-suitabilityforsemantic control. This artifact weakens whendepthoroptical floware used as conditions, as they progressively remove appearance details.CogVideoX-I2V: While capable of generating plausible videos, it often fails to capture or align with the abstract semantics, leading to visually good but semantically inaccurate outputs.CogVideoX-I2V (LoRA): Shows improvedsemantic alignmentcompared to the baseDiT, but it can still suffer fromoverfittingand lacks the fine-grained control or generalization ofVAP. It requires a separate model for each condition.VAP: Produces videos with superiortemporal coherence,visual quality, andsemantic consistency. It effectively transfers the desired semantic (e.g., specific concept, style, motion) to the target withoutcopy artifacts, while maintaining the target's identity.VAP's outputs are visually competitive withcondition-specific commercial modelslikeKlingandVidu.
6.1.4. Zero-Shot Generalization (Figure 7)
Figure 7: Zero-Shot Performance. Given semantic conditions unseen in VAP-Data (left column), VAP still transfers the abstract semantic pattern to the reference image in a zero-shot manner.
A significant finding is VAP's strong zero-shot generalization capability. By treating all semantic conditions as unified video prompts, VAP can handle diverse tasks. Furthermore, when presented with an entirely unseen semantic reference that was not part of the VAP-Data training set (e.g., "crumble," "dissolve," "levitate," "melt"), VAP successfully transfers the abstract semantic pattern to a new reference image. This demonstrates the power of its in-context learning ability and its potential for truly general-purpose controllable video generation.
6.2. Ablation Studies / Parameter Analysis
The paper conducts several ablation studies to validate the effectiveness of VAP's key architectural components and design choices.
The following are the results from Table 7 of the original paper:
| Metrics Variant | Text CLIP Score ↑ | Overall Quality | Reference Alignment Score ↑ | |||
|---|---|---|---|---|---|---|
| Motion Smoothness ↑ | Dynamic Degree ↑ | Aesthetic Quality ↑ | ||||
| In-Context Generation Structure | ||||||
| u (Single-Branch Finetuning) | 23.03 | 97.97 | 70.83 | 56.93 | 68.74 | |
| u (Single-Branch LoRA) | 23.12 | 98.25 | 72.92 | 57.19 | 69.28 | |
| u (Unidir-Cross-Attn) | 22.96 | 97.94 | 66.67 | 56.88 | 67.16 | |
| u (Unidir-Addition) | 22.37 | 97.63 | 62.50 | 56.91 | 55.99 | |
| Position Embedding Design | ||||||
| u (Identical PE) | 23.17 | 98.49 | 70.83 | 57.09 | 68.98 | |
| u (Neg. shift in T, W) | 23.45 | 98.53 | 72.92 | 57.31 | 69.05 | |
| Scalability | ||||||
| u(1K) | 22.84 | 92.12 | 60.42 | 56.77 | 63.91 | |
| u(10K) | 22.87 | 94.89 | 64.58 | 56.79 | 66.28 | |
| u(50K) | 23.29 | 96.72 | 70.83 | 56.82 | 68.23 | |
| u(100K) | 24.13 | 98.59 | 77.08 | 57.71 | 70.44 | |
| DiT Structure | ||||||
| uWan (Wan2.1-I2V-14B) | 23.93 | 97.87 | 79.17 | 58.09 | 70.23 | |
| In-Context Expert Transformer Layer Distribution‡ | ||||||
| u(Lodd) | 24.05 | 98.52 | 75.00 | 57.58 | 70.22 | |
| u(Lodd, ≤ [0.5N]) | 23.72 | 98.19 | 70.83 | 56.71 | 69.61 | |
| u(Lfirst-half) | 23.90 | 98.41 | 75.00 | 57.18 | 69.94 | |
| u(Lfirst-last) | 23.96 | 98.33 | 72.92 | 57.06 | 70.02 | |
| Video Prompt Representation | ||||||
| u (noisy reference) | 23.98 | 98.41 | 75.00 | 57.42 | 70.18 | |
| Ours | ||||||
| u (VAP) | 24.13 | 98.59 | 77.08 | 57.71 | 70.44 | |
Table 7: Ablation Study. We verify the effectiveness of our MoT structure, temporal-biased RoPE, the full dataset, and the transferability in different DiTs. The bottom row reports our full model. Notation: (our VAP parameterized by ), (single-branch finetuning), (single-branch LoRA), (unidirectional cross-attention), (unidirectional addition), (identical position embedding), (negative temporal/width shifts of position embedding), (Wan2.1 as DiT backbone). Scale: with . MoT layers: for selected layers of the backbone with Transformer layers. We define , , and . Our full model uses all 100K data and all MoT layers (for CogVideoX-I2V-5B based VAP).
6.2.1. In-Context Generation Structure
The study compares four variants of in-context generation against the full VAP (MoT with Frozen DiT + Trainable Expert + Full Attention):
-
A1. (Single-Branch Finetuning): Expanding the pre-trained
DiTinput sequence to includereference text,reference video,target text, andtarget video, thenfinetuningthe full model. This approach yielded decentSemantic Alignment(68.74) butVAP(70.44) still outperforms it. This indicates that whilefinetuningcan learnin-context patterns,MoTis superior at preserving the baseDiT's generative ability while enablingplug-and-play control, addressingcatastrophic forgetting. -
A2. (Single-Branch LoRA Finetuning): Similar to A1 but freezing the backbone and training only
LoRAlayers.LoRAhelps retain the backbone's ability and achieves a slightly betterSemantic Alignment(69.28) than fullfinetuning. However, itslimited capacitystruggles with the complexity ofin-context semantic generation, resulting in suboptimal overall performance compared toVAP(70.44). -
A3. (Unidirectional Cross-Attention): Freezing the pre-trained
DiT, adding a new branch (with same weights as backbone), and injecting its features vialayer-wise cross-attentionunidirectionally (reference to target). This leads to worseSemantic Alignment(67.16) thanVAP. This highlights that thebidirectional information exchangethroughfull attentioninMoT(where reference and target representations adapt synchronously) is crucial for improvingsemantic alignment. -
A4. (Unidirectional Addition): Similar to A3 but injecting features via
residual addition. This performs the worst in terms ofSemantic Alignment(55.99). Even with retraining,residual-addition methodsrely onrigid pixel-to-pixel mapping, whichmismatches semantic-controlled generationand severely degrades performance, as shown in Figure 5(a) and quantitatively in Table 1 (VACEresults).Conclusion: The
MoTarchitecture used inVAP(with a frozenDiTand a trainableexpertcommunicating viafull attention) is the most effective structure forin-context semantic-controlled video generation, balancinggenerative ability preservationwithrobust contextual guidance.
6.2.2. Position Embedding Designs
-
(Identical PE): Applying
identical RoPEto both the reference and target videos enforces anunrealistic pixel-wise alignment prior. This results in a lowerSemantic Alignment Score(68.98) compared toVAP(70.44). This validates the motivation for a specializedposition embeddingforsemantic control. -
(Neg. shift in T, W): In this variant, in addition to a
temporal bias, awidth biasis also introduced by placing the reference tokens before the target along the width axis (followingin-context image generation [66]). This yielded slightly lowerSemantic Alignment(69.05) thanVAP. This suggests that whiletemporal biasis crucial, awidth bias(negative shift in W) may not be beneficial for video contexts, possibly interfering with spatial consistency or being less necessary than for static images.Conclusion: The specific
temporally biased RoPE(temporal shift only) used inVAPis crucial for eliminatingspurious pixel-mapping priorsand achieving optimalsemantic alignment.
6.2.3. Scalability
The ablation study on dataset size (1K, 10K, 50K, 100K samples from VAP-Data) shows a clear trend: VAP's performance improves across all metrics as the training data grows.
-
From
1Ksamples,Semantic Alignment Scoreis 63.91,Motion Smoothnessis 92.12. -
With the full
100Ksamples,Semantic Alignment Scorereaches 70.44, andMotion Smoothnessreaches 98.59.Conclusion: This demonstrates the strong
scalabilityofVAP's unified design, which treats reference videos as prompts withouttask-specific modifications, combined with theMoT frameworkthat preserves the backbone's generative capacity.
6.2.4. DiT Structure
To test transferability, VAP was implemented on Wan2.1-I2V-14B, a stronger DiT backbone, with an expert of comparable parameter count (~5B) to the CogVideoX-I2V-5B version. The Wan2.1-based VAP (uWan) shows:
-
Improved
Dynamic Degree(79.17 vs. 77.08 forCogVideoX-based VAP) andAesthetic Quality(58.09 vs. 57.71). This benefit likely comes fromWan2.1's stronger base generative capabilities. -
Slightly worse
Reference Alignment Score(70.23 vs. 70.44 forCogVideoX-based VAP). This could be attributed to theWan2.1variant havingMoT expertlayers inserted across only of the backbone's layers, potentially leading to less comprehensivein-context interactioncompared to theCogVideoXvariant whereMoTwas applied to all layers.Conclusion:
VAPis transferable across differentDiTarchitectures, with performance generally benefiting from a stronger backbone. The distribution ofMoTlayers (how many and which layers) can influence the balance betweengeneration qualityandsemantic alignment.
6.2.5. In-Context Expert Transformer Layer Distribution
The study analyzes how different layer distributions for the MoT expert affect in-context control:
-
u(Lodd)(all odd layers): Achieves aSemantic Alignment Scoreof 70.22, close to the full model (70.44), with goodOverall Quality. -
u(Lfirst-half)(first half layers):Semantic Alignmentof 69.94. -
u(Lfirst-last)(first and last layers only):Semantic Alignmentof 70.02. This outperformsu(Lfirst-half), suggesting thatbalanced feature interactionacross both early and late layers (for capturing low-level and high-level semantics) is important. -
(odd layers in first half):
Semantic Alignmentof 69.61. This is lower thanu(Lodd), further supporting that a wider distribution of expert layers generally improves performance.Conclusion: A more
balanced distributionofMoT expertlayers throughout theDiT backbonegenerally leads to bettergeneration qualityandsemantic alignment. While reducing layers can improve efficiency, it may compromise performance.
6.2.6. Video Prompt Representation
Inspired by Diffusion Forcing [9, 23, 64], the authors experimented with injecting noise into the video prompt representation.
-
Result: This approach often led to severe artifacts and degraded generation quality.
-
Reasoning: Unlike
long-video generationinDiffusion Forcing(where the goal is often to avoid copy-paste or overly static results, and noise can help introduce diversity),VAP'sreference videosalready differ significantly in appearance and layout from the target. Adding noise to thevideo promptin this contextcorrupts the crucial contextual informationneeded forsemantic transfer, thus hindering rather than helping.Conclusion: For
VAP'ssemantic controlparadigm, a clean and informativevideo prompt representationis essential; adding noise to it is detrimental.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces Video-As-Prompt (VAP), a novel and unified framework that addresses the longstanding challenge of generalizable semantic control in video generation. VAP reframes this problem as in-context generation, using a reference video as a direct semantic prompt. Its core technical contributions include a plug-and-play Mixture-of-Transformers (MoT) expert that guides a frozen Video Diffusion Transformer (DiT) while preventing catastrophic forgetting, and a temporally biased position embedding that eliminates spurious pixel-wise mapping priors for robust context retrieval.
To support this innovation, the authors built VAP-Data, the largest dataset for semantic-controlled video generation, comprising over 100K paired videos across 100 semantic conditions. Extensive experiments demonstrate that VAP, as a single unified model, achieves state-of-the-art performance among open-source methods, rivaling leading condition-specific commercial models with a 38.7% user preference rate. Its strong zero-shot generalization and versatility across various downstream applications mark a significant advance toward general-purpose, controllable video generation.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
-
VAP-Data Limitations:
- Synthetic Nature:
VAP-Datais generated using visual effects templates from commercial models and communityLoRAs. This means the dataset issyntheticand mayinherit specific stylistic biases,artifacts, andconceptual limitationsof its source templates. For instance, if the source models struggle with generating realistic hands,VAPtrained on this data might also perform poorly in that aspect. - Limited Real-World Diversity: While large, the semantic conditions in
VAP-Dataare still relatively limited in terms of real-world complexity and diversity. - Future Work: The construction of a truly
large-scale, real-world, semantic-controlled video datasetis a crucial next step, albeit beyond the scope of this paper.
- Synthetic Nature:
-
Influence of Caption and Reference Quality:
- Caption Accuracy:
VAPrelies onstandard video-description captionsfor both reference and target videos to aidsemantic transfer. Inaccurate semantic descriptions in captions candegrade generation quality. - Subject Mismatch: Large structural mismatches between the
main subjectsof the reference and target images/videos can also negatively impact performance.VAPperforms best when captions align and subjects are structurally similar (Figure 14). - Future Work: Exploring
instruction-style captions(e.g., "Please follow the Ghibli style of the reference video") may more effectively capture intended semantics and improve control.
- Caption Accuracy:
-
Multiple Reference Videos:
- The paper experimented with supplying multiple
semantically matched reference videosbut found similar results to using a single reference. - A significant issue observed with multiple references is the potential for the model to
blend unwanted visual detailsacross videos (Figure 15). This is hypothesized to stem from thegeneral-purpose captionslackingexplicit semantic referents. When multiple references introduce conflicting or divergent visual elements (e.g., different types of falling objects), the model might create a mixed, undesirable output. - Future Work: Developing a more effective
multi-reference control mechanism(e.g., a tailoredRoPEfor multi-reference conditions) or usinginstruction-style captionsthat explicitly specify the intended referent is needed to mitigate this issue. A full study of model and caption design for multi-reference training is left for future research.
- The paper experimented with supplying multiple
-
Efficiency:
- Increased Inference Cost: The
plug-and-play MoTapproach, while avoidingbackbone retraining, introducesadditional parametersand computation. This leads tohigher memory usageandlonger runtime(inference time roughly doubles). - Future Work: Performance optimizations like
sparse attention [13, 80]orpruning [15, 73]are orthogonal to the core contributions and could be explored in future work to improve efficiency.
- Increased Inference Cost: The
7.3. Personal Insights & Critique
VAP represents a truly compelling and elegant paradigm shift for controllable video generation. The core innovation of treating reference videos as in-context prompts is brilliant, moving beyond the limitations of pixel-aligned conditions and task-specific architectures. This approach feels much more aligned with how humans learn and abstract concepts – by example rather than by explicit, granular instructions.
Inspirations and Applications:
- Unified Control: The success of
VAPas a single, unified model for diverse semantic controls is highly inspiring. It suggests a path towards foundational generative video models that can adapt to a vast range of user intentions without constant retraining. This could revolutionize creative industries, making complex visual effects and stylization accessible to a broader audience. - Plug-and-Play Extensibility: The
MoTarchitecture is a key enabler. Itsplug-and-playnature means that as new, more powerfulVideo Diffusion Transformersare developed,VAP's control capabilities can be readily integrated, leveraging the advancements of the backbone withoutcatastrophic forgetting. This modularity is a robust design principle. - Data Bootstrapping: The
VAP-Datacreation strategy, leveraging existing commercial and community models tobootstrapa large dataset for a novel task, is a clever workaround for data scarcity. This methodology could be transferable to other nascent fields of generative AI where large, high-quality paired datasets are difficult to acquire.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
"Synthetic Trap" of VAP-Data: While necessary, the synthetic nature of
VAP-Datais a double-edged sword. As the authors admit,VAPmight inheritstylistic biasesorartifactsfrom the generative models used for data creation. This raises questions about the ultimate realism and diversity of outputs for semantics not perfectly captured by the source models. Future work will need to carefully consider how to augment this with real-world data ordomain adaptationtechniques. -
Interpretation of "Semantic Alignment": The
Semantic Alignment Scorerelies onVLMs(Gemini-2.5-Pro,GPT-5). While validated byuser studies, the interpretability and potential biases of theseVLMsthemselves could influence the score. A deeper dive into how theseVLMsmake their judgments for complex video semantics would be valuable. -
General-Purpose Captions vs. Instruction-Style: The paper notes that
standard video descriptionswere used to stay close to theDiT's pre-training data. However, thelimitation analysissuggests thatinstruction-style captionsmight be more effective. This implies an unexplored design space for prompt engineering (textual) in conjunction with video prompts. Integratinginstruction tuningprinciples could unlock even finer control. -
Multi-Reference Challenges: The issue of
blending unwanted visual detailswith multiple references is significant. It highlights that whileVAPis great at transferring a semantic, controlling which part of which reference's semantic to apply, especially when references diverge, remains an open problem. This likely requires more explicitattention mechanismsorgating networksto disentangle and select specific semantic components from multiple contexts. -
Efficiency for Production: While acceptable for research, the doubled inference time might be a hurdle for real-time applications or large-scale commercial deployments without further optimization. This is a common trade-off with increased model complexity for control.
Overall,
VAPmakes a profound contribution by offering a unified and highly generalizable approach tosemantic-controlled video generation. It sets a new benchmark for open-source models and provides a robust framework that can evolve with strongerDiTbackbones and more sophisticatedin-context learningstrategies.
Similar papers
Recommended via semantic vector search.