MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
TL;DR Summary
The MultiShotMaster framework addresses current limitations in multi-shot narrative video generation by integrating two novel RoPE variants, enabling flexible shot arrangement and coherent storytelling, while an automated data annotation pipeline enhances controllability and outp
Abstract
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "MultiShotMaster: A Controllable Multi-Shot Video Generation Framework." It focuses on developing a system capable of generating video sequences composed of multiple distinct shots, each with flexible arrangements, narrative coherence, and enhanced controllability beyond simple text prompts.
1.2. Authors
The authors and their affiliations are:
-
Qinghe Wang, Baolu Li, Huchuan Lu, Xu Jia (Dalian University of Technology)
-
Xiaoyu Shi, Quande Liu, Xintao Wang, Pengfei Wan, Kun Gai (Kling Team, Kuaishou Technology)
-
Weikang Bian (The Chinese University of Hong Kong)
This indicates a collaboration between academic institutions and an industry research team, suggesting a blend of theoretical rigor and practical application focus.
1.3. Journal/Conference
The paper is published as an arXiv preprint, indicating it has not yet undergone formal peer review and publication in a journal or conference proceedings at the time of its posting. While arXiv is a highly respected platform for disseminating research quickly in fields like AI, it is not a peer-reviewed venue itself.
1.4. Publication Year
The publication date is 2025-12-02T18:59:48.000Z.
1.5. Abstract
The paper addresses the challenge that current video generation models, while proficient at single-shot clips, struggle with narrative multi-shot videos due to requirements for flexible shot arrangement, coherent storytelling, and advanced controllability. To overcome this, the authors propose MultiShotMaster, a framework for highly controllable multi-shot video generation.
The core methodology involves extending a pretrained single-shot model by integrating two novel variants of Rotary Position Embedding (RoPE):
-
Multi-Shot Narrative RoPE: This variant applies explicit phase shifts at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. -
Spatiotemporal Position-Aware RoPE: Designed to incorporate reference tokens and grounding signals, facilitating spatiotemporal-grounded reference injection.To tackle data scarcity, the authors establish an automated data annotation pipeline that extracts multi-shot videos, captions, cross-shot grounding signals, and reference images. The framework leverages intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subjects with motion control, and background-driven customized scenes. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the framework's superior performance and outstanding controllability.
1.6. Original Source Link
The official source link is: https://arxiv.org/abs/2512.03041 The PDF link is: https://arxiv.org/pdf/2512.03041v1.pdf This paper is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the significant gap between current video generation techniques and the practical requirements of real-world video content creation, especially for narrative multi-shot videos.
While recent advancements in single-shot video generation, primarily powered by diffusion transformers (DiTs), have achieved high-quality results from text prompts and incorporated various control signals (e.g., reference images, object motion), they fall short in producing videos with coherent narratives composed of multiple shots. Real-world films and television series rely on multi-shot video clips to convey stories, involving holistic scenes, character interactions, and micro-expressions across different camera angles and durations.
The specific challenges and gaps in prior research include:
-
Narrative Coherence: Maintaining a consistent storyline, character identities, and scene layouts across multiple shots.
-
Flexible Shot Arrangement: Current methods often struggle with variable shot counts and durations, or fixed transition points.
-
Advanced Controllability: Beyond basic text prompts, there's a need for director-level control over aspects like character appearance, motion, and scene customization.
-
Computational Cost: Adapting single-shot control methods to multi-shot settings can lead to larger networks and higher costs.
-
Data Scarcity: A lack of suitable large-scale multi-shot, multi-reference datasets for training.
The paper's entry point and innovative idea stem from leveraging the inherent properties of
Rotary Position Embedding (RoPE)within theDiTarchitecture. The authors observe that standard continuousRoPEmight confuse intra-shot frames with inter-shot frames across boundaries. This insight leads to the idea of manipulatingRoPEto explicitly mark shot transitions and integrate spatiotemporal control signals for reference injection.
2.2. Main Contributions / Findings
The primary contributions of MultiShotMaster are:
-
Novel RoPE Variants for Controllability:
Multi-Shot Narrative RoPE: Introduces explicit angular phase shifts at shot transitions, enabling the model to recognize shot boundaries and maintain narrative order, allowing flexible shot counts and durations without additional trainable parameters.Spatiotemporal Position-Aware RoPE: Integrates reference tokens (for subjects and backgrounds) with spatiotemporal grounding signals (e.g., bounding boxes) directly into theRoPEmechanism. This establishes strong correlations, allowing precise control over where and when references appear, and enabling subject motion control across frames.
-
Multi-Shot & Multi-Reference Attention Mask: A novel attention masking strategy that constrains information flow, ensuring each shot primarily accesses its relevant reference tokens while maintaining global consistency across video tokens.
-
Automated Data Curation Pipeline: An automatic
MultiShot & Multi-Reference Data Curationpipeline to address data scarcity. It extracts multi-shot videos, hierarchical captions (global and per-shot), cross-shot grounding signals, and reference images from long internet videos. -
Comprehensive Controllability Framework: The proposed framework supports a wide range of control signals including text prompts, reference subjects with motion control, and background images for scene customization, all within a flexible multi-shot generation setting.
-
Superior Performance: Extensive experiments demonstrate that
MultiShotMasterachieves superior performance and outstanding controllability compared to existing methods across various quantitative metrics (e.g., Text Alignment, Inter-Shot Consistency, Transition Deviation, Narrative Coherence, Reference Consistency, Grounding) and qualitative evaluations.The key conclusions are that by strategically modifying
RoPEand developing an effective data curation pipeline,MultiShotMastersuccessfully enables highly controllable and narratively coherent multi-shot video generation, bridging a critical gap towards director-level AI-powered content creation. The framework's ability to handle flexible shot arrangements and integrate diverse condition signals makes it a versatile tool for customized video narratives.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the MultiShotMaster paper, a reader should be familiar with the following foundational concepts:
3.1.1. Diffusion Models
Diffusion models are a class of generative models that learn to reverse a gradual diffusion process. They work by progressively adding noise to data (e.g., images, videos) until it becomes pure noise, and then learning to reverse this process, step-by-step, to generate new data from noise.
- Forward Diffusion Process: Gradually adds Gaussian noise to an input data point over timesteps, producing , where is pure noise.
- Reverse Diffusion Process: A neural network (often a
U-NetorTransformer) is trained to predict the noise added at each step, or directly predict the "velocity" to return to the clean data. By iteratively removing predicted noise from a noisy sample, it generates a clean data sample.
3.1.2. Latent Diffusion Models (LDMs)
Latent Diffusion Models improve efficiency by performing the diffusion process in a compressed latent space rather than the high-dimensional pixel space.
- Variational Auto-Encoder (VAE): An
Auto-Encoderis a neural network trained to encode input data into a lower-dimensional representation (latent code) and then decode it back to reconstruct the original data. AVariational Auto-Encoder (VAE)introduces a probabilistic twist, learning a distribution over the latent space, which allows for sampling new, coherent latent codes for generation. In LDMs, a VAE's encoder compresses images/videos into a compact latent representation, and its decoder converts generated latent codes back into pixels. - Diffusion in Latent Space: The diffusion process (noise addition and removal) operates on these smaller latent codes, significantly reducing computational cost and memory requirements while maintaining high-quality generation.
3.1.3. Diffusion Transformers (DiTs)
Diffusion Transformers replace the traditional U-Net architecture, commonly used in diffusion models, with a Transformer architecture.
- Transformer Architecture: Originally designed for sequence-to-sequence tasks (like natural language processing),
Transformersrely heavily onself-attentionmechanisms to weigh the importance of different parts of the input sequence. - Applying Transformers to Diffusion: In
DiTs, images or video frames are first broken down into patches, which are then linearly projected into token embeddings. These tokens, along with positional encodings and timestep embeddings, are fed into a series ofTransformerblocks. EachTransformerblock typically containsself-attention,cross-attention(e.g., for text conditioning), andfeed-forward networks (FFN).DiTshave shown excellent scalability and performance in generative tasks. - 3D DiT: For video generation,
DiTsare extended to3D DiTs, which incorporatespatiotemporal self-attentionto process information across both spatial (height, width) and temporal (frames) dimensions.
3.1.4. Rotary Position Embedding (RoPE)
Rotary Position Embedding (RoPE) is a type of positional encoding used in Transformers. Unlike absolute positional embeddings (which directly add fixed vectors to tokens) or relative positional embeddings (which compute pairwise distances), RoPE encodes absolute position information with rotation matrices and naturally incorporates relative position information in self-attention calculations.
- Mechanism:
RoPEapplies a rotation to the query and key vectors based on their position. When computing attention scores (dot products of query and key), the relative position between two tokens is naturally encoded by the angle between their rotated vectors. - Key Property:
RoPEhas a crucial property: tokens with closer spatiotemporal distances receive higher attention weights. This makes it effective for capturing local correlations. - Formula Intuition: For a token at position , its embedding is transformed by a rotation matrix . The attention score between query and key then becomes , which can be shown to depend on
m-n, thus encoding relative position.
3.1.5. Rectified Flow
Rectified Flow (RF) is a method used in generative modeling that provides a direct, straight path between a data distribution and a simple noise distribution.
- Connection to Diffusion: Unlike standard diffusion models that follow a complex, non-linear path,
Rectified Flowaims to define a straight line (flow) in the latent space from noisy data () back to clean data (). - Training Objective: The model is trained to predict the "velocity" vector along this straight path, which is simpler to learn than the noise in traditional diffusion. The path is defined as , where is the timestep and is standard Gaussian noise. The objective is to regress this velocity.
- Advantages: Can lead to more stable training and faster sampling.
3.1.6. Classifier-Free Guidance (CFG)
Classifier-Free Guidance is a technique used to improve the quality and adherence to conditioning signals (like text prompts) in diffusion models.
- Mechanism: During inference, the model runs two denoising passes: one conditioned on the input (e.g., text prompt) and one unconditioned (e.g., using an empty text prompt). The prediction from the unconditioned model is then subtracted from the conditioned prediction, and the difference is scaled by a guidance factor (e.g., 7.5 in this paper) and added back to the conditioned prediction.
- Effect: This process steers the generation more strongly towards the provided condition, often producing higher quality and more relevant outputs, though it can sometimes reduce diversity.
3.1.7. DDIM (Denoising Diffusion Implicit Models)
DDIM is a faster sampling method for diffusion models.
- Implicit Models: Unlike
Denoising Diffusion Probabilistic Models (DDPMs)which require many steps to sample (due to their stochastic nature),DDIMsformulate the reverse process as a deterministic implicit model. - Faster Sampling: This determinism allows
DDIMto generate high-quality samples with significantly fewer denoising steps (e.g., 50 steps in this paper, compared to hundreds or thousands for DDPMs), making inference much quicker.
3.2. Previous Works
The paper discusses previous works in Text-to-Video Generation and Controllable Video Generation, primarily categorizing multi-shot video generation into two paradigms.
3.2.1. Text-to-Video Generation
- Early Methods: Initially, methods inflated pretrained
text-to-image (T2I)generation models (e.g.,Latent Diffusion Modelslike Stable Diffusion [39]) by adding temporal layers [14, 56]. These could generate short video animations. - Recent Advances (DiT-based): More recent approaches, including the base model
MultiShotMasterbuilds upon, employDiffusion Transformer (DiT)architectures [36, 45, 60, 69]. These can generate longer, higher-quality videos with detailed text descriptions [6, 12, 27]. - Scaling Challenges: Scaling video generation beyond short, single clips is still an open problem.
- Single-Shot Long Video Generation: Focuses on extending clip length but faces issues like
error accumulationandmemory loss(e.g., maintaining consistent object appearance or scene details over very long durations) [8, 19, 20, 63, 64]. - Multi-Shot Long Video Generation: The focus of this paper, which aims for
narrative coherenceandinter-shot consistency[7, 15, 37, 58].
- Single-Shot Long Video Generation: Focuses on extending clip length but faces issues like
3.2.2. Multi-Shot Video Generation Paradigms
The paper identifies two main paradigms:
- Text-to-Keyframe Generation + Image-to-Video (I2V) Generation [9, 58, 62, 70]:
- Process: First generates a set of visually consistent
keyframes(static images) based on text prompts. Then, anImage-to-Video (I2V)model is used to generate each shot, animating from its corresponding keyframe. - Limitations:
- Relies heavily on the quality of keyframe generation.
- Limited conditional capability: Sparse keyframes cannot fully cover
briefly-appearing charactersorscene consistencyoutside the keyframes. - Struggles to maintain consistency for characters or objects that appear in multiple non-keyframe segments.
- Process: First generates a set of visually consistent
- Direct End-to-End Generation [7, 15, 22, 37, 51, 57]:
- Process: Generates multi-shot videos directly from text, using
full attentionalong the temporal dimension to maintain consistency. - Limitations:
- Often constrained by
fixed shot durationorlimited shot counts. - May contain "boring" frames if shot durations are fixed but the narrative requires a short, sharp cut (e.g., a quick
insert shotof an action). - Existing methods in this paradigm (like the keyframe-based one) are primarily driven only by text prompts, lacking comprehensive controllability.
- Often constrained by
- Notable Works:
ShotAdapter [22]: Incorporates learnabletransition tokensthat interact only withshot-boundary framesto indicate transitions.CineTrans [57]: Constructs a specificattention maskto weakeninter-shot correlations, enabling transitions at predefined positions.- Differentiation from MultiShotMaster:
MultiShotMastercontrasts withShotAdapterandCineTransby manipulatingRoPEembeddings directly to convey transition signals. This approach aims to prevent interference with thetoken interactionsin pretrained attention and explicitly achieve shot transitions, rather than using separate learnable tokens or attention masks.
- Process: Generates multi-shot videos directly from text, using
3.2.3. Controllable Video Generation
This field focuses on providing explicit and precise user control for content creation [32, 47, 48, 53, 55].
- Diverse Control Signals: Supports various types of control, such as
camera motion[13, 16],object motion[31, 33, 41, 59], andreference video/image[4, 5, 26, 30]. - Reference-to-Video:
VACE [21]andPhantom [29]: Support multi-reference video generation and aim for realistic composition. They typically focus on single-shot generation.- Differentiation from MultiShotMaster: These methods often rely on
separate adaptersfor reference injection and motion control. In a multi-shot setting, this could lead to larger networks and higher computational costs.MultiShotMasterproposes a unified framework that supports reference injection and motion control jointly without additional adapters, by integrating these controls directly into theRoPEmechanism.
- Motion Control:
Tora [68]andMotion Prompting [13]: Control object motion throughpoint trajectories.- Differentiation from MultiShotMaster: These are typically for single-shot settings.
MultiShotMasterextends motion control to multi-shot scenarios by creating multiple copies of subject tokens with differentspatiotemporal RoPEassignments.
3.3. Technological Evolution
The evolution in video generation has moved from:
-
Early T2I-based inflation: Adapting
Text-to-Imagemodels by adding simple temporal layers for basic video animation. (e.g.,Tune-A-Video [56],AnimateDiff [14]). -
Dedicated Video Diffusion Models: Developing architectures specifically for video, often
U-Netbased, leading to better quality and longer single clips. -
Diffusion Transformers (DiTs): Employing
Transformerarchitectures (likeSora [6],Open-Sora [27, 69],WAN [45]) for improved scalability, quality, and contextual understanding in single-shot video generation. -
Controllability Enhancements: Integrating various control signals (e.g.,
ControlNetfor images extended to video, reference image adapters, motion prompting) into theseDiT-based models for single shots. -
Addressing Multi-Shot Narratives: Current frontier, where
MultiShotMastersits. Previous attempts involved keyframe-based or end-to-end approaches that still lacked comprehensive control, flexible shot arrangements, and narrative coherence over longer, multi-shot sequences.MultiShotMasterfits into this timeline by taking the powerfulDiT-based single-shot video generation (stage 3-4) and extending it to address the complex challenges ofmulti-shot narrative video generation(stage 5) by innovating on positional encoding and data conditioning.
3.4. Differentiation Analysis
Compared to the main methods in related work, MultiShotMaster offers several core differences and innovations:
-
Novel RoPE-based Control:
- Vs.
CineTrans [57](attention mask) &ShotAdapter [22](learnable tokens):MultiShotMasterusesMulti-Shot Narrative RoPEto encode shot transitions via explicit angular phase shifts directly into the positional embeddings. This is a more intrinsic modification to theTransformer's attention mechanism compared toCineTrans's externalattention maskmanipulation (which might impede original token interactions) orShotAdapter's use of separatelearnable transition tokens.RoPEmanipulation integrates the transition signal more seamlessly and without adding trainable parameters for this specific function. - Vs.
VACE [21]&Phantom [29](separate adapters for reference/motion):MultiShotMasterintegrates reference injection and motion control directly into theRoPEmechanism viaSpatiotemporal Position-Aware RoPE. This allows forspatiotemporal groundingof subjects and backgrounds without the need for additional adapters, which often leads to larger networks and higher computational costs in multi-shot settings.
- Vs.
-
Flexible Shot Arrangement: By directly manipulating
RoPEfor shot boundaries,MultiShotMasterenables users to flexibly configure both the number of shots and their respective durations, a capability often constrained in existing end-to-end multi-shot generation paradigms that rely on fixed shot durations or limited shot counts. -
Comprehensive Controllability: The framework provides
director-level controllabilityencompassing variable shot counts and durations, dedicated text descriptions for each shot (hierarchical prompts), character appearance and motion control (via reference images and grounding signals), and background-driven scene definition. This goes beyond text-only control or single-aspect control common in many prior works. -
Automated Data Curation: The development of a novel automated pipeline for
Multi-Shot & Multi-Reference Data Curationdirectly addresses the common problem of data scarcity for complex video generation tasks, providing essential training data that is hard to collect manually. -
Unified Architecture: By leveraging intrinsic architectural properties (specifically
RoPEandattention masks) within a pretrainedDiT, the framework achieves multi-shot generation and diverse control signals without significantly altering the core model architecture or adding large numbers of new parameters for each control type.In essence,
MultiShotMasterdistinguishes itself by offering a more integrated, efficient, and flexible approach to multi-shot video generation, moving beyond single-shot limitations and rigid multi-shot paradigms through innovativeRoPEmodifications and a robust data pipeline.
4. Methodology
4.1. Principles
The core idea of MultiShotMaster is to extend a pretrained single-shot Text-to-Video (T2V) model into a highly controllable multi-shot generation framework. The theoretical basis and intuition behind it lie in strategically manipulating Rotary Position Embedding (RoPE)—a crucial component in Transformer architectures—to encode both shot boundary information and spatiotemporal grounding signals for reference injection.
The insights driving this approach are:
-
RoPE's Spatiotemporal Correlation:
RoPEnaturally emphasizes tokens with closer spatiotemporal distances. However, in a multi-shot video, applying continuousRoPEacross all frames would incorrectly imply strong temporal correlations across shot boundaries, confusing intra-shot continuity with inter-shot transitions. The principle is to break this continuity at shot transitions. -
RoPE for Grounding: The same property of
RoPE(weighting closer tokens more) can be leveraged to establish strong connections betweenreference tokens(e.g., an image of a subject) and specificvideo tokens(e.g., a bounding box region in a frame). By applyingRoPEfrom a specified region to a reference token, the model is encouraged to "ground" that reference within that exact spatiotemporal location.By modifying
RoPE, the framework aims to:
- Explicitly define shot transitions without introducing new learnable parameters or complex attention mask heuristics that might interfere with the pretrained model's capabilities.
- Enable precise control over the spatiotemporal appearance and motion of subjects and backgrounds using reference images and grounding signals.
- Maintain narrative coherence and inter-shot consistency through a global understanding of the video and targeted control.
4.2. Core Methodology In-depth (Layer by Layer)
MultiShotMaster builds upon a pretrained single-shot Text-to-Video (T2V) model and introduces several key innovations, primarily centered around RoPE modifications and a data curation pipeline.
4.2.1. Evolving from Single-Shot to Multi-Shot T2V
4.2.1.1. Preliminary: The Base T2V Model
The foundation of MultiShotMaster is a pretrained single-shot Text-to-Video (T2V) model with approximately 1 Billion parameters. This base model consists of three main components:
-
3D Variational Auto-Encoder (VAE) [24]: This component is responsible for encoding input video frames into a lower-dimensional
latent spaceand decoding generated latent representations back into pixel space. For multi-shot videos, each shot is encoded separately through the3D VAE. The resultingvideo latentsare then concatenated. -
T5 Text Encoder [38]: This module processes text prompts, converting them into rich
text embeddingsthat capture semantic information. -
Latent Diffusion Transformer (DiT) model [36]: This is the core generative network. Each basic
DiT blocktypically contains:-
A
2D spatial self-attentionmodule: Processes information within individual frames. -
A
3D spatiotemporal self-attentionmodule: Processes information across both spatial dimensions and the temporal dimension (frames). -
A
text cross-attentionmodule: Allows theDiTto condition its generation on thetext embeddingsfrom theT5 encoder. -
A
feed-forward network (FFN).The denoising process in this
DiTfollowsRectified Flow [11]. A straight path is defined fromclean datatonoised dataat a giventimestep. This path is expressed as: $ z_t = (1 - \tau) z_0 + \tau \epsilon $ where:
-
-
represents the noised data at timestep .
-
represents the clean, original data.
-
is the timestep, typically ranging from 0 to 1.
-
is standard Gaussian noise.
The
denoising networkis trained to predict the velocity along this path. The training objective, often referred to as theLCM (Latent Consistency Model) objective[28], aims to regress this velocity: $ \mathcal{L}{LCM} = \mathbb{E}{\tau, \epsilon, z_0} \left[ || (z_1 - z_0) - v_{\Theta} (z_{\tau}, \tau, c_{text}) ||_2^2 \right] $ where: -
is the training loss.
-
denotes the expectation over
timestep,noise, andclean data. -
is the squared L2 norm, measuring the difference between the true velocity and the predicted velocity.
-
represents the true velocity, which is the vector from the clean data to the fully noisy data .
-
is the predicted velocity by the denoising network , given the noised latent , timestep , and text conditioning .
4.2.1.2. Multi-Shot Narrative RoPE
A critical challenge in extending a single-shot model to multi-shot videos is that the original 3D-RoPE assigns sequential indices along the temporal dimension. This causes the model to incorrectly perceive intra-shot consecutive frames and inter-shot frames across shot boundaries as equally related, leading to confusion and hindering explicit shot transitions.
To explicitly help the model recognize shot boundaries and enable controllable transitions, MultiShotMaster introduces Multi-Shot Narrative RoPE. This mechanism breaks the continuity of RoPE at shot transitions by introducing an angular phase shift for each transition. The query (Q) of the -th shot is computed as follows (and similarly for key (K)):
$
Q_i = \mathrm{RoPE}((t + i \phi) \cdot f, h \cdot f, w \cdot f) \odot \tilde{Q}_i
$
where:
-
is the
queryembedding for the -th shot after applyingMulti-Shot Narrative RoPE. -
denotes the
Rotary Position Embeddingfunction. -
(t, h, w)are thespatiotemporal position indicesfor a specific token within a frame and shot. refers to the temporal index, andh, wrefer to height and width indices respectively. -
is the
shot index(e.g., 0 for the first shot, 1 for the second, etc.). -
is the
angular phase shift factor. This factor introduces a fixed, explicit rotation at each shot boundary, effectively "resetting" or shifting the temporal context for the subsequent shot without disrupting the relative positioning within a shot. -
is a
decreasing base frequency vector, which is a characteristic component ofRoPEthat controls the rotation frequency. -
denotes the
element-wise rotary transformationof the original query embeddings via complex rotations.This design has several advantages:
-
It maintains the narrative shooting order of inter-shot frames.
-
It leverages
RoPE's inherent rotational properties to mark shot boundaries through fixed phase shifts. -
It requires no additional trainable parameters, making it efficient.
-
It allows users to flexibly configure both the number of shots and their respective durations.
4.2.1.3. Hierarchical Prompt Structure and Shot-Level Cross-Attention
To enable fine-grained control for each shot, a hierarchical prompt structure is designed:
-
Global Caption: Describes overarching subject appearances and environmental settings for the entire multi-shot video.
-
Per-Shot Captions: Provide detailed information for each individual shot, such as subject actions, specific backgrounds, and camera movements.
For each shot, the
global captionis combined with its correspondingper-shot caption. In the vanillaT2V model,text embeddingsfrom theT5 encoderare replicated along the temporal dimension to align with the video frame sequence fortext-frame cross-attention. Accordingly,MultiShotMasterreplicates each shot's combinedtext embeddingsto match the corresponding shot's frame count. This enablesshot-level cross-attention, ensuring that each shot is conditioned by its specific textual description.
4.2.2. Spatiotemporal-Grounded Reference Injection
Users often need to customize video content with specific reference images (e.g., for subjects or backgrounds) and control their motion. MultiShotMaster addresses this with Spatiotemporal Position-Aware RoPE.
4.2.2.1. Mechanism for Reference Injection
-
Encoding References: Each
reference image(subject or background) is individually encoded into thelatent spaceusing the3D VAE. Thesereference latents(referred to asreference tokens) are then concatenated with thenoised video latents. -
Information Flow: During
temporal attention, these cleanreference tokenspropagate visual information to the noisyvideo tokens, facilitating the injection of reference visual characteristics. -
Leveraging 3D-RoPE: Inspired by
3D-RoPE's property of assigning higher attention weights to tokens with closer spatiotemporal distances,MultiShotMasterapplies3D-RoPEfrom specified regions to the correspondingreference tokens. This establishes strong correlations between theregion-specified video tokensand thereference tokensduring attention computation.Specifically, for a
reference queryassociated with a subject's bounding box region at a -th frame, the3D-RoPEis sampled: $ \begin{array} { r l r } { { Q^{ref} = \mathrm{RoPE}((t + i \phi) \cdot f, \ h^{ref} \cdot f, \ w^{ref} \cdot f) \odot \tilde{Q}^{ref}, } } \ & { } & { \ h^{ref} = y_1 + \frac{(y_2 - y_1)}{H} \cdot j, \quad j \in [0, H-1], } \ & { } & { \ w^{ref} = x_1 + \frac{(x_2 - x_1)}{W} \cdot k, \quad k \in [0, W-1] \ } \end{array} $ where: -
is the
queryembedding for areference tokenafter applyingSpatiotemporal Position-Aware RoPE. -
is the original
reference queryembedding. -
is the modified temporal positional encoding incorporating the
Multi-Shot Narrative RoPE(the term for shot transitions). -
and are the spatial positional encodings.
-
define the bounding box coordinates (top-left and bottom-right ) of the specified region for the reference.
-
H, Ware the overall spatial dimensions (height, width) of the video frame or latent representation. -
and are indices used to sample the
RoPEwithin the specified bounding box region, effectively mapping the reference token to that region.This method allows precise control over where a subject appears in space and time within a video.
4.2.2.2. Motion Control
To control the motion trajectory of a subject, MultiShotMaster creates multiple copies of the subject tokens. Each copy is assigned a different spatiotemporal RoPE corresponding to a specific position at a specific frame along the desired trajectory. The temporal attention mechanism then transfers the subject motion embedded in these copies to the video tokens at the corresponding spatiotemporal positions. After the attention computation, the copied tokens of each subject are averaged to consolidate their influence.
4.2.2.3. Multi-Shot Scene Customization
For multi-shot scene customization using background images, the 3D-RoPE from the first frame of each shot is copied and applied to the corresponding background tokens. This ensures consistent background injection across the relevant shots.
By incorporating spatiotemporal-controllable multi-reference injection, the framework significantly expands its capabilities, allowing users to customize characters, control their positions and movements, and achieve customized multi-shot scene consistency using multiple background images.
4.2.3. Multi-Shot & Multi-Reference Attention Mask
The introduction of multiple reference images and subject copies can lead to very long contexts, increasing computational costs and potentially causing unnecessary interactions (content leakage) between unrelated tokens (e.g., Subject 0 only appears in Shot 2, but other shots might still attend to its reference token).
To constrain information flow and optimize attention allocation, a multi-shot & multi-reference attention mask is designed:
-
Video Token Interactions: Full attention is maintained across all
multi-shot video tokensto ensure global consistency across the entire video narrative. -
Reference Token Interactions: Each shot is limited to access only the reference tokens that are relevant to itself. Similarly, the
reference tokensof each shot can only attend to each other and thevideo tokenswithin the same shot. This prevents irrelevant content from leaking between shots and optimizes computational resources.This strategy ensures that each shot focuses on its
intra-shot reference injectionwhile preservingglobal consistencythroughinter-shot full attentionamong video tokens.
4.2.4. Training and Inference Paradigm
4.2.4.1. Training
The training process is typically conducted in stages based on a foundation generation model:
- Stage 1: Single-Shot Reference Injection:
Spatiotemporal-specified reference injectionis first trained on 300k single-shot data. This helps the model learn to handle diverse subjects and reference signals.Bounding boxesare sampled with random starting points and 1-second intervals, with a 0.5drop probabilityfor each bounding box to encourage robustness and ease of control. - Stage 2: Multi-Shot & Multi-Reference Training: The model is then trained on the automatically constructed
multi-shot & multi-reference data. To enable controllable multi-shot video generation with various modes (text-driven, subject-driven, background-driven, joint-driven),subjectandbackgroundconditions are randomly dropped during training, each with a 0.5 probability (enablingclassifier-free guidancefor these conditions). - Stage 3: Cross-Shot Subject-Focused Post-training: To improve
subject consistencyand help the model better understand how subjects change across different shots, across-shot subject-focused post-trainingstep is introduced. This assigns a higher loss weight (e.g., ) tosubject regionscompared tobackgrounds(e.g., ).
4.2.4.2. Inference
During inference, MultiShotMaster supports:
-
Text-driven
inter-shot consistency. -
Customized subjects with
motion control(via reference images and bounding boxes). -
Background-driven
customized scene consistency(via reference images). -
Flexible configuration of
shot countandduration.The framework provides diverse
multi-shot video content creationcapabilities, allowing users to craft highly customized video narratives.
4.2.5. Multi-Shot & Multi-Reference Data Curation
To address the data scarcity for complex multi-shot and multi-reference video generation, an automated data annotation pipeline is established.
The following figure (Figure 3 from the original paper) illustrates the multi-step video generation process of the MultiShotMaster framework, including long video data collection, shot transition detection, scene segmentation, and the generation of 235k multi-shot samples. It clarifies the role annotation process using technologies like Gemini and YOLOv11, and showcases the per-shot caption generation.
该图像是一个示意图,展示了MultiShotMaster框架的多步骤视频生成过程,包括长视频数据收集、镜头过渡检测、场景分割和生成235k个多镜头样本的工作环节。该图阐明了使用Gemini和YOLOv11等技术的角色标注过程,并展示了每个镜头的字幕生成。
4.2.5.1. Multi-Shot Videos Collection
- Crawl Long Videos: Long videos are first crawled from the internet, encompassing diverse types (movies, TV series, documentaries, cooking, sports, fitness).
- Shot Transition Detection:
TransNet V2 [43]is used to detectshot transitionsand crop out individual single-shot videos. - Scene Segmentation: A
scene segmentation method [54]is employed to understand the storyline and group single-shot videos captured within the same scene. This can cluster videos spanning tens of minutes. - Multi-Shot Sampling:
Multi-shot videosare then sampled from these segmented scenes with specific criteria:shot countranging from 1 to 5,frame countfrom 77 to 308 (i.e., 5-20 seconds at 15 fps), with priority given to samples having higher frame counts and more shots. This process yields 235k multi-shot samples.
4.2.5.2. Caption Definition
A hierarchical caption structure (global and per-shot captions) is used:
-
Global Caption:
Gemini-2.5 [10]is used to understand the entire multi-shot video and generate a comprehensiveglobal caption. Each subject is denoted by indexed nouns like "Subject X (X [1, 2, 3, ...])". -
Per-Shot Captions: Based on the
global captionand each individualshot video,Gemini-2.5is again used to generateper-shot captions. These captions consistently use the predefined "Subject X" identifiers from the global caption to ensurecross-shot subject consistency.The following figure (Figure 6 from the original paper's appendix) shows the prompt template used for labeling global captions.

The following figure (Figure 7 from the original paper's appendix) shows the prompt template used for labeling per-shot captions, demonstrating how labeled subjects can be consistent across global and per-shot captions.

4.2.5.3. Reference Images Collection
-
Subject Images:
YOLOv11 [23]: Detects objects (subjects).ByteTrack [66]: Tracks detected subjects within each shot.SAM [25]: Segments subjects to obtain clean images.- Cross-Shot Merging: Since tracking is done shot-by-shot,
Gemini-2.5 [10]is used to group tracking results across all shots. This involves selecting the largest subject image for each shot-level track ID from each shot and then grouping them based on appearance similarity using a carefully designed prompt template. This yieldscomplete multi-shot tracking resultsand correspondingsubject images.
-
Background Images: The first frames of each shot and their corresponding
foreground masksare fed intoOmniEraser [52]to obtainclean backgrounds.The following figure (Figure 8 from the original paper's appendix) shows the prompt for applying Gemini-2.5 to the subject images to merge cross-shot tracking annotations.

4.2.6. Algorithm 1: Temporal Attention with Multi-Shot Narrative RoPE and Spatiotemporal Position-Aware RoPE
To provide a clearer understanding of how Multi-Shot Narrative RoPE and Spatiotemporal Position-Aware RoPE are integrated into the temporal attention mechanism, Algorithm 1 details the process.
Algorithm 1 Temporal Attention with Multi-Shot Narrative RoPE and Spatiotemporal Position-Aware RoPE.
Input:
- In-context latents containing:
- Multi-shot video latents where is the number of shots.
- Reference latents where is the number of input reference images (subjects and backgrounds).
- Bounding box sequences of references (subjects and backgrounds) , where is the total number of bounding boxes. Each tuple indicates the bounding box for the -th reference at the -th frame. For background references, bounding boxes are fixed as where is the first frame of the corresponding shot.
Output:
- In-context latents after temporal attention.
-
, , // Linear projections of in-context latents to queries, keys, and values.
-
Apply Multi-Shot Narrative RoPE: $ Q = [Q_i]{i=1}^{N{shot}} = \text{Eq. 2}([ \tilde{Q}i ]{i=1}^{N_{shot}}) $ $ K = [K_i]{i=1}^{N{shot}} = \text{Eq. 2}([ \tilde{K}i ]{i=1}^{N_{shot}}) $ $ V = [ \tilde{V}i ]{i=1}^{N_{shot}} $ Explanation: The original query, key, and value embeddings () for the video latents are processed.
Multi-Shot Narrative RoPE(as defined in Equation 2 in the main paper) is applied to the query () and key () embeddings of each shot. This introduces angular phase shifts at shot boundaries, preserving narrative order while signaling shot transitions. The value () embeddings are not modified by RoPE. -
Apply Spatiotemporal Position-Aware RoPE: $ Q^{ref} = [Q_b^{ref}]{b=0}^{N{box}} = \text{Eq. 3}(\text{Copy}(\tilde{Q}^{ref}), [boxes]b^{N{box}}) $ $ K^{ref} = [K_b^{ref}]{b=0}^{N{box}} = \text{Eq. 3}(\text{Copy}(\tilde{K}^{ref}), [boxes]b^{N{box}}) $ $ V^{ref} = [\tilde{V}b^{ref}]{b=0}^{N_{box}} = \text{Copy}(\tilde{V}^{ref}, [boxes]b^{N{box}}) $ Explanation: For
reference latents(), copies are created based on the number of bounding boxes they are associated with ( for the -th reference).Spatiotemporal Position-Aware RoPE(as defined in Equation 3 in the main paper) is then applied to these copied query () and key () embeddings, using the specific spatiotemporal coordinates from[boxes]. This grounds the reference tokens to precise regions in the video. The value () embeddings are simply copied without RoPE modification. -
Attention Computation: $ \hat{Z} = \mathrm{Attention}([Q, Q^{ref}], [K, K^{ref}], [V, V^{ref}], \text{Mask}) $ Explanation: The combined video queries (), reference queries (), combined video keys (), reference keys (), combined video values (), and reference values () are fed into the
attention mechanism. TheMulti-Shot & Multi-Reference Attention Maskis applied here to constrain information flow, ensuring appropriate interactions between video tokens and relevant reference tokens while maintaining global consistency. The output contains the attention-weighted representations. -
Reference Aggregation: $ \bar{z}^{ref} = [\bar{z}^m]{m=1}^{N{ref}} = [\text{mean}([\hat{z}^m]b^{N{box}^m})]{m=1}^{N{ref}} $ Explanation: For each original reference , all its copies that were processed during attention (denoted as ) are averaged (
mean) to produce a single aggregated representation . This step consolidates the influence of multiple positional copies for motion control. -
// Final linear projection. Explanation: The processed video latents and the aggregated reference latents are concatenated and passed through a final linear projection layer (
to_out) to produce the outputin-context latents, which have the same dimension as the input . -
Return
This algorithm clearly outlines how
Multi-Shot Narrative RoPE(for shot transitions) andSpatiotemporal Position-Aware RoPE(for reference grounding and motion control) are integrated into thetemporal attentionlayer, leveraging theattention maskto manage information flow, ultimately enabling highly controllable multi-shot video generation.
5. Experimental Setup
5.1. Datasets
The paper focuses on building a dedicated dataset for training MultiShotMaster due to the scarcity of existing multi-shot, multi-reference data.
- Source: Long videos crawled from the internet, including diverse categories like movies, television series, documentaries, cooking demonstrations, sports, and fitness. This wide variety ensures generalizability across different content types and cinematic styles.
- Scale: The automated data annotation pipeline generated 235k multi-shot samples.
- Characteristics:
- Each video comprises 1-5 shots.
- Frame count ranges from 77 to 308 (i.e., 5-20 seconds at 15 fps).
- Includes
hierarchical captions: aglobal captionfor overall context andper-shot captionsfor detailed, shot-specific descriptions. - Contains
cross-shot grounding signals(bounding boxes) for subjects. - Includes
reference imagesfor subjects and backgrounds.
- Data Curation Process (as described in 4.2.5):
- Long videos are collected.
TransNet V2 [43]detectsshot transitionsto extract single-shot videos.- A
scene segmentation method [54]groups single shots belonging to the same scene. - Multi-shot videos are sampled from these scenes based on shot count (1-5) and frame count (77-308), prioritizing longer, multi-shot samples.
Gemini-2.5 [10]generatesglobalandper-shot captions, ensuringcross-shot subject consistencyby using indexed nouns for subjects.YOLOv11 [23],ByteTrack [66], andSAM [25]detect, track, and segment subjects within each shot.Gemini-2.5merges thesecross-shot tracking annotationsto obtain consistent subject IDs.OmniEraser [52]extractsclean backgroundsfrom the first frame of each shot using foreground masks.
- Why these datasets were chosen: The creation of this specialized dataset was necessitated by the lack of public datasets with the required complexity (multi-shot, hierarchical captions, multi-reference, grounding signals) to train a model with such comprehensive controllability. The automated pipeline makes it feasible to scale this unique data generation.
5.2. Evaluation Metrics
The paper employs a comprehensive suite of metrics to evaluate MultiShotMaster from various angles, covering both multi-shot narrative video generation and reference injection.
5.2.1. Text Alignment (TA)
- Conceptual Definition: This metric quantifies how well the generated video content matches its accompanying text prompt. A higher value indicates better adherence to the textual description.
- Mathematical Formula: The paper states that it calculates the similarity between text features and shot features extracted by
ViCLIP [50]. This typically involves computing the cosine similarity between the embedding vectors. $ \text{Text Alignment} = \frac{1}{N_{shots}} \sum_{i=1}^{N_{shots}} \mathrm{cosine_similarity}(\text{ViCLIP_text_features}(C_i), \text{ViCLIP_video_features}(V_i)) $ - Symbol Explanation:
- : Total number of shots in the video.
- : Text caption for the -th shot.
- : Generated video segment for the -th shot.
- : Measures the cosine of the angle between two non-zero vectors and , indicating their directional similarity. Calculated as .
- : Function to extract text embedding features using the
ViCLIPmodel. - : Function to extract video embedding features for a shot using the
ViCLIPmodel.
5.2.2. Inter-Shot Consistency
This metric assesses the coherence and consistency of elements (semantic, subject, scene) across different shots within a multi-shot video.
- Conceptual Definition: Measures how consistent the visual elements (overall semantics, specific subjects, and background scenes) remain when transitioning between different shots in the same multi-shot video. High consistency is crucial for narrative coherence.
- Mathematical Formula:
- Semantic Consistency: $ \text{Semantic Consistency} = \mathrm{cosine_similarity}(\text{ViCLIP_video_features}(V_{global}), \text{mean}(\text{ViCLIP_video_features}(V_i))) $ Fallback Rule Note: The paper states calculating "holistic semantic similarity between ViCLIP shot features." A common way to measure inter-shot semantic consistency is to compare the global video embedding (or an average of shot embeddings) with individual shot embeddings or adjacent shot embeddings. The above formula is a plausible interpretation for "holistic semantic similarity". If it means pairwise average, then . The paper is not precise, so I will stick to a holistic average.
- Subject Consistency: $ \text{Subject Consistency} = \frac{1}{N_{subjects} \cdot N_{pairs}} \sum_{s=1}^{N_{subjects}} \sum_{(f_a, f_b) \in \text{pairs}s} \mathrm{cosine_similarity}(\text{DINOv2_features}(\text{Subject}{s,f_a}), \text{DINOv2_features}(\text{Subject}_{s,f_b})) $
- Scene Consistency: $ \text{Scene Consistency} = \frac{1}{N_{scenes} \cdot N_{pairs}} \sum_{s=1}^{N_{scenes}} \sum_{(f_a, f_b) \in \text{pairs}s} \mathrm{cosine_similarity}(\text{DINOv2_features}(\text{Background}{s,f_a}), \text{DINOv2_features}(\text{Background}_{s,f_b})) $
- Symbol Explanation:
- : Holistic features of the entire multi-shot video (e.g., mean of all shot features).
- : Average of video features across all shots.
YOLOv11 [23]andSAM [25]: Used to detect and crop subjects and backgrounds fromkeyframes(first, middle, and last frames) for subject and scene consistency.DINOv2 [35]: A powerful self-supervised vision transformer used to extract robust visual features for comparing cropped subjects and backgrounds.- : Total number of unique subjects in the video.
- : Total number of unique scenes in the video.
- : Number of unique pairs of keyframes for comparison (e.g., across shots for a specific subject/scene).
- : Cropped image of subject in keyframe .
- : Cropped image of background in keyframe for scene .
- : Function to extract visual embedding features using the
DINOv2model.
5.2.3. Transition Deviation
- Conceptual Definition: Measures the accuracy of
shot transition detectionin the generated videos compared to the desired (ground-truth) transition timestamps. A lower deviation indicates more precise and controllable transitions. - Mathematical Formula: $ \text{Transition Deviation} = \frac{1}{N_{transitions}} \sum_{k=1}^{N_{transitions}} |\text{DetectedFrame}_k - \text{GroundTruthFrame}_k| $
- Symbol Explanation:
TransNet V2 [43]: Used to detect transitions in the generated videos.- : The number of shot transitions.
- : The frame number where
TransNet V2detects the -th transition. - : The intended (ground-truth) frame number for the -th transition as specified by the user or dataset.
5.2.4. Narrative Coherence
- Conceptual Definition: Evaluates the overall logical flow and storytelling quality of the multi-shot videos. This is a high-level, subjective assessment of whether the sequence of shots makes sense narratively.
- Measurement Method:
Gemini-2.5 [10]is employed as an automated evaluation metric. It analyzes 20 proportionally sampled frames (at least one per shot) from the multi-shot video, along with the hierarchical captions.Gemini-2.5is instructed to evaluate four core dimensions based on cinematic narrative logic:- Scene Consistency: Verifies background, lighting, and atmosphere stability across transitions.
- Subject Consistency: Scrutinizes identity features and appearance attributes of core objects across different viewpoints.
- Action Coherence: Evaluates the temporal logic of dynamic behaviors, ensuring actions are reasonable continuations.
- Spatial Consistency: Examines whether the topological structure of relative positional relationships between subjects remains constant. The model outputs a "True" or "False" verdict for each dimension, which are then aggregated (e.g., averaged) to quantify narrative coherence. No direct mathematical formula is provided by the paper, as it relies on a large language model's reasoning.
The following figure (Figure 9 from the original paper's appendix) shows the prompt template used for the Narrative Coherence Metric based on Gemini-2.5.

5.2.5. Reference Injection Consistency
This metric evaluates how accurately the generated subjects and backgrounds match the provided reference images.
- Conceptual Definition: Measures the visual similarity between the subjects/backgrounds generated in the video and the specific
reference imagesprovided by the user. - Measurement Method:
- Subjects and backgrounds are detected and cropped from the generated video frames (e.g.,
keyframes). DINOv2 [35]is used to extract features from these cropped regions and from the originalreference images.- The cosine similarity between the generated features and the reference features is calculated.
- Subjects and backgrounds are detected and cropped from the generated video frames (e.g.,
- Mathematical Formula: $ \text{Reference Subject Consistency} = \frac{1}{N_{subjects}} \sum_{s=1}^{N_{subjects}} \mathrm{cosine_similarity}(\text{DINOv2_features}(\text{GeneratedSubject}s), \text{DINOv2_features}(\text{ReferenceSubject}s)) $ $ \text{Reference Background Consistency} = \frac{1}{N{backgrounds}} \sum{s=1}^{N_{backgrounds}} \mathrm{cosine_similarity}(\text{DINOv2_features}(\text{GeneratedBackground}_s), \text{DINOv2_features}(\text{ReferenceBackground}_s)) $
- Symbol Explanation:
- : Number of subjects for which reference images were provided.
- : Number of backgrounds for which reference images were provided.
- : Cropped image of the generated subject in the video.
- : The original reference image for subject .
- : Cropped image of the generated background in the video.
- : The original reference image for background .
5.2.6. Grounding
- Conceptual Definition: Measures the
spatiotemporal-grounded accuracyof reference injection by evaluating how precisely generated objects appear within their specified bounding box regions. - Mathematical Formula: It calculates the
mean Intersection over Union (mIoU)across keyframes. $ \text{IoU}(A, B) = \frac{\text{Area}(A \cap B)}{\text{Area}(A \cup B)} $ $ \text{Grounding (mIoU)} = \frac{1}{N_{objects} \cdot N_{keyframes}} \sum_{o=1}^{N_{objects}} \sum_{k=1}^{N_{keyframes}} \text{IoU}(\text{PredictedBBox}{o,k}, \text{GroundTruthBBox}{o,k}) $ - Symbol Explanation:
- : Intersection over Union, a measure of overlap between two bounding boxes and .
- : The intersection area of bounding boxes and .
- : The union area of bounding boxes and .
- : Number of objects (subjects or backgrounds) with specified bounding boxes.
- : Number of keyframes evaluated.
- : The bounding box of the generated object detected in keyframe (e.g., by
YOLOv11). - : The specified (ground-truth) bounding box for object in keyframe .
5.2.7. Aesthetic Score
- Conceptual Definition: Measures the overall aesthetic quality and visual appeal of the generated multi-shot videos. This is typically an average human-perceived score or a score predicted by a model trained on human aesthetic judgments.
- Measurement Method: The paper mentions
Aesthetic Score [40]. This typically refers to a score derived from models trained on large datasets likeLAION-5B [40]which contain aesthetic ratings, predicting how aesthetically pleasing an image or video is. The paper does not provide a specific formula, but it is implicitly a model output. - Symbol Explanation: No specific symbols beyond the score itself.
5.3. Baselines
MultiShotMaster is compared against both multi-shot and single-shot reference-to-video methods. All competing baselines are based on Wan2.1-T2V1.3B [45] at a resolution of , while MultiShotMaster is based on a model with parameters at .
-
Multi-Shot Video Generation Methods:
CineTrans [57]: The latest open-source multi-shot narrative method. It manipulatesattention scoresusing a mask matrix to weaken correlations across different shots, aiming for cinematic transitions. It uses a global caption for scene/camera transitions and per-shot captions.EchoShot [46]: Focuses onidentity-consistent multi-shot portrait videosand also designs aRoPE-based shot transitionmechanism, but its primary goal is portrait consistency rather than general narrative coherence.
-
Single-Shot Reference-to-Video Methods (adapted for multi-shot comparison): These methods are designed for single shots and are used to generate multiple independent single-shot videos from a story, each with individual text prompts. This highlights the challenge of inter-shot consistency when shots are generated in isolation.
Phantom [29]: A subject-consistent video generation method viacross-modal alignment.VACE [21]: An all-in-one video creation and editing framework that supports multi-reference video generation.
5.4. Implementation Details
- Base Model: Pretrained single-shot
T2V modelwith Billion parameters. - Resolution: . (Lower than baselines, noted as a limitation).
- Video Characteristics:
- Narrative videos containing 77-308 frames (5-20 seconds) at 15 fps.
- Each video comprises 1-5 shots.
- VAE Encoding/Decoding: Each shot is encoded separately via
3D VAE [24]. Asliding window strategyis used for shots longer than 77 frames to maintain alignment between pixel space and latent space. - Training Hardware: 32 GPUs.
- Hyperparameters:
- Learning Rate: .
- Batch Size: 1.
- Angular Phase Shift Factor () for
Multi-Shot Narrative RoPE: 0.5 (default).
- Inference Settings:
- Classifier-Free Guidance Scale [18]: 7.5.
- DDIM [42] Steps: 50.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate the superior performance and outstanding controllability of MultiShotMaster compared to baseline methods in multi-shot video generation.
The following figure (Figure 4 from the original paper) presents two different feature comparisons for multi-shot text-to-video generation and multi-shot reference-to-video generation. The simplified prompts for each shot are shown in the subtitle.

Qualitative Comparison (Figure 4 Analysis):
- Upper Part (Multi-Shot Text-to-Video):
CineTrans [57](Row 1) shows limited variation in camera positioning and fails to preserve character identity consistency. For example, the car in Shot 2 and Shot 3 (if present) might not match. This is attributed to itsattention score manipulationwhich impedes originaltoken interactions.EchoShot [46](Row 2), while designed for identity consistency in portrait videos, shows limitations in broader narrative details likeinconsistent clothing colorsfor the same character across shots.Ours (w/o Ref)(Row 3) demonstrates text-drivencross-shot subject consistencyandscene consistency. For example, the vehicle roof maintains consistent color between Shot 2 and Shot 3, even if it occupies a small area in Shot 3.
- Lower Part (Multi-Shot Reference-to-Video):
VACE [21](Row 4) andPhantom [29](Row 5) are single-shot methods, so when used for multi-shot generation by multiple independent inferences, they fail to maintaininter-shot subject consistency(e.g., the woman wears different clothing in Shot 1 and Shot 3). They also struggle to fully preserve user-providedbackground reference images.Ours (w/ Ref)(Row 6) achieves satisfactoryreference-driven subject consistencyandscene consistency. It further supportsgrounding signalsto control subject injection into specified regions and background injection into specified shots, demonstrating superior control.
- Hierarchical Captioning:
MultiShotMaster's user-friendlyhierarchical caption structure(global caption for subject appearance, indexed nouns in per-shot captions) is more convenient than baselines that require repeating character descriptions in every shot caption.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Text Align.↑ | Inter-Shot Consistency↑ Semantic | Subject | Scene | Transition Deviation↓ | Narrative Coherence↑ | Reference Consistency↑ Subject Background Grounding | |||
| CineTrans [57] | 0.174 | 0.683 | 0.437 | 0.389 | 5.27 | 0.496 | X | X | X |
| EchoShot [46] | 0.183 | 0.617 | 0.425 | 0.346 | 3.54 | 0.213 | X | X | X |
| Ours (w/o Ref) | 0.196 | 0.697 | 0.491 | 0.447 | 1.72 | 0.695 | × | × | X |
| VACE [21] | 0.201 | 0.599 | 0.468 | 0.273 | X | 0.325 | 0.475 | 0.361 | X |
| Phantom [29] | 0.224 | 0.585 | 0.462 | 0.279 | × | 0.362 | 0.490 | 0.328 | × |
| Ours (w/ Ref) | 0.227 | 0.702 | 0.495 | 0.472 | 1.41 | 0.825 | 0.493 | 0.456 | 0.594 |
Quantitative Comparison (Table 1 Analysis):
- Multi-Shot Text-to-Video Generation:
CineTrans [57]scores low inTransition Deviation(5.27, higher is worse) andText Alignment(0.174). This is attributed to itsattention maskstrategy weakening inter-shot correlations, which leads to unsatisfactory transitions and text adherence. ItsNarrative Coherence(0.496) is also moderate.EchoShot [46]performs poorly inNarrative Coherence(0.213) because it is specialized for portrait videos rather than narrative content. ItsInter-Shot Consistency(semantic 0.617, subject 0.425, scene 0.346) is also lower thanMultiShotMaster.Ours (w/o Ref)demonstrates superiorInter-Shot Consistency(semantic 0.697, subject 0.491, scene 0.447),Transition Deviation(1.72, much lower is better), andNarrative Coherence(0.695). This validates the effectiveness of the proposedMulti-Shot Narrative RoPEand other framework components for text-driven multi-shot generation.
- Multi-Shot Reference-to-Video Generation:
-
VACE [21]andPhantom [29](single-shot methods, run independently for multi-shot) naturally do not haveTransition Deviationscores (marked ). TheirInter-Shot Consistency(semantic, subject, scene) andNarrative Coherenceare significantly lower thanMultiShotMasterdue to independent generation. They struggle withReference Consistencyfor backgrounds (0.361 and 0.328 respectively) andNarrative Coherence(0.325 and 0.362). -
Ours (w/ Ref)achieves the best performance across almost all metrics: highestText Alignment(0.227),Inter-Shot Consistency(semantic 0.702, subject 0.495, scene 0.472),Transition Deviation(1.41), andNarrative Coherence(0.825). Crucially, it also shows excellentReference Consistencyfor subjects (0.493) and backgrounds (0.456), and introduces strongGroundingcapabilities (0.594) which baselines do not support (marked ).In summary,
MultiShotMasterconsistently outperforms baselines, especially ininter-shot consistency,narrative coherence, andtransition control, while simultaneously enabling comprehensivespatiotemporal-grounded reference injection.
-
6.3. Ablation Studies / Parameter Analysis
The paper presents ablation studies to validate the effectiveness of key components and the training strategy.
6.3.1. Ablation Study for Network Design
The following are the results from Table 2 of the original paper:
| Inter-Shot Consistency↑ Semantic | Subject | Scene | Transition Deviation↓ | Narrative Coherence↑ | |
| w/o MS RoPE | 0.702 | 0.486 | 0.455 | 4.68 | 0.645 |
| Ours (w/o Ref) | 0.697 | 0.491 | 0.447 | 1.72 | 0.695 |
Analysis of Table 2 (Multi-Shot RoPE):
-
w/o MS RoPE(without Multi-Shot Narrative RoPE): This setting relies solely on per-shot captions for shot transitions and uses continuousRoPE. It shows a significantly higherTransition Deviation(4.68 vs. 1.72 forOurs (w/o Ref)), indicating its inability to perform precise shot transitions by text prompts alone. The lack of explicit shot transitions also leads to less change between shots, which artificially inflatessemantic consistency(0.702) andscene consistency(0.455) because the "shots" blend into one long continuous clip. However,Narrative Coherence(0.645) is lower due to imprecise transitions and potentially confused narrative flow. -
Ours (w/o Ref): WithMulti-Shot Narrative RoPE, the framework achieves a much lowerTransition Deviation(1.72), demonstrating its effectiveness in precisely controlling shot transitions at user-specified timestamps. ItsNarrative Coherence(0.695) is also superior.The following are the results from Table 3 of the original paper:
| Aesthetic| Score↑ Narrative Reference Consistency↑ Coherence↑ Subject Scene Grounding w/o Mean 3.84 0.796 0.482 0.452 0.557 w/o Attn Mask 3.72 0.787 0.468 0.414 0.561 w/o STPA RoPE 3.79 0.761 0.425 0.363 X Ours (w/ Ref) 3.86 0.825 0.493 0.456 0.594
Analysis of Table 3 (Reference Injection):
w/o Mean(without averaging multiple subject token copies): This setting might cause information loss from the subject copies, resulting in slightly suboptimalAesthetic Score(3.84) andNarrative Coherence(0.796), and lowerSubject Reference Consistency(0.482).w/o Attn Mask(without Multi-Shot & Multi-Reference Attention Mask): The lack of an attention mask leads toexcessively long contextsandunnecessary interactionsbetween irrelevant in-context tokens. This results in a weakerAesthetic Score(3.72) and lowerReference Consistency(subject 0.468, scene 0.414), ascontent leakagedegrades overall quality and control.w/o STPA RoPE(without Spatiotemporal Position-Aware RoPE): This setting directly concatenates reference tokens along the temporal dimension and applies a genericRoPE(t=0, h, w)to each reference, without spatiotemporal grounding. It relies only on text prompts for positioning. Consequently, it shows significantly poorer performance onReference Consistency(subject 0.425, scene 0.363) and lacksGroundingcapability (marked ).Narrative Coherence(0.761) is also notably lower.Ours (w/ Ref): The full model achieves the best performance across all metrics, includingAesthetic Score(3.86),Narrative Coherence(0.825),Reference Consistency(subject 0.493, scene 0.456), andGrounding(0.594), validating the effectiveness of all proposed designs forspatiotemporal-grounded reference injection.
6.3.2. Ablation Study for Training Paradigm
The following are the results from Table 4 of the original paper:
| Text Align.↑ | Inter-Shot Consistency↑ Semantic | Subject | Scene | Reference Consistency↑ Subject | Background | Grounding | |
| I: Multi-Shot+Ref. Injection | 0.211 | 0.671 | 0.464 | 0.415 | 0.454 | 0.426 | 0.477 |
| I: Multi-Shot II: Multi-Shot+Ref. Injection | 0.219 | 0.695 | 0.481 | 0.433 | 0.472 | 0.451 | 0.578 |
| I: Ref. Injection II: Multi-Shot+Ref. Injection | 0.222 | 0.692 | 0.484 | 0.437 | 0.485 | 0.454 | 0.583 |
| I: Ref. Injection II: Multi-Shot+Ref. Injection III Multi-Shot+Subject-Focused Ref. Injection | 0.227 | 0.702 | 0.495 | 0.472 | 0.493 | 0.456 | 0.594 |
Analysis of Table 4 (Training Paradigm):
-
I: Multi-Shot+Ref. Injection(Unified Training): Training both multi-shot and reference-to-video tasks simultaneously. This approach shows inadequate performance across metrics (e.g.,Text Alignment0.211,Subject Consistency0.464,Grounding0.477). This is because the diffusion loss, focused on global consistency, struggles to effectively optimize for both tasks concurrently, especially learning diverse subjects. -
I: Multi-Shot II: Multi-Shot+Ref. Injection(Two-stage, Multi-shot first): First learns multi-shot text-to-video generation, then multi-shot reference-to-video generation, both usingmulti-shot & multi-reference data. This improves performance over unified training (e.g.,Text Alignment0.219,Grounding0.578). However,Subject Consistency(0.481) is slightly lower due toinsufficient exposure to diverse subjectsduring training if the multi-shot data itself is limited in subject variety. -
I: Ref. Injection II: Multi-Shot+Ref. Injection(Two-stage, Reference first): This is the adopted strategy. First trainsspatiotemporal-grounded reference injectionon 300k single-shot data, then learns both tasks on themulti-shot & multi-reference data. This yields better results on most metrics (e.g.,Text Alignment0.222,Subject Consistency0.484,Grounding0.583). Learning reference injection on a large single-shot dataset first provides a strong foundation for handling diverse subjects. -
I: Ref. Injection II: Multi-Shot+Ref. Injection III Multi-Shot+Subject-Focused Ref. Injection(Three-stage, with post-training): Adding thesubject-focused post-training(assigning loss weight to subject regions) further boosts performance across all metrics. This final stage guides the model to prioritizesubjects requiring higher consistencyand better comprehendcross-shot subject variations, leading to the best overall results (e.g.,Text Alignment0.227,Subject Consistency0.495,Grounding0.594).The ablation studies clearly validate the individual contributions of
Multi-Shot Narrative RoPE,Spatiotemporal Position-Aware RoPE, theattention mask, and the carefully designedthree-stage training paradigm, all of which contribute to the overall superior performance and controllability ofMultiShotMaster.
7. Conclusion & Reflections
7.1. Conclusion Summary
MultiShotMaster introduces a novel, highly controllable framework for generating narrative multi-shot videos, directly addressing the limitations of existing single-shot and less-controllable multi-shot techniques. The paper's core innovation lies in extending a pretrained single-shot Text-to-Video model through two critical enhancements to Rotary Position Embedding (RoPE):
-
Multi-Shot Narrative RoPE: Enables the explicit recognition of shot boundaries and controllable transitions by applying angular phase shifts, preserving narrative temporal order while allowing flexible shot arrangements.
-
Spatiotemporal Position-Aware RoPE: Facilitates precise
spatiotemporal-grounded reference injectionby incorporating reference tokens and grounding signals directly intoRoPE, enabling customized subjects with motion control and background-driven scenes.In addition to these architectural innovations,
MultiShotMasterestablishes an automatedmulti-shot & multi-reference data curation pipeline. This pipeline effectively overcomes data scarcity by extracting complex video, caption, grounding, and reference image annotations from long internet videos. The framework leverages these intrinsic architectural modifications and a carefully designed three-stage training paradigm to integrate text prompts, subjects, grounding signals, and backgrounds into a versatile system for flexible multi-shot video generation. Extensive experiments qualitatively and quantitatively demonstrate its superior performance and outstanding controllability compared to state-of-the-art baselines.
7.2. Limitations & Future Work
The authors acknowledge several key limitations that require future research:
-
Generation Quality and Resolution: The current implementation is based on a pretrained single-shot
T2Vmodel with Billion parameters at a resolution of . This lags behind more recent open-source models like theWAN family [45], which operate at higher resolutions (e.g., ) and likely higher quality. Future work will involve implementingMultiShotMasteron these larger, higher-resolution models (e.g., WAN 2.1/2.2) to improve overall generation quality. -
Motion Coupling Issue: The current framework explicitly controls
subject motionwhilecamera positionis primarily controlled by text prompts. This can lead to amotion coupling issue, where the generated video aligns the grounding signals, but this alignment is achieved through a coupled movement of both the camera and the object.The following figure (Figure 5 from the original paper) visualizes this limitation, where the subject's motion and camera motion might be inadvertently intertwined.
该图像是一个示意图,展示了两组生成结果的对比:上方为不良生成结果,下方为良好生成结果。左侧是背景图像和主体图像,展示了不同画面生成的效果差异,强调了模型在不同控制条件下的表现。
Future work aims to address this coupling, likely by introducing more explicit and disentangled control over camera motion independent of subject motion.
7.3. Personal Insights & Critique
MultiShotMaster represents a significant step forward in the complex domain of narrative video generation. The paper's strength lies in its elegant approach to a challenging problem: instead of introducing complex new modules or vastly larger models, it cleverly leverages and extends existing mechanisms like RoPE to achieve sophisticated control.
Inspirations and Applications:
- Intrinsic Architectural Control: The idea of embedding control signals directly into positional encodings (
RoPE) is highly inspiring. This suggests that existing architectural components might hold untapped potential for fine-grained control if rethought for new tasks. This could be applied to other generative tasks where sequential or spatial context is crucial, such as long-form audio generation or complex 3D scene synthesis. - Narrative Storytelling Tools: The framework moves AI video generation closer to becoming a practical tool for filmmakers and content creators. The hierarchical captioning, subject control, and background customization are features directly addressing real-world production needs. This approach could be adapted for interactive storytelling platforms or personalized content creation engines.
- Automated Data Curation: The sophisticated automated data pipeline is a crucial contribution. Data scarcity is a perpetual bottleneck in AI research, especially for complex, multi-modal tasks. The methodology for extracting multi-shot videos, consistent captions, and grounding signals is a valuable blueprint for other researchers facing similar data challenges.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Generalizability of RoPE Phase Shift: While the fixed angular phase shift factor () for
Multi-Shot Narrative RoPEis stated as requiring no additional trainable parameters, its optimality across all possible narrative structures, shot lengths, and content types might be an implicit assumption. Exploring whether this factor could be dynamically determined or even softly learned (e.g., via a small hypernetwork) based on narrative complexity could be an interesting avenue. -
Robustness of Automated Data Curation: While effective, automated data pipelines, especially those relying on large language models like
Gemini-2.5for complex reasoning (like cross-shot subject grouping or narrative coherence evaluation), can be prone to hallucination or subtle errors. The quality of the curated 235k dataset is paramount, and a manual auditing process, even for a subset, would strengthen confidence. -
Disentangled Control Limitations: The acknowledged
motion coupling issueis a significant practical limitation. Achieving truly independent control over camera movement, subject movement, and potentially other dynamic elements (e.g., lighting changes, object interactions) is the next frontier. This might require more explicit disentanglement mechanisms beyond positional embeddings, perhaps involving dedicated latent spaces or control signals for each aspect. -
Computational Cost at Scale: While the RoPE modifications themselves are efficient, the overall framework still operates on a Diffusion Transformer model. Scaling to even higher resolutions, longer videos, or greater shot counts will inevitably push computational boundaries. Research into more efficient attention mechanisms or distillation techniques could be valuable.
-
Subjectivity in Narrative Coherence Evaluation: While using
Gemini-2.5for narrative coherence is innovative, it relies on the LLM's understanding of "cinematic narrative logic." This is inherently subjective, and the fidelity of the LLM's judgment to human perception might warrant further validation (e.g., with human evaluations on a small benchmark).Overall,
MultiShotMasteris an ambitious and impactful work, pushing the boundaries of what's possible in controllable video generation by thoughtfully adapting core architectural components. Its innovations provide a strong foundation for future research in creating sophisticated, AI-driven cinematic content.
Similar papers
Recommended via semantic vector search.