Paper status: completed

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Published:12/03/2025

Controllable Video Generation (1)Multishot Video Generation Framework (1)Application of RoPE Techniques in Video Generation (1)Enhanced Multi-Shot Narrative Capability (1)Automated Data Annotation Pipeline (1)

Original Link PDF

Price: 0.100000

12 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The MultiShotMaster framework addresses current limitations in multi-shot narrative video generation by integrating two novel RoPE variants, enabling flexible shot arrangement and coherent storytelling, while an automated data annotation pipeline enhances controllability and outp

Abstract

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

Mind Map

In-depth Reading

English Analysis~40 min read · 57,340 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is "MultiShotMaster: A Controllable Multi-Shot Video Generation Framework." It focuses on developing a system capable of generating video sequences composed of multiple distinct shots, each with flexible arrangements, narrative coherence, and enhanced controllability beyond simple text prompts.

1.2. Authors

The authors and their affiliations are:

Qinghe Wang, Baolu Li, Huchuan Lu, Xu Jia (Dalian University of Technology)
Xiaoyu Shi, Quande Liu, Xintao Wang, Pengfei Wan, Kun Gai (Kling Team, Kuaishou Technology)
Weikang Bian (The Chinese University of Hong Kong)

This indicates a collaboration between academic institutions and an industry research team, suggesting a blend of theoretical rigor and practical application focus.

1.3. Journal/Conference

The paper is published as an arXiv preprint, indicating it has not yet undergone formal peer review and publication in a journal or conference proceedings at the time of its posting. While arXiv is a highly respected platform for disseminating research quickly in fields like AI, it is not a peer-reviewed venue itself.

1.4. Publication Year

The publication date is 2025-12-02T18:59:48.000Z.

1.5. Abstract

The paper addresses the challenge that current video generation models, while proficient at single-shot clips, struggle with narrative multi-shot videos due to requirements for flexible shot arrangement, coherent storytelling, and advanced controllability. To overcome this, the authors propose MultiShotMaster, a framework for highly controllable multi-shot video generation.

The core methodology involves extending a pretrained single-shot model by integrating two novel variants of Rotary Position Embedding (RoPE):

Multi-Shot Narrative RoPE: This variant applies explicit phase shifts at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order.
Spatiotemporal Position-Aware RoPE: Designed to incorporate reference tokens and grounding signals, facilitating spatiotemporal-grounded reference injection.

To tackle data scarcity, the authors establish an automated data annotation pipeline that extracts multi-shot videos, captions, cross-shot grounding signals, and reference images. The framework leverages intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subjects with motion control, and background-driven customized scenes. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the framework's superior performance and outstanding controllability.

1.6. Original Source Link

The official source link is: https://arxiv.org/abs/2512.03041 The PDF link is: https://arxiv.org/pdf/2512.03041v1.pdf This paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the significant gap between current video generation techniques and the practical requirements of real-world video content creation, especially for narrative multi-shot videos.

While recent advancements in single-shot video generation, primarily powered by diffusion transformers (DiTs), have achieved high-quality results from text prompts and incorporated various control signals (e.g., reference images, object motion), they fall short in producing videos with coherent narratives composed of multiple shots. Real-world films and television series rely on multi-shot video clips to convey stories, involving holistic scenes, character interactions, and micro-expressions across different camera angles and durations.

The specific challenges and gaps in prior research include:

Narrative Coherence: Maintaining a consistent storyline, character identities, and scene layouts across multiple shots.
Flexible Shot Arrangement: Current methods often struggle with variable shot counts and durations, or fixed transition points.
Advanced Controllability: Beyond basic text prompts, there's a need for director-level control over aspects like character appearance, motion, and scene customization.
Computational Cost: Adapting single-shot control methods to multi-shot settings can lead to larger networks and higher costs.
Data Scarcity: A lack of suitable large-scale multi-shot, multi-reference datasets for training.

The paper's entry point and innovative idea stem from leveraging the inherent properties of Rotary Position Embedding (RoPE) within the DiT architecture. The authors observe that standard continuous RoPE might confuse intra-shot frames with inter-shot frames across boundaries. This insight leads to the idea of manipulating RoPE to explicitly mark shot transitions and integrate spatiotemporal control signals for reference injection.

2.2. Main Contributions / Findings

The primary contributions of MultiShotMaster are:

Novel RoPE Variants for Controllability:
- Multi-Shot Narrative RoPE: Introduces explicit angular phase shifts at shot transitions, enabling the model to recognize shot boundaries and maintain narrative order, allowing flexible shot counts and durations without additional trainable parameters.
- Spatiotemporal Position-Aware RoPE: Integrates reference tokens (for subjects and backgrounds) with spatiotemporal grounding signals (e.g., bounding boxes) directly into the RoPE mechanism. This establishes strong correlations, allowing precise control over where and when references appear, and enabling subject motion control across frames.
Multi-Shot & Multi-Reference Attention Mask: A novel attention masking strategy that constrains information flow, ensuring each shot primarily accesses its relevant reference tokens while maintaining global consistency across video tokens.
Automated Data Curation Pipeline: An automatic MultiShot & Multi-Reference Data Curation pipeline to address data scarcity. It extracts multi-shot videos, hierarchical captions (global and per-shot), cross-shot grounding signals, and reference images from long internet videos.
Comprehensive Controllability Framework: The proposed framework supports a wide range of control signals including text prompts, reference subjects with motion control, and background images for scene customization, all within a flexible multi-shot generation setting.
Superior Performance: Extensive experiments demonstrate that MultiShotMaster achieves superior performance and outstanding controllability compared to existing methods across various quantitative metrics (e.g., Text Alignment, Inter-Shot Consistency, Transition Deviation, Narrative Coherence, Reference Consistency, Grounding) and qualitative evaluations.

The key conclusions are that by strategically modifying RoPE and developing an effective data curation pipeline, MultiShotMaster successfully enables highly controllable and narratively coherent multi-shot video generation, bridging a critical gap towards director-level AI-powered content creation. The framework's ability to handle flexible shot arrangements and integrate diverse condition signals makes it a versatile tool for customized video narratives.

3.1. Foundational Concepts

To fully understand the MultiShotMaster paper, a reader should be familiar with the following foundational concepts:

3.1.1. Diffusion Models

Diffusion models are a class of generative models that learn to reverse a gradual diffusion process. They work by progressively adding noise to data (e.g., images, videos) until it becomes pure noise, and then learning to reverse this process, step-by-step, to generate new data from noise.

Forward Diffusion Process: Gradually adds Gaussian noise to an input data point $x_0$ over $T$ timesteps, producing $x_1, x_2, \dots, x_T$ , where $x_T$ is pure noise.
Reverse Diffusion Process: A neural network (often a U-Net or Transformer) is trained to predict the noise added at each step, or directly predict the "velocity" to return to the clean data. By iteratively removing predicted noise from a noisy sample, it generates a clean data sample.

3.1.2. Latent Diffusion Models (LDMs)

Latent Diffusion Models improve efficiency by performing the diffusion process in a compressed latent space rather than the high-dimensional pixel space.

Variational Auto-Encoder (VAE): An Auto-Encoder is a neural network trained to encode input data into a lower-dimensional representation (latent code) and then decode it back to reconstruct the original data. A Variational Auto-Encoder (VAE) introduces a probabilistic twist, learning a distribution over the latent space, which allows for sampling new, coherent latent codes for generation. In LDMs, a VAE's encoder compresses images/videos into a compact latent representation, and its decoder converts generated latent codes back into pixels.
Diffusion in Latent Space: The diffusion process (noise addition and removal) operates on these smaller latent codes, significantly reducing computational cost and memory requirements while maintaining high-quality generation.

3.1.3. Diffusion Transformers (DiTs)

Diffusion Transformers replace the traditional U-Net architecture, commonly used in diffusion models, with a Transformer architecture.

Transformer Architecture: Originally designed for sequence-to-sequence tasks (like natural language processing), Transformers rely heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence.
Applying Transformers to Diffusion: In DiTs, images or video frames are first broken down into patches, which are then linearly projected into token embeddings. These tokens, along with positional encodings and timestep embeddings, are fed into a series of Transformer blocks. Each Transformer block typically contains self-attention, cross-attention (e.g., for text conditioning), and feed-forward networks (FFN). DiTs have shown excellent scalability and performance in generative tasks.
3D DiT: For video generation, DiTs are extended to 3D DiTs, which incorporate spatiotemporal self-attention to process information across both spatial (height, width) and temporal (frames) dimensions.

3.1.4. Rotary Position Embedding (RoPE)

Rotary Position Embedding (RoPE) is a type of positional encoding used in Transformers. Unlike absolute positional embeddings (which directly add fixed vectors to tokens) or relative positional embeddings (which compute pairwise distances), RoPE encodes absolute position information with rotation matrices and naturally incorporates relative position information in self-attention calculations.

Mechanism: RoPE applies a rotation to the query and key vectors based on their position. When computing attention scores (dot products of query and key), the relative position between two tokens is naturally encoded by the angle between their rotated vectors.
Key Property: RoPE has a crucial property: tokens with closer spatiotemporal distances receive higher attention weights. This makes it effective for capturing local correlations.
Formula Intuition: For a token at position $m$ , its embedding $x_m$ is transformed by a rotation matrix $R_m$ . The attention score between query $q_m$ and key $k_n$ then becomes $(R_m q_m)^T (R_n k_n)$ , which can be shown to depend on m-n, thus encoding relative position.

3.1.5. Rectified Flow

Rectified Flow (RF) is a method used in generative modeling that provides a direct, straight path between a data distribution and a simple noise distribution.

Connection to Diffusion: Unlike standard diffusion models that follow a complex, non-linear path, Rectified Flow aims to define a straight line (flow) in the latent space from noisy data ( $z_1$ ) back to clean data ( $z_0$ ).
Training Objective: The model is trained to predict the "velocity" vector $v_{\Theta}$ along this straight path, which is simpler to learn than the noise in traditional diffusion. The path is defined as $z_{\tau} = (1 - \tau)z_0 + \tau\epsilon$ , where $\tau \in [0, 1]$ is the timestep and $\epsilon \sim \mathcal{N}(0, \mathbf{I})$ is standard Gaussian noise. The objective is to regress this velocity.
Advantages: Can lead to more stable training and faster sampling.

3.1.6. Classifier-Free Guidance (CFG)

Classifier-Free Guidance is a technique used to improve the quality and adherence to conditioning signals (like text prompts) in diffusion models.

Mechanism: During inference, the model runs two denoising passes: one conditioned on the input (e.g., text prompt) and one unconditioned (e.g., using an empty text prompt). The prediction from the unconditioned model is then subtracted from the conditioned prediction, and the difference is scaled by a guidance factor (e.g., 7.5 in this paper) and added back to the conditioned prediction.
Effect: This process steers the generation more strongly towards the provided condition, often producing higher quality and more relevant outputs, though it can sometimes reduce diversity.

3.1.7. DDIM (Denoising Diffusion Implicit Models)

DDIM is a faster sampling method for diffusion models.

Implicit Models: Unlike Denoising Diffusion Probabilistic Models (DDPMs) which require many steps to sample (due to their stochastic nature), DDIMs formulate the reverse process as a deterministic implicit model.
Faster Sampling: This determinism allows DDIM to generate high-quality samples with significantly fewer denoising steps (e.g., 50 steps in this paper, compared to hundreds or thousands for DDPMs), making inference much quicker.

3.2. Previous Works

The paper discusses previous works in Text-to-Video Generation and Controllable Video Generation, primarily categorizing multi-shot video generation into two paradigms.

3.2.1. Text-to-Video Generation

Early Methods: Initially, methods inflated pretrained text-to-image (T2I) generation models (e.g., Latent Diffusion Models like Stable Diffusion [39]) by adding temporal layers [14, 56]. These could generate short video animations.
Recent Advances (DiT-based): More recent approaches, including the base model MultiShotMaster builds upon, employ Diffusion Transformer (DiT) architectures [36, 45, 60, 69]. These can generate longer, higher-quality videos with detailed text descriptions [6, 12, 27].
Scaling Challenges: Scaling video generation beyond short, single clips is still an open problem.
- Single-Shot Long Video Generation: Focuses on extending clip length but faces issues like error accumulation and memory loss (e.g., maintaining consistent object appearance or scene details over very long durations) [8, 19, 20, 63, 64].
- Multi-Shot Long Video Generation: The focus of this paper, which aims for narrative coherence and inter-shot consistency [7, 15, 37, 58].

3.2.2. Multi-Shot Video Generation Paradigms

The paper identifies two main paradigms:

Text-to-Keyframe Generation + Image-to-Video (I2V) Generation [9, 58, 62, 70]:
- Process: First generates a set of visually consistent keyframes (static images) based on text prompts. Then, an Image-to-Video (I2V) model is used to generate each shot, animating from its corresponding keyframe.
- Limitations:
  - Relies heavily on the quality of keyframe generation.
  - Limited conditional capability: Sparse keyframes cannot fully cover briefly-appearing characters or scene consistency outside the keyframes.
  - Struggles to maintain consistency for characters or objects that appear in multiple non-keyframe segments.
Direct End-to-End Generation [7, 15, 22, 37, 51, 57]:
- Process: Generates multi-shot videos directly from text, using full attention along the temporal dimension to maintain consistency.
- Limitations:
  - Often constrained by fixed shot duration or limited shot counts.
  - May contain "boring" frames if shot durations are fixed but the narrative requires a short, sharp cut (e.g., a quick insert shot of an action).
  - Existing methods in this paradigm (like the keyframe-based one) are primarily driven only by text prompts, lacking comprehensive controllability.
- Notable Works:
  - ShotAdapter [22]: Incorporates learnable transition tokens that interact only with shot-boundary frames to indicate transitions.
  - CineTrans [57]: Constructs a specific attention mask to weaken inter-shot correlations, enabling transitions at predefined positions.
  - Differentiation from MultiShotMaster: MultiShotMaster contrasts with ShotAdapter and CineTrans by manipulating RoPE embeddings directly to convey transition signals. This approach aims to prevent interference with the token interactions in pretrained attention and explicitly achieve shot transitions, rather than using separate learnable tokens or attention masks.

3.2.3. Controllable Video Generation

This field focuses on providing explicit and precise user control for content creation [32, 47, 48, 53, 55].

Diverse Control Signals: Supports various types of control, such as camera motion [13, 16], object motion [31, 33, 41, 59], and reference video/image [4, 5, 26, 30].
Reference-to-Video:
- VACE [21] and Phantom [29]: Support multi-reference video generation and aim for realistic composition. They typically focus on single-shot generation.
- Differentiation from MultiShotMaster: These methods often rely on separate adapters for reference injection and motion control. In a multi-shot setting, this could lead to larger networks and higher computational costs. MultiShotMaster proposes a unified framework that supports reference injection and motion control jointly without additional adapters, by integrating these controls directly into the RoPE mechanism.
Motion Control:
- Tora [68] and Motion Prompting [13]: Control object motion through point trajectories.
- Differentiation from MultiShotMaster: These are typically for single-shot settings. MultiShotMaster extends motion control to multi-shot scenarios by creating multiple copies of subject tokens with different spatiotemporal RoPE assignments.

3.3. Technological Evolution

The evolution in video generation has moved from:

Early T2I-based inflation: Adapting Text-to-Image models by adding simple temporal layers for basic video animation. (e.g., Tune-A-Video [56], AnimateDiff [14]).
Dedicated Video Diffusion Models: Developing architectures specifically for video, often U-Net based, leading to better quality and longer single clips.
Diffusion Transformers (DiTs): Employing Transformer architectures (like Sora [6], Open-Sora [27, 69], WAN [45]) for improved scalability, quality, and contextual understanding in single-shot video generation.
Controllability Enhancements: Integrating various control signals (e.g., ControlNet for images extended to video, reference image adapters, motion prompting) into these DiT-based models for single shots.
Addressing Multi-Shot Narratives: Current frontier, where MultiShotMaster sits. Previous attempts involved keyframe-based or end-to-end approaches that still lacked comprehensive control, flexible shot arrangements, and narrative coherence over longer, multi-shot sequences.

MultiShotMaster fits into this timeline by taking the powerful DiT-based single-shot video generation (stage 3-4) and extending it to address the complex challenges of multi-shot narrative video generation (stage 5) by innovating on positional encoding and data conditioning.

3.4. Differentiation Analysis

Compared to the main methods in related work, MultiShotMaster offers several core differences and innovations:

Novel RoPE-based Control:
- Vs. CineTrans [57] (attention mask) & ShotAdapter [22] (learnable tokens): MultiShotMaster uses Multi-Shot Narrative RoPE to encode shot transitions via explicit angular phase shifts directly into the positional embeddings. This is a more intrinsic modification to the Transformer's attention mechanism compared to CineTrans's external attention mask manipulation (which might impede original token interactions) or ShotAdapter's use of separate learnable transition tokens. RoPE manipulation integrates the transition signal more seamlessly and without adding trainable parameters for this specific function.
- Vs. VACE [21] & Phantom [29] (separate adapters for reference/motion): MultiShotMaster integrates reference injection and motion control directly into the RoPE mechanism via Spatiotemporal Position-Aware RoPE. This allows for spatiotemporal grounding of subjects and backgrounds without the need for additional adapters, which often leads to larger networks and higher computational costs in multi-shot settings.
Flexible Shot Arrangement: By directly manipulating RoPE for shot boundaries, MultiShotMaster enables users to flexibly configure both the number of shots and their respective durations, a capability often constrained in existing end-to-end multi-shot generation paradigms that rely on fixed shot durations or limited shot counts.
Comprehensive Controllability: The framework provides director-level controllability encompassing variable shot counts and durations, dedicated text descriptions for each shot (hierarchical prompts), character appearance and motion control (via reference images and grounding signals), and background-driven scene definition. This goes beyond text-only control or single-aspect control common in many prior works.
Automated Data Curation: The development of a novel automated pipeline for Multi-Shot & Multi-Reference Data Curation directly addresses the common problem of data scarcity for complex video generation tasks, providing essential training data that is hard to collect manually.
Unified Architecture: By leveraging intrinsic architectural properties (specifically RoPE and attention masks) within a pretrained DiT, the framework achieves multi-shot generation and diverse control signals without significantly altering the core model architecture or adding large numbers of new parameters for each control type.

In essence, MultiShotMaster distinguishes itself by offering a more integrated, efficient, and flexible approach to multi-shot video generation, moving beyond single-shot limitations and rigid multi-shot paradigms through innovative RoPE modifications and a robust data pipeline.

4. Methodology

4.1. Principles

The core idea of MultiShotMaster is to extend a pretrained single-shot Text-to-Video (T2V) model into a highly controllable multi-shot generation framework. The theoretical basis and intuition behind it lie in strategically manipulating Rotary Position Embedding (RoPE)—a crucial component in Transformer architectures—to encode both shot boundary information and spatiotemporal grounding signals for reference injection.

The insights driving this approach are:

RoPE's Spatiotemporal Correlation: RoPE naturally emphasizes tokens with closer spatiotemporal distances. However, in a multi-shot video, applying continuous RoPE across all frames would incorrectly imply strong temporal correlations across shot boundaries, confusing intra-shot continuity with inter-shot transitions. The principle is to break this continuity at shot transitions.
RoPE for Grounding: The same property of RoPE (weighting closer tokens more) can be leveraged to establish strong connections between reference tokens (e.g., an image of a subject) and specific video tokens (e.g., a bounding box region in a frame). By applying RoPE from a specified region to a reference token, the model is encouraged to "ground" that reference within that exact spatiotemporal location.

By modifying RoPE, the framework aims to:

Explicitly define shot transitions without introducing new learnable parameters or complex attention mask heuristics that might interfere with the pretrained model's capabilities.
Enable precise control over the spatiotemporal appearance and motion of subjects and backgrounds using reference images and grounding signals.
Maintain narrative coherence and inter-shot consistency through a global understanding of the video and targeted control.

4.2. Core Methodology In-depth (Layer by Layer)

MultiShotMaster builds upon a pretrained single-shot Text-to-Video (T2V) model and introduces several key innovations, primarily centered around RoPE modifications and a data curation pipeline.

4.2.1. Evolving from Single-Shot to Multi-Shot T2V

4.2.1.1. Preliminary: The Base T2V Model

The foundation of MultiShotMaster is a pretrained single-shot Text-to-Video (T2V) model with approximately 1 Billion parameters. This base model consists of three main components:

3D Variational Auto-Encoder (VAE) [24]: This component is responsible for encoding input video frames into a lower-dimensional latent space and decoding generated latent representations back into pixel space. For multi-shot videos, each shot is encoded separately through the 3D VAE. The resulting video latents are then concatenated.
T5 Text Encoder [38]: This module processes text prompts, converting them into rich text embeddings that capture semantic information.
Latent Diffusion Transformer (DiT) model [36]: This is the core generative network. Each basic DiT block typically contains:
- A 2D spatial self-attention module: Processes information within individual frames.
- A 3D spatiotemporal self-attention module: Processes information across both spatial dimensions and the temporal dimension (frames).
- A text cross-attention module: Allows the DiT to condition its generation on the text embeddings from the T5 encoder.
- A feed-forward network (FFN).
  
  The denoising process in this DiT follows Rectified Flow [11]. A straight path is defined from clean data $z_0$ to noised data $z_{\tau}$ at a given timestep $\tau$ . This path is expressed as: $ z_t = (1 - \tau) z_0 + \tau \epsilon $ where:
$z_t$ represents the noised data at timestep $t$ .
$z_0$ represents the clean, original data.
$\tau$ is the timestep, typically ranging from 0 to 1.
$\epsilon \in \mathcal{N}(0, \mathbf{I})$ is standard Gaussian noise.

The denoising network $v_{\Theta}$ is trained to predict the velocity along this path. The training objective, often referred to as the LCM (Latent Consistency Model) objective [28], aims to regress this velocity: $ \mathcal{L}{LCM} = \mathbb{E}{\tau, \epsilon, z_0} \left[ || (z_1 - z_0) - v_{\Theta} (z_{\tau}, \tau, c_{text}) ||_2^2 \right] $ where:
$\mathcal{L}_{LCM}$ is the training loss.
$\mathbb{E}$ denotes the expectation over timestep $\tau$ , noise $\epsilon$ , and clean data $z_0$ .
$|| \cdot ||_2^2$ is the squared L2 norm, measuring the difference between the true velocity and the predicted velocity.
$(z_1 - z_0)$ represents the true velocity, which is the vector from the clean data $z_0$ to the fully noisy data $z_1$ .
$v_{\Theta}(z_{\tau}, \tau, c_{text})$ is the predicted velocity by the denoising network $\Theta$ , given the noised latent $z_{\tau}$ , timestep $\tau$ , and text conditioning $c_{text}$ .

4.2.1.2. Multi-Shot Narrative RoPE

A critical challenge in extending a single-shot model to multi-shot videos is that the original 3D-RoPE assigns sequential indices along the temporal dimension. This causes the model to incorrectly perceive intra-shot consecutive frames and inter-shot frames across shot boundaries as equally related, leading to confusion and hindering explicit shot transitions.

To explicitly help the model recognize shot boundaries and enable controllable transitions, MultiShotMaster introduces Multi-Shot Narrative RoPE. This mechanism breaks the continuity of RoPE at shot transitions by introducing an angular phase shift for each transition. The query (Q) of the $i$ -th shot is computed as follows (and similarly for key (K)): $ Q_i = \mathrm{RoPE}((t + i \phi) \cdot f, h \cdot f, w \cdot f) \odot \tilde{Q}_i $ where:

$Q_i$ is the query embedding for the $i$ -th shot after applying Multi-Shot Narrative RoPE.
$\mathrm{RoPE}(\cdot)$ denotes the Rotary Position Embedding function.
(t, h, w) are the spatiotemporal position indices for a specific token within a frame and shot. $t$ refers to the temporal index, and h, w refer to height and width indices respectively.
$i$ is the shot index (e.g., 0 for the first shot, 1 for the second, etc.).
$\phi$ is the angular phase shift factor. This factor introduces a fixed, explicit rotation at each shot boundary, effectively "resetting" or shifting the temporal context for the subsequent shot without disrupting the relative positioning within a shot.
$f$ is a decreasing base frequency vector, which is a characteristic component of RoPE that controls the rotation frequency.
$\odot$ denotes the element-wise rotary transformation of the original query embeddings $\tilde{Q}_i$ via complex rotations.

This design has several advantages:
It maintains the narrative shooting order of inter-shot frames.
It leverages RoPE's inherent rotational properties to mark shot boundaries through fixed phase shifts.
It requires no additional trainable parameters, making it efficient.
It allows users to flexibly configure both the number of shots and their respective durations.

4.2.1.3. Hierarchical Prompt Structure and Shot-Level Cross-Attention

To enable fine-grained control for each shot, a hierarchical prompt structure is designed:

Global Caption: Describes overarching subject appearances and environmental settings for the entire multi-shot video.
Per-Shot Captions: Provide detailed information for each individual shot, such as subject actions, specific backgrounds, and camera movements.

For each shot, the global caption is combined with its corresponding per-shot caption. In the vanilla T2V model, text embeddings from the T5 encoder are replicated along the temporal dimension to align with the video frame sequence for text-frame cross-attention. Accordingly, MultiShotMaster replicates each shot's combined text embeddings to match the corresponding shot's frame count. This enables shot-level cross-attention, ensuring that each shot is conditioned by its specific textual description.

4.2.2. Spatiotemporal-Grounded Reference Injection

Users often need to customize video content with specific reference images (e.g., for subjects or backgrounds) and control their motion. MultiShotMaster addresses this with Spatiotemporal Position-Aware RoPE.

4.2.2.1. Mechanism for Reference Injection

Encoding References: Each reference image (subject or background) is individually encoded into the latent space using the 3D VAE. These reference latents (referred to as reference tokens) are then concatenated with the noised video latents.
Information Flow: During temporal attention, these clean reference tokens propagate visual information to the noisy video tokens, facilitating the injection of reference visual characteristics.
Leveraging 3D-RoPE: Inspired by 3D-RoPE's property of assigning higher attention weights to tokens with closer spatiotemporal distances, MultiShotMaster applies 3D-RoPE from specified regions to the corresponding reference tokens. This establishes strong correlations between the region-specified video tokens and the reference tokens during attention computation.

Specifically, for a reference query $Q^{ref}$ associated with a subject's bounding box region $(x_1, y_1, x_2, y_2)$ at a $t$ -th frame, the 3D-RoPE is sampled: $ \begin{array} { r l r } { { Q^{ref} = \mathrm{RoPE}((t + i \phi) \cdot f, \ h^{ref} \cdot f, \ w^{ref} \cdot f) \odot \tilde{Q}^{ref}, } } \ & { } & { \ h^{ref} = y_1 + \frac{(y_2 - y_1)}{H} \cdot j, \quad j \in [0, H-1], } \ & { } & { \ w^{ref} = x_1 + \frac{(x_2 - x_1)}{W} \cdot k, \quad k \in [0, W-1] \ } \end{array} $ where:
$Q^{ref}$ is the query embedding for a reference token after applying Spatiotemporal Position-Aware RoPE.
$\tilde{Q}^{ref}$ is the original reference query embedding.
$(t + i \phi) \cdot f$ is the modified temporal positional encoding incorporating the Multi-Shot Narrative RoPE (the term $i \phi$ for shot transitions).
$h^{ref} \cdot f$ and $w^{ref} \cdot f$ are the spatial positional encodings.
$y_1, y_2, x_1, x_2$ define the bounding box coordinates (top-left $y_1, x_1$ and bottom-right $y_2, x_2$ ) of the specified region for the reference.
H, W are the overall spatial dimensions (height, width) of the video frame or latent representation.
$j \in [0, H-1]$ and $k \in [0, W-1]$ are indices used to sample the RoPE within the specified bounding box region, effectively mapping the reference token to that region.

This method allows precise control over where a subject appears in space and time within a video.

4.2.2.2. Motion Control

To control the motion trajectory of a subject, MultiShotMaster creates multiple copies of the subject tokens. Each copy is assigned a different spatiotemporal RoPE corresponding to a specific position at a specific frame along the desired trajectory. The temporal attention mechanism then transfers the subject motion embedded in these copies to the video tokens at the corresponding spatiotemporal positions. After the attention computation, the copied tokens of each subject are averaged to consolidate their influence.

4.2.2.3. Multi-Shot Scene Customization

For multi-shot scene customization using background images, the 3D-RoPE from the first frame of each shot is copied and applied to the corresponding background tokens. This ensures consistent background injection across the relevant shots.

By incorporating spatiotemporal-controllable multi-reference injection, the framework significantly expands its capabilities, allowing users to customize characters, control their positions and movements, and achieve customized multi-shot scene consistency using multiple background images.

4.2.3. Multi-Shot & Multi-Reference Attention Mask

The introduction of multiple reference images and subject copies can lead to very long contexts, increasing computational costs and potentially causing unnecessary interactions (content leakage) between unrelated tokens (e.g., Subject 0 only appears in Shot 2, but other shots might still attend to its reference token).

To constrain information flow and optimize attention allocation, a multi-shot & multi-reference attention mask is designed:

Video Token Interactions: Full attention is maintained across all multi-shot video tokens to ensure global consistency across the entire video narrative.
Reference Token Interactions: Each shot is limited to access only the reference tokens that are relevant to itself. Similarly, the reference tokens of each shot can only attend to each other and the video tokens within the same shot. This prevents irrelevant content from leaking between shots and optimizes computational resources.

This strategy ensures that each shot focuses on its intra-shot reference injection while preserving global consistency through inter-shot full attention among video tokens.

4.2.4. Training and Inference Paradigm

4.2.4.1. Training

The training process is typically conducted in stages based on a foundation generation model:

Stage 1: Single-Shot Reference Injection: Spatiotemporal-specified reference injection is first trained on 300k single-shot data. This helps the model learn to handle diverse subjects and reference signals. Bounding boxes are sampled with random starting points and 1-second intervals, with a 0.5 drop probability for each bounding box to encourage robustness and ease of control.
Stage 2: Multi-Shot & Multi-Reference Training: The model is then trained on the automatically constructed multi-shot & multi-reference data. To enable controllable multi-shot video generation with various modes (text-driven, subject-driven, background-driven, joint-driven), subject and background conditions are randomly dropped during training, each with a 0.5 probability (enabling classifier-free guidance for these conditions).
Stage 3: Cross-Shot Subject-Focused Post-training: To improve subject consistency and help the model better understand how subjects change across different shots, a cross-shot subject-focused post-training step is introduced. This assigns a higher loss weight (e.g., $2 \times$ ) to subject regions compared to backgrounds (e.g., $1 \times$ ).

4.2.4.2. Inference

During inference, MultiShotMaster supports:

Text-driven inter-shot consistency.
Customized subjects with motion control (via reference images and bounding boxes).
Background-driven customized scene consistency (via reference images).
Flexible configuration of shot count and duration.

The framework provides diverse multi-shot video content creation capabilities, allowing users to craft highly customized video narratives.

4.2.5. Multi-Shot & Multi-Reference Data Curation

To address the data scarcity for complex multi-shot and multi-reference video generation, an automated data annotation pipeline is established.

The following figure (Figure 3 from the original paper) illustrates the multi-step video generation process of the MultiShotMaster framework, including long video data collection, shot transition detection, scene segmentation, and the generation of 235k multi-shot samples. It clarifies the role annotation process using technologies like Gemini and YOLOv11, and showcases the per-shot caption generation.

该图像是一个示意图，展示了MultiShotMaster框架的多步骤视频生成过程，包括长视频数据收集、镜头过渡检测、场景分割和生成235k个多镜头样本的工作环节。该图阐明了使用Gemini和YOLOv11等技术的角色标注过程，并展示了每个镜头的字幕生成。

4.2.5.1. Multi-Shot Videos Collection

Crawl Long Videos: Long videos are first crawled from the internet, encompassing diverse types (movies, TV series, documentaries, cooking, sports, fitness).
Shot Transition Detection: TransNet V2 [43] is used to detect shot transitions and crop out individual single-shot videos.
Scene Segmentation: A scene segmentation method [54] is employed to understand the storyline and group single-shot videos captured within the same scene. This can cluster videos spanning tens of minutes.
Multi-Shot Sampling: Multi-shot videos are then sampled from these segmented scenes with specific criteria: shot count ranging from 1 to 5, frame count from 77 to 308 (i.e., 5-20 seconds at 15 fps), with priority given to samples having higher frame counts and more shots. This process yields 235k multi-shot samples.

4.2.5.2. Caption Definition

A hierarchical caption structure (global and per-shot captions) is used:

Global Caption: Gemini-2.5 [10] is used to understand the entire multi-shot video and generate a comprehensive global caption. Each subject is denoted by indexed nouns like "Subject X (X $\in$ [1, 2, 3, ...])".
Per-Shot Captions: Based on the global caption and each individual shot video, Gemini-2.5 is again used to generate per-shot captions. These captions consistently use the predefined "Subject X" identifiers from the global caption to ensure cross-shot subject consistency.

The following figure (Figure 6 from the original paper's appendix) shows the prompt template used for labeling global captions.

The following figure (Figure 7 from the original paper's appendix) shows the prompt template used for labeling per-shot captions, demonstrating how labeled subjects can be consistent across global and per-shot captions.

4.2.5.3. Reference Images Collection

Subject Images:
- YOLOv11 [23]: Detects objects (subjects).
- ByteTrack [66]: Tracks detected subjects within each shot.
- SAM [25]: Segments subjects to obtain clean images.
- Cross-Shot Merging: Since tracking is done shot-by-shot, Gemini-2.5 [10] is used to group tracking results across all shots. This involves selecting the largest subject image for each shot-level track ID from each shot and then grouping them based on appearance similarity using a carefully designed prompt template. This yields complete multi-shot tracking results and corresponding subject images.
Background Images: The first frames of each shot and their corresponding foreground masks are fed into OmniEraser [52] to obtain clean backgrounds.

The following figure (Figure 8 from the original paper's appendix) shows the prompt for applying Gemini-2.5 to the subject images to merge cross-shot tracking annotations.

4.2.6. Algorithm 1: Temporal Attention with Multi-Shot Narrative RoPE and Spatiotemporal Position-Aware RoPE

To provide a clearer understanding of how Multi-Shot Narrative RoPE and Spatiotemporal Position-Aware RoPE are integrated into the temporal attention mechanism, Algorithm 1 details the process.

Algorithm 1 Temporal Attention with Multi-Shot Narrative RoPE and Spatiotemporal Position-Aware RoPE.

Input:

In-context latents $Z$ $Z$ containing:
- Multi-shot video latents $\hat{z} = [z_i]_{i=1}^{N_{shot}}$ where $N_{shot}$ is the number of shots.
- Reference latents $z^{ref} = [z^m]_{m=1}^{N_{ref}}$ where $N_{ref}$ is the number of input reference images (subjects and backgrounds).
- Bounding box sequences of references (subjects and backgrounds) $[boxes]_b^{N_{box}} = [(m, t, x_1, y_1, x_2, y_2)]_b^{N_{box}}$ , where $N_{box}$ is the total number of bounding boxes. Each tuple indicates the bounding box for the $m$ -th reference at the $t$ -th frame. For background references, bounding boxes are fixed as $(m, t, 0, 0, H, W)$ where $t$ is the first frame of the corresponding shot.

Output:

In-context latents $Z^{\ast}$ after temporal attention.

$\tilde{Q} = to\_q(Z)$ , $\tilde{K} = to\_k(Z)$ , $\tilde{V} = to\_v(Z)$ // Linear projections of in-context latents to queries, keys, and values.
Apply Multi-Shot Narrative RoPE: $ Q = [Q_i]{i=1}^{N{shot}} = \text{Eq. 2}([ \tilde{Q}i ]{i=1}^{N_{shot}}) $ $ K = [K_i]{i=1}^{N{shot}} = \text{Eq. 2}([ \tilde{K}i ]{i=1}^{N_{shot}}) $ $ V = [ \tilde{V}i ]{i=1}^{N_{shot}} $ Explanation: The original query, key, and value embeddings ( $\tilde{Q}, \tilde{K}, \tilde{V}$ ) for the video latents are processed. Multi-Shot Narrative RoPE (as defined in Equation 2 in the main paper) is applied to the query ( $Q$ ) and key ( $K$ ) embeddings of each shot. This introduces angular phase shifts at shot boundaries, preserving narrative order while signaling shot transitions. The value ( $V$ ) embeddings are not modified by RoPE.
Apply Spatiotemporal Position-Aware RoPE: $ Q^{ref} = [Q_b^{ref}]{b=0}^{N{box}} = \text{Eq. 3}(\text{Copy}(\tilde{Q}^{ref}), [boxes]b^{N{box}}) $ $ K^{ref} = [K_b^{ref}]{b=0}^{N{box}} = \text{Eq. 3}(\text{Copy}(\tilde{K}^{ref}), [boxes]b^{N{box}}) $ $ V^{ref} = [\tilde{V}b^{ref}]{b=0}^{N_{box}} = \text{Copy}(\tilde{V}^{ref}, [boxes]b^{N{box}}) $ Explanation: For reference latents ( $\tilde{Q}^{ref}, \tilde{K}^{ref}, \tilde{V}^{ref}$ ), copies are created based on the number of bounding boxes they are associated with ( $N_{box}^m$ for the $m$ -th reference). Spatiotemporal Position-Aware RoPE (as defined in Equation 3 in the main paper) is then applied to these copied query ( $Q^{ref}$ ) and key ( $K^{ref}$ ) embeddings, using the specific spatiotemporal coordinates from [boxes]. This grounds the reference tokens to precise regions in the video. The value ( $V^{ref}$ ) embeddings are simply copied without RoPE modification.
Attention Computation: $ \hat{Z} = \mathrm{Attention}([Q, Q^{ref}], [K, K^{ref}], [V, V^{ref}], \text{Mask}) $ Explanation: The combined video queries ( $Q$ ), reference queries ( $Q^{ref}$ ), combined video keys ( $K$ ), reference keys ( $K^{ref}$ ), combined video values ( $V$ ), and reference values ( $V^{ref}$ ) are fed into the attention mechanism. The Multi-Shot & Multi-Reference Attention Mask is applied here to constrain information flow, ensuring appropriate interactions between video tokens and relevant reference tokens while maintaining global consistency. The output $\hat{Z}$ contains the attention-weighted representations.
Reference Aggregation: $ \bar{z}^{ref} = [\bar{z}^m]{m=1}^{N{ref}} = [\text{mean}([\hat{z}^m]b^{N{box}^m})]{m=1}^{N{ref}} $ Explanation: For each original reference $m$ , all its copies that were processed during attention (denoted as $[\hat{z}^m]_b^{N_{box}^m}$ ) are averaged (mean) to produce a single aggregated representation $\bar{z}^m$ . This step consolidates the influence of multiple positional copies for motion control.
$Z^{\ast} = to\_out([ \hat{z}, \bar{z}^{ref} ])$ // Final linear projection. Explanation: The processed video latents $\hat{z}$ and the aggregated reference latents $\bar{z}^{ref}$ are concatenated and passed through a final linear projection layer (to_out) to produce the output in-context latents $Z^{\ast}$ , which have the same dimension as the input $Z$ .
Return $Z^{\ast}$

This algorithm clearly outlines how Multi-Shot Narrative RoPE (for shot transitions) and Spatiotemporal Position-Aware RoPE (for reference grounding and motion control) are integrated into the temporal attention layer, leveraging the attention mask to manage information flow, ultimately enabling highly controllable multi-shot video generation.

5. Experimental Setup

5.1. Datasets

The paper focuses on building a dedicated dataset for training MultiShotMaster due to the scarcity of existing multi-shot, multi-reference data.

Source: Long videos crawled from the internet, including diverse categories like movies, television series, documentaries, cooking demonstrations, sports, and fitness. This wide variety ensures generalizability across different content types and cinematic styles.
Scale: The automated data annotation pipeline generated 235k multi-shot samples.
Characteristics:
- Each video comprises 1-5 shots.
- Frame count ranges from 77 to 308 (i.e., 5-20 seconds at 15 fps).
- Includes hierarchical captions: a global caption for overall context and per-shot captions for detailed, shot-specific descriptions.
- Contains cross-shot grounding signals (bounding boxes) for subjects.
- Includes reference images for subjects and backgrounds.
Data Curation Process (as described in 4.2.5):
1. Long videos are collected.
2. TransNet V2 [43] detects shot transitions to extract single-shot videos.
3. A scene segmentation method [54] groups single shots belonging to the same scene.
4. Multi-shot videos are sampled from these scenes based on shot count (1-5) and frame count (77-308), prioritizing longer, multi-shot samples.
5. Gemini-2.5 [10] generates global and per-shot captions, ensuring cross-shot subject consistency by using indexed nouns for subjects.
6. YOLOv11 [23], ByteTrack [66], and SAM [25] detect, track, and segment subjects within each shot. Gemini-2.5 merges these cross-shot tracking annotations to obtain consistent subject IDs.
7. OmniEraser [52] extracts clean backgrounds from the first frame of each shot using foreground masks.
Why these datasets were chosen: The creation of this specialized dataset was necessitated by the lack of public datasets with the required complexity (multi-shot, hierarchical captions, multi-reference, grounding signals) to train a model with such comprehensive controllability. The automated pipeline makes it feasible to scale this unique data generation.

5.2. Evaluation Metrics

The paper employs a comprehensive suite of metrics to evaluate MultiShotMaster from various angles, covering both multi-shot narrative video generation and reference injection.

5.2.1. Text Alignment (TA)

Conceptual Definition: This metric quantifies how well the generated video content matches its accompanying text prompt. A higher value indicates better adherence to the textual description.
Mathematical Formula: The paper states that it calculates the similarity between text features and shot features extracted by ViCLIP [50]. This typically involves computing the cosine similarity between the embedding vectors. $ \text{Text Alignment} = \frac{1}{N_{shots}} \sum_{i=1}^{N_{shots}} \mathrm{cosine_similarity}(\text{ViCLIP_text_features}(C_i), \text{ViCLIP_video_features}(V_i)) $
Symbol Explanation:
- $N_{shots}$ : Total number of shots in the video.
- $C_i$ : Text caption for the $i$ -th shot.
- $V_i$ : Generated video segment for the $i$ -th shot.
- $\mathrm{cosine\_similarity}(A, B)$ : Measures the cosine of the angle between two non-zero vectors $A$ and $B$ , indicating their directional similarity. Calculated as $\frac{A \cdot B}{||A|| \cdot ||B||}$ .
- $\text{ViCLIP\_text\_features}(\cdot)$ : Function to extract text embedding features using the ViCLIP model.
- $\text{ViCLIP\_video\_features}(\cdot)$ : Function to extract video embedding features for a shot using the ViCLIP model.

5.2.2. Inter-Shot Consistency

This metric assesses the coherence and consistency of elements (semantic, subject, scene) across different shots within a multi-shot video.

Conceptual Definition: Measures how consistent the visual elements (overall semantics, specific subjects, and background scenes) remain when transitioning between different shots in the same multi-shot video. High consistency is crucial for narrative coherence.
Mathematical Formula:
1. Semantic Consistency: $ \text{Semantic Consistency} = \mathrm{cosine_similarity}(\text{ViCLIP_video_features}(V_{global}), \text{mean}(\text{ViCLIP_video_features}(V_i))) $ Fallback Rule Note: The paper states calculating "holistic semantic similarity between ViCLIP shot features." A common way to measure inter-shot semantic consistency is to compare the global video embedding (or an average of shot embeddings) with individual shot embeddings or adjacent shot embeddings. The above formula is a plausible interpretation for "holistic semantic similarity". If it means pairwise average, then $\frac{1}{N_{shots}-1} \sum_{i=1}^{N_{shots}-1} \mathrm{cosine\_similarity}(\text{ViCLIP\_video\_features}(V_{i}), \text{ViCLIP\_video\_features}(V_{i+1}))$ . The paper is not precise, so I will stick to a holistic average.
2. Subject Consistency: $ \text{Subject Consistency} = \frac{1}{N_{subjects} \cdot N_{pairs}} \sum_{s=1}^{N_{subjects}} \sum_{(f_a, f_b) \in \text{pairs}s} \mathrm{cosine_similarity}(\text{DINOv2_features}(\text{Subject}{s,f_a}), \text{DINOv2_features}(\text{Subject}_{s,f_b})) $
3. Scene Consistency: $ \text{Scene Consistency} = \frac{1}{N_{scenes} \cdot N_{pairs}} \sum_{s=1}^{N_{scenes}} \sum_{(f_a, f_b) \in \text{pairs}s} \mathrm{cosine_similarity}(\text{DINOv2_features}(\text{Background}{s,f_a}), \text{DINOv2_features}(\text{Background}_{s,f_b})) $
Symbol Explanation:
- $V_{global}$ : Holistic features of the entire multi-shot video (e.g., mean of all shot features).
- $\text{mean}(\text{ViCLIP\_video\_features}(V_i))$ : Average of video features across all shots.
- YOLOv11 [23] and SAM [25]: Used to detect and crop subjects and backgrounds from keyframes (first, middle, and last frames) for subject and scene consistency.
- DINOv2 [35]: A powerful self-supervised vision transformer used to extract robust visual features for comparing cropped subjects and backgrounds.
- $N_{subjects}$ : Total number of unique subjects in the video.
- $N_{scenes}$ : Total number of unique scenes in the video.
- $N_{pairs}$ : Number of unique pairs of keyframes for comparison (e.g., across shots for a specific subject/scene).
- $\text{Subject}_{s,f_a}$ : Cropped image of subject $s$ in keyframe $f_a$ .
- $\text{Background}_{s,f_a}$ : Cropped image of background in keyframe $f_a$ for scene $s$ .
- $\text{DINOv2\_features}(\cdot)$ : Function to extract visual embedding features using the DINOv2 model.

5.2.3. Transition Deviation

Conceptual Definition: Measures the accuracy of shot transition detection in the generated videos compared to the desired (ground-truth) transition timestamps. A lower deviation indicates more precise and controllable transitions.
Mathematical Formula: $ \text{Transition Deviation} = \frac{1}{N_{transitions}} \sum_{k=1}^{N_{transitions}} |\text{DetectedFrame}_k - \text{GroundTruthFrame}_k| $
Symbol Explanation:
- TransNet V2 [43]: Used to detect transitions in the generated videos.
- $N_{transitions}$ : The number of shot transitions.
- $\text{DetectedFrame}_k$ : The frame number where TransNet V2 detects the $k$ -th transition.
- $\text{GroundTruthFrame}_k$ : The intended (ground-truth) frame number for the $k$ -th transition as specified by the user or dataset.

5.2.4. Narrative Coherence

Conceptual Definition: Evaluates the overall logical flow and storytelling quality of the multi-shot videos. This is a high-level, subjective assessment of whether the sequence of shots makes sense narratively.
Measurement Method: Gemini-2.5 [10] is employed as an automated evaluation metric. It analyzes 20 proportionally sampled frames (at least one per shot) from the multi-shot video, along with the hierarchical captions. Gemini-2.5 is instructed to evaluate four core dimensions based on cinematic narrative logic:
1. Scene Consistency: Verifies background, lighting, and atmosphere stability across transitions.
2. Subject Consistency: Scrutinizes identity features and appearance attributes of core objects across different viewpoints.
3. Action Coherence: Evaluates the temporal logic of dynamic behaviors, ensuring actions are reasonable continuations.
4. Spatial Consistency: Examines whether the topological structure of relative positional relationships between subjects remains constant. The model outputs a "True" or "False" verdict for each dimension, which are then aggregated (e.g., averaged) to quantify narrative coherence. No direct mathematical formula is provided by the paper, as it relies on a large language model's reasoning.

The following figure (Figure 9 from the original paper's appendix) shows the prompt template used for the Narrative Coherence Metric based on Gemini-2.5.

5.2.5. Reference Injection Consistency

This metric evaluates how accurately the generated subjects and backgrounds match the provided reference images.

Conceptual Definition: Measures the visual similarity between the subjects/backgrounds generated in the video and the specific reference images provided by the user.
Measurement Method:
1. Subjects and backgrounds are detected and cropped from the generated video frames (e.g., keyframes).
2. DINOv2 [35] is used to extract features from these cropped regions and from the original reference images.
3. The cosine similarity between the generated features and the reference features is calculated.
Mathematical Formula: $ \text{Reference Subject Consistency} = \frac{1}{N_{subjects}} \sum_{s=1}^{N_{subjects}} \mathrm{cosine_similarity}(\text{DINOv2_features}(\text{GeneratedSubject}s), \text{DINOv2_features}(\text{ReferenceSubject}s)) $ $ \text{Reference Background Consistency} = \frac{1}{N{backgrounds}} \sum{s=1}^{N_{backgrounds}} \mathrm{cosine_similarity}(\text{DINOv2_features}(\text{GeneratedBackground}_s), \text{DINOv2_features}(\text{ReferenceBackground}_s)) $
Symbol Explanation:
- $N_{subjects}$ : Number of subjects for which reference images were provided.
- $N_{backgrounds}$ : Number of backgrounds for which reference images were provided.
- $\text{GeneratedSubject}_s$ : Cropped image of the generated subject $s$ in the video.
- $\text{ReferenceSubject}_s$ : The original reference image for subject $s$ .
- $\text{GeneratedBackground}_s$ : Cropped image of the generated background $s$ in the video.
- $\text{ReferenceBackground}_s$ : The original reference image for background $s$ .

5.2.6. Grounding

Conceptual Definition: Measures the spatiotemporal-grounded accuracy of reference injection by evaluating how precisely generated objects appear within their specified bounding box regions.
Mathematical Formula: It calculates the mean Intersection over Union (mIoU) across keyframes. $ \text{IoU}(A, B) = \frac{\text{Area}(A \cap B)}{\text{Area}(A \cup B)} $ $ \text{Grounding (mIoU)} = \frac{1}{N_{objects} \cdot N_{keyframes}} \sum_{o=1}^{N_{objects}} \sum_{k=1}^{N_{keyframes}} \text{IoU}(\text{PredictedBBox}{o,k}, \text{GroundTruthBBox}{o,k}) $
Symbol Explanation:
- $\text{IoU}(A, B)$ : Intersection over Union, a measure of overlap between two bounding boxes $A$ and $B$ .
- $A \cap B$ : The intersection area of bounding boxes $A$ and $B$ .
- $A \cup B$ : The union area of bounding boxes $A$ and $B$ .
- $N_{objects}$ : Number of objects (subjects or backgrounds) with specified bounding boxes.
- $N_{keyframes}$ : Number of keyframes evaluated.
- $\text{PredictedBBox}_{o,k}$ : The bounding box of the generated object $o$ detected in keyframe $k$ (e.g., by YOLOv11).
- $\text{GroundTruthBBox}_{o,k}$ : The specified (ground-truth) bounding box for object $o$ in keyframe $k$ .

5.2.7. Aesthetic Score

Conceptual Definition: Measures the overall aesthetic quality and visual appeal of the generated multi-shot videos. This is typically an average human-perceived score or a score predicted by a model trained on human aesthetic judgments.
Measurement Method: The paper mentions Aesthetic Score [40]. This typically refers to a score derived from models trained on large datasets like LAION-5B [40] which contain aesthetic ratings, predicting how aesthetically pleasing an image or video is. The paper does not provide a specific formula, but it is implicitly a model output.
Symbol Explanation: No specific symbols beyond the score itself.

5.3. Baselines

MultiShotMaster is compared against both multi-shot and single-shot reference-to-video methods. All competing baselines are based on Wan2.1-T2V1.3B [45] at a resolution of $480 \times 832$ , while MultiShotMaster is based on a model with $\sim 1B$ parameters at $384 \times 672$ .

Multi-Shot Video Generation Methods:
- CineTrans [57]: The latest open-source multi-shot narrative method. It manipulates attention scores using a mask matrix to weaken correlations across different shots, aiming for cinematic transitions. It uses a global caption for scene/camera transitions and per-shot captions.
- EchoShot [46]: Focuses on identity-consistent multi-shot portrait videos and also designs a RoPE-based shot transition mechanism, but its primary goal is portrait consistency rather than general narrative coherence.
Single-Shot Reference-to-Video Methods (adapted for multi-shot comparison): These methods are designed for single shots and are used to generate multiple independent single-shot videos from a story, each with individual text prompts. This highlights the challenge of inter-shot consistency when shots are generated in isolation.
- Phantom [29]: A subject-consistent video generation method via cross-modal alignment.
- VACE [21]: An all-in-one video creation and editing framework that supports multi-reference video generation.

5.4. Implementation Details

Base Model: Pretrained single-shot T2V model with $\sim 1$ Billion parameters.
Resolution: $384 \times 672$ . (Lower than baselines, noted as a limitation).
Video Characteristics:
- Narrative videos containing 77-308 frames (5-20 seconds) at 15 fps.
- Each video comprises 1-5 shots.
VAE Encoding/Decoding: Each shot is encoded separately via 3D VAE [24]. A sliding window strategy is used for shots longer than 77 frames to maintain alignment between pixel space and latent space.
Training Hardware: 32 GPUs.
Hyperparameters:
- Learning Rate: $1 \times 10^{-5}$ .
- Batch Size: 1.
- Angular Phase Shift Factor ( $\phi$ ) for Multi-Shot Narrative RoPE: 0.5 (default).
Inference Settings:
- Classifier-Free Guidance Scale [18]: 7.5.
- DDIM [42] Steps: 50.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the superior performance and outstanding controllability of MultiShotMaster compared to baseline methods in multi-shot video generation.

The following figure (Figure 4 from the original paper) presents two different feature comparisons for multi-shot text-to-video generation and multi-shot reference-to-video generation. The simplified prompts for each shot are shown in the subtitle.

该图像是一个示意图，展示了多镜头视频生成框架的各个镜头及其对应的角色描述和情节。图中包含不同镜头的标签和角色的互动场景，清晰展现了框架的多样性及可控性。

Qualitative Comparison (Figure 4 Analysis):

Upper Part (Multi-Shot Text-to-Video):
- CineTrans [57] (Row 1) shows limited variation in camera positioning and fails to preserve character identity consistency. For example, the car in Shot 2 and Shot 3 (if present) might not match. This is attributed to its attention score manipulation which impedes original token interactions.
- EchoShot [46] (Row 2), while designed for identity consistency in portrait videos, shows limitations in broader narrative details like inconsistent clothing colors for the same character across shots.
- Ours (w/o Ref) (Row 3) demonstrates text-driven cross-shot subject consistency and scene consistency. For example, the vehicle roof maintains consistent color between Shot 2 and Shot 3, even if it occupies a small area in Shot 3.
Lower Part (Multi-Shot Reference-to-Video):
- VACE [21] (Row 4) and Phantom [29] (Row 5) are single-shot methods, so when used for multi-shot generation by multiple independent inferences, they fail to maintain inter-shot subject consistency (e.g., the woman wears different clothing in Shot 1 and Shot 3). They also struggle to fully preserve user-provided background reference images.
- Ours (w/ Ref) (Row 6) achieves satisfactory reference-driven subject consistency and scene consistency. It further supports grounding signals to control subject injection into specified regions and background injection into specified shots, demonstrating superior control.
Hierarchical Captioning: MultiShotMaster's user-friendly hierarchical caption structure (global caption for subject appearance, indexed nouns in per-shot captions) is more convenient than baselines that require repeating character descriptions in every shot caption.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

	Text Align.↑	Inter-Shot Consistency↑ Semantic	Subject	Scene	Transition Deviation↓	Narrative Coherence↑	Reference Consistency↑ Subject Background Grounding
CineTrans [57]	0.174	0.683	0.437	0.389	5.27	0.496	X	X	X
EchoShot [46]	0.183	0.617	0.425	0.346	3.54	0.213	X	X	X
Ours (w/o Ref)	0.196	0.697	0.491	0.447	1.72	0.695	×	×	X
VACE [21]	0.201	0.599	0.468	0.273	X	0.325	0.475	0.361	X
Phantom [29]	0.224	0.585	0.462	0.279	×	0.362	0.490	0.328	×
Ours (w/ Ref)	0.227	0.702	0.495	0.472	1.41	0.825	0.493	0.456	0.594

Quantitative Comparison (Table 1 Analysis):

Multi-Shot Text-to-Video Generation:
- CineTrans [57] scores low in Transition Deviation (5.27, higher is worse) and Text Alignment (0.174). This is attributed to its attention mask strategy weakening inter-shot correlations, which leads to unsatisfactory transitions and text adherence. Its Narrative Coherence (0.496) is also moderate.
- EchoShot [46] performs poorly in Narrative Coherence (0.213) because it is specialized for portrait videos rather than narrative content. Its Inter-Shot Consistency (semantic 0.617, subject 0.425, scene 0.346) is also lower than MultiShotMaster.
- Ours (w/o Ref) demonstrates superior Inter-Shot Consistency (semantic 0.697, subject 0.491, scene 0.447), Transition Deviation (1.72, much lower is better), and Narrative Coherence (0.695). This validates the effectiveness of the proposed Multi-Shot Narrative RoPE and other framework components for text-driven multi-shot generation.
Multi-Shot Reference-to-Video Generation:
- VACE [21] and Phantom [29] (single-shot methods, run independently for multi-shot) naturally do not have Transition Deviation scores (marked $X$ ). Their Inter-Shot Consistency (semantic, subject, scene) and Narrative Coherence are significantly lower than MultiShotMaster due to independent generation. They struggle with Reference Consistency for backgrounds (0.361 and 0.328 respectively) and Narrative Coherence (0.325 and 0.362).
- Ours (w/ Ref) achieves the best performance across almost all metrics: highest Text Alignment (0.227), Inter-Shot Consistency (semantic 0.702, subject 0.495, scene 0.472), Transition Deviation (1.41), and Narrative Coherence (0.825). Crucially, it also shows excellent Reference Consistency for subjects (0.493) and backgrounds (0.456), and introduces strong Grounding capabilities (0.594) which baselines do not support (marked $X$ ).
  
  In summary, MultiShotMaster consistently outperforms baselines, especially in inter-shot consistency, narrative coherence, and transition control, while simultaneously enabling comprehensive spatiotemporal-grounded reference injection.

6.3. Ablation Studies / Parameter Analysis

The paper presents ablation studies to validate the effectiveness of key components and the training strategy.

6.3.1. Ablation Study for Network Design

The following are the results from Table 2 of the original paper:

	Inter-Shot Consistency↑ Semantic	Subject	Scene	Transition Deviation↓	Narrative Coherence↑
w/o MS RoPE	0.702	0.486	0.455	4.68	0.645
Ours (w/o Ref)	0.697	0.491	0.447	1.72	0.695

Analysis of Table 2 (Multi-Shot RoPE):

w/o MS RoPE (without Multi-Shot Narrative RoPE): This setting relies solely on per-shot captions for shot transitions and uses continuous RoPE. It shows a significantly higher Transition Deviation (4.68 vs. 1.72 for Ours (w/o Ref)), indicating its inability to perform precise shot transitions by text prompts alone. The lack of explicit shot transitions also leads to less change between shots, which artificially inflates semantic consistency (0.702) and scene consistency (0.455) because the "shots" blend into one long continuous clip. However, Narrative Coherence (0.645) is lower due to imprecise transitions and potentially confused narrative flow.

Ours (w/o Ref): With Multi-Shot Narrative RoPE, the framework achieves a much lower Transition Deviation (1.72), demonstrating its effectiveness in precisely controlling shot transitions at user-specified timestamps. Its Narrative Coherence (0.695) is also superior.

The following are the results from Table 3 of the original paper:

	\| Aesthetic\| Score↑	Narrative	Reference Consistency↑ Coherence↑ Subject Scene Grounding
w/o Mean	3.84	0.796	0.482 0.452	0.557
w/o Attn Mask	3.72	0.787	0.468 0.414	0.561
w/o STPA RoPE	3.79	0.761	0.425 0.363	X
Ours (w/ Ref)	3.86	0.825	0.493 0.456	0.594

Analysis of Table 3 (Reference Injection):

w/o Mean (without averaging multiple subject token copies): This setting might cause information loss from the subject copies, resulting in slightly suboptimal Aesthetic Score (3.84) and Narrative Coherence (0.796), and lower Subject Reference Consistency (0.482).
w/o Attn Mask (without Multi-Shot & Multi-Reference Attention Mask): The lack of an attention mask leads to excessively long contexts and unnecessary interactions between irrelevant in-context tokens. This results in a weaker Aesthetic Score (3.72) and lower Reference Consistency (subject 0.468, scene 0.414), as content leakage degrades overall quality and control.
w/o STPA RoPE (without Spatiotemporal Position-Aware RoPE): This setting directly concatenates reference tokens along the temporal dimension and applies a generic RoPE(t=0, h, w) to each reference, without spatiotemporal grounding. It relies only on text prompts for positioning. Consequently, it shows significantly poorer performance on Reference Consistency (subject 0.425, scene 0.363) and lacks Grounding capability (marked $X$ ). Narrative Coherence (0.761) is also notably lower.
Ours (w/ Ref): The full model achieves the best performance across all metrics, including Aesthetic Score (3.86), Narrative Coherence (0.825), Reference Consistency (subject 0.493, scene 0.456), and Grounding (0.594), validating the effectiveness of all proposed designs for spatiotemporal-grounded reference injection.

6.3.2. Ablation Study for Training Paradigm

The following are the results from Table 4 of the original paper:

	Text Align.↑	Inter-Shot Consistency↑ Semantic	Subject	Scene	Reference Consistency↑ Subject	Background	Grounding
I: Multi-Shot+Ref. Injection	0.211	0.671	0.464	0.415	0.454	0.426	0.477
I: Multi-Shot II: Multi-Shot+Ref. Injection	0.219	0.695	0.481	0.433	0.472	0.451	0.578
I: Ref. Injection II: Multi-Shot+Ref. Injection	0.222	0.692	0.484	0.437	0.485	0.454	0.583
I: Ref. Injection II: Multi-Shot+Ref. Injection III Multi-Shot+Subject-Focused Ref. Injection	0.227	0.702	0.495	0.472	0.493	0.456	0.594

Analysis of Table 4 (Training Paradigm):

I: Multi-Shot+Ref. Injection (Unified Training): Training both multi-shot and reference-to-video tasks simultaneously. This approach shows inadequate performance across metrics (e.g., Text Alignment 0.211, Subject Consistency 0.464, Grounding 0.477). This is because the diffusion loss, focused on global consistency, struggles to effectively optimize for both tasks concurrently, especially learning diverse subjects.
I: Multi-Shot II: Multi-Shot+Ref. Injection (Two-stage, Multi-shot first): First learns multi-shot text-to-video generation, then multi-shot reference-to-video generation, both using multi-shot & multi-reference data. This improves performance over unified training (e.g., Text Alignment 0.219, Grounding 0.578). However, Subject Consistency (0.481) is slightly lower due to insufficient exposure to diverse subjects during training if the multi-shot data itself is limited in subject variety.
I: Ref. Injection II: Multi-Shot+Ref. Injection (Two-stage, Reference first): This is the adopted strategy. First trains spatiotemporal-grounded reference injection on 300k single-shot data, then learns both tasks on the multi-shot & multi-reference data. This yields better results on most metrics (e.g., Text Alignment 0.222, Subject Consistency 0.484, Grounding 0.583). Learning reference injection on a large single-shot dataset first provides a strong foundation for handling diverse subjects.
I: Ref. Injection II: Multi-Shot+Ref. Injection III Multi-Shot+Subject-Focused Ref. Injection (Three-stage, with post-training): Adding the subject-focused post-training (assigning $2 \times$ loss weight to subject regions) further boosts performance across all metrics. This final stage guides the model to prioritize subjects requiring higher consistency and better comprehend cross-shot subject variations, leading to the best overall results (e.g., Text Alignment 0.227, Subject Consistency 0.495, Grounding 0.594).

The ablation studies clearly validate the individual contributions of Multi-Shot Narrative RoPE, Spatiotemporal Position-Aware RoPE, the attention mask, and the carefully designed three-stage training paradigm, all of which contribute to the overall superior performance and controllability of MultiShotMaster.

7. Conclusion & Reflections

7.1. Conclusion Summary

MultiShotMaster introduces a novel, highly controllable framework for generating narrative multi-shot videos, directly addressing the limitations of existing single-shot and less-controllable multi-shot techniques. The paper's core innovation lies in extending a pretrained single-shot Text-to-Video model through two critical enhancements to Rotary Position Embedding (RoPE):

Multi-Shot Narrative RoPE: Enables the explicit recognition of shot boundaries and controllable transitions by applying angular phase shifts, preserving narrative temporal order while allowing flexible shot arrangements.
Spatiotemporal Position-Aware RoPE: Facilitates precise spatiotemporal-grounded reference injection by incorporating reference tokens and grounding signals directly into RoPE, enabling customized subjects with motion control and background-driven scenes.

In addition to these architectural innovations, MultiShotMaster establishes an automated multi-shot & multi-reference data curation pipeline. This pipeline effectively overcomes data scarcity by extracting complex video, caption, grounding, and reference image annotations from long internet videos. The framework leverages these intrinsic architectural modifications and a carefully designed three-stage training paradigm to integrate text prompts, subjects, grounding signals, and backgrounds into a versatile system for flexible multi-shot video generation. Extensive experiments qualitatively and quantitatively demonstrate its superior performance and outstanding controllability compared to state-of-the-art baselines.

7.2. Limitations & Future Work

The authors acknowledge several key limitations that require future research:

Generation Quality and Resolution: The current implementation is based on a pretrained single-shot T2V model with $\sim 1$ Billion parameters at a resolution of $384 \times 672$ . This lags behind more recent open-source models like the WAN family [45], which operate at higher resolutions (e.g., $480 \times 832$ ) and likely higher quality. Future work will involve implementing MultiShotMaster on these larger, higher-resolution models (e.g., WAN 2.1/2.2) to improve overall generation quality.
Motion Coupling Issue: The current framework explicitly controls subject motion while camera position is primarily controlled by text prompts. This can lead to a motion coupling issue, where the generated video aligns the grounding signals, but this alignment is achieved through a coupled movement of both the camera and the object.

The following figure (Figure 5 from the original paper) visualizes this limitation, where the subject's motion and camera motion might be inadvertently intertwined.

该图像是一个示意图，展示了两组生成结果的对比：上方为不良生成结果，下方为良好生成结果。左侧是背景图像和主体图像，展示了不同画面生成的效果差异，强调了模型在不同控制条件下的表现。

Future work aims to address this coupling, likely by introducing more explicit and disentangled control over camera motion independent of subject motion.

7.3. Personal Insights & Critique

MultiShotMaster represents a significant step forward in the complex domain of narrative video generation. The paper's strength lies in its elegant approach to a challenging problem: instead of introducing complex new modules or vastly larger models, it cleverly leverages and extends existing mechanisms like RoPE to achieve sophisticated control.

Inspirations and Applications:

Intrinsic Architectural Control: The idea of embedding control signals directly into positional encodings (RoPE) is highly inspiring. This suggests that existing architectural components might hold untapped potential for fine-grained control if rethought for new tasks. This could be applied to other generative tasks where sequential or spatial context is crucial, such as long-form audio generation or complex 3D scene synthesis.
Narrative Storytelling Tools: The framework moves AI video generation closer to becoming a practical tool for filmmakers and content creators. The hierarchical captioning, subject control, and background customization are features directly addressing real-world production needs. This approach could be adapted for interactive storytelling platforms or personalized content creation engines.
Automated Data Curation: The sophisticated automated data pipeline is a crucial contribution. Data scarcity is a perpetual bottleneck in AI research, especially for complex, multi-modal tasks. The methodology for extracting multi-shot videos, consistent captions, and grounding signals is a valuable blueprint for other researchers facing similar data challenges.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Generalizability of RoPE Phase Shift: While the fixed angular phase shift factor ( $\phi=0.5$ ) for Multi-Shot Narrative RoPE is stated as requiring no additional trainable parameters, its optimality across all possible narrative structures, shot lengths, and content types might be an implicit assumption. Exploring whether this factor could be dynamically determined or even softly learned (e.g., via a small hypernetwork) based on narrative complexity could be an interesting avenue.
Robustness of Automated Data Curation: While effective, automated data pipelines, especially those relying on large language models like Gemini-2.5 for complex reasoning (like cross-shot subject grouping or narrative coherence evaluation), can be prone to hallucination or subtle errors. The quality of the curated 235k dataset is paramount, and a manual auditing process, even for a subset, would strengthen confidence.
Disentangled Control Limitations: The acknowledged motion coupling issue is a significant practical limitation. Achieving truly independent control over camera movement, subject movement, and potentially other dynamic elements (e.g., lighting changes, object interactions) is the next frontier. This might require more explicit disentanglement mechanisms beyond positional embeddings, perhaps involving dedicated latent spaces or control signals for each aspect.
Computational Cost at Scale: While the RoPE modifications themselves are efficient, the overall framework still operates on a Diffusion Transformer model. Scaling to even higher resolutions, longer videos, or greater shot counts will inevitably push computational boundaries. Research into more efficient attention mechanisms or distillation techniques could be valuable.
Subjectivity in Narrative Coherence Evaluation: While using Gemini-2.5 for narrative coherence is innovative, it relies on the LLM's understanding of "cinematic narrative logic." This is inherently subjective, and the fidelity of the LLM's judgment to human perception might warrant further validation (e.g., with human evaluations on a small benchmark).

Overall, MultiShotMaster is an ambitious and impactful work, pushing the boundaries of what's possible in controllable video generation by thoughtfully adapting core architectural components. Its innovations provide a strong foundation for future research in creating sophisticated, AI-driven cinematic content.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.