CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
TL;DR Summary
CameraCtrl II enables large-scale dynamic scene exploration via a camera-controlled video diffusion model, overcoming limitations in video dynamics and viewpoint range by enhancing individual video clips and allowing user-defined camera trajectories for broader spatial exploratio
Abstract
This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic scenes -- first enhancing dynamic content within individual video clip, then extending this capability to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera parameter annotations for training while designing a lightweight camera injection module and training scheme to preserve dynamics of the pretrained models. Building on these improved single-clip techniques, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl Ii enables camera-controlled dynamic scene synthesis with substantially wider spatial exploration than previous approaches.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
1.2. Authors
Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, Hongsheng Li. The authors are affiliated with:
- The Chinese University of Hong Kong
- ByteDance Seed
- Stanford University
- ByteDance Ceyuan Yang is noted as the corresponding author.
1.3. Journal/Conference
This paper is published as a preprint on arXiv. While not yet peer-reviewed or formally published in a specific journal or conference proceedings at the time of its arXiv release, arXiv is a highly influential platform for rapid dissemination of cutting-edge research in computer science, especially in areas like AI, machine learning, and computer vision. Papers posted on arXiv often represent significant advancements and are frequently subsequently published in top-tier conferences or journals.
1.4. Publication Year
2025
1.5. Abstract
This paper introduces CameraCtrl II, a framework designed for large-scale dynamic scene exploration using a camera-controlled video diffusion model. The authors identify two primary limitations in previous camera-conditioned video generative models: a degradation in video dynamics and a restricted range of viewpoints, particularly when large camera movements are involved. Their approach addresses these issues by progressively expanding the generation capabilities. First, it enhances dynamic content within individual video clips. Second, it extends this enhanced capability to create seamless explorations across broad viewpoint ranges. The methodology involves constructing a new dataset with a high degree of dynamics and precise camera parameter annotations for training. Concurrently, a lightweight camera injection module and a specialized training scheme are designed to preserve the inherent dynamics of pre-trained diffusion models. Building upon these single-clip improvements, CameraCtrl II enables extended scene exploration by allowing users to iteratively define camera trajectories for generating coherent video sequences. Experimental results across diverse scenarios demonstrate that CameraCtrl II significantly expands the spatial exploration capabilities of camera-controlled dynamic scene synthesis compared to prior methods.
1.6. Original Source Link
https://arxiv.org/abs/2503.10592 (Publication Status: Preprint on arXiv)
1.7. PDF Link
https://arxiv.org/pdf/2503.10592v1.pdf
2. Executive Summary
2.1. Background & Motivation
The field of video diffusion models has seen remarkable progress, enabling the generation of high-fidelity, temporally coherent videos from text descriptions. Models like Sora can produce minute-long videos with realistic physics, making them promising tools for modeling dynamic real-world scenes. Beyond simply generating individual scenes, the ability for users to actively explore these digital worlds is gaining importance. Camera control has emerged as a natural interface for this scene exploration, with recent works attempting to inject camera parameters into pre-trained video diffusion models to manipulate viewpoints.
However, existing camera-controlled video generative models face two critical limitations:
-
Diminished Video Dynamics: After incorporating camera control, these models often suffer from a significant reduction in their ability to generate dynamic content. The generated videos tend to be largely static, limiting the types of scenes that can be synthesized.
-
Limited Viewpoint Range and Short Clip Generation: These models are typically restricted to generating short video clips (e.g., 25-49 frames) and cannot generate new clips in the same scene that are coherent with previously generated content and new user-defined camera trajectories. This severely limits the spatial range that can be explored.
These limitations collectively diminish the user experience and restrict the practical applications of camera-controlled video generation. CameraCtrl II aims to address these two fundamental issues.
2.2. Main Contributions / Findings
CameraCtrl II introduces a novel framework that significantly advances camera-controlled video diffusion models for dynamic scene exploration. Its primary contributions and findings are:
-
Systematic Data Curation Pipeline for Dynamic Videos: The paper introduces
REALCAM, a new dataset constructed by extracting camera trajectory annotations from real dynamic videos using Structure-from-Motion (SfM). It includes methods to address challenges like arbitrary scale and long-tailed camera trajectory distributions, ensuring the dataset is suitable for training models that generate dynamic content. This directly combats the issue of models learning from static datasets, which previously compromised dynamic capabilities. -
Lightweight Camera Control Injection and Training Strategy: CameraCtrl II proposes a novel,
lightweight camera injection modulethat conditions camera parameters only at theinitial layerof the diffusion model. This design, combined with ajoint training schemeon both labeled and unlabeled video data, effectively preserves thedynamic content generation capabilitiesof pre-trained models while enabling precise camera control. The training scheme also facilitatescamera classifier-free guidance (CFG)for enhanced control accuracy. -
Clip-wise Autoregressive Generation for Extended Scene Exploration: The framework introduces a
clip-wise autoregressive generation recipethat allows the model to produce multiple coherent video clips sequentially. This technique enables continuous scene exploration across broader ranges by conditioning subsequent clips on clean frames from previous ones and new camera trajectories, maintaining visual consistency. This directly tackles the limitation of short clip generation and limited exploration range.Key Findings:
-
CameraCtrl II substantially outperforms previous methods (e.g., MotionCtrl, CameraCtrl, AC3D) across various metrics, including
FVD(visual quality),Motion strength(dynamic content),TransErrandRotErr(camera control accuracy), andGeometric consistency. -
The curated
REALCAMdataset with dynamic content and scale calibration is crucial for enhancing motion strength and camera control accuracy. -
The lightweight camera injection at the initial layer and the joint training strategy are effective in preserving dynamics and improving control accuracy without over-constraining the model.
-
The global reference frame for relative camera poses and training with clean conditioning frames are critical for achieving high
Appearance consistencyand accurate camera control in sequential generation. -
The generated videos exhibit strong 3D consistency, allowing for high-quality 3D reconstruction into point clouds.
In essence, CameraCtrl II offers a robust solution for generating dynamic, camera-controlled videos that can be seamlessly extended for large-scale scene exploration, overcoming significant limitations of prior works.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand CameraCtrl II, a grasp of the following foundational concepts is essential:
3.1.1. Video Diffusion Models
Video Diffusion Models (VDMs) are a class of generative models that learn to create realistic videos by gradually denoising a random noise signal. They are inspired by diffusion processes in physics.
- Core Idea: A diffusion model has two main processes:
- Forward Diffusion (Noising Process): Gradually adds Gaussian noise to an input video until it becomes pure noise. This process is typically fixed and not learned.
- Reverse Diffusion (Denoising Process): Learns to reverse the forward process, gradually removing noise from a noisy input to reconstruct a clean video. This is the generative part of the model.
- Training: The model is trained to predict the noise added at each step, or directly predict the clean data, given a noisy input and a timestep.
- Generation: To generate a new video, the model starts with random noise and iteratively applies the learned denoising steps until a coherent video emerges.
- Conditional Generation: VDMs can be conditioned on various inputs, such as text descriptions (Text-to-Video, T2V), images (Image-to-Video, I2V), or, in this paper's case, camera parameters, to guide the generation process towards desired outputs.
- Latent Diffusion Models (LDMs): Many modern diffusion models, including those for video, operate in a
latent spacerather than directly on pixel space. This means videos are first encoded into a lower-dimensional latent representation using anautoencoder(specifically aVariational Autoencoder (VAE)or a similarvisual tokenizer). The diffusion process then occurs in this latent space, making training and inference more computationally efficient. After denoising in latent space, the latent representation is decoded back into a high-resolution video.
3.1.2. Transformer Architecture
The Transformer architecture is a neural network design introduced in 2017, which revolutionized natural language processing and has since been widely adopted in computer vision, especially for diffusion models.
- Key Component: Attention Mechanism: Transformers rely heavily on
self-attentionmechanisms. Self-attention allows the model to weigh the importance of different parts of the input sequence (e.g., different tokens in a text, or different patches in an image/video) when processing each element.- The
Attentionmechanism calculates the output as a weighted sum ofValuevectors, where the weight assigned to eachValueis determined by aQueryandKeycomparison. - The standard formula for scaled dot-product attention is:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
- (Query), (Key), (Value) are matrices representing linear transformations of the input embeddings.
- is the dimension of the key vectors, used for scaling to prevent vanishing gradients.
- computes the dot product similarity between queries and keys.
- normalizes these similarities to obtain attention weights.
- The
- DiT (Diffusion Transformer): A specific variant where the
UNetarchitecture, commonly used in diffusion models, is replaced by or integrated with a Transformer block. This allows for better modeling of long-range dependencies in the data.
3.1.3. Structure-from-Motion (SfM)
Structure-from-Motion (SfM) is a photogrammetric range imaging technique for estimating three-dimensional structures from two-dimensional image sequences (like video frames) that are taken from different viewpoints.
- Process: SfM works by identifying and matching
feature pointsacross multiple images. From these matches, it can simultaneously reconstruct the 3D positions of the feature points in space and the 3D camera poses (position and orientation) from which each image was taken. - Output: SfM outputs a sparse
point cloud(3D coordinates of detected features) and theextrinsic parameters(rotation and translation) for each camera, along withintrinsic parameters(focal length, principal point, distortion coefficients). - Scale Ambiguity: A common challenge with monocular SfM is
scale ambiguity. Without prior knowledge of the scene or additional sensors (like a known distance between two points), the reconstructed 3D structure and camera poses are only known up to an arbitrary scale factor. This means a scene can appear larger or smaller, but its internal proportions are correct. This paper addresses this explicitly withscale calibration. - VGGSfM: A specific, well-known SfM system often used as a baseline or tool for robust camera pose estimation.
3.1.4. Plücker Embedding
Plücker embedding is a mathematical representation used to describe lines in 3D space. In the context of camera control, it can represent rays from the camera center through pixels in an image.
- Geometric Interpretation: Each ray from the camera center to a pixel can be uniquely defined by a point on the ray (e.g., the camera center) and the direction vector of the ray. A Plücker embedding compactly combines this information.
- Calculation: Given a camera's extrinsic matrix (rotation and translation ) and intrinsic matrix , for each pixel
(u, v), its Plücker embedding is computed.- represents the camera center in world space (typically derived from ).
- is the direction vector of the ray from the camera to the pixel in world coordinates, transformed by the camera's pose, and is its normalized version.
- The
cross productprovides information about the plane containing the origin and the ray, complementing the direction vector.
- Why it's useful: It provides a compact, geometrically meaningful representation of camera-pixel relationships, which can be fed into neural networks. It contains fine-grained per-pixel camera information that helps guide the generative model.
3.1.5. Classifier-Free Guidance (CFG)
Classifier-Free Guidance (CFG) is a technique used in conditional diffusion models to improve the adherence of generated samples to a given condition (e.g., text prompt, camera parameters).
- Core Idea: During training, the model is sometimes trained with the condition (e.g., text embedding) and sometimes without it (using a
null condition, represented by ). During inference, two predictions are made for the noise: one conditioned on the actual input, and one conditioned on the null input. - Formula: The final noise prediction is a weighted combination of the unconditional prediction and the conditional prediction :
$
\hat{\epsilon}\theta(z_t, c) = \epsilon\theta(z_t, \phi) + w \cdot (\epsilon_\theta(z_t, c) - \epsilon_\theta(z_t, \phi))
$
Where:
- is the noisy latent at timestep .
- is the condition (e.g., text prompt).
- represents the null condition.
- is the
guidance weight(a scalar ), which controls how strongly the model adheres to the condition. Higher leads to stronger adherence but can sometimes reduce sample diversity or quality.
- Application in CameraCtrl II: This paper extends CFG to include
camera guidance, meaning the model makes predictions conditioned on both text and camera parameters, as well as null conditions for each, to enhance control accuracy.
3.2. Previous Works
The paper contextualizes its work within the advancements of Video Diffusion Models and Camera-controlled Video Diffusion Models.
3.2.1. Video Diffusion Models (General)
- Early Works (e.g., [6, 7, 17, 19, 24]): These models often adapted
Text-to-Image (T2I)models (e.g.,UNet-based architectures) by addingtemporal modeling layersto generate videos. They focused on efficient generation from text. - Recent Models (e.g., [1, 9, 29, 38, 49, 58]): Shifted towards
Transformerarchitectures (likeDiffusion TransformersorDiTs) to achieve bettertemporal consistencyandgeneration qualityat scale. Models like Sora [9] are notable for generating minute-long videos with realistic physics. - Role in CameraCtrl II: CameraCtrl II builds upon these general-purpose video generation capabilities. Its base model is an internal
transformer-based text-to-video diffusion modelwith approximately 3B parameters, leveraging atemporal causal VAE tokenizer. The goal is to extend these powerful generative models with precise camera control and exploration capabilities.
3.2.2. Camera-controlled Video Diffusion Models
These models aim to inject camera pose information to manipulate the viewpoint of generated videos.
- Conditioning Mechanisms:
MotionCtrl [54],CameraCtrl [21],I2VControl-Camera [15]: Injectcamera parameters(extrinsic matrices, Plücker embeddings, or point trajectories) directly into a pre-trained video diffusion model.CamCo [56]: Integratesepipolar constraintsinto attention layers, leveraging geometric principles.CamTrol [25]: Uses explicit3D point cloud representationsas conditioning.AC3D [2],VD3D [3]: Focus on carefully designingcamera representation injectionfor transformer-based video diffusion models.
- Multi-camera Scenarios: Some recent works have explored generating from multiple camera views:
CVD [30],Caiva [55],Vivid-ZOO [32],SyncCamMaster [4]. - Limitations of Previous Models:
- Diminished Dynamics: A common problem is that incorporating camera control often leads to a significant degradation in the ability to generate dynamic content. The generated scenes become largely static.
- Limited Clip Length/Exploration: Existing methods are restricted to generating short video clips (e.g., 25-49 frames) and lack the ability to seamlessly extend generation into new clips while maintaining scene consistency and following new camera trajectories.
- Datasets Used: Previous methods often relied on datasets like
RealEstate10K [62],ACID [36],DL3DV10K [35], orObjaverse [12]. While these provide camera annotations, they primarily featurestatic scenes. Training on such datasets inherently compromises the model's capacity to generate dynamic content when camera control is introduced.
3.3. Technological Evolution
Video generation has evolved from early frame-interpolation methods to complex recurrent neural networks and, more recently, to highly effective diffusion models. The shift from UNet-based architectures to Transformer-based ones (DiT) has been crucial for scalability and temporal consistency.
Concurrently, the integration of control signals has matured. Initially, control was mostly text-based. The natural progression led to explicit spatial and motion control, with camera control being a prime example for navigating synthetic environments. This paper sits at the intersection of powerful video diffusion models and advanced camera control, aiming to overcome the existing limitations.
3.4. Differentiation Analysis
CameraCtrl II differentiates itself from prior camera-controlled video diffusion models through several key innovations:
-
Addressing Dynamic Content Degradation:
- Previous Approaches: Suffered from generating largely static content when camera control was active, partly due to training on datasets predominantly featuring static scenes (e.g., RealEstate10K).
- CameraCtrl II's Innovation:
REALCAMDataset: Proactively constructs a new dataset (REALCAM) fromreal-world dynamic videoswith precise camera annotations, explicitly designed to teach the model to generate dynamic content under camera control. This is a fundamental departure from relying on static scene datasets.- Lightweight Camera Injection: Instead of complex encoders or injecting camera features into every layer of the
DiT(which can over-constrain dynamics), CameraCtrl II injects camera parameters only at theinitial layer. This allows the model to preserve more of its inherent dynamic generation capabilities learned during pre-training. - Joint Training: Integrates training on both
camera-labeled dynamic dataandunlabeled general video data. This strategy maintains the model's ability to generate diverse and dynamic scenes from its pre-trained state while gaining camera control capabilities.
-
Enabling Broad Spatial Exploration (Sequential Video Generation):
- Previous Approaches: Limited to generating short, isolated video clips (e.g., 25-49 frames), making continuous exploration of a scene impossible without significant inconsistencies between clips.
- CameraCtrl II's Innovation:
Clip-wise Autoregressive Generation: Introduces a novel technique to generate multiple coherent video clips sequentially. It conditions the generation of a new clip onclean framesfrom the end of the previous clip and a new camera trajectory.- Global Reference Frame: Uses a
global reference frame(the first frame of the initial trajectory) for calculating relative camera poses across all generated clips. This ensures geometric consistency throughout extended sequences, preventingpose error accumulationthat would occur with local, clip-wise references. - Targeted Loss Optimization: Optimizes the loss only on the newly generated frames, while the previous frames serve as conditioning, allowing the model to learn seamless extensions.
-
Enhanced Camera Control Accuracy and Robustness:
-
Classifier-Free Guidance for Camera: Implements
camera classifier-free guidance (CFG), analogous to text CFG, allowing for enhanced camera control accuracy during inference by appropriately weighting conditional and unconditional predictions. -
Scale Calibration & Distribution Balancing: Addresses
SfM's inherent scale ambiguityby aligning all video sequences to aunified metric spaceand balances thelong-tailed distribution of camera trajectory typesin real-world data. These steps are crucial for robust and accurate camera control across diverse movements.In summary, CameraCtrl II moves beyond merely adding camera conditioning to diffusion models by fundamentally re-thinking the data, model architecture, and training strategy to preserve dynamism and enable truly extended, coherent scene exploration, which were major bottlenecks in prior works.
-
4. Methodology
CameraCtrl II is designed to overcome the limitations of previous camera-conditioned video diffusion models by enabling large-scale dynamic scene exploration. This is achieved through a three-pronged approach: careful dataset curation, an effective camera control injection mechanism, and a sequential video generation technique.
The overall framework is built upon a pre-trained latent video diffusion model. Such a model learns to generate video latents conditioned on text and, in this case, camera parameters . The training objective for a standard conditional video diffusion model is to predict the noise added to a latent at a given timestep :
Where:
-
is the loss function for the model with parameters .
-
denotes the expectation over different samples of (encoded video latents), (Gaussian noise), (text prompt), (camera parameters), and (timestep).
-
represents the encoded latents from a visual tokenizer (e.g., a VAE).
-
is the noise added to to get .
-
is the noisy latent at timestep .
-
is the prediction of the noise by the denoising network (e.g., a
Diffusion TransformerorDiT) with parameters , conditioned on , , , and . -
denotes the squared L2 norm, indicating that the model aims to minimize the difference between the true noise and its prediction .
For inference, the model starts from Gaussian noise and iteratively recovers the video latents using a sampler (e.g., Euler sampler), conditioning on both the input image/text and the specified camera parameters.
4.1. Camera Representation (Plücker Embedding)
For camera representation, CameraCtrl II follows previous works like CameraCtrl [21] and CamCo [56] by adopting the Plücker embedding [47]. This representation offers a strong geometric interpretation and provides fine-grained per-pixel camera information.
Given:
-
Camera
extrinsic matrix, where is the rotation matrix and is the translation vector. -
Camera
intrinsic matrix.For each pixel
(u, v)in the image, its Plücker embedding is computed as:
Where:
-
represents the camera center in world space.
-
denotes the ray direction from the camera to the pixel in world coordinates. This calculation transforms the pixel coordinates from image space to a ray in world space, considering the camera's intrinsic and extrinsic properties.
-
is the normalized version of the direction vector .
The final Plücker embedding is constructed for each frame , where and are the spatial dimensions matching those of the encoded visual tokens. The 6 channels come from the 3 components of and the 3 components of .
4.2. Dataset Curation
High-quality datasets with accurate camera parameter annotations are critical. CameraCtrl II addresses the limitation of previous static-scene datasets by introducing REALCAM, a new dynamic video dataset. The overall data processing pipeline is shown in Figure 2.
4.2.1. Camera Estimation from Dynamic Videos
To create a diverse and dynamic dataset, CameraCtrl II curates videos from real-world scenarios (indoor, aerial, street). The pipeline includes:
-
Dynamic Object Identification: The
motion segmentation model TMO [11]is used to identify and mask dynamic foreground objects in a video. This helps distinguish background motion (due to camera movement) from foreground object motion. -
Optical Flow Estimation:
RAFT [50]is employed to estimate the dense optical flow for the video. -
Camera Movement Selection: By averaging the optical flow in static background regions (identified by
TMO), a quantitative measure of camera movement is obtained. Videos are selected only if their average flow exceeds an empirically determined threshold, ensuring sufficient camera motion. -
Camera Parameter Estimation:
VGGSfM [53]is used to estimate camera parameters (intrinsic and extrinsic) for each frame.Figure 2. Dataset curation pipeline. We omit the process of dynamic video selection.
该图像是一个示意图,展示了数据集的策划流程,包括运动分割、VGGSfM、尺度校准和分布平衡等步骤。图中分别展示了用于深度估计的动态视频片段和相关处理流程,最终生成的结果连通至RealCam数据集。
4.2.2. Camera Parameter Calibration for Unified Scales
A key challenge with monocular SfM is the arbitrary scale of reconstructions. To achieve consistent camera movements across different videos, a calibration pipeline is developed to align arbitrary scene scales to a metric space.
For each video sequence:
-
Keyframe Selection: keyframes are selected.
-
Metric Depth Estimation:
Metric depthsfor these keyframes are estimated using ametric depth estimator [8]. Metric depth refers to actual distances in meters. -
SfM Depth Extraction: Corresponding
SfM depthsare obtained from theVGGSfMoutput. SfM depths are relative, unscaled depths. -
Scale Factor Calculation: The scale factor between the metric depth and SfM depth for each frame is calculated by minimizing the difference between them:
Where:
- is the scale factor for frame .
- denotes the set of pixel coordinates.
- is the
Huber loss function, which is robust to outliers and depth estimation errors. - is the SfM depth at pixel for frame .
- is the metric depth at pixel for frame .
This minimization problem is solved using
RANSAC [20]to ensure robustness against noisy depth estimations.
-
Final Scale Factor: The final scale factor for the entire scene is computed as the mean of the individual frame scales. This factor is then multiplied to the camera position vector of the extrinsic matrix to obtain the scaled extrinsic matrix .
4.2.3. Camera Trajectory Distribution Balancing
Real-world videos often have imbalanced camera trajectory distributions (e.g., forward motion is overrepresented). To prevent overfitting and ensure robust performance across diverse movements, CameraCtrl II balances this distribution:
- Keypoint Detection: Key camera positions (keypoints) are detected on a camera trajectory. For each point, two lines are fitted through its preceding and following points. If the angle between these lines exceeds a threshold , it's marked as a keypoint.
- Segment and Direction Analysis: These keypoints divide a trajectory into segments. The direction of each segment is determined by the fitted line vectors. The segment with the longest camera movement defines the trajectory's
primary movement direction. - View Change and Turn Analysis: Along each segment, camera
rotation matricesare analyzed for significantview changes. Between adjacent segments,turnsare identified by measuring theirangular deviations. Turns after the main segment are defined asmain turns. - Importance Weighting: Each trajectory is assigned an
importance weightbased on the number and magnitude of both view changes and turns. - Categorization and Pruning: Trajectories are categorized into types (N primary directions, M main turns). To balance the dataset,
redundant trajectory typesare pruned by removing trajectories with lower importance scores, resulting in a more uniform distribution.
4.3. Adding Camera Control to Video Generation
The core of CameraCtrl II's model design focuses on integrating camera control while preserving the generation of dynamic content.
4.3.1. Lightweight Camera Injection Module
Previous methods often inject camera features (extracted by a dedicated encoder) into multiple Diffusion Transformer (DiT) layers or convolution layers, which can over-constrain video dynamics. CameraCtrl II proposes a lightweight approach:
-
Initial Layer Injection: Camera parameters are injected
only at the initial layerof the diffusion model. -
New Patchify Layer: A new
patchify layeris designed specifically for camera tokenization. This layer processes thePlücker embeddingsto produce camera features that match the dimensions and downsample ratios of the visual features. -
Combination: The visual tokens (from the encoded video latents) and the camera features are combined via
element-wise addition(). This combined feature then flows through the remainingDiTlayers.This simple injection strategy, as illustrated in Figure 3 (a), effectively guides the generation without overly restricting the model's ability to create dynamic motions.
Figure 3. CameraCtrl II's architecture for (a) single-clip generation and (b) clip-wise sequential generation.
该图像是示意图,展示了相机表示注入和视频扩展的过程,包括通过相机补丁和视觉补丁来处理预训练 DIT 块和特征的步骤。图中的 R(n imes h imes w) imes c 表示特征维度,展现了动态场景生成的关键技术。
4.3.2. Joint Training with Camera-labeled and Unlabeled Data
To ensure the model retains its capacity for diverse content generation (which REALCAM alone might not fully cover) and to enable advanced control techniques, a joint training strategy is employed:
- Labeled Data: For videos in
REALCAM(camera-labeled data), the actualPlücker embeddingsare used as the condition input . - Unlabeled Data: For videos without camera annotations (from the pre-training dataset), an
all-zero dummy Plücker embeddingis used as the condition input . - Benefits:
- Preserves Diversity: Exposes the model to a wider range of scenes and motions than
REALCAMalone. - Enables Classifier-Free Guidance (CFG): This joint training setup naturally allows for the implementation of
camera classifier-free guidance, analogous to text CFG. During training, the model learns to predict noise both with and without camera conditions.
- Preserves Diversity: Exposes the model to a wider range of scenes and motions than
4.3.3. Camera Classifier-Free Guidance (CFG)
During inference, classifier-free guidance is applied to enhance both text and camera control accuracy. The final noise prediction is a weighted combination of multiple predictions:
Where:
- denotes the noised latent at timestep .
- represents the denoising network's noise prediction.
- and indicate
null conditioningfor text and camera, respectively. This means an input that provides no specific guidance for that condition. - and are
guidance weightsfor text and camera conditions. These weights control how strongly the generation should adhere to the respective conditions. The equation combines three terms:
-
: The unconditional noise prediction (no text, no camera guidance).
-
: The text guidance term, amplifying the difference between text-conditioned and unconditioned predictions.
-
: The camera guidance term, amplifying the difference between camera-conditioned and camera-unconditioned predictions (while still being text-conditioned).
This formulation allows for precise control over the influence of both text and camera conditions during video generation.
4.4. Sequential Video Generation for Scene Exploration
To enable broader scene exploration, CameraCtrl II extends its single-clip generation capability to a clip-wise autoregressive video generation scheme.
4.4.1. Clip-level Video Extension for Scene Exploration
This technique allows generating multiple coherent video clips sequentially:
-
Contextual Conditioning: For generating a new clip , visual tokens are extracted from the
last framesof the previously generated video clip . These frames act as contextual conditioning. -
Noise Addition: Noise is added to the visual tokens of the
current clipto obtain . -
Token Concatenation: The noisy tokens of the current clip are concatenated along the sequence dimension with the clean conditioning tokens from the previous clip to form , where is the total token count and is the channel dimension.
-
Binary Mask: A
binary maskis introduced: 1 for conditioning tokens (from ) and 0 for tokens being generated (from ). This mask is concatenated with along the channel dimension to form\boldsymbol{z}_t = [\boldsymbol{z}_t ; \boldsymbol{m}] \in \mathbb{R}^{q \times (c+1)}. -
Denoising and Loss Calculation: The model (from Section 4.3) takes this combined feature along with the corresponding
Plücker embeddingsfor the new camera trajectory. Theloss from Eq. (1)is then computed only over tokens from the generated clip (i.e., where ). This process is illustrated in Figure 3 (b).Figure 3. CameraCtrl II's architecture for (a) single-clip generation and (b) clip-wise sequential generation.
该图像是示意图,展示了相机表示注入和视频扩展的过程,包括通过相机补丁和视觉补丁来处理预训练 DIT 块和特征的步骤。图中的 R(n imes h imes w) imes c表示特征维度,展现了动态场景生成的关键技术。
4.4.2. Global Coordinate System for Consistency
To ensure geometric consistency throughout an extended sequence and prevent the accumulation of pose errors, a unified coordinate system is used:
- The
first frame of the initial trajectoryserves as the global reference for calculating relative poses across all generated clips. This means all subsequent camera trajectories are defined relative to this initial global frame.
4.4.3. Model Distillation for Speedup
To accelerate inference and improve user experience, a two-phase distillation approach is implemented:
-
Progressive Distillation [44]: This technique reduces the number of
neural function evaluations (NFEs)required for sampling. The original model required 96 NFEs (32 for unconditional, 32 for text CFG, 32 for camera CFG). Progressive distillation reduced this to 16 NFEs. The following are the results from Table 1 of the original paper:Data Pipeline TransErr↓ RotErr ↓ Sample time (s) ↓ Before distillation 0.1892 1.66 13.83 Progressive dist. [44] 0.2001 1.90 2.61 APT [34] 0.2500 2.56 0.59 As shown in Table 1,
Progressive distillationsignificantly reduced sample time (from 13.83s to 2.61s for a 4-second, 12fps video on 4 H800 GPUs) with only a minor degradation in camera control accuracy (TransErrfrom 0.1892 to 0.2001,RotErrfrom 1.66 to 1.90). -
APT (Diffusion Adversarial Post-Training) [34]: This method is applied for
one-step generation, offering even greater speedup. From Table 1,APTfurther reduced sample time to 0.59s. However, this came at the cost of more noticeable degradation inTransErr(0.2500) andRotErr(2.56), indicating a trade-off between speed and conditional generation quality with current resources.
5. Experimental Setup
5.1. Datasets
CameraCtrl II utilizes a combination of publicly available and newly curated datasets for training and evaluation.
5.1.1. Training Datasets
- Internal Transformer-based Text-to-Video Diffusion Model's Pre-training Data: The base model is pre-trained on a large, diverse dataset for general video generation, which inherently contains a wide variety of dynamic content and scenes. This provides the foundation for generating dynamic content.
- REALCAM Dataset (New, Curated):
- Source: Real-world dynamic videos.
- Characteristics: Features a high degree of dynamics and precise camera parameter annotations.
- Curation Process: Involves identifying dynamic foreground objects (using
TMO [11]), estimating optical flow (RAFT [50]), selecting videos with sufficient camera movement, estimating camera parameters usingVGGSfM [53],scale calibrationto a unified metric space (using ametric depth estimator [8]andRANSAC [20]), andbalancing camera trajectory distributionsto handle overrepresented movement types. - Purpose: Crucial for teaching the model to generate dynamic content under camera control, which static datasets failed to do effectively.
- Data Composition during Joint Training: The training process uses a 4:1 ratio between camera-labeled data (from REALCAM) and unlabeled data (from the base model's pre-training data). This ensures the model learns camera control while retaining its general dynamic generation capabilities.
5.1.2. Evaluation Dataset
- Composition: Consists of 800 video clips.
- 240 videos sampled from the
RealEstate10K [62]test set. - 560 videos from the
processed real-world dynamic videos with camera annotations(presumably a held-out portion of REALCAM or similar).
- 240 videos sampled from the
- Selection: Videos are sampled across different camera trajectory categories, reflecting the balancing efforts during data curation (as described in Section 3.2).
- Purpose: To comprehensively evaluate the model's performance on dynamic videos, camera control accuracy, and scene exploration capabilities across diverse scenarios.
5.2. Evaluation Metrics
CameraCtrl II employs six metrics to provide a comprehensive evaluation, covering visual quality, dynamism, camera control, and consistency.
5.2.1. Visual Quality: Fréchet Video Distance (FVD)
- Conceptual Definition:
Fréchet Video Distance (FVD) [51]is a metric used to assess the quality and diversity of generated videos by comparing their feature distributions to those of real videos. It measures the "distance" between the distributions of features extracted from real and generated video clips. A lower FVD score indicates higher quality and realism, suggesting that the generated videos are more similar to real videos in terms of their visual and temporal characteristics. - Mathematical Formula: FVD is computed using the Fréchet Inception Distance (FID) formula, but applied to video features. The formula for the Fréchet distance between two multivariate Gaussian distributions and is: $ \mathrm{FVD} = ||\mu_1 - \mu_2||_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
- Symbol Explanation:
- and : The mean and covariance matrix of features extracted from the real video dataset.
- and : The mean and covariance matrix of features extracted from the generated video dataset.
- : The squared Euclidean distance (L2 norm) between the mean vectors.
- : The trace of a matrix (sum of its diagonal elements).
- : The matrix square root of the product of the covariance matrices.
5.2.2. Video Dynamic Fidelity: Motion Strength
- Conceptual Definition:
Motion strengthis a proposed metric to quantify the degree of dynamism within generated videos. It specifically measures the average magnitude of motion offoreground objectsacross video frames, isolating object movement from camera movement. A higher motion strength indicates more dynamic content. - Mathematical Formula:
- Dense Optical Flow: For each video,
RAFTextracts dense optical flow fields for each pixel and frame. - Foreground Masking:
TMO-generated segmentation masksare applied to the flow fields to identify foreground objects. - Pixel-wise Motion Magnitude: For each foreground pixel, the motion magnitude is calculated as .
- Average Motion: The final motion strength for a video is the average of these motion magnitudes across all foreground pixels in all frames. The paper mentions converting from radians to degrees, which implies a potential step in calculating angular velocity or direction, but the final output is magnitude. Assuming the flow components are given in pixel units, the magnitude is typically in pixel displacement. Note: The paper does not provide a formal mathematical formula, so the explanation is derived from the conceptual description.
- Dense Optical Flow: For each video,
- Symbol Explanation:
- : Components of the optical flow vector at a pixel.
- : Magnitude of the optical flow vector.
5.2.3. Camera Control Accuracy: TransErr and RotErr
- Conceptual Definition:
TransErr(Translation Error) andRotErr(Rotation Error) measure how accurately the generated video's camera poses align with the conditioned camera poses. They quantify the discrepancy between the desired camera trajectory and the trajectory estimated from the generated content. Lower values indicate better camera control. - Mathematical Formula:
- Pose Estimation:
TMO [11]is used to extract motion patterns from generated videos, andVGGSfM [53]is used to estimate camera parameters from these patterns. - Trajectory Alignment: The estimated camera trajectory is aligned to the ground truth (conditioned) trajectory using
Absolute Trajectory Error (ATE) [48]. This involves:- Centering both trajectories.
- Finding an optimal scale factor.
- Computing optimal rotation via
Singular Value Decomposition (SVD). - Determining alignment translation.
- TransErr: Average Euclidean distance between corresponding camera positions after alignment. $ \mathrm{TransErr} = \frac{1}{N} \sum_{i=1}^N ||\mathbf{t}{est,i} - \mathbf{t}{gt,i}||_2 $
- RotErr: Average angular difference (e.g., in degrees) between corresponding camera orientations (rotation matrices) after alignment. The angular difference between two rotation matrices and is typically computed as , converted to degrees. $ \mathrm{RotErr} = \frac{1}{N} \sum_{i=1}^N \mathrm{acos}\left(\frac{\mathrm{Tr}(\mathbf{R}{est,i}^T \mathbf{R}{gt,i}) - 1}{2}\right) \times \frac{180}{\pi} $
- Pose Estimation:
- Symbol Explanation:
- : Number of frames/poses in the trajectory.
- : Estimated camera translation vector for frame .
- : Ground truth (conditioned) camera translation vector for frame .
- : Estimated camera rotation matrix for frame .
- : Ground truth (conditioned) camera rotation matrix for frame .
- : Euclidean norm.
- : Trace of a matrix.
- : Arccosine function.
5.2.4. Geometry Consistency
- Conceptual Definition:
Geometric consistencyassesses the 3D plausibility and reconstructibility of a generated scene. It measures the success rate ofVGGSfMin estimating camera parameters from the generated videos. A higher successful ratio implies that the generated video frames are geometrically consistent enough for a 3D reconstruction algorithm to accurately determine camera poses and scene structure. - Mathematical Formula: $ \mathrm{Geometric\ Consistency} = \frac{\text{Number of videos with successful SfM estimation}}{\text{Total number of generated videos}} \times 100% $
- Symbol Explanation:
- Successful SfM estimation: Refers to
VGGSfMsuccessfully completing the 3D reconstruction process and outputting valid camera parameters for the given video.
- Successful SfM estimation: Refers to
5.2.5. Appearance Consistency
- Conceptual Definition:
Appearance consistencyevaluates how visually consistent consecutive video clips are when generated sequentially to explore a scene. When extending a scene, the subsequent clips should maintain the same scene content and visual style. - Mathematical Formula:
- Feature Extraction: A pre-trained visual encoder (e.g.,
CLIP [43]) extracts features for each frame in a video clip. - Video Feature Averaging: These frame features are averaged to obtain a single
video featurefor each clip. - Cosine Similarity: The
cosine similarityis computed between the video features of adjacent clips (or perhaps all clips in a sequence, though the description implies consecutive). Higher cosine similarity indicates better appearance consistency. $ \mathrm{Cosine\ Similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \cdot ||\mathbf{B}||} $
- Feature Extraction: A pre-trained visual encoder (e.g.,
- Symbol Explanation:
- : Video feature vectors for two clips.
- : Dot product.
- : Euclidean norm (magnitude) of a vector.
5.3. Baselines
CameraCtrl II compares its performance against several representative existing methods in two main settings: Image-to-Video (I2V) and Text-to-Video (T2V) generation with camera control.
5.3.1. Image-to-Video (I2V) Setting Baselines
In this setting, the model takes an initial image and camera trajectory to generate a video. Since these baselines cannot directly extend videos, the last frame of a generated clip is used as the condition image for the next clip.
- MotionCtrl [54]: A unified and flexible motion controller for video generation. It injects camera parameters (point trajectories) to guide video synthesis.
- CameraCtrl [21]: A method specifically designed to enable camera control for text-to-video generation, often using Plücker embeddings for camera conditioning.
5.3.2. Text-to-Video (T2V) Setting Baselines
In this setting, the model takes a text prompt and a camera trajectory to generate a video.
-
AC3D [2]: Focuses on analyzing and improving 3D camera control in video diffusion transformers, with careful design of camera representation injection.
These baselines are representative as they are prominent works in camera-controlled video generation and utilize different approaches for injecting camera information. The paper notes that due to differing camera parameter support, camera parameters are temporally downsampled for all baselines to ensure a fair comparison.
5.4. Implementation Details
- Base Model: An internal transformer-based text-to-video diffusion model with approximately 3 billion parameters.
- Latent Diffusion Model: Uses a temporal causal
VAE tokenizer(similar toMAGViT2 [60]) with a downsampling rate of 4 for temporal and 8 for spatial dimensions. - Camera Pose Sampling: Camera poses are sampled every 4 frames, matching the number of visual features.
- Training Strategy:
- All parameters of the base video diffusion model are
unfrozenand jointly optimized. - Phase 1 (Single-Clip
CameraCtrl II):- Resolution: .
- Steps: 100,000 steps.
- Batch Size: 640.
- Video Clip Duration: 2 to 10 seconds.
- Data Composition: 4:1 ratio of camera-labeled (REALCAM) to unlabeled data.
- Phase 2 (Finetuning and Video Extension):
- Resolution: Higher resolution of .
- Steps: 50,000 steps.
- Batch Size: 512.
- Condition Frames for Extension: The number of condition frames from the previous clip ranges from a minimum of 5 frames to a maximum of 50% of the total frames in the new clip.
- All parameters of the base video diffusion model are
- Optimizer:
AdamWoptimizer. - Learning Rate:
- Initial: .
- Warm-up: over 500 steps.
- Decay: Cosine learning rate scheduler, decaying to .
- Other Hyperparameters: Weight decay of 0.01, betas of 0.9 and 0.95.
- Computational Resources:
- Phase 1: 64 H100 GPUs.
- Phase 2: 128 H100 GPUs.
- Inference:
- Sampler:
Euler samplerwith 32 steps and a shift of 12 [31]. - CFG Scales: for text, for camera.
- Sampler:
6. Results & Analysis
6.1. Core Results Analysis
The experiments quantitatively and qualitatively demonstrate CameraCtrl II's superior performance in dynamic scene synthesis with precise camera control and extended spatial exploration compared to existing approaches.
The following are the results from Table 2 of the original paper:
| Model | FVD | Motion strength↑ | TransErr | RotErr↓ | Geometric consistency↑ | Appearance consistency↑ |
|---|---|---|---|---|---|---|
| MotionCtrl [54] | 221.23 | 102.21 | 0.3221 | 2.78 | 57.87 | 0.7431 |
| CameraCtrl [21] | 199.53 | 133.37 | 0.2812 | 2.81 | 52.12 | 0.7784 |
| CameraCtrL II | 73.11 | 698.51 | 0.1527 | 1.58 | 88.70 | 0.8893 |
| AC3D [2] | 987.34 | 162.21 | 0.2976 | 2.98 | 69.20 | N/A |
| CamEraCtRL II | 641.23 | 574.21 | 0.1892 | 1.66 | 85.00 | N/A |
Note: The table entry "CamEraCtRL II" seems to be a typo for "CameraCtrl II" in the T2V setting. Also, the column for "Appearance consistency" is "N/A" for T2V models, likely because the metric wasn't applicable or reported for the T2V baseline.
6.1.1. Quantitative Comparison
As presented in Table 2, CameraCtrl II consistently and significantly outperforms baseline methods across all relevant metrics in both I2V and T2V settings.
-
I2V Setting (vs. MotionCtrl, CameraCtrl):
- Visual Quality (FVD): CameraCtrl II achieves an FVD of 73.11, substantially lower than MotionCtrl (221.23) and CameraCtrl (199.53). This indicates that CameraCtrl II generates videos that are much closer in distribution to real videos, demonstrating higher overall quality and realism.
- Motion Strength: CameraCtrl II boasts a
Motion strengthof 698.51, dramatically higher than MotionCtrl (102.21) and CameraCtrl (133.37). This is a crucial validation of its core claim: CameraCtrl II effectively preserves and enhances dynamic content generation, overcoming a major limitation of previous models. - Camera Control Accuracy (TransErr, RotErr): CameraCtrl II shows superior camera control with lower
TransErr(0.1527 vs. 0.3221/0.2812) andRotErr(1.58 vs. 2.78/2.81). This indicates better alignment between the generated camera poses and the conditioned trajectories. - Geometric Consistency: With an 88.70%
Geometric consistencyscore, CameraCtrl II's generated videos are more amenable to 3D reconstruction than those from MotionCtrl (57.87%) or CameraCtrl (52.12%), highlighting better 3D plausibility. - Appearance Consistency: CameraCtrl II achieves 0.8893, higher than MotionCtrl (0.7431) and CameraCtrl (0.7784), demonstrating its ability to maintain visual coherence across sequentially generated clips.
-
T2V Setting (vs. AC3D):
- Even in the T2V setting, where the FVD is generally higher (possibly due to the complexity of T2V and the specific evaluation data), CameraCtrl II (641.23 FVD) significantly outperforms AC3D (987.34 FVD).
- Similar trends are observed for
Motion strength(574.21 vs. 162.21),TransErr(0.1892 vs. 0.2976),RotErr(1.66 vs. 2.98), andGeometric consistency(85.00% vs. 69.20%). This confirms CameraCtrl II's robustness and superior performance even when conditioning on text prompts in addition to camera control.
6.1.2. Qualitative Comparison
Figure 4 provides qualitative examples that visually reinforce the quantitative findings.
Figure 4. Qualitative comparison of CameraCtrl II with CameraCtrl and AC3D. The figure shows generated video sequences with different camera movements. CameraCtrl II excels in accurately following camera trajectories and generating dynamic content (e.g., moving cars), whereas CameraCtrl tends to ignore specific camera movements (like upward motion) and generates static scenes. AC3D struggles with complex camera movements (ignoring forward motion) and text prompts (failing to generate a bus).
该图像是一个示意图,展示了 CameraCtrl II 与 CameraCtrl 的对比。图中展示了不同相机控制下生成的视频序列,包括动态内容增强和视角扩展,显现出 CameraCtrl II 在大范围动态场景探索中的优势。
-
I2V Setting (CameraCtrl II vs. CameraCtrl):
- The first two rows in Figure 4 show that CameraCtrl II more accurately follows the input camera trajectories (e.g., upward movements) compared to CameraCtrl [21], which tends to ignore certain camera movements.
- CameraCtrl II generates more dynamic videos, such as moving cars, while CameraCtrl produces largely static scenes, visually confirming the higher
Motion strengthmetric.
-
T2V Setting (CameraCtrl II vs. AC3D):
-
The third and fourth rows illustrate CameraCtrl II's ability to effectively combine camera control with object motion. It successfully generates dynamic elements like moving vehicles (e.g., a bus) while adhering to the camera path.
-
In contrast, AC3D [2] fails to strictly follow the text prompt (not generating a bus) and ignores forward camera motion, resulting in less dynamic and less accurate generations.
Overall, the results consistently demonstrate that CameraCtrl II achieves superior visual quality, significantly higher dynamic content, more accurate camera control, and better consistency for sequential scene exploration compared to state-of-the-art baselines.
-
6.2. Ablation Studies / Parameter Analysis
Ablation studies are crucial for validating the design choices of CameraCtrl II. The experiments are conducted at a resolution of with 50,000 steps of single-clip training, followed by 30,000 steps of multi-clip video extension training.
6.2.1. Effectiveness of Each Component of Data Construction Pipeline
The following are the results from Table 3 of the original paper:
| Model | Motion strength↑ | TransErr↓ | RotErr↓ | Geometric Consistency↑ |
|---|---|---|---|---|
| w/o Dyn. Vid | 129.40 | 0.2069 | 2.02 | 78.50 |
| w/o Scale Calib. | 301.68 | 0.2121 | 2.14 | 82.10 |
| w/o Dist. Balance | 309.24 | 0.2834 | 4.56 | 85.96 |
| Full Pipeline | 306.99 | 0.1830 | 1.74 | 86.50 |
-
w/o Dyn. Vid(Training only with static data like RealEstate10K):Motion strength(129.40) is significantly lower than theFull Pipeline(306.99). This confirms that training on dynamic videos with camera annotations is absolutely crucial for enabling high-dynamic content generation.- Camera control abilities (TransErr, RotErr, Geometric Consistency) also degrade, highlighting the holistic benefit of dynamic data.
-
w/o Scale Calib.(Removing scale calibration):TransErr(0.2121 vs 0.1830),RotErr(2.14 vs 1.74) increase, andGeometric consistency(82.10 vs 86.50) decreases. This validates that normalizing scene scales to a unified metric space helps the model learn consistent geometric relationships, leading to more accurate camera control and easier 3D reconstruction.
-
w/o Dist. Balance(Removing trajectory distribution balancing):-
Significant degradation in
RotErr(4.56 vs 1.74) andTransErr(0.2834 vs 0.1830). This demonstrates the importance of balancing the long-tailed distribution of camera trajectory types. Without it, the model overfits common movements and performs poorly on diverse camera paths.Conclusion: All components of the data curation pipeline—incorporating dynamic videos, scale calibration, and distribution balancing—are essential for achieving high-quality, dynamic, and accurately camera-controlled video generation.
-
6.2.2. Effectiveness of Model Design and Training Strategy (Single-Clip Model)
The following are the results from Table 4 of the original paper:
| Model | Motion strength↑ | TransErr↓ | RotErr↓ | Geometric consistency↑ |
|---|---|---|---|---|
| Complex Encoder | 301.23 | 0.1826 | 1.88 | 84.00 |
| Multilayer Inj. | 247.23 | 0.1865 | 1.78 | 85.00 |
| w/o Joint Training | 279.82 | 0.2098 | 1.97 | 81.92 |
| CaMERaCtRL II | 306.99 | 0.1830 | 1.74 | 86.50 |
-
Complex Encoder(Using a more sophisticated camera encoder similar to CameraCtrl [21]):- Achieves comparable
TransErr(0.1826) to CameraCtrl II (0.1830) but performs slightly worse onRotErr(1.88 vs 1.74) andGeometric consistency(84.00 vs 86.50). More importantly, itsMotion strength(301.23) is slightly lower than CameraCtrl II (306.99). This validates that the proposedsimple patchify layerfor camera tokenization is sufficient and even superior for preserving dynamics.
- Achieves comparable
-
Multilayer Inj.(Injecting camera features at every DiT layer):- While camera control accuracy (TransErr, RotErr, Geometric consistency) remains comparable,
Motion strength(247.23) significantly reduces compared to CameraCtrl II (306.99). This supports the hypothesis that injecting camera information too deeply or globally can over-constrain the model, limiting its ability to generate dynamic movements. Injecting at the initial layer guides overall generation without stifling dynamics.
- While camera control accuracy (TransErr, RotErr, Geometric consistency) remains comparable,
-
w/o Joint Training(Training only with camera-labeled data, no unlabeled data):-
Results in reduced
Motion strength(279.82 vs 306.99) and degraded camera control performance (higher TransErr, RotErr, lower Geometric consistency). This confirms the importance ofjoint trainingwith unlabeled data. It exposes the model to broader visual domains and object motion types, improving generalization and enabling effectivecamera classifier-free guidance.Conclusion: The lightweight camera injection at the initial layer and the joint training strategy are key to CameraCtrl II's ability to combine precise camera control with robust dynamic content generation.
-
6.2.3. Key Design Choices for Video Extension
The following are the results from Table 5 of the original paper:
| Model | FVD↓ | TransErr | RotErr↓ | Appearance consistency↑ |
|---|---|---|---|---|
| Different Ref. | 118.32 | 0.1963 | 1.94 | 0.8032 |
| Noised Condition | 136.78 | 0.1847 | 1.85 | 0.7843 |
| Noised Condition* | 140.98 | 0.1901 | 1.88 | 0.7982 |
| CAMERaCtRL II | 112.46 | 0.1830 | 1.74 | 0.8654 |
-
Different Ref.(Using each clip's first frame as a local reference for relative camera poses):- Leads to higher
FVD(118.32 vs 112.46),TransErr(0.1963 vs 0.1830),RotErr(1.94 vs 1.74), and lowerAppearance consistency(0.8032 vs 0.8654) compared to CameraCtrl II's approach. This confirms that using aglobal reference frame(the first frame of the first clip) is crucial for maintaining consistent geometric relationships and enabling smooth transitions across clips in extended sequences.
- Leads to higher
-
Noised Condition(Adding noise to all clips during training, including conditioning frames, and computing loss over both conditioning and target clips):- Significantly degrades performance across all metrics (
FVD136.78,Appearance consistency0.7843). This is due to atraining-inference mismatchwhere clean conditioning frames are used at inference time, but noisy ones were used in training.
- Significantly degrades performance across all metrics (
-
Noised Condition*(Attempting to bridge the gap by adding little noise to conditioning frames during inference):-
Still yields suboptimal performance (
FVD140.98,Appearance consistency0.7982), even worse thanNoised Condition.Conclusion: The strategy of using
clean (noise-free) conditioning framesduring both training and inference, combined with theglobal reference framefor camera poses, is critical for achieving high-quality, coherent, and accurateclip-wise video extension.
-
6.3. Visualization Results
6.3.1. Different Scenario Scenes Exploration
Figure 5 showcases the generalization performance of CameraCtrl II across a diverse range of scenarios. Figure 5. The figure illustrates the dynamic scene exploration capabilities of CameraCtrl II across various environments. It demonstrates the model's ability to handle different scene types, such as Minecraft-like game scenes, historical London streets, abandoned hospitals, fantasy outdoor settings, and anime palaces, while executing complex camera movements like panning and complete turns, all while preserving appropriate dynamic effects.
该图像是插图,展示了CameraCtrl II框架生成的动态场景探索效果,包含不同视角和动态内容的对比。顶部为相机轨迹,底部为各个场景的生成结果,展示了广泛视角下的连续视频合成能力。
- The model can be applied to varied environments including:
Minecraft-like game scenesBlack and white 19th-century foggy London streetsIndoor abandoned hospitalOutdoor hiking in a fantasy worldAnime-style palace scenes
- It effectively controls complex camera movements such as
panning left and rightandcomplete turns, while maintaining appropriate dynamic effects within the scene. This highlights its strong generalization and control capabilities.
6.3.2. 3D Reconstruction of Generated Scenes
Figure 6 demonstrates the inherent 3D consistency of the videos generated by CameraCtrl II, enabling high-quality 3D reconstruction. Figure 6. This figure displays videos generated by CameraCtrl II, alongside their corresponding 3D point cloud reconstructions using FLARE. It features various scenes, including indoor and outdoor environments, and close-ups, illustrating the model's capability to generate dynamic content with strong 3D consistency that supports detailed 3D structure inference.
该图像是示意图,展示了通过 CameraCtrl II 生成的视频,包括动态场景和点云数据。图像包含多个视角下的室内和室外场景,以及食物的细节展示,体现了模型在不同环境下的动态表现和空间探索能力。
- The authors use
FLARE [61]to infer detailed 3D point clouds from frames extracted from the generated videos. - The ability to reconstruct high-quality point clouds from the generated videos serves as strong evidence for the
superior 3D consistencyachieved by CameraCtrl II, transforming video generative models into effectiveview synthesizers.
7. Conclusion & Reflections
7.1. Conclusion Summary
CameraCtrl II introduces a robust framework that significantly advances camera-controlled video diffusion models, enabling large-scale dynamic scene exploration. The paper successfully addresses two critical limitations of prior work: diminished video dynamics and restricted viewpoint ranges. This is achieved through a multi-faceted approach:
-
Dataset Innovation: The creation of
REALCAM, a dynamic video dataset with meticulously calibrated camera parameter annotations, provides the necessary data for learning dynamic camera-controlled generation. -
Model Architecture & Training: A lightweight camera injection module, integrated at the initial layer of a
Diffusion Transformer, coupled with a joint training strategy using both labeled and unlabeled data, preserves the model's ability to generate dynamic content while enabling precise camera control. The integration ofcamera classifier-free guidancefurther refines this control. -
Extended Exploration Capability: A novel
clip-level autoregressive generationtechnique allows for seamless, coherent extension of video sequences, supporting iterative scene exploration across broad spatial ranges. This is underpinned by using a global reference frame for camera poses and conditioning on clean previous frames.Experimental results consistently demonstrate CameraCtrl II's superior performance across metrics like visual quality (FVD), motion strength, camera control accuracy (TransErr, RotErr), geometric consistency, and appearance consistency, validating its effectiveness in diverse scenarios.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Conflict Resolution with Scene Geometry: CameraCtrl II occasionally struggles to resolve conflicts between camera movement and scene geometry. For instance, a physically implausible outcome can occur where a camera trajectory through a fence results in the fence structure appearing damaged, rather than the camera stopping or navigating around it.
- Future Work: Incorporating more sophisticated
physics-aware modelingor3D scene understandingcould help the model generate more physically plausible camera paths and object interactions.
- Future Work: Incorporating more sophisticated
- Geometric Consistency with Complex Trajectories: While geometric consistency is improved, it could be further enhanced, especially when dealing with highly complex camera trajectories.
- Future Work: Exploring more explicit
3D representationsorgeometric priorswithin the generative process could lead to more robust 3D consistency for intricate camera movements.
- Future Work: Exploring more explicit
7.3. Personal Insights & Critique
CameraCtrl II represents a significant leap forward in controllable video generation, particularly for interactive scene exploration.
Strengths and Inspirations:
-
Holistic Approach: The paper's strength lies in its holistic approach, tackling data, model architecture, and training strategies simultaneously. This comprehensive design is often necessary for breakthroughs in complex generative tasks.
-
Data as a Solution: The emphasis on curating a dedicated
dynamic dataset(REALCAM) is a powerful lesson. Often, limitations in generative models are not purely architectural but stem from inadequacies in training data. Proactively addressing this by creating a domain-specific, high-quality dataset is commendable. -
Elegant Simplicity in Injection: The
lightweight camera injectionat the initial layer, combined withelement-wise addition, is surprisingly effective. It demonstrates that sometimes simpler mechanisms are better for control, especially when aiming to preserve the inherent capabilities (like dynamism) of a powerful pre-trained model. This suggests a principle of minimal intervention for maximal preservation. -
Seamless Exploration: The
clip-wise autoregressive generationwith aglobal reference frameis a brilliant solution for extending video sequences coherently. This opens up possibilities for applications like virtual tourism, gaming environment creation, and interactive storytelling where continuous scene navigation is paramount. -
Transferability: The methods, particularly the data curation pipeline (SfM, scale calibration, distribution balancing) and the lightweight injection strategy, could potentially be adapted to other forms of controllable generation (e.g., object manipulation, character posing) in video diffusion models, or even extended to 3D asset generation where consistent control is required.
Potential Issues/Areas for Improvement (Critique):
-
Computational Cost: While distillation is applied for speedup, the training still requires substantial resources (e.g., 128 H100 GPUs for phase 2). This limits accessibility for smaller research groups or individual practitioners. Further research into more efficient training or distillation methods would be beneficial.
-
Failure Case Analysis (Physics): The failure case shown in Figure 7 (camera passing through a fence) highlights a common challenge in generative AI:
physical plausibility. While the geometric consistency is good enough for 3D reconstruction, truly "world-simulating" models need to respect physical laws. Future work could integrate implicit or explicitcollision detectionandresponse mechanismsto prevent such anti-reality scenarios. This could involve incorporating differentiable physics engines or learning stronger priors about object solidity. -
User Interface for Trajectory Specification: The paper mentions "iteratively specify camera trajectories." The ease and intuitiveness of this specification process are crucial for user experience. While beyond the scope of this technical paper, the practical utility depends heavily on how users define these complex, multi-clip camera paths.
-
Generalization to Novel Objects/Interactions: While the
REALCAMdataset helps with dynamic scenes, the model's ability to generate novel objects or complex interactions not seen in the training data under camera control remains an open question, especially in extended sequences. -
Long-Term Scene Consistency: While
Appearance consistencyis measured, maintaining perfect scene identity (e.g., specific objects, lighting, background details) over very long, multi-clip explorations is inherently challenging for generative models and will likely require further innovations in memory or explicit scene representations.CameraCtrl II is an impressive step towards truly interactive and dynamic world generation, pushing the boundaries of what is possible with video diffusion models. Its innovations provide a solid foundation for future research in controllable content creation.
Similar papers
Recommended via semantic vector search.