Paper status: completed

CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

Published:03/14/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

CameraCtrl II enables large-scale dynamic scene exploration via a camera-controlled video diffusion model, overcoming limitations in video dynamics and viewpoint range by enhancing individual video clips and allowing user-defined camera trajectories for broader spatial exploratio

Abstract

This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic scenes -- first enhancing dynamic content within individual video clip, then extending this capability to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera parameter annotations for training while designing a lightweight camera injection module and training scheme to preserve dynamics of the pretrained models. Building on these improved single-clip techniques, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl Ii enables camera-controlled dynamic scene synthesis with substantially wider spatial exploration than previous approaches.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

1.2. Authors

Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, Hongsheng Li. The authors are affiliated with:

  • The Chinese University of Hong Kong
  • ByteDance Seed
  • Stanford University
  • ByteDance Ceyuan Yang is noted as the corresponding author.

1.3. Journal/Conference

This paper is published as a preprint on arXiv. While not yet peer-reviewed or formally published in a specific journal or conference proceedings at the time of its arXiv release, arXiv is a highly influential platform for rapid dissemination of cutting-edge research in computer science, especially in areas like AI, machine learning, and computer vision. Papers posted on arXiv often represent significant advancements and are frequently subsequently published in top-tier conferences or journals.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces CameraCtrl II, a framework designed for large-scale dynamic scene exploration using a camera-controlled video diffusion model. The authors identify two primary limitations in previous camera-conditioned video generative models: a degradation in video dynamics and a restricted range of viewpoints, particularly when large camera movements are involved. Their approach addresses these issues by progressively expanding the generation capabilities. First, it enhances dynamic content within individual video clips. Second, it extends this enhanced capability to create seamless explorations across broad viewpoint ranges. The methodology involves constructing a new dataset with a high degree of dynamics and precise camera parameter annotations for training. Concurrently, a lightweight camera injection module and a specialized training scheme are designed to preserve the inherent dynamics of pre-trained diffusion models. Building upon these single-clip improvements, CameraCtrl II enables extended scene exploration by allowing users to iteratively define camera trajectories for generating coherent video sequences. Experimental results across diverse scenarios demonstrate that CameraCtrl II significantly expands the spatial exploration capabilities of camera-controlled dynamic scene synthesis compared to prior methods.

https://arxiv.org/abs/2503.10592 (Publication Status: Preprint on arXiv)

https://arxiv.org/pdf/2503.10592v1.pdf

2. Executive Summary

2.1. Background & Motivation

The field of video diffusion models has seen remarkable progress, enabling the generation of high-fidelity, temporally coherent videos from text descriptions. Models like Sora can produce minute-long videos with realistic physics, making them promising tools for modeling dynamic real-world scenes. Beyond simply generating individual scenes, the ability for users to actively explore these digital worlds is gaining importance. Camera control has emerged as a natural interface for this scene exploration, with recent works attempting to inject camera parameters into pre-trained video diffusion models to manipulate viewpoints.

However, existing camera-controlled video generative models face two critical limitations:

  1. Diminished Video Dynamics: After incorporating camera control, these models often suffer from a significant reduction in their ability to generate dynamic content. The generated videos tend to be largely static, limiting the types of scenes that can be synthesized.

  2. Limited Viewpoint Range and Short Clip Generation: These models are typically restricted to generating short video clips (e.g., 25-49 frames) and cannot generate new clips in the same scene that are coherent with previously generated content and new user-defined camera trajectories. This severely limits the spatial range that can be explored.

    These limitations collectively diminish the user experience and restrict the practical applications of camera-controlled video generation. CameraCtrl II aims to address these two fundamental issues.

2.2. Main Contributions / Findings

CameraCtrl II introduces a novel framework that significantly advances camera-controlled video diffusion models for dynamic scene exploration. Its primary contributions and findings are:

  1. Systematic Data Curation Pipeline for Dynamic Videos: The paper introduces REALCAM, a new dataset constructed by extracting camera trajectory annotations from real dynamic videos using Structure-from-Motion (SfM). It includes methods to address challenges like arbitrary scale and long-tailed camera trajectory distributions, ensuring the dataset is suitable for training models that generate dynamic content. This directly combats the issue of models learning from static datasets, which previously compromised dynamic capabilities.

  2. Lightweight Camera Control Injection and Training Strategy: CameraCtrl II proposes a novel, lightweight camera injection module that conditions camera parameters only at the initial layer of the diffusion model. This design, combined with a joint training scheme on both labeled and unlabeled video data, effectively preserves the dynamic content generation capabilities of pre-trained models while enabling precise camera control. The training scheme also facilitates camera classifier-free guidance (CFG) for enhanced control accuracy.

  3. Clip-wise Autoregressive Generation for Extended Scene Exploration: The framework introduces a clip-wise autoregressive generation recipe that allows the model to produce multiple coherent video clips sequentially. This technique enables continuous scene exploration across broader ranges by conditioning subsequent clips on clean frames from previous ones and new camera trajectories, maintaining visual consistency. This directly tackles the limitation of short clip generation and limited exploration range.

    Key Findings:

  • CameraCtrl II substantially outperforms previous methods (e.g., MotionCtrl, CameraCtrl, AC3D) across various metrics, including FVD (visual quality), Motion strength (dynamic content), TransErr and RotErr (camera control accuracy), and Geometric consistency.

  • The curated REALCAM dataset with dynamic content and scale calibration is crucial for enhancing motion strength and camera control accuracy.

  • The lightweight camera injection at the initial layer and the joint training strategy are effective in preserving dynamics and improving control accuracy without over-constraining the model.

  • The global reference frame for relative camera poses and training with clean conditioning frames are critical for achieving high Appearance consistency and accurate camera control in sequential generation.

  • The generated videos exhibit strong 3D consistency, allowing for high-quality 3D reconstruction into point clouds.

    In essence, CameraCtrl II offers a robust solution for generating dynamic, camera-controlled videos that can be seamlessly extended for large-scale scene exploration, overcoming significant limitations of prior works.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand CameraCtrl II, a grasp of the following foundational concepts is essential:

3.1.1. Video Diffusion Models

Video Diffusion Models (VDMs) are a class of generative models that learn to create realistic videos by gradually denoising a random noise signal. They are inspired by diffusion processes in physics.

  • Core Idea: A diffusion model has two main processes:
    1. Forward Diffusion (Noising Process): Gradually adds Gaussian noise to an input video until it becomes pure noise. This process is typically fixed and not learned.
    2. Reverse Diffusion (Denoising Process): Learns to reverse the forward process, gradually removing noise from a noisy input to reconstruct a clean video. This is the generative part of the model.
  • Training: The model is trained to predict the noise added at each step, or directly predict the clean data, given a noisy input and a timestep.
  • Generation: To generate a new video, the model starts with random noise and iteratively applies the learned denoising steps until a coherent video emerges.
  • Conditional Generation: VDMs can be conditioned on various inputs, such as text descriptions (Text-to-Video, T2V), images (Image-to-Video, I2V), or, in this paper's case, camera parameters, to guide the generation process towards desired outputs.
  • Latent Diffusion Models (LDMs): Many modern diffusion models, including those for video, operate in a latent space rather than directly on pixel space. This means videos are first encoded into a lower-dimensional latent representation using an autoencoder (specifically a Variational Autoencoder (VAE) or a similar visual tokenizer). The diffusion process then occurs in this latent space, making training and inference more computationally efficient. After denoising in latent space, the latent representation is decoded back into a high-resolution video.

3.1.2. Transformer Architecture

The Transformer architecture is a neural network design introduced in 2017, which revolutionized natural language processing and has since been widely adopted in computer vision, especially for diffusion models.

  • Key Component: Attention Mechanism: Transformers rely heavily on self-attention mechanisms. Self-attention allows the model to weigh the importance of different parts of the input sequence (e.g., different tokens in a text, or different patches in an image/video) when processing each element.
    • The Attention mechanism calculates the output as a weighted sum of Value vectors, where the weight assigned to each Value is determined by a Query and Key comparison.
    • The standard formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
      • QQ (Query), KK (Key), VV (Value) are matrices representing linear transformations of the input embeddings.
      • dkd_k is the dimension of the key vectors, used for scaling to prevent vanishing gradients.
      • QKTQK^T computes the dot product similarity between queries and keys.
      • softmax\mathrm{softmax} normalizes these similarities to obtain attention weights.
  • DiT (Diffusion Transformer): A specific variant where the UNet architecture, commonly used in diffusion models, is replaced by or integrated with a Transformer block. This allows for better modeling of long-range dependencies in the data.

3.1.3. Structure-from-Motion (SfM)

Structure-from-Motion (SfM) is a photogrammetric range imaging technique for estimating three-dimensional structures from two-dimensional image sequences (like video frames) that are taken from different viewpoints.

  • Process: SfM works by identifying and matching feature points across multiple images. From these matches, it can simultaneously reconstruct the 3D positions of the feature points in space and the 3D camera poses (position and orientation) from which each image was taken.
  • Output: SfM outputs a sparse point cloud (3D coordinates of detected features) and the extrinsic parameters (rotation and translation) for each camera, along with intrinsic parameters (focal length, principal point, distortion coefficients).
  • Scale Ambiguity: A common challenge with monocular SfM is scale ambiguity. Without prior knowledge of the scene or additional sensors (like a known distance between two points), the reconstructed 3D structure and camera poses are only known up to an arbitrary scale factor. This means a scene can appear larger or smaller, but its internal proportions are correct. This paper addresses this explicitly with scale calibration.
  • VGGSfM: A specific, well-known SfM system often used as a baseline or tool for robust camera pose estimation.

3.1.4. Plücker Embedding

Plücker embedding is a mathematical representation used to describe lines in 3D space. In the context of camera control, it can represent rays from the camera center through pixels in an image.

  • Geometric Interpretation: Each ray from the camera center to a pixel can be uniquely defined by a point on the ray (e.g., the camera center) and the direction vector of the ray. A Plücker embedding compactly combines this information.
  • Calculation: Given a camera's extrinsic matrix E=[R;t]R3×4\mathbf{E} = [\mathbf{R}; \mathbf{t}] \in \mathbb{R}^{3 \times 4} (rotation R\mathbf{R} and translation t\mathbf{t}) and intrinsic matrix KR3×3\mathbf{K} \in \mathbb{R}^{3 \times 3}, for each pixel (u, v), its Plücker embedding p=(o×d,d)\mathbf{p} = (\mathbf{o} \times \mathbf{d}', \mathbf{d}') is computed.
    • o\mathbf{o} represents the camera center in world space (typically derived from t\mathbf{t}).
    • d=RK1[u,v,1]T\mathbf{d} = \mathbf{R} \mathbf{K}^{-1} [u, v, 1]^T is the direction vector of the ray from the camera to the pixel in world coordinates, transformed by the camera's pose, and d\mathbf{d}' is its normalized version.
    • The cross product o×d\mathbf{o} \times \mathbf{d}' provides information about the plane containing the origin and the ray, complementing the direction vector.
  • Why it's useful: It provides a compact, geometrically meaningful representation of camera-pixel relationships, which can be fed into neural networks. It contains fine-grained per-pixel camera information that helps guide the generative model.

3.1.5. Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) is a technique used in conditional diffusion models to improve the adherence of generated samples to a given condition (e.g., text prompt, camera parameters).

  • Core Idea: During training, the model is sometimes trained with the condition (e.g., text embedding) and sometimes without it (using a null condition, represented by ϕ\phi). During inference, two predictions are made for the noise: one conditioned on the actual input, and one conditioned on the null input.
  • Formula: The final noise prediction ϵ^θ\hat{\epsilon}_\theta is a weighted combination of the unconditional prediction ϵθ(zt,ϕ)\epsilon_\theta(z_t, \phi) and the conditional prediction ϵθ(zt,c)\epsilon_\theta(z_t, c): $ \hat{\epsilon}\theta(z_t, c) = \epsilon\theta(z_t, \phi) + w \cdot (\epsilon_\theta(z_t, c) - \epsilon_\theta(z_t, \phi)) $ Where:
    • ztz_t is the noisy latent at timestep tt.
    • cc is the condition (e.g., text prompt).
    • ϕ\phi represents the null condition.
    • ww is the guidance weight (a scalar 1\ge 1), which controls how strongly the model adheres to the condition. Higher ww leads to stronger adherence but can sometimes reduce sample diversity or quality.
  • Application in CameraCtrl II: This paper extends CFG to include camera guidance, meaning the model makes predictions conditioned on both text and camera parameters, as well as null conditions for each, to enhance control accuracy.

3.2. Previous Works

The paper contextualizes its work within the advancements of Video Diffusion Models and Camera-controlled Video Diffusion Models.

3.2.1. Video Diffusion Models (General)

  • Early Works (e.g., [6, 7, 17, 19, 24]): These models often adapted Text-to-Image (T2I) models (e.g., UNet-based architectures) by adding temporal modeling layers to generate videos. They focused on efficient generation from text.
  • Recent Models (e.g., [1, 9, 29, 38, 49, 58]): Shifted towards Transformer architectures (like Diffusion Transformers or DiTs) to achieve better temporal consistency and generation quality at scale. Models like Sora [9] are notable for generating minute-long videos with realistic physics.
  • Role in CameraCtrl II: CameraCtrl II builds upon these general-purpose video generation capabilities. Its base model is an internal transformer-based text-to-video diffusion model with approximately 3B parameters, leveraging a temporal causal VAE tokenizer. The goal is to extend these powerful generative models with precise camera control and exploration capabilities.

3.2.2. Camera-controlled Video Diffusion Models

These models aim to inject camera pose information to manipulate the viewpoint of generated videos.

  • Conditioning Mechanisms:
    • MotionCtrl [54], CameraCtrl [21], I2VControl-Camera [15]: Inject camera parameters (extrinsic matrices, Plücker embeddings, or point trajectories) directly into a pre-trained video diffusion model.
    • CamCo [56]: Integrates epipolar constraints into attention layers, leveraging geometric principles.
    • CamTrol [25]: Uses explicit 3D point cloud representations as conditioning.
    • AC3D [2], VD3D [3]: Focus on carefully designing camera representation injection for transformer-based video diffusion models.
  • Multi-camera Scenarios: Some recent works have explored generating from multiple camera views: CVD [30], Caiva [55], Vivid-ZOO [32], SyncCamMaster [4].
  • Limitations of Previous Models:
    1. Diminished Dynamics: A common problem is that incorporating camera control often leads to a significant degradation in the ability to generate dynamic content. The generated scenes become largely static.
    2. Limited Clip Length/Exploration: Existing methods are restricted to generating short video clips (e.g., 25-49 frames) and lack the ability to seamlessly extend generation into new clips while maintaining scene consistency and following new camera trajectories.
  • Datasets Used: Previous methods often relied on datasets like RealEstate10K [62], ACID [36], DL3DV10K [35], or Objaverse [12]. While these provide camera annotations, they primarily feature static scenes. Training on such datasets inherently compromises the model's capacity to generate dynamic content when camera control is introduced.

3.3. Technological Evolution

Video generation has evolved from early frame-interpolation methods to complex recurrent neural networks and, more recently, to highly effective diffusion models. The shift from UNet-based architectures to Transformer-based ones (DiT) has been crucial for scalability and temporal consistency.

Concurrently, the integration of control signals has matured. Initially, control was mostly text-based. The natural progression led to explicit spatial and motion control, with camera control being a prime example for navigating synthetic environments. This paper sits at the intersection of powerful video diffusion models and advanced camera control, aiming to overcome the existing limitations.

3.4. Differentiation Analysis

CameraCtrl II differentiates itself from prior camera-controlled video diffusion models through several key innovations:

  1. Addressing Dynamic Content Degradation:

    • Previous Approaches: Suffered from generating largely static content when camera control was active, partly due to training on datasets predominantly featuring static scenes (e.g., RealEstate10K).
    • CameraCtrl II's Innovation:
      • REALCAM Dataset: Proactively constructs a new dataset (REALCAM) from real-world dynamic videos with precise camera annotations, explicitly designed to teach the model to generate dynamic content under camera control. This is a fundamental departure from relying on static scene datasets.
      • Lightweight Camera Injection: Instead of complex encoders or injecting camera features into every layer of the DiT (which can over-constrain dynamics), CameraCtrl II injects camera parameters only at the initial layer. This allows the model to preserve more of its inherent dynamic generation capabilities learned during pre-training.
      • Joint Training: Integrates training on both camera-labeled dynamic data and unlabeled general video data. This strategy maintains the model's ability to generate diverse and dynamic scenes from its pre-trained state while gaining camera control capabilities.
  2. Enabling Broad Spatial Exploration (Sequential Video Generation):

    • Previous Approaches: Limited to generating short, isolated video clips (e.g., 25-49 frames), making continuous exploration of a scene impossible without significant inconsistencies between clips.
    • CameraCtrl II's Innovation:
      • Clip-wise Autoregressive Generation: Introduces a novel technique to generate multiple coherent video clips sequentially. It conditions the generation of a new clip on clean frames from the end of the previous clip and a new camera trajectory.
      • Global Reference Frame: Uses a global reference frame (the first frame of the initial trajectory) for calculating relative camera poses across all generated clips. This ensures geometric consistency throughout extended sequences, preventing pose error accumulation that would occur with local, clip-wise references.
      • Targeted Loss Optimization: Optimizes the loss only on the newly generated frames, while the previous frames serve as conditioning, allowing the model to learn seamless extensions.
  3. Enhanced Camera Control Accuracy and Robustness:

    • Classifier-Free Guidance for Camera: Implements camera classifier-free guidance (CFG), analogous to text CFG, allowing for enhanced camera control accuracy during inference by appropriately weighting conditional and unconditional predictions.

    • Scale Calibration & Distribution Balancing: Addresses SfM's inherent scale ambiguity by aligning all video sequences to a unified metric space and balances the long-tailed distribution of camera trajectory types in real-world data. These steps are crucial for robust and accurate camera control across diverse movements.

      In summary, CameraCtrl II moves beyond merely adding camera conditioning to diffusion models by fundamentally re-thinking the data, model architecture, and training strategy to preserve dynamism and enable truly extended, coherent scene exploration, which were major bottlenecks in prior works.

4. Methodology

CameraCtrl II is designed to overcome the limitations of previous camera-conditioned video diffusion models by enabling large-scale dynamic scene exploration. This is achieved through a three-pronged approach: careful dataset curation, an effective camera control injection mechanism, and a sequential video generation technique.

The overall framework is built upon a pre-trained latent video diffusion model. Such a model learns to generate video latents z0z_0 conditioned on text cc and, in this case, camera parameters ss. The training objective for a standard conditional video diffusion model is to predict the noise added to a latent ztz_t at a given timestep tt:

L(θ)=Ez0,ϵ,c,s,t[ϵϵ^θ(zt,c,s,t)22] L ( \theta ) = \mathbb { E } _ { z _ { 0 } , \epsilon , c , s , t } [ | \epsilon - \hat { \epsilon } _ { \theta } ( z _ { t } , c , s , t ) | _ { 2 } ^ { 2 } ]

Where:

  • L(θ)L(\theta) is the loss function for the model with parameters θ\theta.

  • E\mathbb{E} denotes the expectation over different samples of z0z_0 (encoded video latents), ϵ\epsilon (Gaussian noise), cc (text prompt), ss (camera parameters), and tt (timestep).

  • z0z_0 represents the encoded latents from a visual tokenizer (e.g., a VAE).

  • ϵ\epsilon is the noise added to z0z_0 to get ztz_t.

  • ztz_t is the noisy latent at timestep tt.

  • ϵ^θ(zt,c,s,t)\hat{\epsilon}_\theta(z_t, c, s, t) is the prediction of the noise by the denoising network (e.g., a Diffusion Transformer or DiT) with parameters θ\theta, conditioned on ztz_t, cc, ss, and tt.

  • 22|\cdot|_2^2 denotes the squared L2 norm, indicating that the model aims to minimize the difference between the true noise ϵ\epsilon and its prediction ϵ^θ\hat{\epsilon}_\theta.

    For inference, the model starts from Gaussian noise ϵN(0,σt2I)\epsilon \sim \mathcal{N}(0, \sigma_t^2 \mathbf{I}) and iteratively recovers the video latents z0z_0 using a sampler (e.g., Euler sampler), conditioning on both the input image/text and the specified camera parameters.

4.1. Camera Representation (Plücker Embedding)

For camera representation, CameraCtrl II follows previous works like CameraCtrl [21] and CamCo [56] by adopting the Plücker embedding [47]. This representation offers a strong geometric interpretation and provides fine-grained per-pixel camera information.

Given:

  • Camera extrinsic matrix E=[R;t]R3×4\mathbf{E} = [ \mathbf{R} ; \mathbf{t} ] \in \mathbb{R}^{3 \times 4}, where RR3×3\mathbf{R} \in \mathbb{R}^{3 \times 3} is the rotation matrix and tR3×1\mathbf{t} \in \mathbb{R}^{3 \times 1} is the translation vector.

  • Camera intrinsic matrix KR3×3\mathbf{K} \in \mathbb{R}^{3 \times 3}.

    For each pixel (u, v) in the image, its Plücker embedding p\mathbf{p} is computed as:

p=(o×d,d) \mathbf{p} = ( \mathbf{o} \times \mathbf{d}', \mathbf{d}' )

Where:

  • o\mathbf{o} represents the camera center in world space.

  • d=RK1[u,v,1]T+t\mathbf{d} = \mathbf{R} \mathbf{K}^{-1} [u, v, 1]^T + \mathbf{t} denotes the ray direction from the camera to the pixel in world coordinates. This calculation transforms the pixel coordinates from image space to a ray in world space, considering the camera's intrinsic and extrinsic properties.

  • d\mathbf{d}' is the normalized version of the direction vector d\mathbf{d}.

    The final Plücker embedding PiR6×h×w\mathbf{P}_i \in \mathbb{R}^{6 \times h \times w} is constructed for each frame ii, where hh and ww are the spatial dimensions matching those of the encoded visual tokens. The 6 channels come from the 3 components of o×d\mathbf{o} \times \mathbf{d}' and the 3 components of d\mathbf{d}'.

4.2. Dataset Curation

High-quality datasets with accurate camera parameter annotations are critical. CameraCtrl II addresses the limitation of previous static-scene datasets by introducing REALCAM, a new dynamic video dataset. The overall data processing pipeline is shown in Figure 2.

4.2.1. Camera Estimation from Dynamic Videos

To create a diverse and dynamic dataset, CameraCtrl II curates videos from real-world scenarios (indoor, aerial, street). The pipeline includes:

  1. Dynamic Object Identification: The motion segmentation model TMO [11] is used to identify and mask dynamic foreground objects in a video. This helps distinguish background motion (due to camera movement) from foreground object motion.

  2. Optical Flow Estimation: RAFT [50] is employed to estimate the dense optical flow for the video.

  3. Camera Movement Selection: By averaging the optical flow in static background regions (identified by TMO), a quantitative measure of camera movement is obtained. Videos are selected only if their average flow exceeds an empirically determined threshold, ensuring sufficient camera motion.

  4. Camera Parameter Estimation: VGGSfM [53] is used to estimate camera parameters (intrinsic and extrinsic) for each frame.

    Figure 2. Dataset curation pipeline. We omit the process of dynamic video selection.

    Figure 2. Dataset curation pipeline. We omit the process of dynamic video selection. 该图像是一个示意图,展示了数据集的策划流程,包括运动分割、VGGSfM、尺度校准和分布平衡等步骤。图中分别展示了用于深度估计的动态视频片段和相关处理流程,最终生成的结果连通至RealCam数据集。

4.2.2. Camera Parameter Calibration for Unified Scales

A key challenge with monocular SfM is the arbitrary scale of reconstructions. To achieve consistent camera movements across different videos, a calibration pipeline is developed to align arbitrary scene scales to a metric space.

For each video sequence:

  1. Keyframe Selection: NN keyframes are selected.

  2. Metric Depth Estimation: Metric depths {Mi}i=1N\{ \mathbf{M}_i \}_{i=1}^N for these keyframes are estimated using a metric depth estimator [8]. Metric depth refers to actual distances in meters.

  3. SfM Depth Extraction: Corresponding SfM depths {Si}i=1N\{ \mathbf{S}_i \}_{i=1}^N are obtained from the VGGSfM output. SfM depths are relative, unscaled depths.

  4. Scale Factor Calculation: The scale factor sis_i between the metric depth and SfM depth for each frame ii is calculated by minimizing the difference between them:

    si=argminspPρ(sSi(p)Mi(p)) s _ { i } = \underset { s } { \arg \operatorname* { m i n } } \sum _ { p \in \mathcal { P } } \rho ( | s \cdot \mathbf { S } _ { i } ( p ) - \mathbf { M } _ { i } ( p ) | ) Where:

    • sis_i is the scale factor for frame ii.
    • P\mathcal{P} denotes the set of pixel coordinates.
    • ρ()\rho(\cdot) is the Huber loss function, which is robust to outliers and depth estimation errors.
    • Si(p)\mathbf{S}_i(p) is the SfM depth at pixel pp for frame ii.
    • Mi(p)\mathbf{M}_i(p) is the metric depth at pixel pp for frame ii. This minimization problem is solved using RANSAC [20] to ensure robustness against noisy depth estimations.
  5. Final Scale Factor: The final scale factor ss for the entire scene is computed as the mean of the individual frame scales. This factor ss is then multiplied to the camera position vector t\mathbf{t} of the extrinsic matrix to obtain the scaled extrinsic matrix E=[R;st]R3×4\mathbf{E} = [ \mathbf{R} ; s \cdot \mathbf{t} ] \in \mathbb{R}^{3 \times 4}.

4.2.3. Camera Trajectory Distribution Balancing

Real-world videos often have imbalanced camera trajectory distributions (e.g., forward motion is overrepresented). To prevent overfitting and ensure robust performance across diverse movements, CameraCtrl II balances this distribution:

  1. Keypoint Detection: Key camera positions (keypoints) are detected on a camera trajectory. For each point, two lines are fitted through its preceding and following nn points. If the angle between these lines exceeds a threshold γ\gamma, it's marked as a keypoint.
  2. Segment and Direction Analysis: These keypoints divide a trajectory into segments. The direction of each segment is determined by the fitted line vectors. The segment with the longest camera movement defines the trajectory's primary movement direction.
  3. View Change and Turn Analysis: Along each segment, camera rotation matrices are analyzed for significant view changes. Between adjacent segments, turns are identified by measuring their angular deviations. Turns after the main segment are defined as main turns.
  4. Importance Weighting: Each trajectory is assigned an importance weight based on the number and magnitude of both view changes and turns.
  5. Categorization and Pruning: Trajectories are categorized into N×MN \times M types (N primary directions, M main turns). To balance the dataset, redundant trajectory types are pruned by removing trajectories with lower importance scores, resulting in a more uniform distribution.

4.3. Adding Camera Control to Video Generation

The core of CameraCtrl II's model design focuses on integrating camera control while preserving the generation of dynamic content.

4.3.1. Lightweight Camera Injection Module

Previous methods often inject camera features (extracted by a dedicated encoder) into multiple Diffusion Transformer (DiT) layers or convolution layers, which can over-constrain video dynamics. CameraCtrl II proposes a lightweight approach:

  1. Initial Layer Injection: Camera parameters are injected only at the initial layer of the diffusion model.

  2. New Patchify Layer: A new patchify layer is designed specifically for camera tokenization. This layer processes the Plücker embeddings to produce camera features pfeatp_{feat} that match the dimensions and downsample ratios of the visual features.

  3. Combination: The visual tokens zfeatz_{feat} (from the encoded video latents) and the camera features pfeatp_{feat} are combined via element-wise addition (zfeat=zfeat+pfeatz_{feat} = z_{feat} + p_{feat}). This combined feature then flows through the remaining DiT layers.

    This simple injection strategy, as illustrated in Figure 3 (a), effectively guides the generation without overly restricting the model's ability to create dynamic motions.

Figure 3. CameraCtrl II's architecture for (a) single-clip generation and (b) clip-wise sequential generation.

该图像是示意图,展示了相机表示注入和视频扩展的过程,包括通过相机补丁和视觉补丁来处理预训练 DIT 块和特征的步骤。图中的 `R(n imes h imes w) imes c` 表示特征维度,展现了动态场景生成的关键技术。 该图像是示意图,展示了相机表示注入和视频扩展的过程,包括通过相机补丁和视觉补丁来处理预训练 DIT 块和特征的步骤。图中的 R(n imes h imes w) imes c 表示特征维度,展现了动态场景生成的关键技术。

4.3.2. Joint Training with Camera-labeled and Unlabeled Data

To ensure the model retains its capacity for diverse content generation (which REALCAM alone might not fully cover) and to enable advanced control techniques, a joint training strategy is employed:

  1. Labeled Data: For videos in REALCAM (camera-labeled data), the actual Plücker embeddings are used as the condition input ss.
  2. Unlabeled Data: For videos without camera annotations (from the pre-training dataset), an all-zero dummy Plücker embedding is used as the condition input ss.
  3. Benefits:
    • Preserves Diversity: Exposes the model to a wider range of scenes and motions than REALCAM alone.
    • Enables Classifier-Free Guidance (CFG): This joint training setup naturally allows for the implementation of camera classifier-free guidance, analogous to text CFG. During training, the model learns to predict noise both with and without camera conditions.

4.3.3. Camera Classifier-Free Guidance (CFG)

During inference, classifier-free guidance is applied to enhance both text and camera control accuracy. The final noise prediction ϵ^θ\hat{\epsilon}_\theta is a weighted combination of multiple predictions:

ϵ^θ(zt,c,s,t)=ϵθ(zt,ϕtext,ϕcam)+wtext(ϵθ(zt,c,ϕcam)ϵθ(zt,ϕtext,ϕcam))+wcam(ϵθ(zt,c,s)ϵθ(zt,c,ϕcam)) \begin{array} { r l r } { { \hat { \epsilon } _ { \theta } ( z _ { t } , c , s , t ) = \epsilon _ { \theta } \big ( z _ { t } , \phi _ { t e x t } , \phi _ { c a m } \big ) } } \\ & { } & { + w _ { t e x t } \big ( \epsilon _ { \theta } ( z _ { t } , c , \phi _ { c a m } ) - \epsilon _ { \theta } ( z _ { t } , \phi _ { t e x t } , \phi _ { c a m } ) \big ) } \\ & { } & { + w _ { c a m } \big ( \epsilon _ { \theta } ( z _ { t } , c , s ) - \epsilon _ { \theta } ( z _ { t } , c , \phi _ { c a m } ) \big ) } \end{array} Where:

  • ztz_t denotes the noised latent at timestep tt.
  • ϵθ\epsilon_\theta represents the denoising network's noise prediction.
  • ϕtext\phi_{text} and ϕcam\phi_{cam} indicate null conditioning for text and camera, respectively. This means an input that provides no specific guidance for that condition.
  • wtextw_{text} and wcamw_{cam} are guidance weights for text and camera conditions. These weights control how strongly the generation should adhere to the respective conditions. The equation combines three terms:
  1. ϵθ(zt,ϕtext,ϕcam)\epsilon_\theta(z_t, \phi_{text}, \phi_{cam}): The unconditional noise prediction (no text, no camera guidance).

  2. wtext(ϵθ(zt,c,ϕcam)ϵθ(zt,ϕtext,ϕcam))w_{text} (\epsilon_\theta(z_t, c, \phi_{cam}) - \epsilon_\theta(z_t, \phi_{text}, \phi_{cam})): The text guidance term, amplifying the difference between text-conditioned and unconditioned predictions.

  3. wcam(ϵθ(zt,c,s)ϵθ(zt,c,ϕcam))w_{cam} (\epsilon_\theta(z_t, c, s) - \epsilon_\theta(z_t, c, \phi_{cam})): The camera guidance term, amplifying the difference between camera-conditioned and camera-unconditioned predictions (while still being text-conditioned).

    This formulation allows for precise control over the influence of both text and camera conditions during video generation.

4.4. Sequential Video Generation for Scene Exploration

To enable broader scene exploration, CameraCtrl II extends its single-clip generation capability to a clip-wise autoregressive video generation scheme.

4.4.1. Clip-level Video Extension for Scene Exploration

This technique allows generating multiple coherent video clips sequentially:

  1. Contextual Conditioning: For generating a new clip (i+1)(i+1), visual tokens z0iz_0^i are extracted from the last nn frames of the previously generated video clip ii. These frames act as contextual conditioning.

  2. Noise Addition: Noise is added to the visual tokens z0i+1z_0^{i+1} of the current clip (i+1)(i+1) to obtain zti+1z_t^{i+1}.

  3. Token Concatenation: The noisy tokens of the current clip zti+1z_t^{i+1} are concatenated along the sequence dimension with the clean conditioning tokens from the previous clip z0iz_0^i to form zt=[z0i;zti+1]Rq×cz_t = [z_0^i ; z_t^{i+1}] \in \mathbb{R}^{q \times c}, where qq is the total token count and cc is the channel dimension.

  4. Binary Mask: A binary mask mRq×1m \in \mathbb{R}^{q \times 1} is introduced: 1 for conditioning tokens (from z0iz_0^i) and 0 for tokens being generated (from zti+1z_t^{i+1}). This mask is concatenated with ztz_t along the channel dimension to form \boldsymbol{z}_t = [\boldsymbol{z}_t ; \boldsymbol{m}] \in \mathbb{R}^{q \times (c+1)}.

  5. Denoising and Loss Calculation: The model (from Section 4.3) takes this combined feature zt\boldsymbol{z}_t along with the corresponding Plücker embeddings for the new camera trajectory. The loss from Eq. (1) is then computed only over tokens from the generated clip (i.e., where m=0m=0). This process is illustrated in Figure 3 (b).

    Figure 3. CameraCtrl II's architecture for (a) single-clip generation and (b) clip-wise sequential generation.

    该图像是示意图,展示了相机表示注入和视频扩展的过程,包括通过相机补丁和视觉补丁来处理预训练 DIT 块和特征的步骤。图中的 `R(n imes h imes w) imes c` 表示特征维度,展现了动态场景生成的关键技术。 该图像是示意图,展示了相机表示注入和视频扩展的过程,包括通过相机补丁和视觉补丁来处理预训练 DIT 块和特征的步骤。图中的 R(n imes h imes w) imes c 表示特征维度,展现了动态场景生成的关键技术。

4.4.2. Global Coordinate System for Consistency

To ensure geometric consistency throughout an extended sequence and prevent the accumulation of pose errors, a unified coordinate system is used:

  • The first frame of the initial trajectory serves as the global reference for calculating relative poses across all generated clips. This means all subsequent camera trajectories are defined relative to this initial global frame.

4.4.3. Model Distillation for Speedup

To accelerate inference and improve user experience, a two-phase distillation approach is implemented:

  1. Progressive Distillation [44]: This technique reduces the number of neural function evaluations (NFEs) required for sampling. The original model required 96 NFEs (32 for unconditional, 32 for text CFG, 32 for camera CFG). Progressive distillation reduced this to 16 NFEs. The following are the results from Table 1 of the original paper:

    Data Pipeline TransErr↓ RotErr ↓ Sample time (s) ↓
    Before distillation 0.1892 1.66 13.83
    Progressive dist. [44] 0.2001 1.90 2.61
    APT [34] 0.2500 2.56 0.59

    As shown in Table 1, Progressive distillation significantly reduced sample time (from 13.83s to 2.61s for a 4-second, 12fps video on 4 H800 GPUs) with only a minor degradation in camera control accuracy (TransErr from 0.1892 to 0.2001, RotErr from 1.66 to 1.90).

  2. APT (Diffusion Adversarial Post-Training) [34]: This method is applied for one-step generation, offering even greater speedup. From Table 1, APT further reduced sample time to 0.59s. However, this came at the cost of more noticeable degradation in TransErr (0.2500) and RotErr (2.56), indicating a trade-off between speed and conditional generation quality with current resources.

5. Experimental Setup

5.1. Datasets

CameraCtrl II utilizes a combination of publicly available and newly curated datasets for training and evaluation.

5.1.1. Training Datasets

  • Internal Transformer-based Text-to-Video Diffusion Model's Pre-training Data: The base model is pre-trained on a large, diverse dataset for general video generation, which inherently contains a wide variety of dynamic content and scenes. This provides the foundation for generating dynamic content.
  • REALCAM Dataset (New, Curated):
    • Source: Real-world dynamic videos.
    • Characteristics: Features a high degree of dynamics and precise camera parameter annotations.
    • Curation Process: Involves identifying dynamic foreground objects (using TMO [11]), estimating optical flow (RAFT [50]), selecting videos with sufficient camera movement, estimating camera parameters using VGGSfM [53], scale calibration to a unified metric space (using a metric depth estimator [8] and RANSAC [20]), and balancing camera trajectory distributions to handle overrepresented movement types.
    • Purpose: Crucial for teaching the model to generate dynamic content under camera control, which static datasets failed to do effectively.
  • Data Composition during Joint Training: The training process uses a 4:1 ratio between camera-labeled data (from REALCAM) and unlabeled data (from the base model's pre-training data). This ensures the model learns camera control while retaining its general dynamic generation capabilities.

5.1.2. Evaluation Dataset

  • Composition: Consists of 800 video clips.
    • 240 videos sampled from the RealEstate10K [62] test set.
    • 560 videos from the processed real-world dynamic videos with camera annotations (presumably a held-out portion of REALCAM or similar).
  • Selection: Videos are sampled across different camera trajectory categories, reflecting the balancing efforts during data curation (as described in Section 3.2).
  • Purpose: To comprehensively evaluate the model's performance on dynamic videos, camera control accuracy, and scene exploration capabilities across diverse scenarios.

5.2. Evaluation Metrics

CameraCtrl II employs six metrics to provide a comprehensive evaluation, covering visual quality, dynamism, camera control, and consistency.

5.2.1. Visual Quality: Fréchet Video Distance (FVD)

  • Conceptual Definition: Fréchet Video Distance (FVD) [51] is a metric used to assess the quality and diversity of generated videos by comparing their feature distributions to those of real videos. It measures the "distance" between the distributions of features extracted from real and generated video clips. A lower FVD score indicates higher quality and realism, suggesting that the generated videos are more similar to real videos in terms of their visual and temporal characteristics.
  • Mathematical Formula: FVD is computed using the Fréchet Inception Distance (FID) formula, but applied to video features. The formula for the Fréchet distance between two multivariate Gaussian distributions N(μ1,Σ1)N(\mu_1, \Sigma_1) and N(μ2,Σ2)N(\mu_2, \Sigma_2) is: $ \mathrm{FVD} = ||\mu_1 - \mu_2||_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
  • Symbol Explanation:
    • μ1\mu_1 and Σ1\Sigma_1: The mean and covariance matrix of features extracted from the real video dataset.
    • μ2\mu_2 and Σ2\Sigma_2: The mean and covariance matrix of features extracted from the generated video dataset.
    • 22||\cdot||_2^2: The squared Euclidean distance (L2 norm) between the mean vectors.
    • Tr()\mathrm{Tr}(\cdot): The trace of a matrix (sum of its diagonal elements).
    • (Σ1Σ2)1/2(\Sigma_1 \Sigma_2)^{1/2}: The matrix square root of the product of the covariance matrices.

5.2.2. Video Dynamic Fidelity: Motion Strength

  • Conceptual Definition: Motion strength is a proposed metric to quantify the degree of dynamism within generated videos. It specifically measures the average magnitude of motion of foreground objects across video frames, isolating object movement from camera movement. A higher motion strength indicates more dynamic content.
  • Mathematical Formula:
    1. Dense Optical Flow: For each video, RAFT extracts dense optical flow fields (ux,uy)(u_x, u_y) for each pixel p=(x,y)p=(x,y) and frame.
    2. Foreground Masking: TMO-generated segmentation masks are applied to the flow fields to identify foreground objects.
    3. Pixel-wise Motion Magnitude: For each foreground pixel, the motion magnitude is calculated as ux2+uy2\sqrt{u_x^2 + u_y^2}.
    4. Average Motion: The final motion strength for a video is the average of these motion magnitudes across all foreground pixels in all frames. The paper mentions converting from radians to degrees, which implies a potential step in calculating angular velocity or direction, but the final output is magnitude. Assuming the flow components are given in pixel units, the magnitude is typically in pixel displacement. Note: The paper does not provide a formal mathematical formula, so the explanation is derived from the conceptual description.
  • Symbol Explanation:
    • ux,uyu_x, u_y: Components of the optical flow vector at a pixel.
    • ux2+uy2\sqrt{u_x^2 + u_y^2}: Magnitude of the optical flow vector.

5.2.3. Camera Control Accuracy: TransErr and RotErr

  • Conceptual Definition: TransErr (Translation Error) and RotErr (Rotation Error) measure how accurately the generated video's camera poses align with the conditioned camera poses. They quantify the discrepancy between the desired camera trajectory and the trajectory estimated from the generated content. Lower values indicate better camera control.
  • Mathematical Formula:
    1. Pose Estimation: TMO [11] is used to extract motion patterns from generated videos, and VGGSfM [53] is used to estimate camera parameters from these patterns.
    2. Trajectory Alignment: The estimated camera trajectory is aligned to the ground truth (conditioned) trajectory using Absolute Trajectory Error (ATE) [48]. This involves:
      • Centering both trajectories.
      • Finding an optimal scale factor.
      • Computing optimal rotation via Singular Value Decomposition (SVD).
      • Determining alignment translation.
    3. TransErr: Average Euclidean distance between corresponding camera positions after alignment. $ \mathrm{TransErr} = \frac{1}{N} \sum_{i=1}^N ||\mathbf{t}{est,i} - \mathbf{t}{gt,i}||_2 $
    4. RotErr: Average angular difference (e.g., in degrees) between corresponding camera orientations (rotation matrices) after alignment. The angular difference between two rotation matrices R1\mathbf{R}_1 and R2\mathbf{R}_2 is typically computed as acos((Tr(R1TR2)1)/2)\mathrm{acos}((\mathrm{Tr}(\mathbf{R}_1^T \mathbf{R}_2) - 1)/2), converted to degrees. $ \mathrm{RotErr} = \frac{1}{N} \sum_{i=1}^N \mathrm{acos}\left(\frac{\mathrm{Tr}(\mathbf{R}{est,i}^T \mathbf{R}{gt,i}) - 1}{2}\right) \times \frac{180}{\pi} $
  • Symbol Explanation:
    • NN: Number of frames/poses in the trajectory.
    • test,i\mathbf{t}_{est,i}: Estimated camera translation vector for frame ii.
    • tgt,i\mathbf{t}_{gt,i}: Ground truth (conditioned) camera translation vector for frame ii.
    • Rest,i\mathbf{R}_{est,i}: Estimated camera rotation matrix for frame ii.
    • Rgt,i\mathbf{R}_{gt,i}: Ground truth (conditioned) camera rotation matrix for frame ii.
    • 2||\cdot||_2: Euclidean norm.
    • Tr()\mathrm{Tr}(\cdot): Trace of a matrix.
    • acos()\mathrm{acos}(\cdot): Arccosine function.

5.2.4. Geometry Consistency

  • Conceptual Definition: Geometric consistency assesses the 3D plausibility and reconstructibility of a generated scene. It measures the success rate of VGGSfM in estimating camera parameters from the generated videos. A higher successful ratio implies that the generated video frames are geometrically consistent enough for a 3D reconstruction algorithm to accurately determine camera poses and scene structure.
  • Mathematical Formula: $ \mathrm{Geometric\ Consistency} = \frac{\text{Number of videos with successful SfM estimation}}{\text{Total number of generated videos}} \times 100% $
  • Symbol Explanation:
    • Successful SfM estimation: Refers to VGGSfM successfully completing the 3D reconstruction process and outputting valid camera parameters for the given video.

5.2.5. Appearance Consistency

  • Conceptual Definition: Appearance consistency evaluates how visually consistent consecutive video clips are when generated sequentially to explore a scene. When extending a scene, the subsequent clips should maintain the same scene content and visual style.
  • Mathematical Formula:
    1. Feature Extraction: A pre-trained visual encoder (e.g., CLIP [43]) extracts features for each frame in a video clip.
    2. Video Feature Averaging: These frame features are averaged to obtain a single video feature for each clip.
    3. Cosine Similarity: The cosine similarity is computed between the video features of adjacent clips (or perhaps all clips in a sequence, though the description implies consecutive). Higher cosine similarity indicates better appearance consistency. $ \mathrm{Cosine\ Similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \cdot ||\mathbf{B}||} $
  • Symbol Explanation:
    • A,B\mathbf{A}, \mathbf{B}: Video feature vectors for two clips.
    • \cdot: Dot product.
    • ||\cdot||: Euclidean norm (magnitude) of a vector.

5.3. Baselines

CameraCtrl II compares its performance against several representative existing methods in two main settings: Image-to-Video (I2V) and Text-to-Video (T2V) generation with camera control.

5.3.1. Image-to-Video (I2V) Setting Baselines

In this setting, the model takes an initial image and camera trajectory to generate a video. Since these baselines cannot directly extend videos, the last frame of a generated clip is used as the condition image for the next clip.

  • MotionCtrl [54]: A unified and flexible motion controller for video generation. It injects camera parameters (point trajectories) to guide video synthesis.
  • CameraCtrl [21]: A method specifically designed to enable camera control for text-to-video generation, often using Plücker embeddings for camera conditioning.

5.3.2. Text-to-Video (T2V) Setting Baselines

In this setting, the model takes a text prompt and a camera trajectory to generate a video.

  • AC3D [2]: Focuses on analyzing and improving 3D camera control in video diffusion transformers, with careful design of camera representation injection.

    These baselines are representative as they are prominent works in camera-controlled video generation and utilize different approaches for injecting camera information. The paper notes that due to differing camera parameter support, camera parameters are temporally downsampled for all baselines to ensure a fair comparison.

5.4. Implementation Details

  • Base Model: An internal transformer-based text-to-video diffusion model with approximately 3 billion parameters.
  • Latent Diffusion Model: Uses a temporal causal VAE tokenizer (similar to MAGViT2 [60]) with a downsampling rate of 4 for temporal and 8 for spatial dimensions.
  • Camera Pose Sampling: Camera poses are sampled every 4 frames, matching the number of visual features.
  • Training Strategy:
    • All parameters of the base video diffusion model are unfrozen and jointly optimized.
    • Phase 1 (Single-Clip CameraCtrl II):
      • Resolution: 192×320192 \times 320.
      • Steps: 100,000 steps.
      • Batch Size: 640.
      • Video Clip Duration: 2 to 10 seconds.
      • Data Composition: 4:1 ratio of camera-labeled (REALCAM) to unlabeled data.
    • Phase 2 (Finetuning and Video Extension):
      • Resolution: Higher resolution of 384×640384 \times 640.
      • Steps: 50,000 steps.
      • Batch Size: 512.
      • Condition Frames for Extension: The number of condition frames from the previous clip ranges from a minimum of 5 frames to a maximum of 50% of the total frames in the new clip.
  • Optimizer: AdamW optimizer.
  • Learning Rate:
    • Initial: 1×1041 \times 10^{-4}.
    • Warm-up: 5×1055 \times 10^{-5} over 500 steps.
    • Decay: Cosine learning rate scheduler, decaying to 1×1051 \times 10^{-5}.
  • Other Hyperparameters: Weight decay of 0.01, betas of 0.9 and 0.95.
  • Computational Resources:
    • Phase 1: 64 H100 GPUs.
    • Phase 2: 128 H100 GPUs.
  • Inference:
    • Sampler: Euler sampler with 32 steps and a shift of 12 [31].
    • CFG Scales: wtext=7.5w_{text} = 7.5 for text, wcam=8.0w_{cam} = 8.0 for camera.

6. Results & Analysis

6.1. Core Results Analysis

The experiments quantitatively and qualitatively demonstrate CameraCtrl II's superior performance in dynamic scene synthesis with precise camera control and extended spatial exploration compared to existing approaches.

The following are the results from Table 2 of the original paper:

Model FVD Motion strength↑ TransErr RotErr↓ Geometric consistency↑ Appearance consistency↑
MotionCtrl [54] 221.23 102.21 0.3221 2.78 57.87 0.7431
CameraCtrl [21] 199.53 133.37 0.2812 2.81 52.12 0.7784
CameraCtrL II 73.11 698.51 0.1527 1.58 88.70 0.8893
AC3D [2] 987.34 162.21 0.2976 2.98 69.20 N/A
CamEraCtRL II 641.23 574.21 0.1892 1.66 85.00 N/A

Note: The table entry "CamEraCtRL II" seems to be a typo for "CameraCtrl II" in the T2V setting. Also, the column for "Appearance consistency" is "N/A" for T2V models, likely because the metric wasn't applicable or reported for the T2V baseline.

6.1.1. Quantitative Comparison

As presented in Table 2, CameraCtrl II consistently and significantly outperforms baseline methods across all relevant metrics in both I2V and T2V settings.

  • I2V Setting (vs. MotionCtrl, CameraCtrl):

    • Visual Quality (FVD): CameraCtrl II achieves an FVD of 73.11, substantially lower than MotionCtrl (221.23) and CameraCtrl (199.53). This indicates that CameraCtrl II generates videos that are much closer in distribution to real videos, demonstrating higher overall quality and realism.
    • Motion Strength: CameraCtrl II boasts a Motion strength of 698.51, dramatically higher than MotionCtrl (102.21) and CameraCtrl (133.37). This is a crucial validation of its core claim: CameraCtrl II effectively preserves and enhances dynamic content generation, overcoming a major limitation of previous models.
    • Camera Control Accuracy (TransErr, RotErr): CameraCtrl II shows superior camera control with lower TransErr (0.1527 vs. 0.3221/0.2812) and RotErr (1.58 vs. 2.78/2.81). This indicates better alignment between the generated camera poses and the conditioned trajectories.
    • Geometric Consistency: With an 88.70% Geometric consistency score, CameraCtrl II's generated videos are more amenable to 3D reconstruction than those from MotionCtrl (57.87%) or CameraCtrl (52.12%), highlighting better 3D plausibility.
    • Appearance Consistency: CameraCtrl II achieves 0.8893, higher than MotionCtrl (0.7431) and CameraCtrl (0.7784), demonstrating its ability to maintain visual coherence across sequentially generated clips.
  • T2V Setting (vs. AC3D):

    • Even in the T2V setting, where the FVD is generally higher (possibly due to the complexity of T2V and the specific evaluation data), CameraCtrl II (641.23 FVD) significantly outperforms AC3D (987.34 FVD).
    • Similar trends are observed for Motion strength (574.21 vs. 162.21), TransErr (0.1892 vs. 0.2976), RotErr (1.66 vs. 2.98), and Geometric consistency (85.00% vs. 69.20%). This confirms CameraCtrl II's robustness and superior performance even when conditioning on text prompts in addition to camera control.

6.1.2. Qualitative Comparison

Figure 4 provides qualitative examples that visually reinforce the quantitative findings.

Figure 4. Qualitative comparison of CameraCtrl II with CameraCtrl and AC3D. The figure shows generated video sequences with different camera movements. CameraCtrl II excels in accurately following camera trajectories and generating dynamic content (e.g., moving cars), whereas CameraCtrl tends to ignore specific camera movements (like upward motion) and generates static scenes. AC3D struggles with complex camera movements (ignoring forward motion) and text prompts (failing to generate a bus).

该图像是一个示意图,展示了 CameraCtrl II 与 CameraCtrl 的对比。图中展示了不同相机控制下生成的视频序列,包括动态内容增强和视角扩展,显现出 CameraCtrl II 在大范围动态场景探索中的优势。 该图像是一个示意图,展示了 CameraCtrl II 与 CameraCtrl 的对比。图中展示了不同相机控制下生成的视频序列,包括动态内容增强和视角扩展,显现出 CameraCtrl II 在大范围动态场景探索中的优势。

  • I2V Setting (CameraCtrl II vs. CameraCtrl):

    • The first two rows in Figure 4 show that CameraCtrl II more accurately follows the input camera trajectories (e.g., upward movements) compared to CameraCtrl [21], which tends to ignore certain camera movements.
    • CameraCtrl II generates more dynamic videos, such as moving cars, while CameraCtrl produces largely static scenes, visually confirming the higher Motion strength metric.
  • T2V Setting (CameraCtrl II vs. AC3D):

    • The third and fourth rows illustrate CameraCtrl II's ability to effectively combine camera control with object motion. It successfully generates dynamic elements like moving vehicles (e.g., a bus) while adhering to the camera path.

    • In contrast, AC3D [2] fails to strictly follow the text prompt (not generating a bus) and ignores forward camera motion, resulting in less dynamic and less accurate generations.

      Overall, the results consistently demonstrate that CameraCtrl II achieves superior visual quality, significantly higher dynamic content, more accurate camera control, and better consistency for sequential scene exploration compared to state-of-the-art baselines.

6.2. Ablation Studies / Parameter Analysis

Ablation studies are crucial for validating the design choices of CameraCtrl II. The experiments are conducted at a resolution of 192×384192 \times 384 with 50,000 steps of single-clip training, followed by 30,000 steps of multi-clip video extension training.

6.2.1. Effectiveness of Each Component of Data Construction Pipeline

The following are the results from Table 3 of the original paper:

Model Motion strength↑ TransErr↓ RotErr↓ Geometric Consistency↑
w/o Dyn. Vid 129.40 0.2069 2.02 78.50
w/o Scale Calib. 301.68 0.2121 2.14 82.10
w/o Dist. Balance 309.24 0.2834 4.56 85.96
Full Pipeline 306.99 0.1830 1.74 86.50
  • w/o Dyn. Vid (Training only with static data like RealEstate10K):

    • Motion strength (129.40) is significantly lower than the Full Pipeline (306.99). This confirms that training on dynamic videos with camera annotations is absolutely crucial for enabling high-dynamic content generation.
    • Camera control abilities (TransErr, RotErr, Geometric Consistency) also degrade, highlighting the holistic benefit of dynamic data.
  • w/o Scale Calib. (Removing scale calibration):

    • TransErr (0.2121 vs 0.1830), RotErr (2.14 vs 1.74) increase, and Geometric consistency (82.10 vs 86.50) decreases. This validates that normalizing scene scales to a unified metric space helps the model learn consistent geometric relationships, leading to more accurate camera control and easier 3D reconstruction.
  • w/o Dist. Balance (Removing trajectory distribution balancing):

    • Significant degradation in RotErr (4.56 vs 1.74) and TransErr (0.2834 vs 0.1830). This demonstrates the importance of balancing the long-tailed distribution of camera trajectory types. Without it, the model overfits common movements and performs poorly on diverse camera paths.

      Conclusion: All components of the data curation pipeline—incorporating dynamic videos, scale calibration, and distribution balancing—are essential for achieving high-quality, dynamic, and accurately camera-controlled video generation.

6.2.2. Effectiveness of Model Design and Training Strategy (Single-Clip Model)

The following are the results from Table 4 of the original paper:

Model Motion strength↑ TransErr↓ RotErr↓ Geometric consistency↑
Complex Encoder 301.23 0.1826 1.88 84.00
Multilayer Inj. 247.23 0.1865 1.78 85.00
w/o Joint Training 279.82 0.2098 1.97 81.92
CaMERaCtRL II 306.99 0.1830 1.74 86.50
  • Complex Encoder (Using a more sophisticated camera encoder similar to CameraCtrl [21]):

    • Achieves comparable TransErr (0.1826) to CameraCtrl II (0.1830) but performs slightly worse on RotErr (1.88 vs 1.74) and Geometric consistency (84.00 vs 86.50). More importantly, its Motion strength (301.23) is slightly lower than CameraCtrl II (306.99). This validates that the proposed simple patchify layer for camera tokenization is sufficient and even superior for preserving dynamics.
  • Multilayer Inj. (Injecting camera features at every DiT layer):

    • While camera control accuracy (TransErr, RotErr, Geometric consistency) remains comparable, Motion strength (247.23) significantly reduces compared to CameraCtrl II (306.99). This supports the hypothesis that injecting camera information too deeply or globally can over-constrain the model, limiting its ability to generate dynamic movements. Injecting at the initial layer guides overall generation without stifling dynamics.
  • w/o Joint Training (Training only with camera-labeled data, no unlabeled data):

    • Results in reduced Motion strength (279.82 vs 306.99) and degraded camera control performance (higher TransErr, RotErr, lower Geometric consistency). This confirms the importance of joint training with unlabeled data. It exposes the model to broader visual domains and object motion types, improving generalization and enabling effective camera classifier-free guidance.

      Conclusion: The lightweight camera injection at the initial layer and the joint training strategy are key to CameraCtrl II's ability to combine precise camera control with robust dynamic content generation.

6.2.3. Key Design Choices for Video Extension

The following are the results from Table 5 of the original paper:

Model FVD↓ TransErr RotErr↓ Appearance consistency↑
Different Ref. 118.32 0.1963 1.94 0.8032
Noised Condition 136.78 0.1847 1.85 0.7843
Noised Condition* 140.98 0.1901 1.88 0.7982
CAMERaCtRL II 112.46 0.1830 1.74 0.8654
  • Different Ref. (Using each clip's first frame as a local reference for relative camera poses):

    • Leads to higher FVD (118.32 vs 112.46), TransErr (0.1963 vs 0.1830), RotErr (1.94 vs 1.74), and lower Appearance consistency (0.8032 vs 0.8654) compared to CameraCtrl II's approach. This confirms that using a global reference frame (the first frame of the first clip) is crucial for maintaining consistent geometric relationships and enabling smooth transitions across clips in extended sequences.
  • Noised Condition (Adding noise to all clips during training, including conditioning frames, and computing loss over both conditioning and target clips):

    • Significantly degrades performance across all metrics (FVD 136.78, Appearance consistency 0.7843). This is due to a training-inference mismatch where clean conditioning frames are used at inference time, but noisy ones were used in training.
  • Noised Condition* (Attempting to bridge the gap by adding little noise to conditioning frames during inference):

    • Still yields suboptimal performance (FVD 140.98, Appearance consistency 0.7982), even worse than Noised Condition.

      Conclusion: The strategy of using clean (noise-free) conditioning frames during both training and inference, combined with the global reference frame for camera poses, is critical for achieving high-quality, coherent, and accurate clip-wise video extension.

6.3. Visualization Results

6.3.1. Different Scenario Scenes Exploration

Figure 5 showcases the generalization performance of CameraCtrl II across a diverse range of scenarios. Figure 5. The figure illustrates the dynamic scene exploration capabilities of CameraCtrl II across various environments. It demonstrates the model's ability to handle different scene types, such as Minecraft-like game scenes, historical London streets, abandoned hospitals, fantasy outdoor settings, and anime palaces, while executing complex camera movements like panning and complete turns, all while preserving appropriate dynamic effects.

该图像是插图,展示了CameraCtrl II框架生成的动态场景探索效果,包含不同视角和动态内容的对比。顶部为相机轨迹,底部为各个场景的生成结果,展示了广泛视角下的连续视频合成能力。 该图像是插图,展示了CameraCtrl II框架生成的动态场景探索效果,包含不同视角和动态内容的对比。顶部为相机轨迹,底部为各个场景的生成结果,展示了广泛视角下的连续视频合成能力。

  • The model can be applied to varied environments including:
    • Minecraft-like game scenes
    • Black and white 19th-century foggy London streets
    • Indoor abandoned hospital
    • Outdoor hiking in a fantasy world
    • Anime-style palace scenes
  • It effectively controls complex camera movements such as panning left and right and complete turns, while maintaining appropriate dynamic effects within the scene. This highlights its strong generalization and control capabilities.

6.3.2. 3D Reconstruction of Generated Scenes

Figure 6 demonstrates the inherent 3D consistency of the videos generated by CameraCtrl II, enabling high-quality 3D reconstruction. Figure 6. This figure displays videos generated by CameraCtrl II, alongside their corresponding 3D point cloud reconstructions using FLARE. It features various scenes, including indoor and outdoor environments, and close-ups, illustrating the model's capability to generate dynamic content with strong 3D consistency that supports detailed 3D structure inference.

该图像是示意图,展示了通过 CameraCtrl II 生成的视频,包括动态场景和点云数据。图像包含多个视角下的室内和室外场景,以及食物的细节展示,体现了模型在不同环境下的动态表现和空间探索能力。 该图像是示意图,展示了通过 CameraCtrl II 生成的视频,包括动态场景和点云数据。图像包含多个视角下的室内和室外场景,以及食物的细节展示,体现了模型在不同环境下的动态表现和空间探索能力。

  • The authors use FLARE [61] to infer detailed 3D point clouds from frames extracted from the generated videos.
  • The ability to reconstruct high-quality point clouds from the generated videos serves as strong evidence for the superior 3D consistency achieved by CameraCtrl II, transforming video generative models into effective view synthesizers.

7. Conclusion & Reflections

7.1. Conclusion Summary

CameraCtrl II introduces a robust framework that significantly advances camera-controlled video diffusion models, enabling large-scale dynamic scene exploration. The paper successfully addresses two critical limitations of prior work: diminished video dynamics and restricted viewpoint ranges. This is achieved through a multi-faceted approach:

  1. Dataset Innovation: The creation of REALCAM, a dynamic video dataset with meticulously calibrated camera parameter annotations, provides the necessary data for learning dynamic camera-controlled generation.

  2. Model Architecture & Training: A lightweight camera injection module, integrated at the initial layer of a Diffusion Transformer, coupled with a joint training strategy using both labeled and unlabeled data, preserves the model's ability to generate dynamic content while enabling precise camera control. The integration of camera classifier-free guidance further refines this control.

  3. Extended Exploration Capability: A novel clip-level autoregressive generation technique allows for seamless, coherent extension of video sequences, supporting iterative scene exploration across broad spatial ranges. This is underpinned by using a global reference frame for camera poses and conditioning on clean previous frames.

    Experimental results consistently demonstrate CameraCtrl II's superior performance across metrics like visual quality (FVD), motion strength, camera control accuracy (TransErr, RotErr), geometric consistency, and appearance consistency, validating its effectiveness in diverse scenarios.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

  1. Conflict Resolution with Scene Geometry: CameraCtrl II occasionally struggles to resolve conflicts between camera movement and scene geometry. For instance, a physically implausible outcome can occur where a camera trajectory through a fence results in the fence structure appearing damaged, rather than the camera stopping or navigating around it.
    • Future Work: Incorporating more sophisticated physics-aware modeling or 3D scene understanding could help the model generate more physically plausible camera paths and object interactions.
  2. Geometric Consistency with Complex Trajectories: While geometric consistency is improved, it could be further enhanced, especially when dealing with highly complex camera trajectories.
    • Future Work: Exploring more explicit 3D representations or geometric priors within the generative process could lead to more robust 3D consistency for intricate camera movements.

7.3. Personal Insights & Critique

CameraCtrl II represents a significant leap forward in controllable video generation, particularly for interactive scene exploration.

Strengths and Inspirations:

  • Holistic Approach: The paper's strength lies in its holistic approach, tackling data, model architecture, and training strategies simultaneously. This comprehensive design is often necessary for breakthroughs in complex generative tasks.

  • Data as a Solution: The emphasis on curating a dedicated dynamic dataset (REALCAM) is a powerful lesson. Often, limitations in generative models are not purely architectural but stem from inadequacies in training data. Proactively addressing this by creating a domain-specific, high-quality dataset is commendable.

  • Elegant Simplicity in Injection: The lightweight camera injection at the initial layer, combined with element-wise addition, is surprisingly effective. It demonstrates that sometimes simpler mechanisms are better for control, especially when aiming to preserve the inherent capabilities (like dynamism) of a powerful pre-trained model. This suggests a principle of minimal intervention for maximal preservation.

  • Seamless Exploration: The clip-wise autoregressive generation with a global reference frame is a brilliant solution for extending video sequences coherently. This opens up possibilities for applications like virtual tourism, gaming environment creation, and interactive storytelling where continuous scene navigation is paramount.

  • Transferability: The methods, particularly the data curation pipeline (SfM, scale calibration, distribution balancing) and the lightweight injection strategy, could potentially be adapted to other forms of controllable generation (e.g., object manipulation, character posing) in video diffusion models, or even extended to 3D asset generation where consistent control is required.

    Potential Issues/Areas for Improvement (Critique):

  • Computational Cost: While distillation is applied for speedup, the training still requires substantial resources (e.g., 128 H100 GPUs for phase 2). This limits accessibility for smaller research groups or individual practitioners. Further research into more efficient training or distillation methods would be beneficial.

  • Failure Case Analysis (Physics): The failure case shown in Figure 7 (camera passing through a fence) highlights a common challenge in generative AI: physical plausibility. While the geometric consistency is good enough for 3D reconstruction, truly "world-simulating" models need to respect physical laws. Future work could integrate implicit or explicit collision detection and response mechanisms to prevent such anti-reality scenarios. This could involve incorporating differentiable physics engines or learning stronger priors about object solidity.

  • User Interface for Trajectory Specification: The paper mentions "iteratively specify camera trajectories." The ease and intuitiveness of this specification process are crucial for user experience. While beyond the scope of this technical paper, the practical utility depends heavily on how users define these complex, multi-clip camera paths.

  • Generalization to Novel Objects/Interactions: While the REALCAM dataset helps with dynamic scenes, the model's ability to generate novel objects or complex interactions not seen in the training data under camera control remains an open question, especially in extended sequences.

  • Long-Term Scene Consistency: While Appearance consistency is measured, maintaining perfect scene identity (e.g., specific objects, lighting, background details) over very long, multi-clip explorations is inherently challenging for generative models and will likely require further innovations in memory or explicit scene representations.

    CameraCtrl II is an impressive step towards truly interactive and dynamic world generation, pushing the boundaries of what is possible with video diffusion models. Its innovations provide a solid foundation for future research in controllable content creation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.