- Title: UniVideo: Unified Understanding, Generation, and Editing for Videos
- Authors: Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen.
- Affiliations: The authors are from the University of Waterloo and the Kling Team at Kuaishou Technology. This indicates a collaboration between academia and a major tech industry lab known for its work in video technology.
- Journal/Conference: The paper is available as a preprint on arXiv. Preprints are common in fast-moving fields like AI for rapid dissemination of results, often before or during peer review for a major conference (e.g., CVPR, NeurIPS, ICLR).
- Publication Year: The content cites papers from 2025, and the provided arXiv ID
2510.08377
is futuristic. This suggests the paper is a very recent or forthcoming work, with the year likely being 2024 or 2025 based on the context of its citations.
- Abstract: The paper introduces UniVideo, a unified framework designed to extend multimodal understanding, generation, and editing capabilities from the image domain to the more complex video domain. It employs a dual-stream architecture, combining a Multimodal Large Language Model (MLLM) for interpreting complex instructions and a Multimodal Diffusion Transformer (MMDiT) for high-quality video generation. UniVideo is jointly trained on a diverse set of tasks (e.g., text-to-video, image-to-video, in-context editing) and demonstrates performance matching or exceeding specialized, state-of-the-art models. A key finding is its generalization ability: it can compose tasks and transfer editing skills learned from image data to free-form video editing tasks without explicit training. The model also supports novel interaction methods like visual prompting.
- Original Source Link:
- arXiv Link:
https://arxiv.org/abs/2510.08377
- PDF Link:
http://arxiv.org/pdf/2510.08377v1
- Project Webpage:
https://congwei1230.github.io/UniVideo/
2. Executive Summary
4. Methodology (Core Technology & Implementation)
UniVideo's architecture is a carefully designed dual-stream system to balance high-level understanding with low-level visual fidelity.
该图像是论文中模型结构的示意图,展示了UniVideo的双流设计。包含MLLM用于理解指令,和MMDiT用于视频生成,MMDiT内部细分为理解流块和生成流块,支持图像和视频的编码与解码。
5. Experimental Setup
-
Datasets: The model is trained on a large, diverse collection of data synthesized for various tasks.
- In-context Generation/Editing: Data was created by leveraging existing tools. For example,
SAM2
was used for object segmentation and a video inpainting model was used to create "before" (with object) and "after" (object removed) pairs for deletion/insertion training.
- Stylization: Pairs of realistic and stylized videos were created using a T2V model and a
ControlNet
model.
- Image Editing: Data was sourced from open-source datasets like
OmniEdit
and ImgEdit
, and also generated using state-of-the-art models like FLUX.1 Kontext
.
- The following table, transcribed from the paper, provides an overview of the training data composition.
Table 1: Overview of the multimodal training data used for UniVi deo.
Task |
Input |
#Examples |
Text to Image |
txt |
O(40)M |
Text to Image (High Quality) |
txt |
O(10)K |
Image Reconstruction |
image |
O(40)M |
Text to Video |
txt |
O(10)M |
Text to Video (High Quality) |
txt |
O(10)K |
Image to Video |
img+txt |
O(10)K |
Image Editing |
img+txt |
O(1)M |
Image Style Transfer |
img+txt |
O(10)K |
In-Context Video Editing (swap, addition, delete, style) |
ref-img × n + video + txt |
O(10)K |
In-Context Video Generation |
ref-img × n + txt |
O(10)K |
In-Context Image Style Transfer |
ref-img × n + img + txt |
O(10)K |
-
Evaluation Metrics:
- Visual Understanding:
MMBench
, MMMU
, MM-Vet
. These are standard benchmarks that evaluate an MLLM's ability to answer questions about images/videos, testing its reasoning and perception.
- Video Generation:
VBench
. A comprehensive benchmark for evaluating video generation quality across multiple dimensions like temporal consistency, motion quality, and aesthetics.
- Identity Consistency:
CLIP-I
(CLIP image similarity) and DINO-I
(DINO feature similarity). These metrics measure how well the identity of a subject from a reference image is preserved in the generated video. Higher is better.
- Prompt Following:
CLIP-Score
. Measures the semantic similarity between the text prompt and the generated video frames using CLIP. Higher is better.
- Human Evaluation: For subjective qualities, human annotators rated videos on:
Subject Consistency (SC)
: Does the subject in the video match the reference?
Prompt Following (PF)
: Does the video accurately reflect the text prompt?
Overall
: The overall quality and appeal of the video.
- Stylization Quality:
CSD-Score
(CLIP-Style-Content-Disentanglement) and ArtFID
. These metrics measure how well the style of a reference image is transferred to a video while preserving the video's original content.
-
Baselines: UniVideo was compared against a strong set of models, including:
- Open-Source Models:
LLaVA-1.5
, I2VGen-XL
, VACE
, UNIC
, AnyV2V
.
- Closed-Source/Commercial Models:
Pika2.2
, Kling1.6
.
- Other Unified Models:
Emu3
, Show-o2
.
6. Results & Analysis
The experiments robustly demonstrate UniVideo's effectiveness and versatility.
-
Core Results:
-
Visual Understanding and Generation (Table 3): UniVideo achieves top-tier scores on understanding benchmarks (MMBench
, MMMU
, MM-Vet
) because it inherits the capabilities of its frozen Qwen-2.5VL-7B
MLLM. Simultaneously, it scores competitively on the VBench
generation benchmark, proving that integrating the MLLM does not compromise its generation quality.
This table is transcribed from the paper.
Table 3: Quantitative comparison on Visual Understanding and Video Generation.
Model |
MMB |
MMMU |
MM-Vet |
Vbench T2V |
Video Understanding Model |
|
|
|
|
LLaVA-1.5(Liu et al., 2024a) |
36.4 |
67.8 |
36.3 |
× |
LLaVA-NeXT(Liu et al., 2024b) |
79.3 |
51.1 |
57.4 |
× |
Video Generation Model |
|
|
|
|
CogVideoX(T2V/I2V) |
× |
× |
× |
81.61 |
I2VGen-XL |
× |
× |
× |
× |
HunyuanVideo(T2V/I2V) |
× |
× |
× |
83.24 |
Step-Video-(T2V/TI2V) |
× |
× |
× |
81.83 |
Wan2.1(T2V/I2V) |
× |
× |
× |
84.70 |
Unified Understanding & Generation Model |
|
|
|
|
Emu3 |
58.5 |
31.6 |
37.2 |
80.96 |
Show-o2 |
79.3 |
48.9 |
56.6 |
81.34 |
UniVideo * |
83.5 |
58.6 |
66.6 |
82.58 |
*We report understanding task results for UniVide using the MLLM component — Qwen-2.5VL-7B results.
-
In-Context Generation and Editing (Tables 4 & 5): Across both in-context generation and editing tasks, UniVideo consistently performs at or near the top, often outperforming specialized commercial models. Notably, in Table 4, it achieves the highest Subject Consistency (SC) score, a critical metric for this task. In Table 5, it achieves these strong results in a more challenging mask-free setting, where it must infer the editing region from instructions alone, unlike baselines that require an explicit mask.
This table is transcribed from the paper.
Table 4: Quantitative comparison on In-Context Generation.
Model |
SC↑ |
PF↑ |
Overall↑ |
Smoothness↑ |
Dynamic↑ |
Aesthetic↑ |
Single Reference Generation |
|
|
|
|
|
|
VACE |
0.31 |
0.65 |
0.42 |
0.922 |
40.341 |
5.426 |
Kling1.6 |
0.68 |
0.95 |
0.88 |
0.938 |
86.641 |
5.896 |
Pika2.2 |
0.45 |
0.43 |
0.15 |
0.928 |
104.768 |
5.125 |
UniVideo |
0.88 |
0.93 |
0.95 |
0.943 |
56.336 |
5.740 |
Multi Reference (≥ 2) Generation |
|
|
|
|
|
|
VACE |
0.48 |
0.53 |
0.48 |
0.53 |
|
|
Kling.6 |
0.73 |
0.45 |
0.95 |
0.916 |
65.606 |
5.941 |
Pika2.2 |
0.71 |
0.48 |
0.43 |
0.898 |
76.796 |
5.176 |
UniVideo |
0.81 |
0.75 |
0.85 |
0.942 |
59.393 |
6.128 |
-
Ablations / Parameter Sensitivity:
- Multi-task vs. Single-task (Table 6): An ablation study compared the final multi-task UniVideo model against single-task versions (same architecture but trained only on one task's data). The multi-task model performed significantly better across the board, with an average overall score of 0.85 compared to 0.79. This confirms that joint training allows the model to leverage knowledge across tasks (e.g., using image editing data to improve video editing).
- Importance of Dual-Stream Visual Input (Table 7): Another ablation removed the direct visual input stream to the MMDiT, forcing all visual information to pass through the MLLM. This caused a catastrophic drop in performance, especially for Subject Consistency (SC fell from 0.78 to 0.18). This result validates the dual-stream design, proving that the direct VAE-to-MMDiT pathway is essential for preserving the fine-grained visual details needed for consistent generation and editing.
-
Zero-Shot Generalization and Visual Prompting:
The paper's most exciting findings are highlighted in Figures 5 and 6.
-
Zero-Shot Generalization: UniVideo demonstrates two powerful generalization capabilities. It can compose tasks it was trained on separately (e.g., adding a character AND changing the style) and, more impressively, perform entirely new free-form video editing tasks (e.g., changing a woman into glass, green-screening) by transferring skills learned from its large image editing dataset.
该图像是展示UniVideo视频生成与编辑效果的插图。左侧为原始视频帧,中间为绿幕背景替换示例,右侧展示了视频人物材质变化的编辑过程,体现了该模型对多任务视频编辑的一体化能力。
-
Visual Prompting: The model successfully interprets complex visual prompts, like a hand-drawn storyboard, to generate a coherent video sequence. This showcases the advanced reasoning enabled by its MLLM component.
该图像是示意图,展示了视频场景的三步转换过程,包括动态低角度跟拍、汽车爆炸冲出以及隧道通向海岸城市。图中用箭头标明了动作顺序,体现了视频内容的叙事流。
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents UniVideo, a unified and powerful framework that significantly advances the state of video generation and editing. By integrating a strong MLLM for understanding and an MMDiT for generation in a dual-stream architecture, UniVideo achieves SOTA performance on a wide range of tasks. Its most significant contribution is demonstrating that a unified, multi-task training approach leads to remarkable generalization, allowing the model to compose tasks and transfer skills across modalities (image-to-video) in a zero-shot manner.
-
Limitations & Future Work: The authors are transparent about the model's limitations:
- It can sometimes "over-edit" regions not mentioned in the prompt.
- Its ability to preserve the original motion in a video is limited by the backbone model.
- While it can perform free-form video editing, its success rate is lower than for image editing, highlighting the inherent difficulty of the video domain.
- Future work aims to address these issues by exploring better video backbones, curating large-scale video editing datasets, and eventually moving from an "assembled" system of pre-trained parts to a native, end-to-end trained multimodal video model.
-
Personal Insights & Critique:
- UniVideo represents a significant and practical step towards creating a general-purpose "AI video assistant." The dual-stream architecture is an intelligent engineering choice, effectively leveraging the strengths of massive pre-trained models (MLLM for reasoning, DiT for synthesis) without requiring training from scratch.
- The demonstrated zero-shot transfer from image editing to video editing is a compelling result. It suggests that with a sufficiently general architecture and diverse training data, models can learn abstract concepts of "editing" that are not tied to a single modality. This has profound implications for data efficiency and model capability.
- The distinction between an "assembled" system and a "native" end-to-end model is important. While UniVideo is highly effective, it relies on combining existing expert models. The next frontier will be to build a single model that learns all these capabilities from the ground up, which may unlock even deeper forms of reasoning and creativity.
- Overall, UniVideo is a landmark paper that not only delivers a highly capable system but also provides a clear blueprint and strong evidence for the power of unified modeling in the video domain.