Paper status: completed

UniVideo: Unified Understanding, Generation, and Editing for Videos

Published:10/10/2025

Multimodal Large Language Model (24)Vision-Language Models (9)Multimodal Video Generation and Editing (1)Video Generation Model Architecture (1)Instruction-Based Video Understanding and Synthesis (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

UniVideo introduces a unified framework for video understanding, generation, and editing. Utilizing a dual-stream MLLM for instruction comprehension and MMDiT for video synthesis, it unifies diverse video tasks under one paradigm. UniVideo surpasses SOTA task-specific models and

Abstract

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

Mind Map

In-depth Reading

English Analysis~13 min read · 17,052 chars

1. Bibliographic Information

Title: UniVideo: Unified Understanding, Generation, and Editing for Videos
Authors: Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen.
Affiliations: The authors are from the University of Waterloo and the Kling Team at Kuaishou Technology. This indicates a collaboration between academia and a major tech industry lab known for its work in video technology.
Journal/Conference: The paper is available as a preprint on arXiv. Preprints are common in fast-moving fields like AI for rapid dissemination of results, often before or during peer review for a major conference (e.g., CVPR, NeurIPS, ICLR).
Publication Year: The content cites papers from 2025, and the provided arXiv ID 2510.08377 is futuristic. This suggests the paper is a very recent or forthcoming work, with the year likely being 2024 or 2025 based on the context of its citations.
Abstract: The paper introduces UniVideo, a unified framework designed to extend multimodal understanding, generation, and editing capabilities from the image domain to the more complex video domain. It employs a dual-stream architecture, combining a Multimodal Large Language Model (MLLM) for interpreting complex instructions and a Multimodal Diffusion Transformer (MMDiT) for high-quality video generation. UniVideo is jointly trained on a diverse set of tasks (e.g., text-to-video, image-to-video, in-context editing) and demonstrates performance matching or exceeding specialized, state-of-the-art models. A key finding is its generalization ability: it can compose tasks and transfer editing skills learned from image data to free-form video editing tasks without explicit training. The model also supports novel interaction methods like visual prompting.
Original Source Link:
- arXiv Link: https://arxiv.org/abs/2510.08377
- PDF Link: http://arxiv.org/pdf/2510.08377v1
- Project Webpage: https://congwei1230.github.io/UniVideo/

2. Executive Summary

Background & Motivation (Why):
- Core Problem: While unified multimodal models have shown great success in generating and editing images from complex instructions, the video domain has been left behind. Existing video models are typically specialized for a single task (e.g., text-to-video generation or video editing) and rely on simple text encoders, limiting their ability to understand rich, multimodal instructions (e.g., text combined with reference images and videos).
- Importance & Gaps: This specialization prevents the development of a single, versatile AI assistant that can seamlessly handle a wide range of video-related requests. Key capabilities like in-context video generation (creating a video featuring subjects from multiple reference images), complex free-form editing (e.g., "make the woman look like glass"), and compositional tasks (e.g., "add Superman and change the style to anime") are beyond the reach of any single existing model.
- Innovation: UniVideo introduces a unified framework that bridges this gap. Its core innovation is a dual-stream architecture that synergizes the high-level reasoning of an MLLM with the high-fidelity synthesis power of a diffusion model, all trained jointly across a multitude of video tasks.
Main Contributions / Findings (What):
1. A Unified Video Framework: The paper presents UniVideo, the first powerful model to unify understanding, generation, and editing of videos within a single, coherent framework.
2. Dual-Stream Architecture: It proposes a novel architecture that decouples understanding from generation. An MLLM interprets complex multimodal instructions, while an MMDiT handles the video synthesis, ensuring both semantic accuracy and visual fidelity.
3. State-of-the-Art Performance: Through extensive experiments, UniVideo is shown to match or surpass specialized, state-of-the-art models in tasks like text-to-video, image-to-video, in-context generation, and in-context editing.
4. Impressive Generalization: The unified training approach enables two powerful forms of generalization without explicit training:
  - Task Composition: UniVideo can combine multiple learned skills in a single instruction (e.g., performing an object swap and a style transfer simultaneously).
  - Zero-Shot Transfer: It can perform free-form video editing tasks (like green-screening or changing an object's material) by transferring knowledge gained from large-scale image editing data.
5. Visual Prompt Understanding: The model can interpret visual prompts, such as diagrams or annotations on an image, to guide video generation, offering a more intuitive way for users to express their intent.

Foundational Concepts:
- Multimodal Large Language Models (MLLMs): These are advanced AI models, extensions of Large Language Models (LLMs) like GPT, that can process and reason about information from multiple modalities simultaneously, including text, images, and video. In UniVideo, the MLLM acts as the "brain," understanding the user's intent from a mix of inputs.
- Diffusion Models: These are a class of generative models that have become the standard for high-quality image and video synthesis. They work by learning to reverse a process of gradually adding noise to an image or video. By starting with pure noise and applying the learned "denoising" process, they can generate new, coherent content.
- Diffusion Transformer (DiT): A recent architectural innovation for diffusion models. Instead of using the traditional U-Net architecture, DiTs use a Transformer, which has been shown to scale more effectively and can lead to better performance, especially for complex generation tasks. UniVideo's generation component is based on a Multimodal DiT (MMDiT).
- Variational Autoencoder (VAE): A type of neural network used to compress data into a compact latent (hidden) representation and then reconstruct it. In generative models, VAEs are often used to encode high-resolution images/videos into a smaller latent space. The diffusion process then happens in this efficient latent space, significantly reducing computational cost.
Previous Works & Differentiation:
- Unified Image Models: The paper builds on the success of models like Qwen-Image, OmniGen2, and GPT-image-1, which unified understanding and generation for static images. UniVideo's primary contribution is extending this powerful paradigm to the dynamic and more challenging video domain.
- Task-Specific Video Models:
  - Generation: Models like HunyuanVideo are experts at text-to-video generation but cannot perform editing or follow complex multimodal instructions.
  - Editing: Models like AnyV2V, UNIC, and VACE are designed for specific editing tasks (e.g., object replacement, style transfer) and often require task-specific modules, pipelines, or user-provided masks, making them inflexible.
  - UniVideo's Differentiation: Unlike these specialized models, UniVideo handles all these tasks and more within a single architecture, using a unified instruction format and without needing task-specific components or masks.
- Related Unified Video Models: The paper acknowledges contemporary works like Omni-Video and UniVid. However, it distinguishes itself by explicitly investigating and demonstrating the emergent benefits of unification, such as compositional generalization and zero-shot transfer of skills, which it argues are underexplored in those works.

4. Methodology (Core Technology & Implementation)

UniVideo's architecture is a carefully designed dual-stream system to balance high-level understanding with low-level visual fidelity.

Figure 2: Model architecture. UniVide is a dual-stream model consisting of an MLLM for understanding and an MMDiT module for generation. While prior work such as Qwen-Image and OmniGen2, explores a s… 该图像是论文中模型结构的示意图，展示了UniVideo的双流设计。包含MLLM用于理解指令，和MMDiT用于视频生成，MMDiT内部细分为理解流块和生成流块，支持图像和视频的编码与解码。

Principles: The Dual-Stream Design As shown in Figure 2, the model consists of two main components:
1. Understanding Branch (MLLM): A frozen, pre-trained MLLM (qwen2.5VL-7B) serves as the instruction processing unit. It takes all inputs (text prompt, reference images, reference video) and produces rich semantic features (the last-layer hidden states) that encode the user's intent. Keeping the MLLM frozen is a crucial design choice to preserve its powerful, pre-existing reasoning and understanding capabilities.
2. Generation Branch (MMDiT): A Multimodal Diffusion Transformer (HunyuanVideo-T2V-13B backbone) is responsible for synthesizing the output video. It operates with two internal streams:
  - An Understanding Stream receives the high-level semantic features from the MLLM. These features are passed through a trainable MLP connector to align their dimensions. This stream guides the generation process to ensure it follows the instructions.
  - A Generation Stream receives fine-grained visual features directly from the input images/video after they are encoded by a VAE. This direct pathway preserves crucial details like object identity, texture, and structure, which is essential for high-fidelity editing and in-context generation.
    
    This dual-path design is superior to approaches that rely only on semantic encoders (which lose visual detail) or bottlenecked query tokens (which struggle with high-information inputs like video).
Steps & Procedures:
- Unifying Multiple Tasks: All tasks are standardized into a single multimodal instruction format. Visual inputs are assigned ID tags (e.g., Generate a video of the woman in <Image 1> holding the flowers in <Image 2> in the scene of <Video 1>). For tasks with multiple visual conditions, their VAE-encoded latents are concatenated along the time dimension. To help the model distinguish between different inputs and the noisy video being generated, it uses 3D positional embeddings.
- Understanding Visual Prompts: As illustrated in the figure below, UniVideo can handle non-textual prompts. The MLLM branch interprets visual diagrams or annotations, translates them into a dense, structured textual description, and then feeds the corresponding embeddings into the MMDiT to guide generation. This turns a complex visual instruction into a standard in-context generation task for the model.
  
  该图像是示意图，展示了UniVideo中多模态DiT（MMDiT）的双流设计，其中理解流借助MLLM解析复杂多模态指令以生成视频内容，生成流通过VAE模块逐帧合成视频，实现图像到视频的连贯转换。
- Training Strategy: The model is trained in a three-stage curriculum:
  1. Stage 1: Connector Alignment: Only the MLP connector between the frozen MLLM and frozen MMDiT is trained. This is a fast stage to align the feature spaces of the two large models using a massive dataset of text-to-image/video pairs.
  2. Stage 2: Fine-tuning MMDiT: The MLLM remains frozen, while the MMDiT and connector are fine-tuned on a smaller set of high-quality data. This brings the generation quality of the integrated system up to par with the original, standalone MMDiT backbone.
  3. Stage 3: Multi-task Training: The MLLM is still frozen, and the connector and MMDiT are trained jointly on a diverse mix of all tasks: generation, editing, in-context generation, etc. This is the crucial stage where the model learns to be a versatile, general-purpose video system and develops its generalization abilities.

5. Experimental Setup

Datasets: The model is trained on a large, diverse collection of data synthesized for various tasks.

In-context Generation/Editing: Data was created by leveraging existing tools. For example, SAM2 was used for object segmentation and a video inpainting model was used to create "before" (with object) and "after" (object removed) pairs for deletion/insertion training.
Stylization: Pairs of realistic and stylized videos were created using a T2V model and a ControlNet model.
Image Editing: Data was sourced from open-source datasets like OmniEdit and ImgEdit, and also generated using state-of-the-art models like FLUX.1 Kontext.
The following table, transcribed from the paper, provides an overview of the training data composition.

Table 1: Overview of the multimodal training data used for UniVi deo.

Task	Input	#Examples
Text to Image	txt	O(40)M
Text to Image (High Quality)	txt	O(10)K
Image Reconstruction	image	O(40)M
Text to Video	txt	O(10)M
Text to Video (High Quality)	txt	O(10)K
Image to Video	img+txt	O(10)K
Image Editing	img+txt	O(1)M
Image Style Transfer	img+txt	O(10)K
In-Context Video Editing (swap, addition, delete, style)	ref-img × n + video + txt	O(10)K
In-Context Video Generation	ref-img × n + txt	O(10)K
In-Context Image Style Transfer	ref-img × n + img + txt	O(10)K

Evaluation Metrics:
- Visual Understanding: MMBench, MMMU, MM-Vet. These are standard benchmarks that evaluate an MLLM's ability to answer questions about images/videos, testing its reasoning and perception.
- Video Generation: VBench. A comprehensive benchmark for evaluating video generation quality across multiple dimensions like temporal consistency, motion quality, and aesthetics.
- Identity Consistency: CLIP-I (CLIP image similarity) and DINO-I (DINO feature similarity). These metrics measure how well the identity of a subject from a reference image is preserved in the generated video. Higher is better.
- Prompt Following: CLIP-Score. Measures the semantic similarity between the text prompt and the generated video frames using CLIP. Higher is better.
- Human Evaluation: For subjective qualities, human annotators rated videos on:
  - Subject Consistency (SC): Does the subject in the video match the reference?
  - Prompt Following (PF): Does the video accurately reflect the text prompt?
  - Overall: The overall quality and appeal of the video.
- Stylization Quality: CSD-Score (CLIP-Style-Content-Disentanglement) and ArtFID. These metrics measure how well the style of a reference image is transferred to a video while preserving the video's original content.
Baselines: UniVideo was compared against a strong set of models, including:
- Open-Source Models: LLaVA-1.5, I2VGen-XL, VACE, UNIC, AnyV2V.
- Closed-Source/Commercial Models: Pika2.2, Kling1.6.
- Other Unified Models: Emu3, Show-o2.

6. Results & Analysis

The experiments robustly demonstrate UniVideo's effectiveness and versatility.

Core Results:

Visual Understanding and Generation (Table 3): UniVideo achieves top-tier scores on understanding benchmarks (MMBench, MMMU, MM-Vet) because it inherits the capabilities of its frozen Qwen-2.5VL-7B MLLM. Simultaneously, it scores competitively on the VBench generation benchmark, proving that integrating the MLLM does not compromise its generation quality.

This table is transcribed from the paper. Table 3: Quantitative comparison on Visual Understanding and Video Generation.

Model	MMB	MMMU	MM-Vet	Vbench T2V
Video Understanding Model
LLaVA-1.5(Liu et al., 2024a)	36.4	67.8	36.3	×
LLaVA-NeXT(Liu et al., 2024b)	79.3	51.1	57.4	×
Video Generation Model
CogVideoX(T2V/I2V)	×	×	×	81.61
I2VGen-XL	×	×	×	×
HunyuanVideo(T2V/I2V)	×	×	×	83.24
Step-Video-(T2V/TI2V)	×	×	×	81.83
Wan2.1(T2V/I2V)	×	×	×	84.70
Unified Understanding & Generation Model
Emu3	58.5	31.6	37.2	80.96
Show-o2	79.3	48.9	56.6	81.34
UniVideo *	83.5	58.6	66.6	82.58

*We report understanding task results for UniVide using the MLLM component — Qwen-2.5VL-7B results.

In-Context Generation and Editing (Tables 4 & 5): Across both in-context generation and editing tasks, UniVideo consistently performs at or near the top, often outperforming specialized commercial models. Notably, in Table 4, it achieves the highest Subject Consistency (SC) score, a critical metric for this task. In Table 5, it achieves these strong results in a more challenging mask-free setting, where it must infer the editing region from instructions alone, unlike baselines that require an explicit mask.

This table is transcribed from the paper. Table 4: Quantitative comparison on In-Context Generation.

Model	SC↑	PF↑	Overall↑	Smoothness↑	Dynamic↑	Aesthetic↑
Single Reference Generation
VACE	0.31	0.65	0.42	0.922	40.341	5.426
Kling1.6	0.68	0.95	0.88	0.938	86.641	5.896
Pika2.2	0.45	0.43	0.15	0.928	104.768	5.125
UniVideo	0.88	0.93	0.95	0.943	56.336	5.740
Multi Reference (≥ 2) Generation
VACE	0.48	0.53	0.48	0.53
Kling.6	0.73	0.45	0.95	0.916	65.606	5.941
Pika2.2	0.71	0.48	0.43	0.898	76.796	5.176
UniVideo	0.81	0.75	0.85	0.942	59.393	6.128

Ablations / Parameter Sensitivity:
- Multi-task vs. Single-task (Table 6): An ablation study compared the final multi-task UniVideo model against single-task versions (same architecture but trained only on one task's data). The multi-task model performed significantly better across the board, with an average overall score of 0.85 compared to 0.79. This confirms that joint training allows the model to leverage knowledge across tasks (e.g., using image editing data to improve video editing).
- Importance of Dual-Stream Visual Input (Table 7): Another ablation removed the direct visual input stream to the MMDiT, forcing all visual information to pass through the MLLM. This caused a catastrophic drop in performance, especially for Subject Consistency (SC fell from 0.78 to 0.18). This result validates the dual-stream design, proving that the direct VAE-to-MMDiT pathway is essential for preserving the fine-grained visual details needed for consistent generation and editing.
Zero-Shot Generalization and Visual Prompting: The paper's most exciting findings are highlighted in Figures 5 and 6.
- Zero-Shot Generalization: UniVideo demonstrates two powerful generalization capabilities. It can compose tasks it was trained on separately (e.g., adding a character AND changing the style) and, more impressively, perform entirely new free-form video editing tasks (e.g., changing a woman into glass, green-screening) by transferring skills learned from its large image editing dataset.
  
  该图像是展示UniVideo视频生成与编辑效果的插图。左侧为原始视频帧，中间为绿幕背景替换示例，右侧展示了视频人物材质变化的编辑过程，体现了该模型对多任务视频编辑的一体化能力。
- Visual Prompting: The model successfully interprets complex visual prompts, like a hand-drawn storyboard, to generate a coherent video sequence. This showcases the advanced reasoning enabled by its MLLM component.
  
  该图像是示意图，展示了视频场景的三步转换过程，包括动态低角度跟拍、汽车爆炸冲出以及隧道通向海岸城市。图中用箭头标明了动作顺序，体现了视频内容的叙事流。

7. Conclusion & Reflections

Conclusion Summary: The paper successfully presents UniVideo, a unified and powerful framework that significantly advances the state of video generation and editing. By integrating a strong MLLM for understanding and an MMDiT for generation in a dual-stream architecture, UniVideo achieves SOTA performance on a wide range of tasks. Its most significant contribution is demonstrating that a unified, multi-task training approach leads to remarkable generalization, allowing the model to compose tasks and transfer skills across modalities (image-to-video) in a zero-shot manner.
Limitations & Future Work: The authors are transparent about the model's limitations:
- It can sometimes "over-edit" regions not mentioned in the prompt.
- Its ability to preserve the original motion in a video is limited by the backbone model.
- While it can perform free-form video editing, its success rate is lower than for image editing, highlighting the inherent difficulty of the video domain.
- Future work aims to address these issues by exploring better video backbones, curating large-scale video editing datasets, and eventually moving from an "assembled" system of pre-trained parts to a native, end-to-end trained multimodal video model.
Personal Insights & Critique:
- UniVideo represents a significant and practical step towards creating a general-purpose "AI video assistant." The dual-stream architecture is an intelligent engineering choice, effectively leveraging the strengths of massive pre-trained models (MLLM for reasoning, DiT for synthesis) without requiring training from scratch.
- The demonstrated zero-shot transfer from image editing to video editing is a compelling result. It suggests that with a sufficiently general architecture and diverse training data, models can learn abstract concepts of "editing" that are not tied to a single modality. This has profound implications for data efficiency and model capability.
- The distinction between an "assembled" system and a "native" end-to-end model is important. While UniVideo is highly effective, it relies on combining existing expert models. The next frontier will be to build a single model that learns all these capabilities from the ground up, which may unlock even deeper forms of reasoning and creativity.
- Overall, UniVideo is a landmark paper that not only delivers a highly capable system but also provides a clear blueprint and strong evidence for the power of unified modeling in the video domain.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.