Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Kaifu Zhang

Paper status: completed

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Published:05/05/2025

Text-to-Image Generation (14)Multimodal Understanding and Generation Models (1)Generative Adversarial Models (1)Fusion of Diffusion and Autoregressive Models (1)Multimodal Datasets and Benchmarks (1)

Original Link PDF

Price: 0.10

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper provides a comprehensive survey on unified multimodal understanding and generation models, exploring the challenges posed by architectural differences between autoregressive and diffusion models. It highlights three main unified framework paradigms and offers tailored

Abstract

Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).

In-depth Reading

English Analysis~43 min read · 70,056 chars

1. Bibliographic Information

1.1. Title

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

1.2. Authors

Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

1.3. Journal/Conference

This paper is a preprint published on arXiv. The abstract and the content do not specify a journal or conference publication at the time of this analysis. arXiv is a well-regarded open-access archive for scholarly articles, particularly in fields like physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It allows researchers to share their work before formal peer review and publication, making it influential for disseminating cutting-edge research rapidly.

1.4. Publication Year

2025 (Specifically, published at 2025-05-05T11:18:03.000Z, which is May 5, 2025).

1.5. Abstract

The paper surveys the rapidly evolving field of unified multimodal understanding and generation models. It notes that while multimodal understanding models (often autoregressive-based) and image generation models (predominantly diffusion-based) have seen independent successes, recent interest, exemplified by GPT-4o, drives their integration. The architectural disparities between these domains present significant challenges. This survey offers a comprehensive overview, starting with foundational concepts and advancements in both multimodal understanding and text-to-image generation. It then categorizes existing unified models into three architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches. For each category, it details structural designs and innovations. Additionally, it compiles relevant datasets and benchmarks to guide future research. Finally, the paper discusses key challenges such as tokenization strategies, cross-modal attention, and data construction, and identifies opportunities in this nascent field, aiming to inspire further research and serve as a valuable reference.

1.6. Original Source Link

https://arxiv.org/abs/2505.02567v5 (Preprint)

1.7. PDF Link

https://arxiv.org/pdf/2505.02567v5.pdf (Preprint)

2. Executive Summary

2.1. Background & Motivation

The field of Artificial Intelligence has witnessed rapid advancements in two distinct areas: multimodal understanding models and image generation models.

Multimodal Understanding Models: Large Language Models (LLMs) like LLaMa and GPT have expanded into multimodal domains, giving rise to powerful models such as LLaVa, Qwen-VL, and GPT-4. These models excel at tasks ranging from simple image captioning to complex reasoning based on user instructions, primarily utilizing autoregressive-based architectures (decoder-only, next-token prediction).
Image Generation Models: Technologies like the Stable Diffusion (SD) series and FLUX have achieved remarkable success in producing high-quality images from text prompts, with diffusion-based models becoming the dominant paradigm (leveraging U-Net or DiT architectures).

The core problem the paper aims to address is the architectural and methodological divergence between these two successful domains. Despite their individual achievements, they have largely evolved independently. This independence has led to distinct paradigms that are not inherently compatible, posing a significant challenge for creating systems that can seamlessly perform both understanding and generation tasks.

The motivation for unification is compelling:

Enhanced Capabilities: A unified model could generate images from complex instructions, reason about visual data, and then visualize those analyses through generated outputs, offering a much richer human-AI interaction.
Real-world Impact: The emergence of advanced models like GPT-4o, demonstrating enhanced capabilities in both understanding and generation, has highlighted the immense potential and growing interest in this unification trend. This development underscores the need for consolidated research and understanding of the challenges and opportunities in this nascent field.
Efficiency and Scalability: Managing separate models for understanding and generation tasks is complex and inefficient. A unified model promises a more scalable solution capable of handling a wider variety of tasks within a single framework.

The paper's entry point is to provide a comprehensive survey that bridges this gap, guiding future research by clearly outlining the current state, categorizing existing approaches, compiling resources, and discussing challenges and opportunities.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Comprehensive Survey and Categorization: It provides a thorough overview of existing efforts towards unifying multimodal understanding and generation models. It categorizes these models into three main architectural paradigms:
- Diffusion-based models.
- Autoregressive-based models (further sub-categorized by image tokenization strategies: pixel-based, semantic-based, learnable query-based, and hybrid encoding).
- Hybrid approaches that fuse autoregressive and diffusion mechanisms (also sub-categorized by encoding strategies). This systematic classification helps in understanding the diverse approaches and their underlying design principles.
Foundational Concepts and Advancements: It details the foundational concepts and recent advancements in both multimodal understanding and text-to-image generation, providing a necessary background for novices.
Resource Compilation: It compiles and reviews relevant datasets and benchmarks tailored for training and evaluating unified models. These resources span various tasks including multimodal understanding, text-to-image generation, image editing, and interleaved image-text processing, serving as a valuable guide for researchers.
Identification of Challenges and Opportunities: The survey critically discusses key challenges facing the field, such as:
- Tokenization strategy: How to effectively represent visual information for different architectures.
- Cross-modal attention: Managing computational complexity and ensuring effective interaction between modalities.
- Data construction: The need for large-scale, high-quality, and diverse multimodal datasets.
- Evaluation protocols: Developing comprehensive benchmarks that assess both understanding and generation holistically. It also highlights opportunities for future research, including incorporating Chain-of-Thought (CoT) reasoning and Reinforcement Learning (RL), addressing demographic and social biases, and enabling personalized knowledge-driven generation.

In essence, the paper serves as a structured reference point for researchers entering or working in the field of unified multimodal AI, aiming to inspire innovation and facilitate progress toward truly general-purpose multimodal models.

3.1. Foundational Concepts

To understand the paper's analysis of unified multimodal models, several foundational concepts from Large Language Models (LLMs), Multimodal Understanding, and Image Generation are crucial.

3.1.1. Large Language Models (LLMs) and Autoregressive Generation

Large Language Models (LLMs): These are neural networks with billions of parameters, pre-trained on massive text datasets to understand, generate, and process human language. They have shown remarkable capabilities in natural language processing (NLP) tasks. Examples include GPT series, LLaMa, and Qwen.
Autoregressive Generation: This is a sequential process where a model predicts the next element in a sequence based on all previously generated elements. In LLMs, this means predicting the next word or token given the preceding context. The decoder-only Transformer architecture is commonly used for this, allowing efficient generation of text sequences. The model learns to factorize the joint probability of a sequence $\boldsymbol{x} = (x_1, x_2, ..., x_N)$ as a product of conditional probabilities: $ p(\boldsymbol{x}) = \prod_{i=1}^{N} p(x_i | x_1, x_2, ..., x_{i-1}; \theta) $ where $\theta$ represents the model parameters. The training objective is typically to minimize the negative log-likelihood (NLL) loss: $ \mathcal{L}(\boldsymbol{\theta}) = - \sum_{i=1}^{N} \log p(x_i | x_1, x_2, ..., x_{i-1}; \boldsymbol{\theta}) $ This process is analogous to how humans read or write, predicting one word after another.

3.1.2. Multimodal Understanding Models (MLLMs/MMMs)

Multimodal Understanding Models: These models extend the capabilities of LLMs to process and reason over multiple input modalities, most commonly vision-language understanding (VLU) which integrates visual (images, video) and textual inputs. They aim to achieve a comprehensive understanding of scenes, objects, relationships, and concepts across modalities.
Vision-Language Understanding (VLU): A subfield of multimodal AI focusing on tasks that require joint reasoning over visual and linguistic data, such as image captioning, visual question answering (VQA), and visual grounding.
Dual-Encoder Architectures: Early VLU models used this, where separate encoders process images and text, then their latent representations are aligned. For example, CLIP (Contrastive Language-Image Pre-training) learns to embed images and text into a shared latent space such that relevant image-text pairs are close together.
LLM Backbones with Connectors: Modern VLU models often use a frozen or fine-tuned LLM as their core. A connector module bridges the visual input (encoded by a vision encoder) to the LLM's token space. As shown in the VLM description for images/2.jpg, these connectors can be projection-based (simple linear layers), query-based (e.g., using a transformer to query visual features), or fusion-based (more complex integration of features).

3.1.3. Image Generation Models

Diffusion Models: These generative models have become state-of-the-art for high-quality image synthesis. They operate by learning to reverse a Markov chain of Gaussian noise addition.
- Forward Process: Gradually adds Gaussian noise to an image $x_0$ over $T$ time steps, corrupting it into pure noise $x_T$ . $ q(\boldsymbol{x}{1:T} | \boldsymbol{x}0) := \prod{t=1}^{T} q(\boldsymbol{x}t | \boldsymbol{x}{t-1}) $ $ q(x_t | x{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I}) $ Here, $\beta_t$ are small variance hyperparameters controlling the noise added at each step.
- Reverse Process: A neural network (often a U-Net) is trained to predict and remove this noise iteratively, transforming $x_T$ back into $x_0$ . This process generates a clean image from noise. $ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) $ The network parameterizes the mean $\mu_\theta(x_t, t)$ and variance $\Sigma_\theta(x_t, t)$ , learning to predict the noise $\epsilon_\theta(x_t, t)$ that was added at timestep $t$ .
- Score Function: Early diffusion models often aimed to learn the score function $\nabla_{x_t} \log q(x_t|x_0)$ , which points in the direction of higher probability density.
- U-Net: A convolutional neural network architecture with a symmetrical U-shape, featuring skip connections that allow direct information flow from earlier (downsampled) layers to later (upsampled) layers. This helps preserve fine-grained details during image generation.
- Latent Diffusion Models (LDMs): To improve computational efficiency, LDMs (e.g., Stable Diffusion) perform the diffusion process in a compressed latent space learned by a Variational Autoencoder (VAE) or VQ-GAN. This reduces the dimensionality of data, making diffusion faster while maintaining high-quality generation.
- Diffusion Transformers (DiT): These models replace the traditional U-Net with a Transformer architecture for the noise prediction network, processing images as sequences of patches, similar to how Vision Transformers (ViT) process images for classification. They take conditional information (timestep, text embeddings) as input.
Generative Adversarial Networks (GANs): An older class of generative models consisting of a generator (which creates samples) and a discriminator (which tries to distinguish real from generated samples), trained in an adversarial manner. While powerful, GANs often suffer from mode collapse (generating a limited variety of samples) and training instability.

3.1.4. Tokenization Strategies

Tokenization: The process of converting raw input data (like images or text) into discrete units (tokens) that a model can process.
Text Tokens: For text, this usually involves splitting sentences into words or sub-word units (e.g., using Byte Pair Encoding - BPE).
Visual Tokens: For images, tokenization is more complex:
- Next-Pixel Prediction: Flattens image pixels into a 1D sequence and predicts the next pixel. Computationally expensive.
- Next-Token Prediction (Discrete Vector Quantization): Uses a visual tokenizer (like VQ-VAE or VQ-GAN) to compress an image into a sequence of discrete latent codes or tokens from a learned codebook. This reduces sequence length and makes images compatible with Transformer architectures.
- Semantic Encoders (e.g., CLIP, EVA-CLIP, SigLIP): These encoders are trained to align visual and text modalities in a shared latent space. They produce continuous semantic embeddings that capture high-level conceptual information, making them valuable for understanding tasks.
- Next-Multiple-Tokens Prediction: Predicts multiple tokens at once, often by grouping pixels or tokens into larger units (patches, blocks) or using multi-scale approaches, to improve generation efficiency.
  
  As seen in the VLM description for images/3.jpg, autoregressive models use various sequence representation strategies:
Next-Pixel Prediction: Treats each pixel as a unit, flattening the image into a pixel sequence.
Next-Token Prediction: Converts the image into a token sequence via a visual tokenizer (using discrete vector quantization).
Next-Multiple-Tokens Prediction: Groups multiple tokens as a unit, often using spatial scale or frequency groups.

3.2. Previous Works

The paper references a vast array of previous works, which it then categorizes. Here's a summary of some core ones relevant to the foundational concepts:

LLMs:
- LLaMa [1, 2]: A family of large language models from Meta AI, known for their open-source nature and strong performance, often used as backbones in multimodal models.
- GPT Series [7, 13]: OpenAI's generative pre-trained transformer series, particularly GPT-4 [13] and GPT-4o [27], are highlighted for their advanced language and increasingly multimodal capabilities, influencing the trend toward unification.
- Qwen Series [5, 6]: Alibaba Cloud's LLM series, with Qwen-VL [5] being a prominent multimodal extension.
Multimodal Understanding (VLU/MLLMs):
- CLIP [22]: (Contrastive Language-Image Pre-training) A seminal dual-encoder model by OpenAI that aligns images and text into a shared embedding space using contrastive learning. Widely used as a vision encoder or for feature extraction.
- LLaVa [8]: (Large Language and Vision Assistant) One of the early visual instruction tuning models that adapts LLMs for multimodal understanding tasks, leveraging CLIP and Vicuna.
- MiniGPT-4 [57]: Utilized a simple projection layer to connect CLIP-derived image embeddings to a frozen LLM (Vicuna), demonstrating efficient vision-language alignment.
- BLIP-2 [53]: Introduced a Q-Former (Querying Transformer) to bridge a frozen visual encoder with a frozen LLM, improving efficiency and alignment.
- Flamingo [60]: Used gated cross-attention layers to connect a pre-trained vision encoder with a frozen LLM decoder.
- InternVL [11]: Explores unified multimodal pre-training for various visual-linguistic tasks.
- Ovis [12]: Introduces structural embedding alignment for multimodal LLMs.
- DeepSeek-VL2 [68]: Employs a Mixture-of-Experts (MoE) architecture for cross-modal reasoning, showcasing advanced scalability.
Image Generation (Diffusion Models):
- SD Series [14, 15]: (Stable Diffusion) Latent Diffusion Models (LDMs) that achieve high-quality image synthesis efficiently by operating in a latent space. SD3.0 [15] uses rectified flow transformers.
- FLUX [16]: A recent, high-performance image generation model.
- DiT [20]: (Diffusion Transformers) Replaces the U-Net in diffusion models with a Transformer, processing images as patches, leading to highly scalable diffusion models.
- GLIDE [71]: Introduced classifier-free guidance for text-guided diffusion, improving controllability.
- Imagen [72]: Used large pre-trained LLMs (T5) as text encoders for powerful text-to-image generation.
- VQ-Diffusion [73]: Applies vector quantization to diffusion models.
Image Generation (Autoregressive Models):
- PixelRNN [88] / PixelCNN [89]: Early pixel-based autoregressive models for images, generating pixels sequentially.
- VQ-VAE-2 [93] / VQGAN [32]: Key models for Vector Quantization (VQ) that compress images into discrete latent codes, enabling Transformer-based autoregressive image generation (e.g., LlamaGen [24]).
- MAR [25]: Autoregressive image generation without discrete vector quantization, using continuous representations and a diffusion loss.
Connectors/Tokenizers:
- VAE [31] / VQ-GAN [32]: Used for pixel-based encoding, compressing images into continuous latents or discrete tokens.
- CLIP [22] / EVA-CLIP [36] / SigLIP [209]: Used for semantic encoding, producing continuous embeddings aligned with text.
- SEED Tokenizer [163]: An example of a learnable query-based encoding mechanism.
  
  The VLM description for images/2.jpg (which is Fig. 2 in the paper) illustrates the typical architecture of multimodal understanding models. It shows multimodal encoders (for images, audio, video) feeding into a connector, which then provides input to an LLM. The connector types are broadly projection-based, query-based, and fusion-based. These concepts are central to how visual information is integrated into a language model.

The VLM description for images/2.jpg (which is Fig. 3 in the paper) illustrates the typical process of diffusion-based text-to-image generation. It details the forward process of adding noise and the reverse process of denoising, conditioned by various inputs like text, identity, style, and sketches. This highlights the conditional generation capabilities of diffusion models.

The VLM description for images/3.jpg (which is Fig. 4 in the paper) illustrates core components of autoregressive models. It details autoregression sequence modeling and discrete vector quantization. It categorizes AR models into Next-Pixel Prediction, Next-Token Prediction, and Next-Multiple-Tokens Prediction, based on how they define the unit for sequential generation.

3.3. Technological Evolution

The evolution towards unified multimodal understanding and generation models can be summarized in several stages:

Early Unimodal Dominance (Pre-2018): Separate research streams for NLP (e.g., RNNs, LSTMs, early Transformers for text) and Computer Vision (e.g., CNNs for image classification, GANs for image generation).
Emergence of Multimodal Understanding (2018-2021):
- Development of Transformer architectures revolutionized NLP.
- Initial attempts to align modalities with dual-encoder architectures (e.g., ViLBERT, VisualBERT, UNITER) or simple projection layers (CLIP).
- Focus on tasks like VQA and image captioning.
Rise of Large Language Models (2020-Present):
- Scaling of LLMs (GPT-3, LLaMa) demonstrated powerful reasoning and generation capabilities.
- Adaptation of LLMs for multimodal understanding by integrating vision encoders through connector modules (e.g., Flamingo, BLIP-2, LLaVa), leveraging the LLM's reasoning power.
Dominance of Diffusion Models in Image Generation (2020-Present):
- Diffusion models (DDPM, GLIDE, Imagen, Stable Diffusion) surpassed GANs in sample quality and controllability for image generation.
- Introduction of Latent Diffusion Models (LDMs) for efficiency and Diffusion Transformers (DiT) for scalability.
The Push for Unification (2023-Present):
- Recognition of the limitations of separate systems and the desire for models that can seamlessly understand and generate across modalities.
- Initial explorations into using LLM-inspired autoregressive architectures for image generation, despite quality gaps compared to diffusion models.
- Growing interest in unified frameworks that integrate the strengths of both autoregressive models (for reasoning and text generation) and diffusion models (for high-quality image synthesis). The release of GPT-4o exemplified this trend, demonstrating advanced multimodal capabilities.
- Development of hybrid architectures combining AR and diffusion mechanisms.
- Expansion beyond text-image to any-to-any multimodal models (audio, video, speech).
  
  The paper's work fits into this timeline by systematically surveying the current state of this unification effort, categorizing the various architectural approaches, and outlining the challenges that remain as the field moves towards truly general-purpose multimodal AI.

3.4. Differentiation Analysis

This paper is a survey, so its innovation lies in its comprehensive scope and structured analysis of a rapidly emerging field, rather than proposing a new model or algorithm. Compared to prior research (which it also cites), its core differentiation points are:

Specific Focus on Unification: While there are excellent surveys on LLMs, multimodal understanding, and image generation individually [40, 41, 42, 43, 44, 45, 46], this work specifically focuses on the integration of understanding and generation tasks within a unified framework. It addresses the unique challenges and opportunities that arise from trying to combine these often disparate paradigms.
Architectural Categorization: The paper introduces a clear and detailed architectural taxonomy for unified models, dividing them into diffusion-based, autoregressive-based, and fused AR + diffusion approaches. Crucially, it further sub-categorizes autoregressive and fused AR + diffusion models based on their image tokenization strategies (pixel-based, semantic-based, learnable query-based, hybrid encoding), which is a key differentiator in how these models process visual information. This fine-grained categorization provides a much-needed structured view of the field.
Comprehensive Resource Compilation: Beyond architectural analysis, the survey provides curated lists of datasets and benchmarks specifically relevant to unified multimodal tasks, covering understanding, generation, editing, and interleaved modalities. This serves as a practical guide for researchers.
Forward-Looking Analysis: It not only reviews existing work but also critically discusses current challenges (e.g., tokenization, cross-modal attention, data construction, evaluation) and opportunities (e.g., CoT, RL, bias mitigation, personalization), offering concrete directions for future research.

In essence, this survey differentiates itself by providing a holistic, structured, and forward-looking perspective on the emerging trend of unified multimodal AI, addressing the unique complexities of integrating understanding and generation capabilities.

4. Methodology

The paper's methodology is a systematic survey of unified multimodal understanding and generation models. Its core approach involves categorizing existing models based on their architectural paradigms and modality encoding methods, then detailing their structural designs, innovations, and limitations. It also compiles relevant datasets and benchmarks to provide a comprehensive resource for the field.

4.1. Principles

The core idea is to structure the nascent field of unified multimodal models into a coherent taxonomy. This taxonomy helps to:

Clarify Architectural Choices: By grouping models based on their fundamental generative mechanisms (diffusion, autoregressive, fused AR + diffusion).
Highlight Key Design Decisions: Especially regarding image tokenization, which is a critical aspect for integrating visual data into language-centric architectures.
Identify Trends and Gaps: By systematically reviewing models, the survey can pinpoint common solutions, unique innovations, and persistent challenges.

The theoretical basis is that despite the diversity of implementations, unified multimodal models often share underlying generative principles (autoregressive vs. diffusion) and face similar challenges in representing and integrating different modalities. By analyzing these commonalities and differences, the survey aims to provide a guiding framework for future research.

4.2. Core Methodology In-depth (Layer by Layer)

The paper abstracts a typical unified multimodal framework into three core components:

Modality-Specific Encoders: Project different input modalities (e.g., text, image, video, audio) into a shared or compatible representation space.
Modality-Fusion Backbone: Integrates information from multiple modalities, enabling cross-modal reasoning. This often takes the form of a large Transformer model.
Modality-Specific Decoders: Generate output in the desired modality (e.g., text generation, image synthesis).

The primary focus of this survey is on models supporting vision-language understanding and generation (i.e., taking image and text as input and producing either text or image as output). The core categorization, as illustrated in the VLM description of images/4.jpg (which is Fig. 5 in the original paper), divides existing unified models into three main types: Diffusion Models, Autoregressive Models (MLLM (AR)), and Fused AR + Diffusion Models (MLLM (AR) + Diffusion).

4.2.1. Diffusion Models

General Mechanism: As introduced in Section 3.1.3, diffusion models formulate generation as a pair of Markov chains: a forward process that adds Gaussian noise to data $x_0$ over $T$ timesteps to produce $x_T$ , and a reverse process that learns to iteratively denoise back to $x_0$ . The VLM description of images/2.jpg (which is Fig. 3 in the paper) visually explains this process.

Forward Process: $ q(\boldsymbol{x}{1:T} | \boldsymbol{x}0) := \prod{t=1}^{T} q(\boldsymbol{x}t | \boldsymbol{x}{t-1}) $ $ q(x_t | x{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I}) $ where $\beta_t$ are the variance hyperparameters controlling the noise.
Reverse Process: $ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) $ where $\mu_\theta(x_t, t)$ and $\Sigma_\theta(x_t, t)$ are parameterized by a neural network. The training objective is to minimize the Variational Lower-Bound of the Negative Log-Likelihood, typically simplified to a noise prediction loss: $ \mathcal{L} = \mathbb{E}{q(x_0, x{1:T})} \left[ \lVert \epsilon_\theta(x_t, t) - \epsilon^*(x_t, t) \rVert^2 \right] $ Here, $\epsilon_\theta(x_t, t)$ is the model's prediction of the noise at timestep $t$ , and $\epsilon^*(x_t, t)$ is the true noise added.

In multimodal diffusion models, the denoising process is conditioned on multimodal contexts (textual descriptions, images, or joint embeddings) to enable synchronized generation and semantic alignment.

Representative Models:

Dual Diffusion [127]:
- Concept: Introduces a dual-branch diffusion process for joint text and image generation.
- Encoders: Text encoded by a pre-trained T5 encoder [23] (with softmax probability modeling for discrete text representations). Image encoded by VAE encoder from Stable Diffusion [14] (for continuous image latents).
- Forward Process: Both text and image latents are independently noised through separate forward diffusion processes.
- Reverse Process: Two modality-specific denoisers (Transformer-based for text, U-Net-based for image) jointly denoise the latents. Crucially, cross-modal conditioning occurs at each timestep, where text latents attend to image latents and vice versa.
- Decoders: Text latent decoded by a T5 decoder, image latent by VAE decoder.
- Training: Image branch minimizes standard noise prediction loss. Text branch minimizes contrastive log-loss.
UniDisc [128]:
- Concept: Employs a fully discrete diffusion framework for both text and images, training a Diffusion Transformer from scratch.
- Tokenization: Text tokenized using LLaMA2 tokenizer [2]. Images converted to discrete tokens with MAGVIT-v2 encoder [207].
- Diffusion Process: Discrete forward diffusion adds structured noise simultaneously across modalities. The reverse process progressively denoised these discrete tokens.
- Decoders: LLaMA2 and MAGVIT-v2 decoders transform sequences into text and images.
FUDOKI [130]:
- Concept: A novel generative approach based on discrete flow matching [208], modeling a direct path between noise and data distributions.
- Architecture: Based on Janus1.5B [174], but with modifications: replaces causal mask with full attention mask for global contextual understanding. Infers corruption state directly instead of explicit timestep embeddings.
- Encoders: SigLIP encoder [209] for high-level semantic features (understanding). VQGAN-based tokenizer from LlamaGen [24] for low-level discrete tokens (generation).
- Output: Feature embeddings from Janus1.5B backbone passed through modality-specific output heads for final text and image outputs.
- Decoupling: Decouples understanding and generation paths.
Muddit [131]:
- Concept: Unified model for bidirectional generation using a purely discrete diffusion framework.
- Architecture: Single Multimodal Diffusion Transformer (MM-DiT) similar to FLUX [210], initialized from Meissonic [211] (trained for high-resolution synthesis).
- Quantization: Both modalities quantized into a shared discrete space: VQ-VAE [32] encodes images, CLIP [22] provides text token embeddings.
- Training: Cosine scheduling strategy to mask tokens. MM-DiT trained to predict clean tokens conditioned on the other modality.
- Output: Lightweight linear head for text, VQ-VAE decoder for image reconstruction.
MMaDA [129]:
- Concept: Scales up the diffusion paradigm towards a unified multimodal foundation model.
- Language Backbone: LLaDA-8B-Instruct [212].
- Image Tokenizer: MAGVIT-v2 [213] converts images to discrete semantic tokens.
- Alignment: Introduces mixed chain-of-thought (CoT) fine-tuning to unify reasoning formats between text and vision tasks.
- Optimization: Incorporates UniGRPO, a unified policy-gradient-based Reinforcement Learning (RL) algorithm for diffusion models, enabling post-training optimization across reasoning and generation tasks using diversified reward signals.

Challenges:

Inference Efficiency: Discrete diffusion models can be slower than autoregressive counterparts due to lack of key-value cache support and quality degradation with parallel decoding.
Training Difficulties: Sparse supervision in discrete diffusion (loss on subset of masked tokens) leads to inefficient training and high variance.
Length Bias: Models struggle to generalize across different output lengths, lacking a natural stopping mechanism like end-of-sequence tokens.
Architectural Suitability: Many models reuse autoregressive designs which might not be optimal for capturing joint data distributions inherent to diffusion.
Infrastructure: Limited mature frameworks and open-source options compared to autoregressive models.

4.2.2. Autoregressive Models

Autoregressive (AR) models serialize both vision and language tokens and model them sequentially, typically using a Transformer backbone adapted from LLMs (e.g., LLaMA, Qwen, Gemma) as a unified modality-fusion module. The VLM description of images/3.jpg (which is Fig. 4 in the paper) illustrates the core components and types of AR models.

Image Tokenization Strategies: Models are categorized by how they tokenize (encode) images for the AR framework:

4.2.2.1. Pixel-based Encoding (Fig. 5 (b-1) in `images/4.jpg`)

Principle: Images are represented as continuous or discrete tokens obtained from pre-trained autoencoders (like VQGAN-like models [32, 220, 221, 222]) optimized for image reconstruction. These encoders compress high-dimensional pixel space into a compact latent space. The resulting image tokens are then processed similarly to text tokens within a single sequence.
Encoder/Decoder: VQGAN-like models are typically used as both encoder and decoder. The decoder is usually a lightweight convolutional architecture.
Training: Often uses causal attention masks and next-token prediction loss for unified training across modalities.

Representative Models:

LWM [29]: Uses a VQGAN tokenizer [32] to encode images into discrete latent codes without semantic supervision. Learns world dynamics through reconstruction-based visual tokens and textual descriptions.
Chameleon [30] & ANOLE [132]: Adopt VQ-IMG [222], an improved VQ-VAE variant with a deeper encoder and residual prediction for better detail preservation.
Emu3 [133], SynerGen-VL [136], UGen [138]: Employ SBER-MoVQGAN [220, 221], a multiscale VQGAN variant for expressive visual representations.
Liquid [137]: Uses a VQGAN-style tokenizer, highlighting mutual benefits of unified understanding and generation under a shared visual token representation.
MMAR [134], Orthus [135], Harmon [139]: Use continuous-valued image tokens (e.g., from SD-VAE or EmbeddingViT) and decouple the diffusion process from the AR backbone by employing lightweight diffusion MLPs as decoders.
TokLIP [140]: Integrates a low-level discrete VQGAN tokenizer with a ViT-based SigLIP [209] encoder for high-level continuous semantics, enhancing both generative capacity and semantic understanding.
Selftok [141]: Introduces a discrete visual self-consistency tokenizer for high-quality reconstruction and compression, optimizing for visual reinforcement learning.

Limitations:

Lack of Semantic Abstraction: Visual tokens optimized for pixel reconstruction often lack high-level semantics, complicating cross-modal alignment.
Computational Overhead: Dense token grids lead to long sequence lengths, increasing computational and memory costs for high-resolution images.
Modality-Specific Biases: Reconstruction-centric objectives can result in visual tokens retaining biases towards textures and low-level patterns, which may not be optimal for semantic reasoning.

4.2.2.2. Semantic Encoding (Fig. 5 (b-2) in `images/4.jpg`)

Principle: Image inputs are processed using pre-trained text-aligned vision encoders (e.g., OpenAI-CLIP [22], SigLIP [209], EVA-CLIP [36], UNIT [223]). These encoders produce visual embeddings that align closely with language features in a shared semantic space, improving cross-modal alignment.
Decoder: Most models pair semantic encoders with diffusion-based decoders (e.g., SD family [14], IP-adapter [227], FLUX [16], Lumina-Next [228]), trained independently from the MLLM. The MLLM produces semantic-level visual tokens, which the diffusion decoder refines into high-resolution images.
Training: Typically uses causal attention masks and next-token prediction loss on text and semantic visual tokens.

Representative Models:

Emu [142], Emu2 [33], LaViT [143]: Employ EVA-CLIP [36] as the vision encoder. Emu unifies VQA, image captioning, and image generation. Emu2 scales the MLLM to 37B parameters. LaViT uses dynamic visual tokenization.
DreamLLM [34], VL-GPT [35], MM-Interleaved [144], PUMA [147]: Utilize OpenAI-CLIP [22]. DreamLLM uses a linear projection. VL-GPT employs a causal Transformer. MM-Interleaved and PUMA extract multi-granular features.
Mini-Gemini [145]: Uses dual semantic encoders (CLIP for global features, RQ-VAE for dense local features) with a cross-attention module to refine global tokens.
MetaMorph [148]: Employs SigLIP [209] and introduces modality-specific adapters within the LLM.
ILLUME [149]: Adopts UNIT [223] as its vision encoder, which is trained with both image reconstruction and contrastive alignment losses for balanced representation.
VILA-U [146], UniTok [150]: Mimic UNIT to obtain novel text-aligned vision tokenizers.
QLIP [151]: Addresses reconstruction vs. text-image alignment conflict using binary-spherical quantization.
Tar [157]: Leverages LLM vocabulary for visual codebook, uses scale-adaptive pooling, and employs diffusion for visual generation enhancement.
UniFork [153]: Shares shallow-layer parameters for understanding/generation, but uses distinct deeper networks.
UniCode2 [154]: Employs a cascaded codebook (frozen foundational codebook from clustered SigLIP features + supplementary learnable codebooks).
DualToken [152]: Uses shallow SigLIP features for reconstruction and deep SigLIP features for semantic learning.
X-Omni [160]: Uses SigLIP-VQ and reinforcement learning to mitigate errors and information loss.
Qwen-Image [161], OmniGen2 [158], Ovis-U1 [159], Bifrost-1 [162], UniWorld [155], Pisces [156]: Integrate diffusion models (e.g., MMDiT, OmniGen, FLUX, DiT, SD-1.5) as decoders, using the MLLM's visual comprehension outputs as conditions.

Limitations:

Limited Pixel-level Control: Semantic embeddings lack spatial density for fine-grained editing (inpainting, structure-preserving transformations).
Insufficient Spatial Correspondence: Global/mid-level representations may be inadequate for tasks requiring precise spatial alignment (e.g., referring expression segmentation).
Mismatch between MLLM and Decoder: Separate training of semantic encoder and diffusion decoder can lead to semantic drift or generation artifacts due to lack of end-to-end optimization.

4.2.2.3. Learnable Query Encoding (Fig. 5 (b-3) in `images/4.jpg`)

Principle: Introduces a set of learnable query tokens that dynamically extract informative content from image features, creating compact and semantically aligned embeddings. These queries act as content-aware probes interacting with visual encoders.
Training: Usually involves image-text contrastive learning and/or image reconstruction supervision.

Representative Models:

SEED [163], SEED-LLAMA [164], SEED-X [165]:
- Concept: Use a seed tokenizer to learn causal visual embeddings.
- Process: Input image encoded by BLIP-2 ViT encoder [53] into dense token features. These features are concatenated with learnable query tokens and processed by a causal Q-Former.
- Decoders: SEED-LLAMA and SEED-X upgrade to UnCLIP-SD [14] or SDXL [226] decoders.
MetaQueries [166], OpenUni [170], Nexus-Gen [167], Ming-Lite-Uni [168], Ming-Omni [171], BLIP3-o [169], UniLIP [172], TBAC-Unilmage [173]:
- Concept: Simplified learnable query encoding where image features (e.g., from SigLIP [209]) are extracted by a frozen encoder, concatenated with learnable query tokens, and directly passed through a frozen vision-language backbone (e.g., LLaVA, Qwen2.5-VL).
- Conditioning: Output causal embeddings condition a diffusion decoder.
- Innovations: Nexus-Gen uses a more powerful diffusion decoder (FLUX-1.dev). Ming-Lite-Uni and Ming-Omni use advanced MLLMs (M2-omini [200]), multi-scale learnable tokens, and multi-scale representation alignment. Ming-Omni also integrates audio and uses a dual-stage training strategy. BLIP3-o explores flow matching loss for better image quality. UniLIP incorporates reconstruction ability into CLIP tokens. TBAC-Unilmage applies learnable queries at multiple MLLM layers.

Limitations:

Computational Overhead: Increased complexity with more query tokens (memory, computation).
Fixed Encoder Limitations: Reliance on fixed encoders can limit adaptability to novel visual inputs.
Adaptability of Visual Features: Frozen backbones can restrict dynamic alignment of image features with evolving query semantics.
Coverage of Diverse Content: A small, fixed query set may not adequately capture the richness of complex scenes or fine-grained details, especially for highly detailed outputs.

4.2.2.4. Hybrid Encoding

Principle: Combines the strengths of pixel-based encoding (preserving fine-grained visual details) and semantic-based encoding (providing high-level semantic abstraction).

4.2.2.4.1. Pseudo Hybrid Encoding (Fig. 5 (b-4) in `images/4.jpg`)

Principle: Uses dual encoders—typically a semantic encoder (e.g., SigLIP) and a pixel encoder (e.g., VQGAN or VAE)—but activates them in a task-specific manner. The semantic encoder is for understanding tasks, and the pixel encoder for generation tasks.
Decoders: Typically engage pixel decoders to reconstruct images from latent codes.

Representative Models:

Janus [174], Janus-Pro [175], OmniMamba [176], Unifluid [177], MindOmni [178], Skywork UniPic [179]:
- Concept: Train dual encoders concurrently with combined understanding and generation datasets, but one encoder is disabled during inference for the other task.
- Example: For understanding, the semantic encoder is active; for text-to-image generation, the pixel encoder is active. For image editing, Unifluid uses the semantic encoder, while MindOmni uses both VAE and semantic encoder to encode the source image.
- Limitation: Underutilizes potential synergy because both encoders are not leveraged simultaneously during inference for a single task. Lack of semantic grounding in generation tasks and high-fidelity details in understanding tasks.

4.2.2.4.2. Joint Hybrid Encoding (Fig. 5 (b-5) in `images/4.jpg`)

Principle: Integrates both semantic and pixel tokens into a single unified input for the language model or decoder, enabling simultaneous utilization of both representations. These models differ in their fusion strategies.
Decoders: Support both pixel decoders (e.g., VQGAN, Infinity [231], VAR-D30 [113]) and diffusion-based decoders (e.g., SDXL [226]).

Representative Models:

MUSE-VL [180], UniToken [186]: Concatenate features from SigLIP and VQGAN along the channel dimension before passing them to the LLM.
Tokenflow [181]: Incorporates dual encoders and codebooks with a shared mapping for joint optimization of high-level semantics and low-level pixel details.
VARGPT [182], VARGPT-1.1 [184], ILLUME $^+$ [185]: Concatenate semantic and pixel tokens along the sequence dimension, maintaining both types in the LLM's input.
SemHiTok [183]: Introduces Semantic Guided Hierarchical Codebook (SGHC) to inherit semantic information while incorporating texture information for pixel reconstruction.
Show-o2 [187]: Uses separate network branches for processing latent features from 3DVAE [230], aggregating outputs with a spatial-temporal fusion module.

Limitations:

Increased Complexity: Dual-encoder architectures and token fusion increase computational costs and training times.
Alignment Challenges: Ensuring effective alignment between semantic and pixel-level features is difficult.
Trade-offs: Balancing understanding and generation objectives can lead to compromises in performance.
Implicit Alignment Assumption: Assumes implicit alignment between pixel and semantic tokens, which is non-trivial and can lead to conflicting supervision signals.

4.2.3. Fused Autoregressive and Diffusion Models

This paradigm combines autoregressive generation for text tokens with multi-step denoising via diffusion for image tokens. This allows for compositional reasoning (AR) and high visual quality (diffusion).

4.2.3.1. Pixel-based Encoding (Fig. 5 (c-1) in `images/4.jpg`)

Principle: Images are transformed into discrete tokens or continuous latent vectors (e.g., via SD-VAE), which become targets for a diffusion-based denoising process conditioned on autoregressively generated text.
Training: Combines autoregressive loss for language modeling and diffusion loss for image reconstruction. Uses bidirectional attention for spatial coherence.

Representative Models:

Transfusion [38], MonoFormer [37], LMFusion [188]:
- Concept: Use continuous latent representations extracted via SD-VAE.
- Architectural Innovations: Transfusion proposes a unified Transformer backbone with modality-specific layers. MonoFormer introduces a compact architecture with shared blocks and task-dependent attention masking. LMFusion uses a lightweight visual injection module for frozen LLMs.
Show-o [39]:
- Concept: Employs a discrete pixel-based tokenizer based on MAGVIT-v2 [213], generating symbolic image tokens compatible with Transformer-style decoding.
- Process: Supports AR-based text token generation and diffusion-based image synthesis.

Limitations:

Computational Overhead: Continuous latent spaces and iterative diffusion sampling are resource-intensive.
Text-Visual Alignment: Latent representations from unsupervised reconstruction objectives might not be optimally aligned with semantic language tokens, affecting controllability.
Discrete Tokenization Issues: Codebook collapse and limited capacity for subtle visual nuances in VQ-based discrete tokens.

4.2.3.2. Hybrid Encoding (Fig. 5 (c-2) in `images/4.jpg`)

Principle: Fuses both semantic features (e.g., from CLIP or ViT encoders) and pixel-level latents (e.g., from SD-VAE) for a more expressive image representation.
Decoder: Usually rectified flow based.

Representative Models:

Janus-flow [189], Mogao [190], BAGEL [191]:
- Concept: Adopt a dual-encoder architecture and use rectified flow for generation.
- Encoders: Decouple understanding and generation encoders. SigLIP or SigLIP + SDXL-VAE for understanding; SDXL-VAE or FLUX-VAE for image generation.
- Limitation: The pseudo hybrid encoding design limits simultaneous leveraging of semantic and pixel-level features during generation, as only the pixel encoder is active during image synthesis.

Challenges:

Increased Model Complexity: Dual-encoder architectures and combined AR/diffusion processes lead to higher computational costs and longer training.
Effective Alignment: Difficult to ensure effective alignment between semantic and pixel-level features.
Trade-offs: Balancing understanding and generation objectives often leads to compromises.

The VLM description of images/4.jpg (Fig. 5 in the paper) visually summarizes this entire categorization, showing the flow from modality encoders through different backbones (Diffusion, MLLM (AR), MLLM (AR + Diffusion)) and their respective modality decoders and masking strategies.

4.2.4. Any-to-Any Multimodal Models

Principle: Aims to process and generate across a diverse set of modalities beyond text and image, including audio, video, speech, and music. These models unify modality-specific encoders and decoders within a single architecture, using a shared backbone for cross-modal representation learning and sequence modeling.

Representative Models:

Modular Designs:
- OmniFlow [199]: Integrates HiFiGen [232] for audio/music, SD-VAE [14] for images, and a DiT-like diffusion model (MMDiT) [15] as the backbone.
Shared Embedding Spaces:
- Spider [198], X-VILA [196], Next-GPT [192]: Leverage ImageBind to map six modalities (text, image, video, audio, depth, thermal) into a single embedding space, then use modality-specific decoders (e.g., Stable Diffusion, Zeroscope, LLM-based text decoders).
Sequence-to-Sequence Paradigm:
- AnyGPT [195]: Uses EnCodec [233] for audio, SpeechTokenizer [234] for speech, and trains a unified Transformer with modality-specific prefixes.
- Unified-IO 2 [193]: Structured encoder-decoder design for visual, audio, and language, supporting tasks like AST-to-text or speech-to-image.
M2-omni [200]:
- Concept: Highly versatile architecture for text, image, video, and audio.
- Tokenizers/Decoders: Uses NaViT [235] for video/image encoding, SD-3 [226] as image decoder. For audio, paraformer-zh [236] extracts tokens, and CosyVoice [237] flow matching/vocoder model generates audio streams.

Challenges:

Modality Imbalance: Text and image modalities often dominate, limiting diversity.
Scalability: Supporting many modalities increases complexity, latency, and resource requirements.
Semantic Consistency: Maintaining grounded and aligned outputs across diverse modalities is difficult.

5. Experimental Setup

As a survey paper, this work does not present novel experimental results from its own methodology. Instead, it systematically reviews the datasets and benchmarks used by the models it surveys. This section will summarize those resources as presented in the paper.

5.1. Datasets

The paper emphasizes that large-scale, high-quality, and diverse training data are crucial for building powerful unified multimodal models. It categorizes these datasets by their primary use and modality characteristics, focusing on those released from 2020 onwards. It also notes that before multimodal training, many models are initialized with parameters from large-scale natural language corpora (e.g., Common Crawl, RedPajama, WebText), but these text-only datasets are excluded from the detailed discussion.

5.1.1. Multimodal Understanding Datasets

These datasets train models for tasks like image captioning, visual question answering (VQA), image-text retrieval, and visual grounding. They consist of images paired with textual descriptions.

RedCaps [238]: Contains 12 million image-text pairs sourced from Reddit, focusing on everyday items and moments.
Wukong [239]: A large-scale Chinese multimodal pre-training dataset with 100 million Chinese image-text pairs filtered from the web, addressing the need for Chinese multimodal data.
LAION [240]: One of the largest publicly available image-text pair datasets. LAION-5B contains nearly 6 billion image-text pairs crawled from the web, filtered using CLIP models for relevance. Laion-COCO [242], a subset, contains 600 million high-quality samples stylistically closer to MS COCO.
COYO [241]: Another large-scale image-text pair dataset with approximately 747 million samples, sourced from web crawls.
DataComp [243]: Comprises 1.4 billion samples derived from Common Crawl, using CLIP score and Image-based filtering for higher quality image-text pairs.
ShareGPT4V [246]: Provides approximately 100K high-quality image-text conversational data points, designed to enhance instruction-following and dialogue capabilities.
ALLaVA [216]: A synthetic dataset with 1.4 million samples, generated using proprietary models like GPT-4V to create fine-grained captions and complex reasoning VQA pairs, for training resource-friendly Lite Vision-Language Models (LVLMs).
CapsFusion-120M [245]: A collection of 120M image-text pairs from LaionCOCO, with captions integrated using CapsFusionLLaMA.
Cambrian-10M(7M) [247]: A large-scale dataset for multimodal instruction tuning, with an initial 10M samples, refined to 7M after quality filtering.
LLaVA-OneVision [248]: A visual instruction tuning collection with 3.2 million single-image samples (QA, OCR, math) and 1.6 million mixed-modal samples (video, multi-image).
Infinity-MM [249]: A comprehensive multimodal training dataset with over 40 million samples, collected from existing open-source datasets and newly generated data, including image captions, general/selective visual instructions, and GPT-4 or VLM-based pipeline synthesized data.
Other Datasets: GRIT (Grid-based Representation for Image-Text) [244] (20M samples) focuses on fine-grained image region-text phrase alignment. SAM Dataset [251] (11M high-resolution images with segmentation masks) can enhance fine-grained understanding.

5.1.2. Text-to-Image Datasets

These datasets train models to generate images corresponding to textual descriptions, often with an emphasis on aesthetic quality, content richness, or stylistic attributes.

CC-12M (Conceptual Captions 12M) [250]: About 12 million image-text pairs from web Alt-text, known for concise and descriptive captions.
LAION-Aesthetics [240]: A subset of LAION with 120 million images and texts, filtered using an aesthetic scoring model.
Text Rendering Datasets:
- Mario-10M [252]: 10 million samples used to train TextDiffuser for improved text placement and legibility.
- RenderedText [253]: 12 million high-resolution synthetic images of handwritten text for understanding and generation.
- AnyWord-3M [255]: 3 million samples crucial for AnyText [255] and enhancing generated text quality.
- TextAtlas5M [265]: Targets dense text generation with interleaved documents, synthetic data, and real-world images with long captions.
JourneyDB [254]: 4 million high-quality image-prompt pairs generated by Midjourney, valuable for training models for complex, detailed, and artistic text-to-image mappings.
CosmicMan-HQ 1.0 [256]: 6 million high-quality real-world human images with precise text annotations (from 115 million attributes), for generating human images.
DOCCI [257]: 15K curated images with long, human-annotated English descriptions (avg. 136 words), for fine-grained details and complex compositions.
PixelProse [258]: Extracted from DataComp, CC-12M, and RedCaps, with richly annotated images and metadata (watermark, aesthetic scores).
Megalith [260]: 10 million links to Flickr images (photo category, no copyright restrictions) with community-made captions (ShareCaptioner, Florence2, InternVL2).
PD12M [262]: 12.4 million high-quality public domain/CC0-licensed images paired with synthetic captions from Florence-2-large [294].
Synthesized Datasets:
- text-to-image-2M [261]: 2 million enhanced text-image pairs for fine-tuning, curated by T2I and captioning models.
- SFHQ-T2I [263]: 122K diverse, high-resolution synthetic face images generated by multiple T2I models.
- EliGen TrainSet [264]: Uses images from FLUX.1-dev and MLLM-generated prompts for stylistic consistency and detailed annotation.
- BLIP-3o 60k [169]: 60,000 instruction tuning samples distilled from GPT-4o.
- ShareGPT4o-Image [266]: 45K text-to-image pairs, with prompts generated via attribute-first and image-first approaches, and images synthesized by GPT-4o.
- Echo-4o-Image [267]: Over 100K samples targeting surreal fantasy scenarios and complex instructions to enhance model imagination.
Other Datasets: SAM dataset [251] and DenseFusion [259] (1M samples).

5.1.3. Image Editing Datasets

These datasets train models to alter input images according to textual commands, comprising (source image, editing instruction, target image) triplets.

InstructPix2Pix [268]: 313K (instruction, input image, output image) samples generated synthetically using GPT-3 and Stable Diffusion to create before-and-after image pairs based on instructions.
MagicBrush [269]: 10K high-quality, manually annotated samples for instruction-based image editing (object manipulation, style transfer, with masks).
HIVE [270]: Introduces human feedback, providing a 1.1M training dataset (similar to InstructPix2Pix generation) and a 3.6K reward dataset for human ranking.
EditWorld [272]: Focuses on "world-instructed image editing," curated via GPT-3.5 and T2I models for complex input-output generation, and video frame extraction with VLM-generated instructions.
PromptFix [274]: 1.01M triplets focusing on low-level image processing tasks (inpainting, dehazing, super-resolution).
HQ-Edit [271], SEED-Data-Edit [165], UltraEdit [273], OmniEdit [275], AnyEdit [276]: Large-scale datasets (e.g., SEED-Data-Edit: 3.7M, UltraEdit: 4M, AnyEdit: 2.5M, OmniEdit: 1.2M, HQ-Edit: 197K) often combining automated generation with human filtering for diversity and quality.
RefEdit [277]: Synthetic dataset targeting instruction-based editing with referring expressions, generated using GPT-4o for text, FLUX for images, Grounded SAM for masks, and specialized models for controlled edits.
ImgEdit [278]: 1.2M edit pairs for high-quality single/multi-turn editing, using a multi-stage pipeline with LAION-Aesthetics, VLMs, detection/segmentation models, state-of-the-art generative models (FLUX, SDXL), and GPT-4o for quality filtering.
ByteMorph-6M [279]: Over 6 million image editing pairs for non-rigid motions (camera viewpoint, object deformations), constructed by VLM-generated motion captions, image-to-video models, and LLM-generated instructions for frame transformations.
ShareGPT-4o-Image (Editing) [266]: 46K instruction-guided image editing triplets, using GPT-4o to synthesize edited images from source images and LLM-generated instructions.
GPT-Image-Edit-1.5M [280]: Over 1.5 million high-quality instruction-guided image editing triplets, leveraging GPT-4o to unify and refine existing datasets (OmniEdit, HQ-Edit, UltraEdit) by regenerating outputs and rewriting prompts.
X2Edit [281]: 3.7 million samples for arbitrary-instruction image editing, balanced across 14 tasks, constructed via VLM-generated task-aware instructions and expert generative models, with rigorous filtering.

5.1.4. Interleaved Image-Text Datasets

These datasets contain documents or sequences where text and images naturally follow each other, enhancing models' ability to comprehend and generate multimodal content in context.

Multimodal C4 (MMC4) [282]: Augments the text-only C4 corpus with over 101 million documents and 571 million images, algorithmically interleaved from Common Crawl.
OBELICS [283]: 141 million multimodal web documents from Common Crawl, featuring 353 million images interleaved with 115 billion text tokens, focusing on full document structure.
CoMM [284]: 227K high-quality, curated samples focusing on the coherence and consistency of interleaved image-text sequences, primarily from instructional and visual storytelling websites.
OmniCorpus [285]: A very large-scale (10 billion-level) image-text interleaved dataset, with 8.6 billion images and 1,696 billion text tokens from 2.2 billion documents, extracted and filtered from diverse sources (web, video platforms) with human-feedback filtering.

5.1.5. Other Text+Image-to-Image Datasets

These datasets enhance specific capabilities like generating images based on provided subject images or control signals.

LAION-Face [286]: 50 million image-text pairs used for ID-preserving image generation.
MultiGen-20M [287]: 20 million samples to train models for unified image generation conditioned on multiple control signals (text, edge maps, depth maps, segmentation masks, sketches), such as UniControl [287]. Data is structured as triplets (e.g., "depth map, instruction with prompt, target image").
Subjects200K [288]: 200K samples for subject-driven image generation, synthetically generated using LLM (ChatGPT-4o) for structured descriptions and FLUX [16] for image synthesis, with LLM quality assessment.
SynCD [289]: Synthetic Customization Dataset with 95K sets of images for text + image-to-image customization, generated using existing T2I models and 3D asset datasets (Objaverse [296]) for multiple consistent views of objects.
X2I-subject-driven [84]: Facilitates subject-driven generation via GRIT-Entity dataset (objects detected/segmented from GRIT, optionally repainted by MS-Diffusion [297]) and Web-Images (notable individuals identified via text analysis/LLM, web images scraped, visually verified, and captioned).
Graph200K [290]: A graph-structured dataset built on Subjects200K, augmenting images with 49 types of annotations across five meta-tasks (conditional generation, IP preservation, style transfer, image editing, restoration), enabling models to learn shared knowledge for universal image generation.
Echo-4o-Image (Multi-Reference) [267]: 73K synthetic samples for "Multi-to-one" generation, with diverse instructions and rich reference information for multi-reference image composition.

5.1.6. Data Synthesis Pipelines

The paper notes the growing reliance on data synthesis to address the shortage of high-quality public data for complex tasks.

Data synthesis from images: Uses models like BLIP-2 [53] or Kosmos2 [244] for initial captioning and grounding, followed by object detection (Grounding DINO [300]) and segmentation (SAM [251]) to extract subject masks and region captions.
Data synthesis from videos: Alleviates "copy-paste" issues by extracting subjects from different frames using video segmentation models (e.g., SAM2 [301]), also enabling training data for image editing.

5.2. Evaluation Metrics

The paper discusses various benchmarks and evaluation suites rather than individual low-level metrics. These benchmarks are designed to assess the capabilities of unified multimodal models across understanding, generation, and interleaved tasks.

5.2.1. Evaluation on Understanding

5.2.1.1. Perception

These benchmarks evaluate how accurately models connect visual inputs with linguistic descriptions through grounding, recognition, and retrieval.

Flickr30k [365]: An early benchmark for image-text retrieval and captioning.
MS COCO Captions [366]: Evaluates models' ability to retrieve relevant captions and localize textual phrases to image regions.
VQA [302], VQA v2 [303]: Visual Question Answering benchmarks where models answer free-form queries about objects, attributes, and relationships in images.
VisDial [308]: Visual Dialog, assessing models' ability to engage in multi-turn conversations about an image.
TextVQA [310]: Focuses on Visual Question Answering where questions involve reading text within images.
ChartQA [309]: Assesses understanding of structured charts and graphs, combining visual perception with quantitative reasoning.
VSR [8]: Probes spatial relation reasoning in real-world images.
MMBench [316]: Supplies 3K bilingual multiple-choice questions spanning grounding, recognition, and retrieval for cross-lingual comparison.
MMMU [317]: Adds ~11.5K college-level multimodal problems across six disciplines to probe domain knowledge and logical deduction.
HaluEval [312]: Diagnoses hallucination recognition on diverse model-generated and annotated statements.
MM-Vet [318] & MM-Vet v2 [319]: Covers recognition, OCR, spatial reasoning, math, and open question answering, with v2 evaluating interleaved image-text sequences.
SEED-Bench [321]: Offers 19K multi-choice items over 12 dimensions, generated via a pipeline targeting specific evaluation dimensions.
LLaVa-Bench [314]: Provides COCO and in-the-wild image sets with dense queries for generalization checks.
LAMM [313]: Supplies instruction-tuning examples covering 2D and 3D modalities for agent development.
Open-VQA [322]: Formulates hierarchical follow-up questions to refine coarse VQA answers.
OwlEval [315]: Offers human-rated open-ended visual questions assessing relevance and informativeness.
MMStar [320]: Curates balanced challenge samples across six core skills and 18 axes for high-precision evaluation.

5.2.1.2. Reasoning

These benchmarks probe progressively richer cognitive skills, requiring logical inference, commonsense, and explanation.

CLEVR [304]: Systematically varies object attributes and spatial relations, testing counting, comparison, and relational logic.
GQA [305]: Leverages dense scene graphs to generate compositional questions for testing consistency, grounding, and plausibility in natural images.
OK-VQA [306] & A-OKVQA [311]: Commonsense extensions requiring world knowledge retrieval or inference for answers.
VCR [307]: Demands a model not only to answer but also to justify it by selecting a coherent rationale, coupling recognition with explanation.
MathVista [323]: Broadens scope to mathematical problem solving in visually grounded contexts, combining fine-grained visual understanding with symbolic manipulation.
General-Bench [324]: An ultra-large benchmark comprising over 700 tasks and 325,800 instances across varied modalities and capabilities, for multimodal generalist models.

5.2.2. Evaluation on Image Generation

5.2.2.1. Text-to-Image Generation

These benchmarks emphasize compositional reasoning, prompt alignment, and real-world applicability beyond basic image quality metrics like FID [367] and CLIPScore [22].

PaintSkills [326], DrawBench [72], PartiPrompts [325]: Evaluate core compositional capabilities.
GenEval [329]: Evaluates six fine-grained tasks (single-object generation, object co-occurrence, counting, color control, relative positioning, attribute binding) by comparing outputs with pre-trained detectors.
GenAI-Bench [334]: Presents 1.6K human prompts covering relational, logical, and attribute-based categories, combining human preference with automated alignment scores.
HRS-Bench [327]: Evaluates 13 distinct skills grouped into five categories: accuracy, robustness, generalization, fairness, and bias.
DPG-Bench [336]: Focuses on dense prompts describing multiple objects with various attributes and relationships.
T2I-CompBench [330] & T2I-CompBench $^ { + + }$ [337]: Specifically target compositional generalization using detector-based scoring.
VISOR [368]: Proposes an automatic method for evaluating spatial understanding.
Commonsense-T2I [332]: Challenges models to depict everyday concepts requiring commonsense grounding.
EvalMuse40K [369]: Provides 40K crowdsourced prompts focusing on nuanced concept representation.
HEIM [331]: Identifies 12 aspects including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency.
FlashEval [370]: Shrinks large-scale evaluation sets to accelerate benchmark testing.
MEMOBench [371]: Comprehensive benchmark for emotional understanding and expression capabilities.
ConceptMix [335]: Evaluates compositional generation by sampling k-tuples of visual concepts and automatically verifying concept presence.
TIFA [328]: Fine-grained benchmark for evaluating text-to-image faithfulness via VQA.
DSG-1k [333]: Refines VQA questions using a multi-level semantic graph for enhanced image-prompt alignment.
MMIG-Bench [338]: Multi-dimensional assessment framework for T2I models.
OneIG-Bench [339]: Comprehensive fine-grained evaluation framework for T2I models.
WISE [340] & WorldGenBench [342]: Evaluate world knowledge understanding, emphasizing semantic consistency, realism, and aesthetics.
CVTG-2K [341]: Evaluates visual-text generation on complex multi-region layouts, diverse text attributes, and fine-grained positioning.

5.2.2.2. Image Editing

Benchmarks for instruction-guided image editing have grown in scale and scope.

MagicBrush [269]: Large-scale, manually annotated dataset for instruction-guided real image editing (single-turn, multi-turn, mask-provided, mask-free).
HQ-Edit [271]: ~200K high-resolution edits with computed alignment and coherence scores, using GPT-4V.
I2EBench [347]: Consolidates over 2K images and 4K multi-step instructions across 16 editing dimensions.
EditVal [344]: Standardized benchmark with fine-grained edit annotations and an automated evaluation pipeline aligned with human judgment.
EmuEdit [345]: Covers seven editing tasks (background changes, object-level edits, style modifications) with paired instructions.
Reason-Edit [346]: Diagnostic benchmark targeting causal and counterfactual reasoning (object relations, attribute dependencies, multi-step inference).
EditBench [343]: Diagnostic benchmark for text-guided image inpainting with masked input/reference pairs.
HumanEdit [348]: 5,751 high-resolution images and open-form instructions across six edit types, with annotated masks and multi-stage human feedback.
IE-Bench [349]: Human-aligned benchmark for text-driven image editing quality.
GEdit-Bench [350]: 606 real-world instruction-image pairs.
CompBench [351]: Decomposes edits into location, appearance, dynamics, and object dimensions via MLLM and human collaboration.
GIE-Bench [352]: Uses multi-choice VQA and object-aware masking to evaluate editing accuracy and content preservation.
EditInspector [353], ComplexBench-Edit [354]: Undertake comprehensive evaluation of text-guided image editing, assessing vision consistency, artifact detection, instruction adherence, quality, and detail preservation.
ByteMorph-Bench [279]: Tackles non-rigid image manipulation.
RefEdit-Bench [277]: Evaluates referring-expression-based edits in complex multi-entity scenes.
ImgEdit-Bench [278]: Unified image editing dataset and benchmark.
KRIS-Bench [355]: Cognitively grounded suite assessing factual, conceptual, and procedural reasoning.

5.2.2.3. Other Types of Image Generation

MultiGen-20M [287]: 20 million image-prompt-condition triplets (LAION-Aesthetics-V2 [372]) for automated evaluation across diverse visual conditions.
DreamBench [362]: Benchmarks personalized generation using 30 reference objects with curated prompts and human fidelity ratings.
DreamBench $^ { + + }$ [363]: Scales to 150 subjects and 1,350 prompts, using advanced VLMs for human-aligned scoring of concept preservation, composition, and style.
VTBench [364]: Systematic benchmark for evaluating visual tokenizers in autoregressive image generation across reconstruction, detail preservation, and text preservation.

5.2.3. Evaluation on Interleaved Generation

These benchmarks challenge models to seamlessly alternate between text and image modalities across multiple turns, reflecting realistic dialogue and storytelling.

InterleavedBench [356]: Human-curated benchmark for interleaved text-and-image generation, evaluating text quality, perceptual fidelity, multimodal coherence, and helpfulness.
ISG [358]: Introduces scene-graph annotations and a four-tier evaluation (holistic, structural, block-level, image-specific) over 1K samples in eight scenarios and 21 subtasks.
OpenING [360]: Assembles 5K human-annotated instances across 56 real-world tasks with IntJudge to test open-ended multimodal generation.
OpenLEAF [357]: Gathers 30 open-domain queries to probe foundational interleaved text-image generation, measuring entity and style consistency via LMM evaluators and human validation.
MMIE [359]: Unified interleaved suite sampling from 12 fields and 102 subfields, with multiple-choice and open-ended questions.
UniBench [361]: Comprehensive compositional benchmark with 81 fine-grained tags for diversity.

5.3. Baselines

As this paper is a comprehensive survey, it does not propose a new model or conduct its own experiments against baselines. Instead, it systematically reviews and categorizes existing unified multimodal models, detailing their architectures and capabilities. The "baselines" in this context are the various models themselves, which are presented within each category and subcategory as examples of different approaches to unification. The paper aims to provide a structured overview of these models and their underlying design principles rather than a comparative performance evaluation against a single "baseline."

6. Results & Analysis

As a survey, this paper does not present new experimental results in the traditional sense (e.g., performance metrics of a new model compared to baselines). Instead, its "results" are the comprehensive categorization and analysis of existing unified multimodal understanding and generation models, along with the identification of their advances, challenges, and opportunities. The paper's findings are presented through its structured review of architectural paradigms, model implementations, and available resources.

6.1. Core Results Analysis

The core analysis presented in the paper revolves around how different models attempt to unify multimodal understanding and generation, highlighting the various architectural choices and their implications.

1. Architectural Paradigms for Unification: The survey identifies three primary architectural paradigms, reflecting distinct strategies for integration:

Diffusion Models: These models leverage the inherent strengths of diffusion for high-quality image generation. They extend the denoising process to be conditioned on multimodal contexts, sometimes employing discrete diffusion for text and continuous for images, or fully discrete frameworks. The key advantage is superior sample quality and flexibility in conditioning. However, they face challenges in inference efficiency, sparse supervision during training, and architectural suitability.
Autoregressive (AR) Models: These approaches serialize multimodal inputs (text and visual tokens) and process them sequentially using Transformer backbones. Their strength lies in structural consistency with LLMs and their ability to perform compositional reasoning. The choice of image tokenization is critical here, leading to sub-categories:
- Pixel-based Encoding: Good for low-level reconstruction but lacks semantic abstraction and increases sequence length.
- Semantic Encoding: Provides strong text-image alignment but lacks pixel-level control, often requiring a separate diffusion decoder.
- Learnable Query Encoding: Offers adaptive and compact semantic representations but introduces computational overhead and relies on fixed encoders.
- Hybrid Encoding: Aims to combine pixel-level detail with semantic abstraction, but pseudo-hybrid methods may underutilize synergy, while joint-hybrid methods increase complexity and face alignment challenges.
Fused AR + Diffusion Models: These models combine AR for text generation with diffusion for image generation. This hybrid strategy aims to balance symbolic control (from AR) and visual fidelity (from diffusion). Similar to AR models, their effectiveness depends on image tokenization, with both pixel-based and hybrid encoding variants. They offer a promising trade-off but increase inference costs due to multi-step sampling.

2. Advances in Modality Integration: The paper illustrates significant advances in how modalities are integrated:

Improved Tokenization: From simple pixel flattening to advanced VQGAN-like models, semantic encoders (CLIP, SigLIP), and learnable queries, researchers are constantly seeking more efficient and semantically rich ways to represent visual data for language models.
Specialized Connectors: The evolution from simple projection layers to Q-Formers and adapter-based approaches has allowed more effective bridging between vision encoders and LLMs.
Scaling Laws: Models like Emu2 and MMaDA demonstrate that scaling up parameters and leveraging techniques like MoE (Mixture-of-Experts) in backbones can significantly enhance both understanding and generation capabilities.
Any-to-Any Modality Support: Beyond text and image, models are now exploring unification across audio, video, and speech, often through modular designs or shared embedding spaces (ImageBind).

3. Resource Development: The paper highlights a burgeoning ecosystem of datasets and benchmarks tailored for unified models. This includes massive web-scale paired data (LAION, COYO), fine-grained conversational data (ShareGPT4V), synthetic instruction tuning data (ALLaVA), and specialized datasets for image editing (InstructPix2Pix, MagicBrush), subject-driven generation (Subjects200K), and interleaved image-text generation (Multimodal C4, OBELICS). The increasing use of data synthesis pipelines is a critical development for addressing data scarcity in complex tasks.

4. Emerging Challenges: Despite rapid progress, the field faces significant hurdles that need addressing for true unification:

Tokenization: Finding optimal strategies for representing images (discrete vs. continuous, efficient compression, semantic richness).
Cross-Modal Attention: Managing computational overhead and ensuring effective interaction in high-dimensional, long-sequence contexts.
Data Construction: The need for higher quality, less biased, and more diverse training data, including complex compositional and interleaved data.
Evaluation: Developing comprehensive benchmarks that can holistically assess understanding, generation, and multi-turn interactions.
Cognitive Integration: Incorporating Chain-of-Thought (CoT) reasoning and Reinforcement Learning (RL) for improved interpretability and performance.
Fairness and Bias: Addressing ethical concerns related to potential amplification of biases in generated content.
Personalization: Enabling personalized knowledge-driven generation within unified models.
Task-Specific Limitations: Many unified models still struggle with advanced functionalities like spatially controlled generation, multi-subject generation, and complex image editing without extensive post-finetuning.

Overall, the survey demonstrates that while significant progress has been made, the field is still in its early stages, characterized by diverse architectural explorations and a clear path forward marked by persistent challenges and abundant opportunities.

6.2. Data Presentation (Tables)

The paper includes two tables, TABLE 1 and TABLE 2, which summarize various unified multimodal models. These tables are crucial for understanding the current landscape and will be transcribed here.

The VLM description for images/4.jpg also contains TABLE 1. The text immediately following TABLE 1 in the original paper, "TABLE 2 shift toward broader multimodal interactions in recent years." confirms the presence of a second table summarizing Any-to-Any Multimodal Models.

The following are the results from TABLE 1 of the original paper, which categorizes unified multimodal models by their backbone architecture: Diffusion, MLLM (AR), and MLLM (AR + Diffusion), and further subdivides them by encoding variations and encoder-decoder configurations.

Model	Type	Backbone	Architecture	Gen. Enc.	Und. Enc.	Gen. Dec.	Mask	Date
Dual Diffusion [127]	a	D-DiT	Diffusion Model	SD-VAE		SD-VAE	Bidirect.	2024-12
UniDisc [128]	a	DiT		MAGVIT-v2		MAGVIT-v2	Bidirect.	2025-03
MMaDA [129]	a	LLaDA		MAGVIT-v2		MAGVIT-v2	Bidirect.	2025-05
FUDOKI [130]	a	DeepSeek-LLM		SigLIP	VQGAN	VQGAN	Bidirect.	2025-05
Muddit [131]	a	Meissonic (MM-DiT)		VQGAN		VQGAN	Bidirect.	2025-05
	b-1		Autoregressive Model
LWM [29]	b-1	LLaMa-2		VQGAN		VQGAN	Causal	2024-02
Chameleon [30]	b-1	LLaMa-2		VQ-IMG		VQ-IMG	Causal	2024-05
ANOLE [132]	b-1	LLaMa-2		VQ-IMG			Causal	2024-07
Emu3 [133]	b-1	LLaMA-2		SBER-MoVQGAN		SBER-MoVQGAN	Causal	2024-09
MMAR [134]	b-1	Qwen2		SD-VAE + EmbeddingViT		Diffusion MLP	Bidirect.	2024-10
Orthus [135]	b-1	Chameleon		VQ-IMG+Vision embed.		Diffusion MLP	Causal	2024-11
SynerGen-VL [136]	b-1	InterLM2		SBER-MoVQGAN		SBER-MoVQGAN	Causal	2024-12
Liquid [137]	b-1	GEMMA		VQGAN		VQGAN	Causal	2024-12
UGen [138]	b-1	TinyLlama		SBER-MoVQGAN		SBER-MoVQGAN	Causal	2025-03
Harmon [139]	b-1	Qwen2.5		MAR		MAR	Bidirect.	2025-03
TokLIP [140]	b-1	Qwen2.5		VQGAN+SigLIP		VQGAN	Causal	2025-05
Selftok [141]	b-1	LLaMA3.1		SD3-VAE+MMDiT		SD3	Causal	2025-05
Emu [142]	b-2	LLaMA		EVA-CLIP		SD	Causal	2023-07
LaVIT [143]	b-2	LLaMA		EVA-CLIP		SD-2.1	Causal	2023-09
DreamLLM [34]	b-2	LLaMA		OpenAI-CLIP		SDXL	Causal	2023-09
Emu2 [33]	b-2	LLaMA		EVA-CLIP	OpenAI-CLIP	IP-Adapter	Causal	2023-12
VL-GPT [35]	b-2	LLaMA		Open-CLIP		SD-v2.1	Causal	2023-12
MM-Interleaved [144]	b-2	Vicuna		OpenAI-CLIP+ConvNext		SDXL	Causal	2024-01
Mini-Gemini [145]	b-2	Gemma&Vicuna		SigLIP+RQ		RQ-VAE	Causal	2024-03
VILA-U [146]	b-2	LLaMA-2		OpenAI-CLIP		SDXL	Bidirect.	2024-09
PUMA [147]	b-2	LLaMA-3		SigLIP		SD-1.5	Causal	2024-10
MetaMorph [148]	b-2	Vicuna		UNIT		SDXL	Causal	2024-12
ILLUME [149]	b-2	LLaMa-2		ViTamin		ViTamin	Causal	2024-12
UniTok [150]	b-2	LlaMa-3		QLIP-ViT+BSQ		BSQ-AE	Causal	2025-02
QLIP [151]	b-2	Qwen2.5		SigLIP		RQVAE	Causal	2025-02
DualToken [152]	b-2	Qwen2.5		SigLIP+RQ		RQ-VAE	Causal	2025-03
UniFork [153]	b-2	Qwen2.5		SigLIP+RQ		FLUX.1-dev / SD-1.5	Causal	2025-05
UniCode2 [154]	b-2	Qwen2.5-VL		SigLIP2		DiT	Bidirect.	2025-06
UniWorld [155]	b-2	Qwen2.5-VL		SigLIP	EVA-CLIP	Diffusion	Causal	2025-06
Pisces [156]	b-2	LLaMA-3.1		SigLIP2+VQ		VQGAN / SANA	Causal	2025-06
Tar [157]	b-2	Qwen2.5		SigLIP		OmniGen	Causal	2025-06
OmniGen2 [158]	b-2	Qwen2.5-VL		AimV2		MMDiT	Causal	2025-06
Ovis-U1 [159]	b-2	Ovis				MMDiT	Causal	2025-06
X-Omni [160]	b-2	Qwen2.5-VL		QwenViT	Siglip	FLUX	Causal	2025-07
Qwen-Image [161]	b-2	Qwen2.5-VL		QwenViT		MMDiT	Causal	2025-08
Bifrost-1 [162]	b-2	Qwen2.5-VL		QwenViT	ViT	FLUX	Causal	2025-08
SEED [163]	b-3	OPT		SEED Tokenizer	Learnable Query	SD	Causal	2023-07
SEED-LLaMA [164]	b-3	LLaMa-2 &Vicuna		SEED Tokenizer	Learnable Query	unCLIP-SD	Causal	2023-10
SEED-X [165]	b-3	LLaMa-2		SEED Tokenizer	Learnable Query	SDXL	Causal	2024-04
MetaQueries [166]	b-3	LLaVA&Qwen2.5-VL		SigLIP	earnable Query	Sana	Causal	2025-04
Nexus-Gen [167]	b-3	Qwen2.5-VL		wenVitT	Learnable Query	FLUX	Causal	2025-04
Ming-Lite-Uni [168]	b-3	M2-omni		NaViT		Sana	Causal	2025-05
BLIP3-o [169]	b-3	Qwen2.5-VL		OpenAI-CLIP	Learnable Query	Lumina-Next	Causal	2025-05
OpenUni [170]	b-3	InternVL3		InternViT	Learnable Query	Sana	Causal	2025-05

The following are the results from `TABLE 2` of the original paper, which summarizes `Any-to-Any Multimodal Models` that support broader multimodal interactions.

Model	Architecture				Date
Model	Backbone	Modality Enc.	Modality Dec.	Mask	Date
Next-GPT [192]	Vicuna	ImageBind	AudioLDM+SD-1.5+Zeroscope-v2	Causal	2023-09
Unified-IO 2 [193]	T5	Audio Spectrogram Transformer+Vision ViT	Audio ViT-VQGAN + Vision VQGAN	Causal	2023-12
Video-LaVIT [194]	LLaVA-1.5	LaVIT+Motion VQ-VAE	SVD img2vid-xt	Causal	2024-02
AnyGPT [195]	LLaMA-2	Encodec+SEED Tokenizer+SpeechTokenizer	Encodec+SD+SoundStorm	Causal	2024-02
X-VILA [196]	Vicuna	ImageBind	AudioLDM+SD-1.5+Zeroscope-v2	Causal	2024-05
MIO [197]	Yi-Base	SpeechTokenizer+SEED-Tokenizer	SpeechTokenizer+SEED Tokenizer	Causal	2024-09
Spider [198]	LLaMA-2	ImageBind	AudioLDM+SD-1.5+Zeroscope-v2 +Grounding DINO+SAM	Causal	2024-11
OmniFlow [199]	MMDiT	HiFiGen+SD-VAE+Flan-T5	HiFiGen+SD-VAE+TinyLlama	Bidirect.	2024-12
M2-omni [200]	LLaMA-3	paraformer-zh+NaViT	CosyVoice-vocoder+SD-3	Casual	2025-02

6.3. Ablation Studies / Parameter Analysis

As a survey paper, this work does not conduct its own ablation studies or parameter analyses. Instead, it synthesizes findings and architectural details from the surveyed literature. The paper implicitly highlights the impact of various design choices (e.g., choice of image tokenizer, type of backbone, fusion strategy) by categorizing models accordingly and discussing their respective strengths and limitations, which are often derived from the ablation studies and analyses performed by the original authors of the models. For example, the discussions on the limitations of pixel-based encoding (lack of semantic abstraction, computational overhead) versus semantic encoding (lack of pixel-level control, mismatch with decoder) directly reflect insights gained from comparing different architectural components in the field.

7. Conclusion & Reflections

7.1. Conclusion Summary

This survey provides a comprehensive overview of the rapidly developing field of unified multimodal understanding and generation models. It highlights the increasing convergence of two previously distinct domains: multimodal understanding (traditionally autoregressive-based) and image generation (primarily diffusion-based). The paper systematically categorizes existing unified models into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. Within the autoregressive and hybrid categories, it further delineates models based on their image tokenization strategies (pixel-based, semantic-based, learnable query-based, and hybrid encoding), illustrating the diverse ways visual information is processed and integrated.

The survey compiles relevant datasets and benchmarks across tasks like understanding, text-to-image generation, image editing, and interleaved image-text generation, offering a valuable resource for researchers. Finally, it identifies critical challenges such as tokenization complexity, cross-modal attention scalability, data scarcity and quality, and the need for comprehensive evaluation protocols. Despite these hurdles, the paper concludes that the field is in its infancy with abundant opportunities for innovation, including the application of Chain-of-Thought (CoT) and Reinforcement Learning (RL), addressing demographic and social biases, and enabling personalized knowledge-driven generation. The ultimate goal is to move towards truly universal, general-purpose generative models capable of seamless understanding and generation across a full spectrum of modalities.

7.2. Limitations & Future Work

The authors explicitly point out several key challenges that represent limitations in the current state of unified multimodal models and suggest future research directions:

Tokenization Strategy: The high dimensionality of visual data leads to extremely long token sequences, causing significant memory and computation costs. Future work needs to focus on efficient tokenization and compression strategies that preserve representational fidelity while reducing overhead. The debate between discrete and continuous representations for image tokens remains active.
Cross-Modal Attention: As image resolution and context length increase, cross-modal attention can become a performance bottleneck. Research into scalable alternatives such as sparse or hierarchical attention mechanisms is crucial.
Data Construction: Pre-training datasets often contain noisy or biased image-text pairs, particularly for complex compositions and interleaved data. Developing reliable data filtering, debiasing, and synthesizing techniques is essential to ensure fairness and robustness.
Model Evaluation: Current evaluation protocols are typically designed for single tasks in isolation. There is a growing need for comprehensive benchmarks that can assess both understanding and generation capabilities in an integrated manner, especially for sophisticated tasks like image editing and interleaved image-text generation.
Cognitive Integration: Exploring the application of Chain-of-Thought (CoT) reasoning and Reinforcement Learning (RL) techniques is a promising avenue to improve both interpretability and performance. CoT can guide models through intermediate reasoning steps, beneficial for complex visual-question answering, while RL can optimize for long-horizon objectives like factual consistency and user satisfaction.
Fairness and Bias: Investigating and mitigating demographic and social biases of existing unified models is critical for responsible deployment. Unintentional amplification of stereotypes embedded in pre-training data can lead to harmful outputs, necessitating fairness-aware training pipelines.
Personalized Knowledge-Driven Generation: Enabling personalized knowledge-driven generation within unified MLLMs is an emerging and important direction. Current approaches often separate understanding and generation for personalized concepts, limiting generalization to compositional prompts requiring implicit knowledge. Unifying this under a shared modeling framework would improve semantic grounding and contextual generalization.
Advanced Generation Capabilities: Most current unified models primarily emphasize image understanding and text-to-image generation, with capabilities like image editing often achieved through post-finetuning. Advanced functionalities such as spatially controlled image generation, subject(s)-driven image generation, and interleaved image-text generation remain largely unexplored within a truly unified framework. This represents a significant opportunity for future research to develop more integrated and capable models.

7.3. Personal Insights & Critique

This survey provides a remarkably structured and timely overview of a field that is rapidly accelerating, largely driven by the impressive capabilities demonstrated by models like GPT-4o. The clear categorization of models based on their underlying architectures (diffusion, autoregressive, hybrid) and, crucially, their image tokenization strategies is highly valuable. For a beginner, this breakdown offers an essential map to navigate the complex landscape of various model names and their technical nuances. The inclusion of extensive lists of datasets and benchmarks further solidifies its utility as a reference.

One key insight is the inherent tension between the strengths of autoregressive models (compositional reasoning, text generation) and diffusion models (high-fidelity image synthesis). The various tokenization strategies and hybrid architectures are all attempts to reconcile this tension. The paper effectively shows that there's no single "best" approach yet, but rather a spectrum of trade-offs between semantic understanding, pixel-level control, computational efficiency, and architectural complexity. This suggests that future breakthroughs might come from novel ways to represent visual information that are both compact and semantically rich for language models, while also being easily convertible into high-fidelity pixels.

A critical reflection is on the data bottleneck. The sheer volume and quality required for training these increasingly sophisticated models are immense. The reliance on synthetic data generation pipelines is a testament to this, but it also introduces potential pitfalls such as mode collapse or bias amplification if the synthetic data generation process itself isn't diverse and unbiased enough. The paper correctly highlights data construction, fairness, and bias as major challenges, underscoring that without robust and ethical data, even the most advanced architectures will struggle.

The opportunity to integrate Chain-of-Thought and Reinforcement Learning seems particularly promising. CoT could bring much-needed transparency and reasoning capabilities to visual generation, allowing models to explain why they generated certain visual elements. RL could fine-tune models to align better with complex human preferences and abstract goals, moving beyond simple pixel-level fidelity to capture more nuanced aspects of "good" generation.

The explicit mention of current limitations in image editing and spatially controlled generation within unified frameworks indicates a clear gap. While LLMs excel at understanding complex instructions, translating these into precise visual manipulations is still largely a specialized task. Bridging this gap will require deeper integration of spatial and semantic understanding directly into the generative process, rather than relying on post-hoc finetuning or external modules.

Overall, the paper is an excellent and timely guide. Its systematic approach not only summarizes the current state but also provides a clear roadmap for future innovation in building truly intelligent, versatile, and ethical multimodal AI systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

TL;DR Summary

Abstract

In-depth Reading

English Analysis~43 min read · 70,056 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs) and Autoregressive Generation

3.1.2. Multimodal Understanding Models (MLLMs/MMMs)

3.1.3. Image Generation Models

3.1.4. Tokenization Strategies

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Diffusion Models

4.2.2. Autoregressive Models

4.2.2.1. Pixel-based Encoding (Fig. 5 (b-1) in images/4.jpg)

4.2.2.2. Semantic Encoding (Fig. 5 (b-2) in images/4.jpg)

4.2.2.3. Learnable Query Encoding (Fig. 5 (b-3) in images/4.jpg)

4.2.2.4. Hybrid Encoding

4.2.2.4.1. Pseudo Hybrid Encoding (Fig. 5 (b-4) in images/4.jpg)

4.2.2.4.2. Joint Hybrid Encoding (Fig. 5 (b-5) in images/4.jpg)

4.2.3. Fused Autoregressive and Diffusion Models

4.2.3.1. Pixel-based Encoding (Fig. 5 (c-1) in images/4.jpg)

4.2.3.2. Hybrid Encoding (Fig. 5 (c-2) in images/4.jpg)

4.2.4. Any-to-Any Multimodal Models

5. Experimental Setup

5.1. Datasets

5.1.1. Multimodal Understanding Datasets

5.1.2. Text-to-Image Datasets

5.1.3. Image Editing Datasets

5.1.4. Interleaved Image-Text Datasets

5.1.5. Other Text+Image-to-Image Datasets

5.1.6. Data Synthesis Pipelines

5.2. Evaluation Metrics

5.2.1. Evaluation on Understanding

5.2.1.1. Perception

5.2.1.2. Reasoning

5.2.2. Evaluation on Image Generation

5.2.2.1. Text-to-Image Generation

5.2.2.2. Image Editing

5.2.2.3. Other Types of Image Generation

5.2.3. Evaluation on Interleaved Generation

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.2.1. Pixel-based Encoding (Fig. 5 (b-1) in `images/4.jpg`)

4.2.2.2. Semantic Encoding (Fig. 5 (b-2) in `images/4.jpg`)

4.2.2.3. Learnable Query Encoding (Fig. 5 (b-3) in `images/4.jpg`)

4.2.2.4.1. Pseudo Hybrid Encoding (Fig. 5 (b-4) in `images/4.jpg`)

4.2.2.4.2. Joint Hybrid Encoding (Fig. 5 (b-5) in `images/4.jpg`)

4.2.3.1. Pixel-based Encoding (Fig. 5 (c-1) in `images/4.jpg`)

4.2.3.2. Hybrid Encoding (Fig. 5 (c-2) in `images/4.jpg`)