Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
TL;DR Summary
This paper provides a comprehensive survey on unified multimodal understanding and generation models, exploring the challenges posed by architectural differences between autoregressive and diffusion models. It highlights three main unified framework paradigms and offers tailored
Abstract
Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
1.2. Authors
Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
1.3. Journal/Conference
This paper is a preprint published on arXiv. The abstract and the content do not specify a journal or conference publication at the time of this analysis. arXiv is a well-regarded open-access archive for scholarly articles, particularly in fields like physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It allows researchers to share their work before formal peer review and publication, making it influential for disseminating cutting-edge research rapidly.
1.4. Publication Year
2025 (Specifically, published at 2025-05-05T11:18:03.000Z, which is May 5, 2025).
1.5. Abstract
The paper surveys the rapidly evolving field of unified multimodal understanding and generation models. It notes that while multimodal understanding models (often autoregressive-based) and image generation models (predominantly diffusion-based) have seen independent successes, recent interest, exemplified by GPT-4o, drives their integration. The architectural disparities between these domains present significant challenges. This survey offers a comprehensive overview, starting with foundational concepts and advancements in both multimodal understanding and text-to-image generation. It then categorizes existing unified models into three architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches. For each category, it details structural designs and innovations. Additionally, it compiles relevant datasets and benchmarks to guide future research. Finally, the paper discusses key challenges such as tokenization strategies, cross-modal attention, and data construction, and identifies opportunities in this nascent field, aiming to inspire further research and serve as a valuable reference.
1.6. Original Source Link
https://arxiv.org/abs/2505.02567v5 (Preprint)
1.7. PDF Link
https://arxiv.org/pdf/2505.02567v5.pdf (Preprint)
2. Executive Summary
2.1. Background & Motivation
The field of Artificial Intelligence has witnessed rapid advancements in two distinct areas: multimodal understanding models and image generation models.
-
Multimodal Understanding Models: Large Language Models (LLMs) like LLaMa and GPT have expanded into multimodal domains, giving rise to powerful models such as LLaVa, Qwen-VL, and GPT-4. These models excel at tasks ranging from simple image captioning to complex reasoning based on user instructions, primarily utilizing
autoregressive-based architectures(decoder-only, next-token prediction). -
Image Generation Models: Technologies like the Stable Diffusion (SD) series and FLUX have achieved remarkable success in producing high-quality images from text prompts, with
diffusion-based modelsbecoming the dominant paradigm (leveraging U-Net or DiT architectures).The core problem the paper aims to address is the architectural and methodological divergence between these two successful domains. Despite their individual achievements, they have largely evolved independently. This independence has led to distinct paradigms that are not inherently compatible, posing a significant challenge for creating systems that can seamlessly perform both understanding and generation tasks.
The motivation for unification is compelling:
-
Enhanced Capabilities: A unified model could generate images from complex instructions, reason about visual data, and then visualize those analyses through generated outputs, offering a much richer human-AI interaction.
-
Real-world Impact: The emergence of advanced models like GPT-4o, demonstrating enhanced capabilities in both understanding and generation, has highlighted the immense potential and growing interest in this unification trend. This development underscores the need for consolidated research and understanding of the challenges and opportunities in this nascent field.
-
Efficiency and Scalability: Managing separate models for understanding and generation tasks is complex and inefficient. A unified model promises a more scalable solution capable of handling a wider variety of tasks within a single framework.
The paper's entry point is to provide a comprehensive survey that bridges this gap, guiding future research by clearly outlining the current state, categorizing existing approaches, compiling resources, and discussing challenges and opportunities.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Comprehensive Survey and Categorization: It provides a thorough overview of existing efforts towards unifying multimodal understanding and generation models. It categorizes these models into three main architectural paradigms:
Diffusion-based models.Autoregressive-based models(further sub-categorized by image tokenization strategies: pixel-based, semantic-based, learnable query-based, and hybrid encoding).Hybrid approachesthat fuse autoregressive and diffusion mechanisms (also sub-categorized by encoding strategies). This systematic classification helps in understanding the diverse approaches and their underlying design principles.
-
Foundational Concepts and Advancements: It details the foundational concepts and recent advancements in both
multimodal understandingandtext-to-image generation, providing a necessary background for novices. -
Resource Compilation: It compiles and reviews relevant
datasetsandbenchmarkstailored for training and evaluating unified models. These resources span various tasks including multimodal understanding, text-to-image generation, image editing, and interleaved image-text processing, serving as a valuable guide for researchers. -
Identification of Challenges and Opportunities: The survey critically discusses key challenges facing the field, such as:
Tokenization strategy: How to effectively represent visual information for different architectures.Cross-modal attention: Managing computational complexity and ensuring effective interaction between modalities.Data construction: The need for large-scale, high-quality, and diverse multimodal datasets.Evaluation protocols: Developing comprehensive benchmarks that assess both understanding and generation holistically. It also highlights opportunities for future research, including incorporatingChain-of-Thought (CoT)reasoning andReinforcement Learning (RL), addressingdemographic and social biases, and enablingpersonalized knowledge-driven generation.
In essence, the paper serves as a structured reference point for researchers entering or working in the field of unified multimodal AI, aiming to inspire innovation and facilitate progress toward truly general-purpose multimodal models.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the paper's analysis of unified multimodal models, several foundational concepts from Large Language Models (LLMs), Multimodal Understanding, and Image Generation are crucial.
3.1.1. Large Language Models (LLMs) and Autoregressive Generation
- Large Language Models (LLMs): These are neural networks with billions of parameters, pre-trained on massive text datasets to understand, generate, and process human language. They have shown remarkable capabilities in natural language processing (NLP) tasks. Examples include GPT series, LLaMa, and Qwen.
- Autoregressive Generation: This is a sequential process where a model predicts the next element in a sequence based on all previously generated elements. In LLMs, this means predicting the next word or token given the preceding context. The
decoder-only Transformerarchitecture is commonly used for this, allowing efficient generation of text sequences. The model learns to factorize the joint probability of a sequence as a product of conditional probabilities: $ p(\boldsymbol{x}) = \prod_{i=1}^{N} p(x_i | x_1, x_2, ..., x_{i-1}; \theta) $ where represents the model parameters. The training objective is typically to minimize the negative log-likelihood (NLL) loss: $ \mathcal{L}(\boldsymbol{\theta}) = - \sum_{i=1}^{N} \log p(x_i | x_1, x_2, ..., x_{i-1}; \boldsymbol{\theta}) $ This process is analogous to how humans read or write, predicting one word after another.
3.1.2. Multimodal Understanding Models (MLLMs/MMMs)
- Multimodal Understanding Models: These models extend the capabilities of LLMs to process and reason over multiple input modalities, most commonly
vision-language understanding (VLU)which integrates visual (images, video) and textual inputs. They aim to achieve a comprehensive understanding of scenes, objects, relationships, and concepts across modalities. - Vision-Language Understanding (VLU): A subfield of multimodal AI focusing on tasks that require joint reasoning over visual and linguistic data, such as image captioning, visual question answering (VQA), and visual grounding.
- Dual-Encoder Architectures: Early VLU models used this, where separate encoders process images and text, then their latent representations are aligned. For example,
CLIP(Contrastive Language-Image Pre-training) learns to embed images and text into a shared latent space such that relevant image-text pairs are close together. - LLM Backbones with Connectors: Modern VLU models often use a frozen or fine-tuned LLM as their core. A
connectormodule bridges the visual input (encoded by a vision encoder) to the LLM's token space. As shown in the VLM description forimages/2.jpg, these connectors can beprojection-based(simple linear layers),query-based(e.g., using a transformer to query visual features), orfusion-based(more complex integration of features).
3.1.3. Image Generation Models
-
Diffusion Models: These generative models have become state-of-the-art for high-quality image synthesis. They operate by learning to reverse a
Markov chainof Gaussian noise addition.- Forward Process: Gradually adds Gaussian noise to an image over time steps, corrupting it into pure noise . $ q(\boldsymbol{x}{1:T} | \boldsymbol{x}0) := \prod{t=1}^{T} q(\boldsymbol{x}t | \boldsymbol{x}{t-1}) $ $ q(x_t | x{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I}) $ Here, are small variance hyperparameters controlling the noise added at each step.
- Reverse Process: A neural network (often a
U-Net) is trained to predict and remove this noise iteratively, transforming back into . This process generates a clean image from noise. $ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) $ The network parameterizes the mean and variance , learning to predict the noise that was added at timestep . - Score Function: Early diffusion models often aimed to learn the score function , which points in the direction of higher probability density.
- U-Net: A convolutional neural network architecture with a symmetrical U-shape, featuring skip connections that allow direct information flow from earlier (downsampled) layers to later (upsampled) layers. This helps preserve fine-grained details during image generation.
- Latent Diffusion Models (LDMs): To improve computational efficiency, LDMs (e.g., Stable Diffusion) perform the diffusion process in a compressed
latent spacelearned by aVariational Autoencoder (VAE)orVQ-GAN. This reduces the dimensionality of data, making diffusion faster while maintaining high-quality generation. - Diffusion Transformers (DiT): These models replace the traditional U-Net with a Transformer architecture for the noise prediction network, processing images as sequences of patches, similar to how Vision Transformers (ViT) process images for classification. They take conditional information (timestep, text embeddings) as input.
-
Generative Adversarial Networks (GANs): An older class of generative models consisting of a
generator(which creates samples) and adiscriminator(which tries to distinguish real from generated samples), trained in an adversarial manner. While powerful, GANs often suffer frommode collapse(generating a limited variety of samples) and training instability.
3.1.4. Tokenization Strategies
- Tokenization: The process of converting raw input data (like images or text) into discrete units (
tokens) that a model can process. - Text Tokens: For text, this usually involves splitting sentences into words or sub-word units (e.g., using
Byte Pair Encoding - BPE). - Visual Tokens: For images, tokenization is more complex:
-
Next-Pixel Prediction: Flattens image pixels into a 1D sequence and predicts the next pixel. Computationally expensive.
-
Next-Token Prediction (Discrete Vector Quantization): Uses a
visual tokenizer(likeVQ-VAEorVQ-GAN) to compress an image into a sequence of discretelatent codesortokensfrom a learnedcodebook. This reduces sequence length and makes images compatible with Transformer architectures. -
Semantic Encoders (e.g., CLIP, EVA-CLIP, SigLIP): These encoders are trained to align visual and text modalities in a shared latent space. They produce continuous
semantic embeddingsthat capture high-level conceptual information, making them valuable for understanding tasks. -
Next-Multiple-Tokens Prediction: Predicts multiple tokens at once, often by grouping pixels or tokens into larger units (patches, blocks) or using multi-scale approaches, to improve generation efficiency.
As seen in the VLM description for
images/3.jpg, autoregressive models use various sequence representation strategies:
-
Next-Pixel Prediction: Treats each pixel as a unit, flattening the image into a pixel sequence.Next-Token Prediction: Converts the image into a token sequence via a visual tokenizer (using discrete vector quantization).Next-Multiple-Tokens Prediction: Groups multiple tokens as a unit, often using spatial scale or frequency groups.
3.2. Previous Works
The paper references a vast array of previous works, which it then categorizes. Here's a summary of some core ones relevant to the foundational concepts:
-
LLMs:
- LLaMa [1, 2]: A family of large language models from Meta AI, known for their open-source nature and strong performance, often used as backbones in multimodal models.
- GPT Series [7, 13]: OpenAI's generative pre-trained transformer series, particularly GPT-4 [13] and GPT-4o [27], are highlighted for their advanced language and increasingly multimodal capabilities, influencing the trend toward unification.
- Qwen Series [5, 6]: Alibaba Cloud's LLM series, with Qwen-VL [5] being a prominent multimodal extension.
-
Multimodal Understanding (VLU/MLLMs):
- CLIP [22]: (Contrastive Language-Image Pre-training) A seminal dual-encoder model by OpenAI that aligns images and text into a shared embedding space using contrastive learning. Widely used as a vision encoder or for feature extraction.
- LLaVa [8]: (Large Language and Vision Assistant) One of the early
visual instruction tuningmodels that adapts LLMs for multimodal understanding tasks, leveraging CLIP and Vicuna. - MiniGPT-4 [57]: Utilized a simple projection layer to connect CLIP-derived image embeddings to a frozen LLM (Vicuna), demonstrating efficient vision-language alignment.
- BLIP-2 [53]: Introduced a
Q-Former(Querying Transformer) to bridge a frozen visual encoder with a frozen LLM, improving efficiency and alignment. - Flamingo [60]: Used gated cross-attention layers to connect a pre-trained vision encoder with a frozen LLM decoder.
- InternVL [11]: Explores unified multimodal pre-training for various visual-linguistic tasks.
- Ovis [12]: Introduces structural embedding alignment for multimodal LLMs.
- DeepSeek-VL2 [68]: Employs a
Mixture-of-Experts (MoE)architecture for cross-modal reasoning, showcasing advanced scalability.
-
Image Generation (Diffusion Models):
- SD Series [14, 15]: (Stable Diffusion) Latent Diffusion Models (LDMs) that achieve high-quality image synthesis efficiently by operating in a latent space. SD3.0 [15] uses rectified flow transformers.
- FLUX [16]: A recent, high-performance image generation model.
- DiT [20]: (Diffusion Transformers) Replaces the U-Net in diffusion models with a Transformer, processing images as patches, leading to highly scalable diffusion models.
- GLIDE [71]: Introduced
classifier-free guidancefor text-guided diffusion, improving controllability. - Imagen [72]: Used large pre-trained LLMs (T5) as text encoders for powerful text-to-image generation.
- VQ-Diffusion [73]: Applies vector quantization to diffusion models.
-
Image Generation (Autoregressive Models):
- PixelRNN [88] / PixelCNN [89]: Early pixel-based autoregressive models for images, generating pixels sequentially.
- VQ-VAE-2 [93] / VQGAN [32]: Key models for
Vector Quantization (VQ)that compress images into discrete latent codes, enabling Transformer-based autoregressive image generation (e.g., LlamaGen [24]). - MAR [25]: Autoregressive image generation without discrete vector quantization, using continuous representations and a diffusion loss.
-
Connectors/Tokenizers:
-
VAE [31] / VQ-GAN [32]: Used for
pixel-based encoding, compressing images into continuous latents or discrete tokens. -
CLIP [22] / EVA-CLIP [36] / SigLIP [209]: Used for
semantic encoding, producing continuous embeddings aligned with text. -
SEED Tokenizer [163]: An example of a
learnable query-based encodingmechanism.The VLM description for
images/2.jpg(which is Fig. 2 in the paper) illustrates the typical architecture of multimodal understanding models. It showsmultimodal encoders(for images, audio, video) feeding into aconnector, which then provides input to anLLM. The connector types are broadlyprojection-based,query-based, andfusion-based. These concepts are central to how visual information is integrated into a language model.
-
The VLM description for images/2.jpg (which is Fig. 3 in the paper) illustrates the typical process of diffusion-based text-to-image generation. It details the forward process of adding noise and the reverse process of denoising, conditioned by various inputs like text, identity, style, and sketches. This highlights the conditional generation capabilities of diffusion models.
The VLM description for images/3.jpg (which is Fig. 4 in the paper) illustrates core components of autoregressive models. It details autoregression sequence modeling and discrete vector quantization. It categorizes AR models into Next-Pixel Prediction, Next-Token Prediction, and Next-Multiple-Tokens Prediction, based on how they define the unit for sequential generation.
3.3. Technological Evolution
The evolution towards unified multimodal understanding and generation models can be summarized in several stages:
- Early Unimodal Dominance (Pre-2018): Separate research streams for NLP (e.g., RNNs, LSTMs, early Transformers for text) and Computer Vision (e.g., CNNs for image classification, GANs for image generation).
- Emergence of Multimodal Understanding (2018-2021):
- Development of
Transformerarchitectures revolutionized NLP. - Initial attempts to align modalities with dual-encoder architectures (e.g.,
ViLBERT,VisualBERT,UNITER) or simple projection layers (CLIP). - Focus on tasks like VQA and image captioning.
- Development of
- Rise of Large Language Models (2020-Present):
- Scaling of LLMs (GPT-3, LLaMa) demonstrated powerful reasoning and generation capabilities.
- Adaptation of LLMs for multimodal understanding by integrating vision encoders through
connectormodules (e.g.,Flamingo,BLIP-2,LLaVa), leveraging the LLM's reasoning power.
- Dominance of Diffusion Models in Image Generation (2020-Present):
- Diffusion models (
DDPM,GLIDE,Imagen,Stable Diffusion) surpassed GANs in sample quality and controllability for image generation. - Introduction of
Latent Diffusion Models(LDMs) for efficiency andDiffusion Transformers(DiT) for scalability.
- Diffusion models (
- The Push for Unification (2023-Present):
-
Recognition of the limitations of separate systems and the desire for models that can seamlessly understand and generate across modalities.
-
Initial explorations into using LLM-inspired
autoregressive architecturesfor image generation, despite quality gaps compared to diffusion models. -
Growing interest in
unified frameworksthat integrate the strengths of both autoregressive models (for reasoning and text generation) and diffusion models (for high-quality image synthesis). The release of GPT-4o exemplified this trend, demonstrating advanced multimodal capabilities. -
Development of
hybrid architecturescombining AR and diffusion mechanisms. -
Expansion beyond text-image to
any-to-any multimodal models(audio, video, speech).The paper's work fits into this timeline by systematically surveying the current state of this unification effort, categorizing the various architectural approaches, and outlining the challenges that remain as the field moves towards truly general-purpose multimodal AI.
-
3.4. Differentiation Analysis
This paper is a survey, so its innovation lies in its comprehensive scope and structured analysis of a rapidly emerging field, rather than proposing a new model or algorithm. Compared to prior research (which it also cites), its core differentiation points are:
-
Specific Focus on Unification: While there are excellent surveys on LLMs, multimodal understanding, and image generation individually [40, 41, 42, 43, 44, 45, 46], this work specifically focuses on the integration of understanding and generation tasks within a unified framework. It addresses the unique challenges and opportunities that arise from trying to combine these often disparate paradigms.
-
Architectural Categorization: The paper introduces a clear and detailed architectural taxonomy for unified models, dividing them into
diffusion-based,autoregressive-based, andfused AR + diffusionapproaches. Crucially, it further sub-categorizesautoregressiveandfused AR + diffusionmodels based on theirimage tokenization strategies(pixel-based, semantic-based, learnable query-based, hybrid encoding), which is a key differentiator in how these models process visual information. This fine-grained categorization provides a much-needed structured view of the field. -
Comprehensive Resource Compilation: Beyond architectural analysis, the survey provides curated lists of
datasetsandbenchmarksspecifically relevant to unified multimodal tasks, covering understanding, generation, editing, and interleaved modalities. This serves as a practical guide for researchers. -
Forward-Looking Analysis: It not only reviews existing work but also critically discusses current
challenges(e.g., tokenization, cross-modal attention, data construction, evaluation) andopportunities(e.g., CoT, RL, bias mitigation, personalization), offering concrete directions for future research.In essence, this survey differentiates itself by providing a holistic, structured, and forward-looking perspective on the emerging trend of unified multimodal AI, addressing the unique complexities of integrating understanding and generation capabilities.
4. Methodology
The paper's methodology is a systematic survey of unified multimodal understanding and generation models. Its core approach involves categorizing existing models based on their architectural paradigms and modality encoding methods, then detailing their structural designs, innovations, and limitations. It also compiles relevant datasets and benchmarks to provide a comprehensive resource for the field.
4.1. Principles
The core idea is to structure the nascent field of unified multimodal models into a coherent taxonomy. This taxonomy helps to:
-
Clarify Architectural Choices: By grouping models based on their fundamental generative mechanisms (
diffusion,autoregressive,fused AR + diffusion). -
Highlight Key Design Decisions: Especially regarding
image tokenization, which is a critical aspect for integrating visual data into language-centric architectures. -
Identify Trends and Gaps: By systematically reviewing models, the survey can pinpoint common solutions, unique innovations, and persistent challenges.
The theoretical basis is that despite the diversity of implementations, unified multimodal models often share underlying generative principles (autoregressive vs. diffusion) and face similar challenges in representing and integrating different modalities. By analyzing these commonalities and differences, the survey aims to provide a guiding framework for future research.
4.2. Core Methodology In-depth (Layer by Layer)
The paper abstracts a typical unified multimodal framework into three core components:
-
Modality-Specific Encoders: Project different input modalities (e.g., text, image, video, audio) into a shared or compatible representation space.
-
Modality-Fusion Backbone: Integrates information from multiple modalities, enabling cross-modal reasoning. This often takes the form of a large Transformer model.
-
Modality-Specific Decoders: Generate output in the desired modality (e.g., text generation, image synthesis).
The primary focus of this survey is on models supporting
vision-language understandingandgeneration(i.e., taking image and text as input and producing either text or image as output). The core categorization, as illustrated in the VLM description ofimages/4.jpg(which is Fig. 5 in the original paper), divides existing unified models into three main types:Diffusion Models,Autoregressive Models (MLLM (AR)), andFused AR + Diffusion Models (MLLM (AR) + Diffusion).
4.2.1. Diffusion Models
General Mechanism: As introduced in Section 3.1.3, diffusion models formulate generation as a pair of Markov chains: a forward process that adds Gaussian noise to data over timesteps to produce , and a reverse process that learns to iteratively denoise back to . The VLM description of images/2.jpg (which is Fig. 3 in the paper) visually explains this process.
-
Forward Process: $ q(\boldsymbol{x}{1:T} | \boldsymbol{x}0) := \prod{t=1}^{T} q(\boldsymbol{x}t | \boldsymbol{x}{t-1}) $ $ q(x_t | x{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I}) $ where are the variance hyperparameters controlling the noise.
-
Reverse Process: $ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) $ where and are parameterized by a neural network. The training objective is to minimize the Variational Lower-Bound of the Negative Log-Likelihood, typically simplified to a
noise prediction loss: $ \mathcal{L} = \mathbb{E}{q(x_0, x{1:T})} \left[ \lVert \epsilon_\theta(x_t, t) - \epsilon^*(x_t, t) \rVert^2 \right] $ Here, is the model's prediction of the noise at timestep , and is the true noise added.
In multimodal diffusion models, the denoising process is conditioned on multimodal contexts (textual descriptions, images, or joint embeddings) to enable synchronized generation and semantic alignment.
Representative Models:
-
Dual Diffusion [127]:
- Concept: Introduces a
dual-branch diffusion processfor joint text and image generation. - Encoders: Text encoded by a pre-trained
T5 encoder[23] (with softmax probability modeling for discrete text representations). Image encoded byVAE encoderfrom Stable Diffusion [14] (for continuous image latents). - Forward Process: Both text and image latents are independently noised through separate forward diffusion processes.
- Reverse Process: Two modality-specific denoisers (Transformer-based for text, U-Net-based for image) jointly denoise the latents. Crucially,
cross-modal conditioningoccurs at each timestep, where text latents attend to image latents and vice versa. - Decoders: Text latent decoded by a
T5 decoder, image latent byVAE decoder. - Training: Image branch minimizes standard
noise prediction loss. Text branch minimizescontrastive log-loss.
- Concept: Introduces a
-
UniDisc [128]:
- Concept: Employs a
fully discrete diffusion frameworkfor both text and images, training aDiffusion Transformerfrom scratch. - Tokenization: Text tokenized using
LLaMA2 tokenizer[2]. Images converted to discrete tokens withMAGVIT-v2 encoder[207]. - Diffusion Process: Discrete forward diffusion adds structured noise simultaneously across modalities. The reverse process progressively denoised these discrete tokens.
- Decoders:
LLaMA2andMAGVIT-v2decoders transform sequences into text and images.
- Concept: Employs a
-
FUDOKI [130]:
- Concept: A novel generative approach based on
discrete flow matching[208], modeling a direct path between noise and data distributions. - Architecture: Based on
Janus1.5B[174], but with modifications: replaces causal mask withfull attention maskfor global contextual understanding. Infers corruption state directly instead of explicit timestep embeddings. - Encoders:
SigLIP encoder[209] for high-level semantic features (understanding).VQGAN-based tokenizerfrom LlamaGen [24] for low-level discrete tokens (generation). - Output: Feature embeddings from
Janus1.5Bbackbone passed through modality-specific output heads for final text and image outputs. - Decoupling: Decouples understanding and generation paths.
- Concept: A novel generative approach based on
-
Muddit [131]:
- Concept: Unified model for bidirectional generation using a
purely discrete diffusion framework. - Architecture: Single
Multimodal Diffusion Transformer (MM-DiT)similar toFLUX[210], initialized fromMeissonic[211] (trained for high-resolution synthesis). - Quantization: Both modalities quantized into a shared discrete space:
VQ-VAE[32] encodes images,CLIP[22] provides text token embeddings. - Training: Cosine scheduling strategy to mask tokens. MM-DiT trained to predict clean tokens conditioned on the other modality.
- Output: Lightweight linear head for text,
VQ-VAE decoderfor image reconstruction.
- Concept: Unified model for bidirectional generation using a
-
MMaDA [129]:
- Concept: Scales up the diffusion paradigm towards a unified multimodal foundation model.
- Language Backbone:
LLaDA-8B-Instruct[212]. - Image Tokenizer:
MAGVIT-v2[213] converts images to discrete semantic tokens. - Alignment: Introduces
mixed chain-of-thought (CoT) fine-tuningto unify reasoning formats between text and vision tasks. - Optimization: Incorporates
UniGRPO, a unified policy-gradient-basedReinforcement Learning (RL)algorithm for diffusion models, enabling post-training optimization across reasoning and generation tasks using diversified reward signals.
Challenges:
- Inference Efficiency: Discrete diffusion models can be slower than autoregressive counterparts due to lack of key-value cache support and quality degradation with parallel decoding.
- Training Difficulties: Sparse supervision in discrete diffusion (loss on subset of masked tokens) leads to inefficient training and high variance.
- Length Bias: Models struggle to generalize across different output lengths, lacking a natural stopping mechanism like
end-of-sequencetokens. - Architectural Suitability: Many models reuse autoregressive designs which might not be optimal for capturing joint data distributions inherent to diffusion.
- Infrastructure: Limited mature frameworks and open-source options compared to autoregressive models.
4.2.2. Autoregressive Models
Autoregressive (AR) models serialize both vision and language tokens and model them sequentially, typically using a Transformer backbone adapted from LLMs (e.g., LLaMA, Qwen, Gemma) as a unified modality-fusion module. The VLM description of images/3.jpg (which is Fig. 4 in the paper) illustrates the core components and types of AR models.
Image Tokenization Strategies: Models are categorized by how they tokenize (encode) images for the AR framework:
4.2.2.1. Pixel-based Encoding (Fig. 5 (b-1) in images/4.jpg)
- Principle: Images are represented as continuous or discrete tokens obtained from pre-trained
autoencoders(likeVQGAN-like models [32, 220, 221, 222]) optimized for image reconstruction. These encoders compress high-dimensional pixel space into a compact latent space. The resulting image tokens are then processed similarly to text tokens within a single sequence. - Encoder/Decoder:
VQGAN-like models are typically used as both encoder and decoder. The decoder is usually a lightweight convolutional architecture. - Training: Often uses
causal attention masksandnext-token prediction lossfor unified training across modalities.
Representative Models:
- LWM [29]: Uses a
VQGAN tokenizer[32] to encode images into discrete latent codes without semantic supervision. Learns world dynamics through reconstruction-based visual tokens and textual descriptions. - Chameleon [30] & ANOLE [132]: Adopt
VQ-IMG[222], an improvedVQ-VAEvariant with a deeper encoder and residual prediction for better detail preservation. - Emu3 [133], SynerGen-VL [136], UGen [138]: Employ
SBER-MoVQGAN[220, 221], a multiscale VQGAN variant for expressive visual representations. - Liquid [137]: Uses a
VQGAN-style tokenizer, highlighting mutual benefits of unified understanding and generation under a shared visual token representation. - MMAR [134], Orthus [135], Harmon [139]: Use continuous-valued image tokens (e.g., from
SD-VAEorEmbeddingViT) and decouple the diffusion process from the AR backbone by employing lightweightdiffusion MLPsas decoders. - TokLIP [140]: Integrates a low-level discrete
VQGAN tokenizerwith a ViT-basedSigLIP[209] encoder for high-level continuous semantics, enhancing both generative capacity and semantic understanding. - Selftok [141]: Introduces a
discrete visual self-consistency tokenizerfor high-quality reconstruction and compression, optimizing for visual reinforcement learning.
Limitations:
- Lack of Semantic Abstraction: Visual tokens optimized for pixel reconstruction often lack high-level semantics, complicating cross-modal alignment.
- Computational Overhead: Dense token grids lead to long sequence lengths, increasing computational and memory costs for high-resolution images.
- Modality-Specific Biases: Reconstruction-centric objectives can result in visual tokens retaining biases towards textures and low-level patterns, which may not be optimal for semantic reasoning.
4.2.2.2. Semantic Encoding (Fig. 5 (b-2) in images/4.jpg)
- Principle: Image inputs are processed using pre-trained
text-aligned vision encoders(e.g.,OpenAI-CLIP[22],SigLIP[209],EVA-CLIP[36],UNIT[223]). These encoders produce visual embeddings that align closely with language features in a shared semantic space, improving cross-modal alignment. - Decoder: Most models pair semantic encoders with
diffusion-based decoders(e.g.,SD family[14],IP-adapter[227],FLUX[16],Lumina-Next[228]), trained independently from the MLLM. The MLLM produces semantic-level visual tokens, which the diffusion decoder refines into high-resolution images. - Training: Typically uses
causal attention masksandnext-token prediction losson text and semantic visual tokens.
Representative Models:
- Emu [142], Emu2 [33], LaViT [143]: Employ
EVA-CLIP[36] as the vision encoder. Emu unifies VQA, image captioning, and image generation. Emu2 scales the MLLM to 37B parameters. LaViT usesdynamic visual tokenization. - DreamLLM [34], VL-GPT [35], MM-Interleaved [144], PUMA [147]: Utilize
OpenAI-CLIP[22]. DreamLLM uses a linear projection. VL-GPT employs a causal Transformer. MM-Interleaved and PUMA extract multi-granular features. - Mini-Gemini [145]: Uses dual semantic encoders (
CLIPfor global features,RQ-VAEfor dense local features) with a cross-attention module to refine global tokens. - MetaMorph [148]: Employs
SigLIP[209] and introduces modality-specific adapters within the LLM. - ILLUME [149]: Adopts
UNIT[223] as its vision encoder, which is trained with both image reconstruction and contrastive alignment losses for balanced representation. - VILA-U [146], UniTok [150]: Mimic
UNITto obtain novel text-aligned vision tokenizers. - QLIP [151]: Addresses reconstruction vs. text-image alignment conflict using
binary-spherical quantization. - Tar [157]: Leverages LLM vocabulary for visual codebook, uses scale-adaptive pooling, and employs diffusion for visual generation enhancement.
- UniFork [153]: Shares shallow-layer parameters for understanding/generation, but uses distinct deeper networks.
- UniCode2 [154]: Employs a
cascaded codebook(frozen foundational codebook from clusteredSigLIPfeatures + supplementary learnable codebooks). - DualToken [152]: Uses shallow
SigLIPfeatures for reconstruction and deepSigLIPfeatures for semantic learning. - X-Omni [160]: Uses
SigLIP-VQand reinforcement learning to mitigate errors and information loss. - Qwen-Image [161], OmniGen2 [158], Ovis-U1 [159], Bifrost-1 [162], UniWorld [155], Pisces [156]: Integrate diffusion models (e.g.,
MMDiT,OmniGen,FLUX,DiT,SD-1.5) as decoders, using the MLLM's visual comprehension outputs as conditions.
Limitations:
- Limited Pixel-level Control: Semantic embeddings lack spatial density for fine-grained editing (inpainting, structure-preserving transformations).
- Insufficient Spatial Correspondence: Global/mid-level representations may be inadequate for tasks requiring precise spatial alignment (e.g., referring expression segmentation).
- Mismatch between MLLM and Decoder: Separate training of semantic encoder and diffusion decoder can lead to
semantic driftorgeneration artifactsdue to lack of end-to-end optimization.
4.2.2.3. Learnable Query Encoding (Fig. 5 (b-3) in images/4.jpg)
- Principle: Introduces a set of
learnable query tokensthat dynamically extract informative content from image features, creating compact and semantically aligned embeddings. These queries act as content-aware probes interacting with visual encoders. - Training: Usually involves
image-text contrastive learningand/orimage reconstruction supervision.
Representative Models:
- SEED [163], SEED-LLAMA [164], SEED-X [165]:
- Concept: Use a
seed tokenizerto learn causal visual embeddings. - Process: Input image encoded by
BLIP-2 ViT encoder[53] into dense token features. These features are concatenated withlearnable query tokensand processed by acausal Q-Former. - Decoders: SEED-LLAMA and SEED-X upgrade to
UnCLIP-SD[14] orSDXL[226] decoders.
- Concept: Use a
- MetaQueries [166], OpenUni [170], Nexus-Gen [167], Ming-Lite-Uni [168], Ming-Omni [171], BLIP3-o [169], UniLIP [172], TBAC-Unilmage [173]:
- Concept: Simplified learnable query encoding where
image features(e.g., fromSigLIP[209]) are extracted by a frozen encoder, concatenated withlearnable query tokens, and directly passed through a frozenvision-language backbone(e.g., LLaVA, Qwen2.5-VL). - Conditioning: Output causal embeddings condition a diffusion decoder.
- Innovations: Nexus-Gen uses a more powerful diffusion decoder (
FLUX-1.dev). Ming-Lite-Uni and Ming-Omni use advanced MLLMs (M2-omini[200]), multi-scale learnable tokens, and multi-scale representation alignment. Ming-Omni also integrates audio and uses a dual-stage training strategy. BLIP3-o exploresflow matching lossfor better image quality. UniLIP incorporates reconstruction ability into CLIP tokens. TBAC-Unilmage applies learnable queries at multiple MLLM layers.
- Concept: Simplified learnable query encoding where
Limitations:
- Computational Overhead: Increased complexity with more query tokens (memory, computation).
- Fixed Encoder Limitations: Reliance on fixed encoders can limit adaptability to novel visual inputs.
- Adaptability of Visual Features: Frozen backbones can restrict dynamic alignment of image features with evolving query semantics.
- Coverage of Diverse Content: A small, fixed query set may not adequately capture the richness of complex scenes or fine-grained details, especially for highly detailed outputs.
4.2.2.4. Hybrid Encoding
- Principle: Combines the strengths of
pixel-based encoding(preserving fine-grained visual details) andsemantic-based encoding(providing high-level semantic abstraction).
4.2.2.4.1. Pseudo Hybrid Encoding (Fig. 5 (b-4) in images/4.jpg)
- Principle: Uses
dual encoders—typically asemantic encoder(e.g.,SigLIP) and apixel encoder(e.g.,VQGANorVAE)—but activates them in a task-specific manner. Thesemantic encoderis for understanding tasks, and thepixel encoderfor generation tasks. - Decoders: Typically engage
pixel decodersto reconstruct images from latent codes.
Representative Models:
- Janus [174], Janus-Pro [175], OmniMamba [176], Unifluid [177], MindOmni [178], Skywork UniPic [179]:
- Concept: Train dual encoders concurrently with combined understanding and generation datasets, but one encoder is disabled during inference for the other task.
- Example: For understanding, the
semantic encoderis active; for text-to-image generation, thepixel encoderis active. For image editing, Unifluid uses the semantic encoder, while MindOmni uses both VAE and semantic encoder to encode the source image. - Limitation: Underutilizes potential synergy because both encoders are not leveraged simultaneously during inference for a single task. Lack of semantic grounding in generation tasks and high-fidelity details in understanding tasks.
4.2.2.4.2. Joint Hybrid Encoding (Fig. 5 (b-5) in images/4.jpg)
- Principle: Integrates both
semanticandpixel tokensinto a single unified input for the language model or decoder, enabling simultaneous utilization of both representations. These models differ in their fusion strategies. - Decoders: Support both
pixel decoders(e.g.,VQGAN,Infinity[231],VAR-D30[113]) anddiffusion-based decoders(e.g.,SDXL[226]).
Representative Models:
- MUSE-VL [180], UniToken [186]: Concatenate features from
SigLIPandVQGANalong the channel dimension before passing them to the LLM. - Tokenflow [181]: Incorporates dual encoders and codebooks with a shared mapping for joint optimization of high-level semantics and low-level pixel details.
- VARGPT [182], VARGPT-1.1 [184], ILLUME [185]: Concatenate semantic and pixel tokens along the sequence dimension, maintaining both types in the LLM's input.
- SemHiTok [183]: Introduces
Semantic Guided Hierarchical Codebook (SGHC)to inherit semantic information while incorporating texture information for pixel reconstruction. - Show-o2 [187]: Uses separate network branches for processing latent features from
3DVAE[230], aggregating outputs with a spatial-temporal fusion module.
Limitations:
- Increased Complexity: Dual-encoder architectures and token fusion increase computational costs and training times.
- Alignment Challenges: Ensuring effective alignment between semantic and pixel-level features is difficult.
- Trade-offs: Balancing understanding and generation objectives can lead to compromises in performance.
- Implicit Alignment Assumption: Assumes implicit alignment between pixel and semantic tokens, which is non-trivial and can lead to conflicting supervision signals.
4.2.3. Fused Autoregressive and Diffusion Models
This paradigm combines autoregressive generation for text tokens with multi-step denoising via diffusion for image tokens. This allows for compositional reasoning (AR) and high visual quality (diffusion).
4.2.3.1. Pixel-based Encoding (Fig. 5 (c-1) in images/4.jpg)
- Principle: Images are transformed into discrete tokens or continuous latent vectors (e.g., via
SD-VAE), which become targets for a diffusion-based denoising process conditioned on autoregressively generated text. - Training: Combines
autoregressive lossfor language modeling anddiffusion lossfor image reconstruction. Usesbidirectional attentionfor spatial coherence.
Representative Models:
- Transfusion [38], MonoFormer [37], LMFusion [188]:
- Concept: Use continuous latent representations extracted via
SD-VAE. - Architectural Innovations: Transfusion proposes a unified Transformer backbone with modality-specific layers. MonoFormer introduces a compact architecture with shared blocks and task-dependent attention masking. LMFusion uses a lightweight visual injection module for frozen LLMs.
- Concept: Use continuous latent representations extracted via
- Show-o [39]:
- Concept: Employs a discrete pixel-based tokenizer based on
MAGVIT-v2[213], generating symbolic image tokens compatible with Transformer-style decoding. - Process: Supports AR-based text token generation and diffusion-based image synthesis.
- Concept: Employs a discrete pixel-based tokenizer based on
Limitations:
- Computational Overhead: Continuous latent spaces and iterative diffusion sampling are resource-intensive.
- Text-Visual Alignment: Latent representations from unsupervised reconstruction objectives might not be optimally aligned with semantic language tokens, affecting controllability.
- Discrete Tokenization Issues:
Codebook collapseand limited capacity for subtle visual nuances in VQ-based discrete tokens.
4.2.3.2. Hybrid Encoding (Fig. 5 (c-2) in images/4.jpg)
- Principle: Fuses both
semantic features(e.g., fromCLIPorViT encoders) andpixel-level latents(e.g., fromSD-VAE) for a more expressive image representation. - Decoder: Usually
rectified flowbased.
Representative Models:
- Janus-flow [189], Mogao [190], BAGEL [191]:
- Concept: Adopt a
dual-encoder architectureand userectified flowfor generation. - Encoders: Decouple understanding and generation encoders.
SigLIPorSigLIP+SDXL-VAEfor understanding;SDXL-VAEorFLUX-VAEfor image generation. - Limitation: The
pseudo hybrid encodingdesign limits simultaneous leveraging of semantic and pixel-level features during generation, as only the pixel encoder is active during image synthesis.
- Concept: Adopt a
Challenges:
-
Increased Model Complexity: Dual-encoder architectures and combined AR/diffusion processes lead to higher computational costs and longer training.
-
Effective Alignment: Difficult to ensure effective alignment between semantic and pixel-level features.
-
Trade-offs: Balancing understanding and generation objectives often leads to compromises.
The VLM description of
images/4.jpg(Fig. 5 in the paper) visually summarizes this entire categorization, showing the flow frommodality encodersthrough differentbackbones(Diffusion, MLLM (AR), MLLM (AR + Diffusion)) and their respectivemodality decodersandmasking strategies.
4.2.4. Any-to-Any Multimodal Models
- Principle: Aims to process and generate across a diverse set of modalities beyond text and image, including audio, video, speech, and music. These models unify modality-specific encoders and decoders within a single architecture, using a shared backbone for cross-modal representation learning and sequence modeling.
Representative Models:
- Modular Designs:
- OmniFlow [199]: Integrates
HiFiGen[232] for audio/music,SD-VAE[14] for images, and aDiT-like diffusion model (MMDiT)[15] as the backbone.
- OmniFlow [199]: Integrates
- Shared Embedding Spaces:
- Spider [198], X-VILA [196], Next-GPT [192]: Leverage
ImageBindto map six modalities (text, image, video, audio, depth, thermal) into a single embedding space, then use modality-specific decoders (e.g.,Stable Diffusion,Zeroscope,LLM-based text decoders).
- Spider [198], X-VILA [196], Next-GPT [192]: Leverage
- Sequence-to-Sequence Paradigm:
- AnyGPT [195]: Uses
EnCodec[233] for audio,SpeechTokenizer[234] for speech, and trains a unified Transformer with modality-specific prefixes. - Unified-IO 2 [193]: Structured encoder-decoder design for visual, audio, and language, supporting tasks like
AST-to-textorspeech-to-image.
- AnyGPT [195]: Uses
- M2-omni [200]:
- Concept: Highly versatile architecture for text, image, video, and audio.
- Tokenizers/Decoders: Uses
NaViT[235] for video/image encoding,SD-3[226] as image decoder. For audio,paraformer-zh[236] extracts tokens, andCosyVoice[237] flow matching/vocoder model generates audio streams.
Challenges:
- Modality Imbalance: Text and image modalities often dominate, limiting diversity.
- Scalability: Supporting many modalities increases complexity, latency, and resource requirements.
- Semantic Consistency: Maintaining grounded and aligned outputs across diverse modalities is difficult.
5. Experimental Setup
As a survey paper, this work does not present novel experimental results from its own methodology. Instead, it systematically reviews the datasets and benchmarks used by the models it surveys. This section will summarize those resources as presented in the paper.
5.1. Datasets
The paper emphasizes that large-scale, high-quality, and diverse training data are crucial for building powerful unified multimodal models. It categorizes these datasets by their primary use and modality characteristics, focusing on those released from 2020 onwards. It also notes that before multimodal training, many models are initialized with parameters from large-scale natural language corpora (e.g., Common Crawl, RedPajama, WebText), but these text-only datasets are excluded from the detailed discussion.
5.1.1. Multimodal Understanding Datasets
These datasets train models for tasks like image captioning, visual question answering (VQA), image-text retrieval, and visual grounding. They consist of images paired with textual descriptions.
- RedCaps [238]: Contains 12 million image-text pairs sourced from Reddit, focusing on everyday items and moments.
- Wukong [239]: A large-scale Chinese multimodal pre-training dataset with 100 million Chinese image-text pairs filtered from the web, addressing the need for Chinese multimodal data.
- LAION [240]: One of the largest publicly available image-text pair datasets.
LAION-5Bcontains nearly 6 billion image-text pairs crawled from the web, filtered usingCLIPmodels for relevance.Laion-COCO[242], a subset, contains 600 million high-quality samples stylistically closer toMS COCO. - COYO [241]: Another large-scale image-text pair dataset with approximately 747 million samples, sourced from web crawls.
- DataComp [243]: Comprises 1.4 billion samples derived from Common Crawl, using
CLIP scoreandImage-based filteringfor higher quality image-text pairs. - ShareGPT4V [246]: Provides approximately 100K high-quality image-text conversational data points, designed to enhance instruction-following and dialogue capabilities.
- ALLaVA [216]: A synthetic dataset with 1.4 million samples, generated using proprietary models like
GPT-4Vto create fine-grained captions and complex reasoning VQA pairs, for training resource-friendlyLite Vision-Language Models (LVLMs). - CapsFusion-120M [245]: A collection of 120M image-text pairs from
LaionCOCO, with captions integrated usingCapsFusionLLaMA. - Cambrian-10M(7M) [247]: A large-scale dataset for multimodal instruction tuning, with an initial 10M samples, refined to 7M after quality filtering.
- LLaVA-OneVision [248]: A visual instruction tuning collection with 3.2 million single-image samples (QA, OCR, math) and 1.6 million mixed-modal samples (video, multi-image).
- Infinity-MM [249]: A comprehensive multimodal training dataset with over 40 million samples, collected from existing open-source datasets and newly generated data, including image captions, general/selective visual instructions, and
GPT-4orVLM-based pipelinesynthesized data. - Other Datasets:
GRIT (Grid-based Representation for Image-Text)[244] (20M samples) focuses on fine-grained image region-text phrase alignment.SAM Dataset[251] (11M high-resolution images with segmentation masks) can enhance fine-grained understanding.
5.1.2. Text-to-Image Datasets
These datasets train models to generate images corresponding to textual descriptions, often with an emphasis on aesthetic quality, content richness, or stylistic attributes.
- CC-12M (Conceptual Captions 12M) [250]: About 12 million image-text pairs from web Alt-text, known for concise and descriptive captions.
- LAION-Aesthetics [240]: A subset of
LAIONwith 120 million images and texts, filtered using an aesthetic scoring model. - Text Rendering Datasets:
- Mario-10M [252]: 10 million samples used to train
TextDiffuserfor improved text placement and legibility. - RenderedText [253]: 12 million high-resolution synthetic images of handwritten text for understanding and generation.
- AnyWord-3M [255]: 3 million samples crucial for
AnyText[255] and enhancing generated text quality. - TextAtlas5M [265]: Targets dense text generation with interleaved documents, synthetic data, and real-world images with long captions.
- Mario-10M [252]: 10 million samples used to train
- JourneyDB [254]: 4 million high-quality image-prompt pairs generated by
Midjourney, valuable for training models for complex, detailed, and artistic text-to-image mappings. - CosmicMan-HQ 1.0 [256]: 6 million high-quality real-world human images with precise text annotations (from 115 million attributes), for generating human images.
- DOCCI [257]: 15K curated images with long, human-annotated English descriptions (avg. 136 words), for fine-grained details and complex compositions.
- PixelProse [258]: Extracted from
DataComp,CC-12M, andRedCaps, with richly annotated images and metadata (watermark, aesthetic scores). - Megalith [260]: 10 million links to Flickr images (photo category, no copyright restrictions) with community-made captions (ShareCaptioner, Florence2, InternVL2).
- PD12M [262]: 12.4 million high-quality public domain/CC0-licensed images paired with synthetic captions from
Florence-2-large[294]. - Synthesized Datasets:
- text-to-image-2M [261]: 2 million enhanced text-image pairs for fine-tuning, curated by T2I and captioning models.
- SFHQ-T2I [263]: 122K diverse, high-resolution synthetic face images generated by multiple T2I models.
- EliGen TrainSet [264]: Uses images from
FLUX.1-devandMLLM-generated promptsfor stylistic consistency and detailed annotation. - BLIP-3o 60k [169]: 60,000 instruction tuning samples distilled from
GPT-4o. - ShareGPT4o-Image [266]: 45K text-to-image pairs, with prompts generated via attribute-first and image-first approaches, and images synthesized by
GPT-4o. - Echo-4o-Image [267]: Over 100K samples targeting surreal fantasy scenarios and complex instructions to enhance model imagination.
- Other Datasets:
SAM dataset[251] andDenseFusion[259] (1M samples).
5.1.3. Image Editing Datasets
These datasets train models to alter input images according to textual commands, comprising (source image, editing instruction, target image) triplets.
- InstructPix2Pix [268]: 313K (instruction, input image, output image) samples generated synthetically using
GPT-3andStable Diffusionto create before-and-after image pairs based on instructions. - MagicBrush [269]: 10K high-quality, manually annotated samples for instruction-based image editing (object manipulation, style transfer, with masks).
- HIVE [270]: Introduces human feedback, providing a 1.1M training dataset (similar to InstructPix2Pix generation) and a 3.6K reward dataset for human ranking.
- EditWorld [272]: Focuses on "world-instructed image editing," curated via
GPT-3.5andT2I modelsfor complex input-output generation, and video frame extraction withVLM-generated instructions. - PromptFix [274]: 1.01M triplets focusing on low-level image processing tasks (inpainting, dehazing, super-resolution).
- HQ-Edit [271], SEED-Data-Edit [165], UltraEdit [273], OmniEdit [275], AnyEdit [276]: Large-scale datasets (e.g., SEED-Data-Edit: 3.7M, UltraEdit: 4M, AnyEdit: 2.5M, OmniEdit: 1.2M, HQ-Edit: 197K) often combining automated generation with human filtering for diversity and quality.
- RefEdit [277]: Synthetic dataset targeting instruction-based editing with
referring expressions, generated usingGPT-4ofor text,FLUXfor images,Grounded SAMfor masks, and specialized models for controlled edits. - ImgEdit [278]: 1.2M edit pairs for high-quality single/multi-turn editing, using a multi-stage pipeline with
LAION-Aesthetics,VLMs,detection/segmentation models, state-of-the-art generative models (FLUX,SDXL), andGPT-4ofor quality filtering. - ByteMorph-6M [279]: Over 6 million image editing pairs for
non-rigid motions(camera viewpoint, object deformations), constructed byVLM-generated motion captions, image-to-video models, and LLM-generated instructions for frame transformations. - ShareGPT-4o-Image (Editing) [266]: 46K instruction-guided image editing triplets, using
GPT-4oto synthesize edited images from source images and LLM-generated instructions. - GPT-Image-Edit-1.5M [280]: Over 1.5 million high-quality instruction-guided image editing triplets, leveraging
GPT-4oto unify and refine existing datasets (OmniEdit, HQ-Edit, UltraEdit) by regenerating outputs and rewriting prompts. - X2Edit [281]: 3.7 million samples for
arbitrary-instruction image editing, balanced across 14 tasks, constructed viaVLM-generated task-aware instructionsand expert generative models, with rigorous filtering.
5.1.4. Interleaved Image-Text Datasets
These datasets contain documents or sequences where text and images naturally follow each other, enhancing models' ability to comprehend and generate multimodal content in context.
- Multimodal C4 (MMC4) [282]: Augments the text-only
C4corpus with over 101 million documents and 571 million images, algorithmically interleaved from Common Crawl. - OBELICS [283]: 141 million multimodal web documents from Common Crawl, featuring 353 million images interleaved with 115 billion text tokens, focusing on full document structure.
- CoMM [284]: 227K high-quality, curated samples focusing on the coherence and consistency of interleaved image-text sequences, primarily from instructional and visual storytelling websites.
- OmniCorpus [285]: A very large-scale (10 billion-level) image-text interleaved dataset, with 8.6 billion images and 1,696 billion text tokens from 2.2 billion documents, extracted and filtered from diverse sources (web, video platforms) with human-feedback filtering.
5.1.5. Other Text+Image-to-Image Datasets
These datasets enhance specific capabilities like generating images based on provided subject images or control signals.
- LAION-Face [286]: 50 million image-text pairs used for
ID-preserving image generation. - MultiGen-20M [287]: 20 million samples to train models for
unified image generation conditioned on multiple control signals(text, edge maps, depth maps, segmentation masks, sketches), such asUniControl[287]. Data is structured as triplets (e.g., "depth map, instruction with prompt, target image"). - Subjects200K [288]: 200K samples for
subject-driven image generation, synthetically generated usingLLM (ChatGPT-4o)for structured descriptions andFLUX[16] for image synthesis, withLLMquality assessment. - SynCD [289]:
Synthetic Customization Datasetwith 95K sets of images fortext + image-to-image customization, generated using existing T2I models and3D asset datasets (Objaverse [296])for multiple consistent views of objects. - X2I-subject-driven [84]: Facilitates subject-driven generation via
GRIT-Entitydataset (objects detected/segmented from GRIT, optionally repainted byMS-Diffusion[297]) andWeb-Images(notable individuals identified via text analysis/LLM, web images scraped, visually verified, and captioned). - Graph200K [290]: A
graph-structured datasetbuilt onSubjects200K, augmenting images with 49 types of annotations across five meta-tasks (conditional generation, IP preservation, style transfer, image editing, restoration), enabling models to learn shared knowledge for universal image generation. - Echo-4o-Image (Multi-Reference) [267]: 73K synthetic samples for "Multi-to-one" generation, with diverse instructions and rich reference information for
multi-reference image composition.
5.1.6. Data Synthesis Pipelines
The paper notes the growing reliance on data synthesis to address the shortage of high-quality public data for complex tasks.
- Data synthesis from images: Uses models like
BLIP-2[53] orKosmos2[244] for initial captioning and grounding, followed by object detection (Grounding DINO[300]) and segmentation (SAM[251]) to extract subject masks and region captions. - Data synthesis from videos: Alleviates "copy-paste" issues by extracting subjects from different frames using video segmentation models (e.g.,
SAM2[301]), also enabling training data for image editing.
5.2. Evaluation Metrics
The paper discusses various benchmarks and evaluation suites rather than individual low-level metrics. These benchmarks are designed to assess the capabilities of unified multimodal models across understanding, generation, and interleaved tasks.
5.2.1. Evaluation on Understanding
5.2.1.1. Perception
These benchmarks evaluate how accurately models connect visual inputs with linguistic descriptions through grounding, recognition, and retrieval.
- Flickr30k [365]: An early benchmark for image-text retrieval and captioning.
- MS COCO Captions [366]: Evaluates models' ability to retrieve relevant captions and localize textual phrases to image regions.
- VQA [302], VQA v2 [303]: Visual Question Answering benchmarks where models answer free-form queries about objects, attributes, and relationships in images.
- VisDial [308]: Visual Dialog, assessing models' ability to engage in multi-turn conversations about an image.
- TextVQA [310]: Focuses on Visual Question Answering where questions involve reading text within images.
- ChartQA [309]: Assesses understanding of structured charts and graphs, combining visual perception with quantitative reasoning.
- VSR [8]: Probes spatial relation reasoning in real-world images.
- MMBench [316]: Supplies 3K bilingual multiple-choice questions spanning grounding, recognition, and retrieval for cross-lingual comparison.
- MMMU [317]: Adds ~11.5K college-level multimodal problems across six disciplines to probe domain knowledge and logical deduction.
- HaluEval [312]: Diagnoses
hallucination recognitionon diverse model-generated and annotated statements. - MM-Vet [318] & MM-Vet v2 [319]: Covers recognition, OCR, spatial reasoning, math, and open question answering, with v2 evaluating interleaved image-text sequences.
- SEED-Bench [321]: Offers 19K multi-choice items over 12 dimensions, generated via a pipeline targeting specific evaluation dimensions.
- LLaVa-Bench [314]: Provides
COCOand in-the-wild image sets with dense queries for generalization checks. - LAMM [313]: Supplies instruction-tuning examples covering 2D and 3D modalities for agent development.
- Open-VQA [322]: Formulates hierarchical follow-up questions to refine coarse VQA answers.
- OwlEval [315]: Offers human-rated open-ended visual questions assessing relevance and informativeness.
- MMStar [320]: Curates balanced challenge samples across six core skills and 18 axes for high-precision evaluation.
5.2.1.2. Reasoning
These benchmarks probe progressively richer cognitive skills, requiring logical inference, commonsense, and explanation.
- CLEVR [304]: Systematically varies object attributes and spatial relations, testing counting, comparison, and relational logic.
- GQA [305]: Leverages dense scene graphs to generate compositional questions for testing consistency, grounding, and plausibility in natural images.
- OK-VQA [306] & A-OKVQA [311]: Commonsense extensions requiring world knowledge retrieval or inference for answers.
- VCR [307]: Demands a model not only to answer but also to justify it by selecting a coherent rationale, coupling recognition with explanation.
- MathVista [323]: Broadens scope to mathematical problem solving in visually grounded contexts, combining fine-grained visual understanding with symbolic manipulation.
- General-Bench [324]: An ultra-large benchmark comprising over 700 tasks and 325,800 instances across varied modalities and capabilities, for multimodal generalist models.
5.2.2. Evaluation on Image Generation
5.2.2.1. Text-to-Image Generation
These benchmarks emphasize compositional reasoning, prompt alignment, and real-world applicability beyond basic image quality metrics like FID [367] and CLIPScore [22].
- PaintSkills [326], DrawBench [72], PartiPrompts [325]: Evaluate core compositional capabilities.
- GenEval [329]: Evaluates six fine-grained tasks (single-object generation, object co-occurrence, counting, color control, relative positioning, attribute binding) by comparing outputs with pre-trained detectors.
- GenAI-Bench [334]: Presents 1.6K human prompts covering relational, logical, and attribute-based categories, combining human preference with automated alignment scores.
- HRS-Bench [327]: Evaluates 13 distinct skills grouped into five categories: accuracy, robustness, generalization, fairness, and bias.
- DPG-Bench [336]: Focuses on dense prompts describing multiple objects with various attributes and relationships.
- T2I-CompBench [330] & T2I-CompBench [337]: Specifically target compositional generalization using detector-based scoring.
- VISOR [368]: Proposes an automatic method for evaluating spatial understanding.
- Commonsense-T2I [332]: Challenges models to depict everyday concepts requiring commonsense grounding.
- EvalMuse40K [369]: Provides 40K crowdsourced prompts focusing on nuanced concept representation.
- HEIM [331]: Identifies 12 aspects including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency.
- FlashEval [370]: Shrinks large-scale evaluation sets to accelerate benchmark testing.
- MEMOBench [371]: Comprehensive benchmark for emotional understanding and expression capabilities.
- ConceptMix [335]: Evaluates compositional generation by sampling k-tuples of visual concepts and automatically verifying concept presence.
- TIFA [328]: Fine-grained benchmark for evaluating text-to-image faithfulness via VQA.
- DSG-1k [333]: Refines VQA questions using a multi-level semantic graph for enhanced image-prompt alignment.
- MMIG-Bench [338]: Multi-dimensional assessment framework for T2I models.
- OneIG-Bench [339]: Comprehensive fine-grained evaluation framework for T2I models.
- WISE [340] & WorldGenBench [342]: Evaluate world knowledge understanding, emphasizing semantic consistency, realism, and aesthetics.
- CVTG-2K [341]: Evaluates visual-text generation on complex multi-region layouts, diverse text attributes, and fine-grained positioning.
5.2.2.2. Image Editing
Benchmarks for instruction-guided image editing have grown in scale and scope.
- MagicBrush [269]: Large-scale, manually annotated dataset for instruction-guided real image editing (single-turn, multi-turn, mask-provided, mask-free).
- HQ-Edit [271]: ~200K high-resolution edits with computed alignment and coherence scores, using
GPT-4V. - I2EBench [347]: Consolidates over 2K images and 4K multi-step instructions across 16 editing dimensions.
- EditVal [344]: Standardized benchmark with fine-grained edit annotations and an automated evaluation pipeline aligned with human judgment.
- EmuEdit [345]: Covers seven editing tasks (background changes, object-level edits, style modifications) with paired instructions.
- Reason-Edit [346]: Diagnostic benchmark targeting causal and counterfactual reasoning (object relations, attribute dependencies, multi-step inference).
- EditBench [343]: Diagnostic benchmark for text-guided image inpainting with masked input/reference pairs.
- HumanEdit [348]: 5,751 high-resolution images and open-form instructions across six edit types, with annotated masks and multi-stage human feedback.
- IE-Bench [349]: Human-aligned benchmark for text-driven image editing quality.
- GEdit-Bench [350]: 606 real-world instruction-image pairs.
- CompBench [351]: Decomposes edits into location, appearance, dynamics, and object dimensions via MLLM and human collaboration.
- GIE-Bench [352]: Uses multi-choice VQA and object-aware masking to evaluate editing accuracy and content preservation.
- EditInspector [353], ComplexBench-Edit [354]: Undertake comprehensive evaluation of text-guided image editing, assessing vision consistency, artifact detection, instruction adherence, quality, and detail preservation.
- ByteMorph-Bench [279]: Tackles non-rigid image manipulation.
- RefEdit-Bench [277]: Evaluates referring-expression-based edits in complex multi-entity scenes.
- ImgEdit-Bench [278]: Unified image editing dataset and benchmark.
- KRIS-Bench [355]: Cognitively grounded suite assessing factual, conceptual, and procedural reasoning.
5.2.2.3. Other Types of Image Generation
- MultiGen-20M [287]: 20 million image-prompt-condition triplets (
LAION-Aesthetics-V2[372]) for automated evaluation across diverse visual conditions. - DreamBench [362]: Benchmarks personalized generation using 30 reference objects with curated prompts and human fidelity ratings.
- DreamBench [363]: Scales to 150 subjects and 1,350 prompts, using advanced VLMs for human-aligned scoring of concept preservation, composition, and style.
- VTBench [364]: Systematic benchmark for evaluating
visual tokenizersin autoregressive image generation across reconstruction, detail preservation, and text preservation.
5.2.3. Evaluation on Interleaved Generation
These benchmarks challenge models to seamlessly alternate between text and image modalities across multiple turns, reflecting realistic dialogue and storytelling.
- InterleavedBench [356]: Human-curated benchmark for interleaved text-and-image generation, evaluating text quality, perceptual fidelity, multimodal coherence, and helpfulness.
- ISG [358]: Introduces scene-graph annotations and a four-tier evaluation (holistic, structural, block-level, image-specific) over 1K samples in eight scenarios and 21 subtasks.
- OpenING [360]: Assembles 5K human-annotated instances across 56 real-world tasks with
IntJudgeto test open-ended multimodal generation. - OpenLEAF [357]: Gathers 30 open-domain queries to probe foundational interleaved text-image generation, measuring entity and style consistency via LMM evaluators and human validation.
- MMIE [359]: Unified interleaved suite sampling from 12 fields and 102 subfields, with multiple-choice and open-ended questions.
- UniBench [361]: Comprehensive compositional benchmark with 81 fine-grained tags for diversity.
5.3. Baselines
As this paper is a comprehensive survey, it does not propose a new model or conduct its own experiments against baselines. Instead, it systematically reviews and categorizes existing unified multimodal models, detailing their architectures and capabilities. The "baselines" in this context are the various models themselves, which are presented within each category and subcategory as examples of different approaches to unification. The paper aims to provide a structured overview of these models and their underlying design principles rather than a comparative performance evaluation against a single "baseline."
6. Results & Analysis
As a survey, this paper does not present new experimental results in the traditional sense (e.g., performance metrics of a new model compared to baselines). Instead, its "results" are the comprehensive categorization and analysis of existing unified multimodal understanding and generation models, along with the identification of their advances, challenges, and opportunities. The paper's findings are presented through its structured review of architectural paradigms, model implementations, and available resources.
6.1. Core Results Analysis
The core analysis presented in the paper revolves around how different models attempt to unify multimodal understanding and generation, highlighting the various architectural choices and their implications.
1. Architectural Paradigms for Unification: The survey identifies three primary architectural paradigms, reflecting distinct strategies for integration:
- Diffusion Models: These models leverage the inherent strengths of diffusion for high-quality image generation. They extend the denoising process to be conditioned on multimodal contexts, sometimes employing discrete diffusion for text and continuous for images, or fully discrete frameworks. The key advantage is superior sample quality and flexibility in conditioning. However, they face challenges in inference efficiency, sparse supervision during training, and architectural suitability.
- Autoregressive (AR) Models: These approaches serialize multimodal inputs (text and visual tokens) and process them sequentially using Transformer backbones. Their strength lies in structural consistency with LLMs and their ability to perform compositional reasoning. The choice of
image tokenizationis critical here, leading to sub-categories:- Pixel-based Encoding: Good for low-level reconstruction but lacks semantic abstraction and increases sequence length.
- Semantic Encoding: Provides strong text-image alignment but lacks pixel-level control, often requiring a separate diffusion decoder.
- Learnable Query Encoding: Offers adaptive and compact semantic representations but introduces computational overhead and relies on fixed encoders.
- Hybrid Encoding: Aims to combine pixel-level detail with semantic abstraction, but pseudo-hybrid methods may underutilize synergy, while joint-hybrid methods increase complexity and face alignment challenges.
- Fused AR + Diffusion Models: These models combine AR for text generation with diffusion for image generation. This hybrid strategy aims to balance symbolic control (from AR) and visual fidelity (from diffusion). Similar to AR models, their effectiveness depends on image tokenization, with both pixel-based and hybrid encoding variants. They offer a promising trade-off but increase inference costs due to multi-step sampling.
2. Advances in Modality Integration: The paper illustrates significant advances in how modalities are integrated:
- Improved Tokenization: From simple pixel flattening to advanced
VQGAN-like models,semantic encoders(CLIP,SigLIP), andlearnable queries, researchers are constantly seeking more efficient and semantically rich ways to represent visual data for language models. - Specialized Connectors: The evolution from simple projection layers to
Q-Formersand adapter-based approaches has allowed more effective bridging between vision encoders and LLMs. - Scaling Laws: Models like
Emu2andMMaDAdemonstrate that scaling up parameters and leveraging techniques likeMoE(Mixture-of-Experts) in backbones can significantly enhance both understanding and generation capabilities. - Any-to-Any Modality Support: Beyond text and image, models are now exploring unification across audio, video, and speech, often through modular designs or shared embedding spaces (
ImageBind).
3. Resource Development:
The paper highlights a burgeoning ecosystem of datasets and benchmarks tailored for unified models. This includes massive web-scale paired data (LAION, COYO), fine-grained conversational data (ShareGPT4V), synthetic instruction tuning data (ALLaVA), and specialized datasets for image editing (InstructPix2Pix, MagicBrush), subject-driven generation (Subjects200K), and interleaved image-text generation (Multimodal C4, OBELICS). The increasing use of data synthesis pipelines is a critical development for addressing data scarcity in complex tasks.
4. Emerging Challenges: Despite rapid progress, the field faces significant hurdles that need addressing for true unification:
-
Tokenization: Finding optimal strategies for representing images (discrete vs. continuous, efficient compression, semantic richness).
-
Cross-Modal Attention: Managing computational overhead and ensuring effective interaction in high-dimensional, long-sequence contexts.
-
Data Construction: The need for higher quality, less biased, and more diverse training data, including complex compositional and interleaved data.
-
Evaluation: Developing comprehensive benchmarks that can holistically assess understanding, generation, and multi-turn interactions.
-
Cognitive Integration: Incorporating
Chain-of-Thought (CoT)reasoning andReinforcement Learning (RL)for improved interpretability and performance. -
Fairness and Bias: Addressing ethical concerns related to potential amplification of biases in generated content.
-
Personalization: Enabling personalized knowledge-driven generation within unified models.
-
Task-Specific Limitations: Many unified models still struggle with advanced functionalities like spatially controlled generation, multi-subject generation, and complex image editing without extensive post-finetuning.
Overall, the survey demonstrates that while significant progress has been made, the field is still in its early stages, characterized by diverse architectural explorations and a clear path forward marked by persistent challenges and abundant opportunities.
6.2. Data Presentation (Tables)
The paper includes two tables, TABLE 1 and TABLE 2, which summarize various unified multimodal models. These tables are crucial for understanding the current landscape and will be transcribed here.
The VLM description for images/4.jpg also contains TABLE 1. The text immediately following TABLE 1 in the original paper, "TABLE 2 shift toward broader multimodal interactions in recent years." confirms the presence of a second table summarizing Any-to-Any Multimodal Models.
The following are the results from TABLE 1 of the original paper, which categorizes unified multimodal models by their backbone architecture: Diffusion, MLLM (AR), and MLLM (AR + Diffusion), and further subdivides them by encoding variations and encoder-decoder configurations.
| Model | Type | Backbone | Architecture | Gen. Enc. | Und. Enc. | Gen. Dec. | Mask | Date |
|---|---|---|---|---|---|---|---|---|
| Dual Diffusion [127] | a | D-DiT | Diffusion Model | SD-VAE | SD-VAE | Bidirect. | 2024-12 | |
| UniDisc [128] | a | DiT | MAGVIT-v2 | MAGVIT-v2 | Bidirect. | 2025-03 | ||
| MMaDA [129] | a | LLaDA | MAGVIT-v2 | MAGVIT-v2 | Bidirect. | 2025-05 | ||
| FUDOKI [130] | a | DeepSeek-LLM | SigLIP | VQGAN | VQGAN | Bidirect. | 2025-05 | |
| Muddit [131] | a | Meissonic (MM-DiT) | VQGAN | VQGAN | Bidirect. | 2025-05 | ||
| b-1 | Autoregressive Model | |||||||
| LWM [29] | b-1 | LLaMa-2 | VQGAN | VQGAN | Causal | 2024-02 | ||
| Chameleon [30] | b-1 | LLaMa-2 | VQ-IMG | VQ-IMG | Causal | 2024-05 | ||
| ANOLE [132] | b-1 | LLaMa-2 | VQ-IMG | Causal | 2024-07 | |||
| Emu3 [133] | b-1 | LLaMA-2 | SBER-MoVQGAN | SBER-MoVQGAN | Causal | 2024-09 | ||
| MMAR [134] | b-1 | Qwen2 | SD-VAE + EmbeddingViT | Diffusion MLP | Bidirect. | 2024-10 | ||
| Orthus [135] | b-1 | Chameleon | VQ-IMG+Vision embed. | Diffusion MLP | Causal | 2024-11 | ||
| SynerGen-VL [136] | b-1 | InterLM2 | SBER-MoVQGAN | SBER-MoVQGAN | Causal | 2024-12 | ||
| Liquid [137] | b-1 | GEMMA | VQGAN | VQGAN | Causal | 2024-12 | ||
| UGen [138] | b-1 | TinyLlama | SBER-MoVQGAN | SBER-MoVQGAN | Causal | 2025-03 | ||
| Harmon [139] | b-1 | Qwen2.5 | MAR | MAR | Bidirect. | 2025-03 | ||
| TokLIP [140] | b-1 | Qwen2.5 | VQGAN+SigLIP | VQGAN | Causal | 2025-05 | ||
| Selftok [141] | b-1 | LLaMA3.1 | SD3-VAE+MMDiT | SD3 | Causal | 2025-05 | ||
| Emu [142] | b-2 | LLaMA | EVA-CLIP | SD | Causal | 2023-07 | ||
| LaVIT [143] | b-2 | LLaMA | EVA-CLIP | SD-2.1 | Causal | 2023-09 | ||
| DreamLLM [34] | b-2 | LLaMA | OpenAI-CLIP | SDXL | Causal | 2023-09 | ||
| Emu2 [33] | b-2 | LLaMA | EVA-CLIP | OpenAI-CLIP | IP-Adapter | Causal | 2023-12 | |
| VL-GPT [35] | b-2 | LLaMA | Open-CLIP | SD-v2.1 | Causal | 2023-12 | ||
| MM-Interleaved [144] | b-2 | Vicuna | OpenAI-CLIP+ConvNext | SDXL | Causal | 2024-01 | ||
| Mini-Gemini [145] | b-2 | Gemma&Vicuna | SigLIP+RQ | RQ-VAE | Causal | 2024-03 | ||
| VILA-U [146] | b-2 | LLaMA-2 | OpenAI-CLIP | SDXL | Bidirect. | 2024-09 | ||
| PUMA [147] | b-2 | LLaMA-3 | SigLIP | SD-1.5 | Causal | 2024-10 | ||
| MetaMorph [148] | b-2 | Vicuna | UNIT | SDXL | Causal | 2024-12 | ||
| ILLUME [149] | b-2 | LLaMa-2 | ViTamin | ViTamin | Causal | 2024-12 | ||
| UniTok [150] | b-2 | LlaMa-3 | QLIP-ViT+BSQ | BSQ-AE | Causal | 2025-02 | ||
| QLIP [151] | b-2 | Qwen2.5 | SigLIP | RQVAE | Causal | 2025-02 | ||
| DualToken [152] | b-2 | Qwen2.5 | SigLIP+RQ | RQ-VAE | Causal | 2025-03 | ||
| UniFork [153] | b-2 | Qwen2.5 | SigLIP+RQ | FLUX.1-dev / SD-1.5 | Causal | 2025-05 | ||
| UniCode2 [154] | b-2 | Qwen2.5-VL | SigLIP2 | DiT | Bidirect. | 2025-06 | ||
| UniWorld [155] | b-2 | Qwen2.5-VL | SigLIP | EVA-CLIP | Diffusion | Causal | 2025-06 | |
| Pisces [156] | b-2 | LLaMA-3.1 | SigLIP2+VQ | VQGAN / SANA | Causal | 2025-06 | ||
| Tar [157] | b-2 | Qwen2.5 | SigLIP | OmniGen | Causal | 2025-06 | ||
| OmniGen2 [158] | b-2 | Qwen2.5-VL | AimV2 | MMDiT | Causal | 2025-06 | ||
| Ovis-U1 [159] | b-2 | Ovis | MMDiT | Causal | 2025-06 | |||
| X-Omni [160] | b-2 | Qwen2.5-VL | QwenViT | Siglip | FLUX | Causal | 2025-07 | |
| Qwen-Image [161] | b-2 | Qwen2.5-VL | QwenViT | MMDiT | Causal | 2025-08 | ||
| Bifrost-1 [162] | b-2 | Qwen2.5-VL | QwenViT | ViT | FLUX | Causal | 2025-08 | |
| SEED [163] | b-3 | OPT | SEED Tokenizer | Learnable Query | SD | Causal | 2023-07 | |
| SEED-LLaMA [164] | b-3 | LLaMa-2 &Vicuna | SEED Tokenizer | Learnable Query | unCLIP-SD | Causal | 2023-10 | |
| SEED-X [165] | b-3 | LLaMa-2 | SEED Tokenizer | Learnable Query | SDXL | Causal | 2024-04 | |
| MetaQueries [166] | b-3 | LLaVA&Qwen2.5-VL | SigLIP | earnable Query | Sana | Causal | 2025-04 | |
| Nexus-Gen [167] | b-3 | Qwen2.5-VL | wenVitT | Learnable Query | FLUX | Causal | 2025-04 | |
| Ming-Lite-Uni [168] | b-3 | M2-omni | NaViT | Sana | Causal | 2025-05 | ||
| BLIP3-o [169] | b-3 | Qwen2.5-VL | OpenAI-CLIP | Learnable Query | Lumina-Next | Causal | 2025-05 | |
| OpenUni [170] | b-3 | InternVL3 | InternViT | Learnable Query | Sana | Causal | 2025-05 |
The following are the results from `TABLE 2` of the original paper, which summarizes `Any-to-Any Multimodal Models` that support broader multimodal interactions.
| Model | Architecture | Date | |||
|---|---|---|---|---|---|
| Backbone | Modality Enc. | Modality Dec. | Mask | ||
| Next-GPT [192] | Vicuna | ImageBind | AudioLDM+SD-1.5+Zeroscope-v2 | Causal | 2023-09 |
| Unified-IO 2 [193] | T5 | Audio Spectrogram Transformer+Vision ViT | Audio ViT-VQGAN + Vision VQGAN | Causal | 2023-12 |
| Video-LaVIT [194] | LLaVA-1.5 | LaVIT+Motion VQ-VAE | SVD img2vid-xt | Causal | 2024-02 |
| AnyGPT [195] | LLaMA-2 | Encodec+SEED Tokenizer+SpeechTokenizer | Encodec+SD+SoundStorm | Causal | 2024-02 |
| X-VILA [196] | Vicuna | ImageBind | AudioLDM+SD-1.5+Zeroscope-v2 | Causal | 2024-05 |
| MIO [197] | Yi-Base | SpeechTokenizer+SEED-Tokenizer | SpeechTokenizer+SEED Tokenizer | Causal | 2024-09 |
| Spider [198] | LLaMA-2 | ImageBind | AudioLDM+SD-1.5+Zeroscope-v2 +Grounding DINO+SAM | Causal | 2024-11 |
| OmniFlow [199] | MMDiT | HiFiGen+SD-VAE+Flan-T5 | HiFiGen+SD-VAE+TinyLlama | Bidirect. | 2024-12 |
| M2-omni [200] | LLaMA-3 | paraformer-zh+NaViT | CosyVoice-vocoder+SD-3 | Casual | 2025-02 |
6.3. Ablation Studies / Parameter Analysis
As a survey paper, this work does not conduct its own ablation studies or parameter analyses. Instead, it synthesizes findings and architectural details from the surveyed literature. The paper implicitly highlights the impact of various design choices (e.g., choice of image tokenizer, type of backbone, fusion strategy) by categorizing models accordingly and discussing their respective strengths and limitations, which are often derived from the ablation studies and analyses performed by the original authors of the models. For example, the discussions on the limitations of pixel-based encoding (lack of semantic abstraction, computational overhead) versus semantic encoding (lack of pixel-level control, mismatch with decoder) directly reflect insights gained from comparing different architectural components in the field.
7. Conclusion & Reflections
7.1. Conclusion Summary
This survey provides a comprehensive overview of the rapidly developing field of unified multimodal understanding and generation models. It highlights the increasing convergence of two previously distinct domains: multimodal understanding (traditionally autoregressive-based) and image generation (primarily diffusion-based). The paper systematically categorizes existing unified models into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. Within the autoregressive and hybrid categories, it further delineates models based on their image tokenization strategies (pixel-based, semantic-based, learnable query-based, and hybrid encoding), illustrating the diverse ways visual information is processed and integrated.
The survey compiles relevant datasets and benchmarks across tasks like understanding, text-to-image generation, image editing, and interleaved image-text generation, offering a valuable resource for researchers. Finally, it identifies critical challenges such as tokenization complexity, cross-modal attention scalability, data scarcity and quality, and the need for comprehensive evaluation protocols. Despite these hurdles, the paper concludes that the field is in its infancy with abundant opportunities for innovation, including the application of Chain-of-Thought (CoT) and Reinforcement Learning (RL), addressing demographic and social biases, and enabling personalized knowledge-driven generation. The ultimate goal is to move towards truly universal, general-purpose generative models capable of seamless understanding and generation across a full spectrum of modalities.
7.2. Limitations & Future Work
The authors explicitly point out several key challenges that represent limitations in the current state of unified multimodal models and suggest future research directions:
- Tokenization Strategy: The high dimensionality of visual data leads to extremely long token sequences, causing significant memory and computation costs. Future work needs to focus on
efficient tokenization and compression strategiesthat preserve representational fidelity while reducing overhead. The debate between discrete and continuous representations for image tokens remains active. - Cross-Modal Attention: As image resolution and context length increase,
cross-modal attentioncan become a performance bottleneck. Research intoscalable alternativessuch as sparse or hierarchical attention mechanisms is crucial. - Data Construction: Pre-training datasets often contain
noisy or biased image-text pairs, particularly for complex compositions and interleaved data. Developingreliable data filtering, debiasing, and synthesizing techniquesis essential to ensure fairness and robustness. - Model Evaluation: Current
evaluation protocolsare typically designed for single tasks in isolation. There is a growing need forcomprehensive benchmarksthat can assess both understanding and generation capabilities in an integrated manner, especially for sophisticated tasks like image editing and interleaved image-text generation. - Cognitive Integration: Exploring the application of
Chain-of-Thought (CoT)reasoning andReinforcement Learning (RL)techniques is a promising avenue to improve bothinterpretability and performance. CoT can guide models through intermediate reasoning steps, beneficial for complex visual-question answering, while RL can optimize for long-horizon objectives like factual consistency and user satisfaction. - Fairness and Bias: Investigating and mitigating
demographic and social biasesof existing unified models is critical for responsible deployment. Unintentional amplification of stereotypes embedded in pre-training data can lead to harmful outputs, necessitatingfairness-aware training pipelines. - Personalized Knowledge-Driven Generation: Enabling
personalized knowledge-driven generationwithin unified MLLMs is an emerging and important direction. Current approaches often separate understanding and generation for personalized concepts, limiting generalization to compositional prompts requiring implicit knowledge. Unifying this under a shared modeling framework would improve semantic grounding and contextual generalization. - Advanced Generation Capabilities: Most current unified models primarily emphasize
image understandingandtext-to-image generation, with capabilities likeimage editingoften achieved through post-finetuning. Advanced functionalities such asspatially controlled image generation,subject(s)-driven image generation, andinterleaved image-text generationremain largely unexplored within a truly unified framework. This represents a significant opportunity for future research to develop more integrated and capable models.
7.3. Personal Insights & Critique
This survey provides a remarkably structured and timely overview of a field that is rapidly accelerating, largely driven by the impressive capabilities demonstrated by models like GPT-4o. The clear categorization of models based on their underlying architectures (diffusion, autoregressive, hybrid) and, crucially, their image tokenization strategies is highly valuable. For a beginner, this breakdown offers an essential map to navigate the complex landscape of various model names and their technical nuances. The inclusion of extensive lists of datasets and benchmarks further solidifies its utility as a reference.
One key insight is the inherent tension between the strengths of autoregressive models (compositional reasoning, text generation) and diffusion models (high-fidelity image synthesis). The various tokenization strategies and hybrid architectures are all attempts to reconcile this tension. The paper effectively shows that there's no single "best" approach yet, but rather a spectrum of trade-offs between semantic understanding, pixel-level control, computational efficiency, and architectural complexity. This suggests that future breakthroughs might come from novel ways to represent visual information that are both compact and semantically rich for language models, while also being easily convertible into high-fidelity pixels.
A critical reflection is on the data bottleneck. The sheer volume and quality required for training these increasingly sophisticated models are immense. The reliance on synthetic data generation pipelines is a testament to this, but it also introduces potential pitfalls such as mode collapse or bias amplification if the synthetic data generation process itself isn't diverse and unbiased enough. The paper correctly highlights data construction, fairness, and bias as major challenges, underscoring that without robust and ethical data, even the most advanced architectures will struggle.
The opportunity to integrate Chain-of-Thought and Reinforcement Learning seems particularly promising. CoT could bring much-needed transparency and reasoning capabilities to visual generation, allowing models to explain why they generated certain visual elements. RL could fine-tune models to align better with complex human preferences and abstract goals, moving beyond simple pixel-level fidelity to capture more nuanced aspects of "good" generation.
The explicit mention of current limitations in image editing and spatially controlled generation within unified frameworks indicates a clear gap. While LLMs excel at understanding complex instructions, translating these into precise visual manipulations is still largely a specialized task. Bridging this gap will require deeper integration of spatial and semantic understanding directly into the generative process, rather than relying on post-hoc finetuning or external modules.
Overall, the paper is an excellent and timely guide. Its systematic approach not only summarizes the current state but also provides a clear roadmap for future innovation in building truly intelligent, versatile, and ethical multimodal AI systems.
Similar papers
Recommended via semantic vector search.