Paper status: completed

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

Published:01/19/2024

Multimodal Image-Text Generative Model (1)Multi-Scale Multi-Image Feature Synchronizer (1)Multi-Modal Instruction Following (1)End-to-End Multimodal Pretraining (1)Interleaved Image-Text Sequence Modeling (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MM-Interleaved is an end-to-end interleaved image-text generative model featuring a multi-scale multi-image feature synchronizer, improving fine-grained visual detail access and enhancing multimodal instruction following and consistent generation.

Abstract

Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.

Mind Map

In-depth Reading

English Analysis~45 min read · 61,219 chars

1. Bibliographic Information

1.1. Title

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

1.2. Authors

Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, Hongsheng Li, Yu Qiao, Jifeng Dai.

Affiliations: Shanghai AI Laboratory (OpenGVLab), MMLab CUHK, Tsinghua University, SenseTime Research, University of Toronto, Fudan University, Nanjing University, CAIR HKISI CAS.
Research Backgrounds: The authors come from prominent AI research institutions, suggesting expertise in areas such as computer vision, natural language processing, large language models, and multi-modal learning. The presence of researchers from MMLab, CUHK and Tsinghua University, along with Shanghai AI Laboratory, indicates a strong background in foundational AI research and large-scale model development.

1.3. Journal/Conference

Published at arXiv, a preprint server.

Reputation and Influence: arXiv is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While it is not a peer-reviewed journal or conference, papers published on arXiv are often subsequently submitted to and published in top-tier conferences (e.g., NeurIPS, ICCV, ICLR, CVPR) or journals. Its reputation lies in its role as a primary means for researchers to share new work quickly and receive feedback from the community.

1.4. Publication Year

2024 (Specifically, published at UTC: 2024-01-18T18:50:16.000Z).

1.5. Abstract

The paper addresses the challenge of developing generative models for interleaved image-text data, which involves understanding and generating both images and text in sequences. A key limitation of existing models is their inability to efficiently capture fine-grained image details, especially in multi-image scenarios, due to a fixed and often small number of visual tokens. To overcome this, the authors propose MM-Interleaved, an end-to-end generative model. Its core innovation is the Multi-Modal Feature Synchronizer (MMFS) module, which allows the model to directly access fine-grained, multi-scale image features from previous contexts during both text and image generation. MM-Interleaved is pre-trained end-to-end on both paired and interleaved image-text corpora and further refined through supervised fine-tuning to follow complex multi-modal instructions. Experiments demonstrate its versatility in recognizing visual details and generating consistent images based on both textual and visual conditions, achieving state-of-the-art results on various benchmarks.

1.6. Original Source Link

https://arxiv.org/abs/2401.10208

Publication Status: This is a preprint available on arXiv. It indicates the research has been shared publicly but may not yet have undergone formal peer review or official publication in a conference or journal.

2. Executive Summary

2.1. Background & Motivation

Core Problem: The paper aims to solve the problem of effectively processing and generating interleaved image-text data. This data format, where images and text are interspersed (like in blogs or news articles), is ubiquitous on the internet. A major challenge for existing multi-modal generative models is their limited ability to capture fine-grained visual details, particularly when dealing with multiple images in a sequence. This is often due to the fixed, small number of visual tokens used to represent images within Large Language Models (LLMs).
Importance of the Problem:
- Research Value: It expands the scope of multi-modal modeling beyond simple image-text pairs, enabling more unified processing of diverse multi-modal data. It bridges previously disparate fields like text-to-image generation and visual question-answering.
- Practical Application Potential: Models capable of comprehending and generating such complex sequences could revolutionize content creation, interactive AI assistants, and multi-modal dialogue systems.
- Specific Challenges/Gaps:
  1. Limited Context Length of LLMs: Multi-modal LLMs have constrained context windows (e.g., 2048-4096 tokens).
  2. Information Loss from Perceiver Resamplers: To fit images into LLM contexts, Perceiver Resamplers compress high-resolution images into a fixed, small number of visual tokens (e.g., 32 or 64). This compression leads to a loss of critical image details, especially problematic for tasks requiring fine-grained observation or in multi-image scenarios.
  3. Context Insensitivity: Standard Resamplers operate in a context-insensitive manner, extracting features from images without considering the broader multi-modal context, leading to a static representation that might not fulfill dynamic perceptual requirements.
Paper's Entry Point / Innovative Idea: The paper's innovative idea is to mitigate the information loss from fixed visual tokens by enabling dynamic feature extraction from original images within the intermediate layers of the LLM. Instead of relying solely on the initial compressed visual tokens, the model can dynamically extract relevant, fine-grained information from multi-scale image feature maps on demand, guided by the LLM's intermediate features. This is achieved through a novel Multi-Modal Feature Synchronizer (MMFS) module.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

Multi-Modal Feature Synchronizer (MMFS): It proposes a novel Multi-Modal Feature Synchronizer (MMFS) module. This module efficiently extracts fine-grained visual details from multi-scale feature maps of multiple images. It operates dynamically, responding to the intermediate context features of the multi-modal LLMs, thereby reducing the reliance on a large, fixed number of visual tokens.
MM-Interleaved Model: It introduces MM-Interleaved, an end-to-end generative model specifically designed for interleaved image-text data. Built upon MMFS, this model can preserve fine-grained image information with a small number of visual tokens per image and is optimized end-to-end for both text and image generation.
State-of-the-Art Performance: The method achieves state-of-the-art results across a wide range of multi-modal comprehension and generation benchmarks. This is accomplished without using any in-house data, demonstrating its effectiveness in generating accurate text descriptions and visually consistent images from interleaved image-text inputs. It outperforms previous methods supporting interleaved generation and is competitive with or surpasses specialists in either text or image generation.

3.1. Foundational Concepts

Generative Models: In machine learning, generative models are a class of models that learn the distribution of training data and can then generate new data samples that resemble the training data. For example, a generative model trained on images of faces can generate new, realistic-looking faces. In the context of this paper, the model generates both text and images.
Interleaved Image-Text Data: This refers to sequences of content where images and text are mixed together, rather than being separate (like a single image with a caption). Examples include blog posts, news articles, or online tutorials, where text explains images, and images illustrate text, creating a coherent narrative or informational flow.
Multi-modal Large Language Models (LLMs):
- Large Language Models (LLMs): These are advanced neural networks trained on vast amounts of text data to understand, generate, and process human language. They can perform tasks like translation, summarization, and question-answering. Examples include GPT-3, LLaMA, and Vicuna.
- Multi-modal LLMs: These are LLMs extended to process and understand multiple types of data (modalities), such as text and images. They achieve this by converting non-textual data (like images) into a format that the LLM can understand, typically through visual tokens.
Visual Tokens: To integrate images into LLMs, images are first processed by a vision encoder (e.g., a Vision Transformer) which extracts features. These features are then often compressed into a sequence of discrete tokens, similar to how words are tokenized in text. These visual tokens represent the image's content in a compact form that can be fed into an LLM alongside text tokens.
Perceiver Resamplers: A Perceiver Resampler (introduced by Flamingo) is a module used to reduce the dimensionality of input features, particularly in the context of multi-modal LLMs. It takes a large number of input features (e.g., many image patches) and uses a cross-attention mechanism to query a much smaller, fixed number of latent query vectors. This effectively compresses a high-dimensional input into a low-dimensional fixed-size representation (e.g., 2890 image patches to 32 visual tokens), making it computationally feasible to process images within the LLM's limited context window.
Diffusion Models: These are a class of generative models that have achieved significant success in image generation. They work by gradually adding noise to an image until it becomes pure noise, and then learning to reverse this process, step-by-step, to generate a new image from noise. They are particularly known for generating high-quality and diverse images. Stable Diffusion is a popular example.
Attention Mechanism: A core component of Transformers, attention allows a model to weigh the importance of different parts of its input data when processing a specific element.
- Self-attention: Used within a single modality (e.g., text or image tokens) to understand relationships between different parts of the input. For example, in a sentence, it helps the model decide which other words are most relevant when interpreting a particular word. The formula for Self-attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
  - $d_k$ is the dimension of the keys.
  - $QK^T$ measures the similarity between queries and keys.
  - $\mathrm{softmax}$ normalizes these similarities to get attention weights.
  - $V$ is weighted by these attention weights.
- Cross-attention: Used to establish relationships between different modalities (e.g., text and image). For instance, in a multi-modal LLM, text tokens might cross-attend to visual tokens to gather visual information relevant to the current text being processed.
Deformable Attention: An enhancement over standard self-attention or cross-attention, particularly useful in computer vision. Instead of attending to all positions in a feature map, Deformable Attention learns a small set of sampling offsets to dynamically sample features only at a few relevant locations. This makes it more efficient and allows it to focus on fine-grained details without incurring high computational costs, especially with high-resolution inputs. The intuition is that not all pixels are equally important, and the model can learn where to look.

3.2. Previous Works

The paper categorizes related work into three main areas:

Modeling of Paired Image-Text Data:
- Concept: Early multi-modal research primarily focused on image-text pairs, where a single image is associated with a single piece of text (e.g., a caption).
- Examples:
  - CLIP [71], BLIP [48], EVA-CLIP [83]: Models trained with image-text contrastive learning, excelling at recognizing and understanding open-world semantics. These models learn to align visual and textual representations.
  - LLaVA [54], BLIP-2 [47]: LLM-based multi-modal models that connect LLMs with pre-trained visual encoders, often using simple linear projections. LLaVA demonstrated the effectiveness of LLMs for processing additional visual inputs and focused on visual instruction tuning.
  - Image captioning models (COCA [99]) and text-to-image generation models (DALL-E [73], Stable Diffusion [74]).
- Relevance to Paper: These models provide the foundational components (visual encoders, LLMs) that MM-Interleaved builds upon. However, they typically focus on single image-text interactions, not interleaved sequences.
Modeling of Interleaved Image-Text Data:
- Concept: More recent works have addressed the complexity of interleaved image-text data.
- Early Works: Kosmos [33] and Flamingo [3] were pioneers in understanding such data, often relying on non-public datasets. Flamingo introduced the Perceiver Resampler to manage visual token length.
- Public Datasets: Obelics [45] and MMC4 [111] were later released to promote public research in this field.
- Generative Capabilities:
  - CM3Leon [101]: An early attempt at generating both images and text, using VQ-VAE [87] to convert images into discrete tokens for auto-regressive modeling. However, it was noted for weaknesses in image understanding.
  - DreamLLM [17]: Focused on single-stage end-to-end modeling, using raw image pixels as inputs.
  - Emu2 [82], SEED-LLaMA [25], VL-GPT [110]: Adopted additional training stages for image tokenizers/detokenizers.
- Limitations of Prior Work: The paper highlights that these previous efforts (except for MM-Interleaved) are limited because they primarily feed image information only at the LLM inputs and suffer from the problem that a fixed number of visual tokens cannot efficiently preserve image details, especially in multi-image contexts. Many are also limited to generating only text.
Integrating Image Details into LLMs:
- Concept: Addressing the challenge of preserving image details when integrating them into LLMs.
- Perceiver Resamplers: Most works, like Flamingo [3], BLIP-2 [47], and Mini-GPT4 [109], use Perceiver Resamplers to map each image to a fixed, small number of visual tokens (e.g., 32 or 64). Flamingo injects these features in intermediate layers, while BLIP-2 and Mini-GPT4 insert them at the LLM's input.
- Increasing Visual Tokens: To retain more details, some recent works (LLaVA [54], Qwen-VL [6], Shikra [10], Kosmos-2.5 [56]) increased the number of input visual tokens per image to hundreds. SPHINX [85] even pushed this to 2,890 tokens.
- Limitations of Increasing Tokens: While this mitigates information loss, it significantly increases computational and memory demands for LLMs, making it particularly problematic and inefficient in multi-image scenarios where the total token count quickly explodes.

3.3. Technological Evolution

The evolution of multi-modal AI has progressed from:

Text-only Models: Initial LLMs focusing solely on language.
Paired Image-Text Understanding: Models like CLIP and BLIP learning alignments between single images and texts.
Paired Image-Text Generation: Models like DALL-E and Stable Diffusion generating images from text, or captioning images.
Multi-modal LLMs (for comprehension): Connecting LLMs with vision encoders (e.g., LLaVA, BLIP-2) to enable LLMs to understand visual input and generate text responses. These typically relied on Perceiver Resamplers for token reduction.
Interleaved Multi-modal Comprehension: Models like Flamingo and Kosmos began to tackle sequences of images and text, but often focused on understanding rather than full generation, and still faced visual token limitations.
Interleaved Multi-modal Generation (with limitations): Early attempts (CM3Leon, DreamLLM, VL-GPT) to generate both images and text in interleaved sequences, but still constrained by the fixed-token problem for images.

MM-Interleaved fits into the latest stage, aiming to overcome the fixed visual token limitation that plagued previous interleaved multi-modal generation models.

3.4. Differentiation Analysis

The core difference and innovation of MM-Interleaved compared to previous multi-modal LLMs, especially those handling interleaved image-text data, lies in its approach to integrating fine-grained image details.

Previous Approaches:
- Fixed Visual Tokens: Most prior multi-modal LLMs (e.g., Flamingo, BLIP-2, CM3Leon, DreamLLM) compress each image into a small, fixed number of visual tokens using Perceiver Resamplers before feeding them into the LLM. This is done to manage computational costs and LLM context length.
- Increasing Visual Tokens: Later models (LLaVA, Qwen-VL, SPHINX) tried to mitigate detail loss by simply increasing the number of visual tokens per image.
- Context Insensitivity: These Resamplers are typically context-insensitive; they extract a fixed set of features regardless of the LLM's current understanding or generation task.
MM-Interleaved's Core Innovation:
- Dynamic Feature Extraction: Instead of relying solely on a fixed set of visual tokens at the input, MM-Interleaved introduces the Multi-Modal Feature Synchronizer (MMFS). This module allows the LLM to dynamically and on-demand access fine-grained, multi-scale image features directly from the original image context during the generation process.
- Reduced Token Count, Retained Detail: This dynamic mechanism means MM-Interleaved can achieve competitive performance with significantly fewer input visual tokens (e.g., 64 vs. 576 or 2890 in other models) while still preserving critical image details. This makes it much more efficient and scalable for multi-image scenarios.
- End-to-End Generative Modeling: It provides a unified framework capable of generating both images and text coherently within an interleaved sequence, addressing a broader range of tasks than many comprehension-focused multi-modal LLMs.
- Fine-grained Control for Image Generation: By integrating MMFS into the image decoder as well, MM-Interleaved can leverage pixel-level spatial information from previous images for tasks requiring precise visual alignment (e.g., segmentation-to-image translation), a capability that models relying only on coarse Resampler features struggle with.

4. Methodology

4.1. Principles

The core idea behind MM-Interleaved is to overcome the inherent information bottleneck created by compressing images into a fixed, small number of visual tokens for multi-modal LLMs. This bottleneck leads to a loss of fine-grained image details, which is particularly detrimental in multi-image scenarios or tasks requiring precise visual understanding and generation.

The theoretical basis is that while a Perceiver Resampler can provide a coarse representation of an image, the LLM's intermediate context can provide clues about which specific fine-grained details are relevant for the current generation step. By allowing the LLM and image decoder to dynamically query and extract these details from the original high-resolution image features on demand using a specialized attention mechanism, the model can maintain efficiency (low visual token count) while achieving high fidelity and detail recognition. This dynamic extraction mechanism, embodied by the Multi-Modal Feature Synchronizer (MMFS), synchronizes the LLM's high-level contextual understanding with the low-level visual details.

4.2. Core Methodology In-depth

The MM-Interleaved model is an end-to-end generative model for interleaved image-text data. It builds upon and integrates three key components: a Visual Foundation Model (VFM), a Large Language Model (LLM), and a Diffusion Model (DM).

4.2.1. Task Formulation

The model aims to generate a sequence of interleaved image-text data $X=\left\{x_{1}, x_{2}, x_{3}, \ldots\right\}$ , where each element $x_{n}$ can either be a text token ( $x_{n}^{L}$ ) or an entire image ( $x_{n}^{V}$ ). The training objective is to maximize the log-likelihood of observing this sequence.

Embedding Inputs: First, individual text tokens and images are mapped into embeddings that can be processed by the LLM:

For a text token $x_{n}^{L}$ , its embedding is $e_{n}^{L}=\mathcal{E}_{L}\left(x_{n}^{L}\right)$ , where $\mathcal{E}_{L}$ is a standard word embedding layer.
For an image $x_{n}^{V}$ , its embedding is $e_{n}^{V}=\mathcal{E}_{V}\left(x_{n}^{V}\right)$ , where $\mathcal{E}_{V}$ is an image encoder followed by a Perceiver Resampler to reduce the image to a fixed number of visual tokens.

Log-Likelihood Objective: The overall generative modeling objective is to maximize the joint log-likelihood of the sequence, which can be decomposed auto-regressively: $ \log p(X)=\sum_{n} \log p\left(x_{n} \mid e_{<n}\right)=\sum_{n \in \mathcal{I}{L}} \log p\left(x{n}^{L} \mid e_{<n}\right)+\sum_{n \in \mathcal{I}{V}} \log p\left(x{n}^{V} \mid e_{<n}\right) $ Where:

$\log p(X)$ : The logarithm of the probability of the entire interleaved sequence $X$ .
$\sum_{n} \log p\left(x_{n} \mid e_{<n}\right)$ : The sum of log-probabilities of each element $x_n$ conditioned on all preceding elements $e_{<n}$ (i.e., $\{e_1, e_2, \ldots, e_{n-1}\}$ ). This represents an auto-regressive generation process.
$\mathcal{I}_{L}$ : The set of indices corresponding to text tokens in the sequence.
$\mathcal{I}_{V}$ : The set of indices corresponding to images in the sequence.

This equation implies that the model learns to predict the next element in the sequence, whether it's a text token or an image, given all previously generated or observed multi-modal elements.

Text Generation with Multi-modal Condition: For text generation, the objective is to predict the next text token $x_{n}^{L}$ given the preceding multi-modal context $e_{<n}$ . This is formulated as a next-token prediction task, similar to traditional causal language modeling in LLMs: $ L_{\mathrm{NTP}}\left(x_{n}^{L} \mid e_{<n}\right)=-\bar{x}{n}^{L} \cdot \log \operatorname{softmax}\left(W \cdot \mathcal{D}{\mathrm{LLM}}\left(e_{<n}\right)\right) $ Where:

$L_{\mathrm{NTP}}$ : The loss function for Next-Text-Token Prediction.
$\bar{x}_{n}^{L}$ : A one-hot vector representing the ground-truth (actual) next text token.
$W$ : A linear classification weight matrix used to project the LLM's output feature to the vocabulary size.
$\mathcal{D}_{\mathrm{LLM}}\left(e_{<n}\right)$ : The contextual features produced by the LLM (e.g., LLaMA, Vicuna) after processing all elements up to n-1.
$\operatorname{softmax}(\cdot)$ : The softmax function, which converts the LLM's raw output scores into a probability distribution over the vocabulary.
The negative log-likelihood (cross-entropy) is used as the loss, penalizing incorrect predictions.

Image Generation with Multi-modal Condition: For image generation, the objective is to predict the next image $x_{n}^{V}$ given the preceding multi-modal context $e_{<n}$ . This is framed as a denoising-diffusion process, which is commonly used in modern image generation: $ L_{\mathrm{NIP}}\left(x_{n}^{V} \mid e_{<n}\right)=\mathbb{E}{\epsilon, t}\left|\epsilon-\mathcal{D}{\mathrm{DM}}\left(x_{n, t}^{V}, t, \mathcal{D}{\mathrm{LLM}}\left(e{<n}\right)\right)\right|^{2} $ Where:

$L_{\mathrm{NIP}}$ : The loss function for Next-Image Prediction.
$\mathbb{E}_{\epsilon, t}$ : Expectation over the noise $\epsilon$ and time step $t$ .
$\epsilon$ : The ground-truth noise that was added to the original image $x_n^V$ .
$\mathcal{D}_{\mathrm{DM}}$ : The Diffusion Model (e.g., Stable Diffusion) which acts as a denoising network. Its task is to predict the noise component of a noisy image.
$x_{n, t}^{V}$ : The noisy version of the original image $x_{n}^{V}$ at a specific denoising step $t$ .
$\mathcal{D}_{\mathrm{LLM}}\left(e_{<n}\right)$ : The conditional features from the LLM that guide the Diffusion Model during image generation, ensuring consistency with the multi-modal context.
$\|\cdot\|^2$ : The squared $L_2$ norm, indicating that the model aims to minimize the difference between the predicted noise and the actual noise.

This formulation allows for flexibility in combining pre-trained models for different modalities and enables end-to-end optimization of the entire framework.

4.2.2. Architecture

The MM-Interleaved architecture integrates three main components: a Visual Foundation Model (VFM), a Large Language Model (LLM), and a Diffusion Model (DM). The red lines in Figure 4 visually represent the flow of multi-scale image features and how they are utilized, especially by the Multi-Modal Feature Synchronizer (MMFS).

The following figure (Figure 4 from the original paper) shows the architecture of MM-Interleaved:

Image 11: Architecture of MM-Interleaved. The red lines represent how the multi-scale image features are generated and utilized. MM-Interleaved incorporates image encoder to both extract high-resolution multi-scale image features (red lines) and map each image into a fixed number of low resolution visual tokens. These visual tokens are fed into the multi-modal LLM along with text tokens. LLM then uses our proposed feature synchronizer (MMFS) to extract additional high-resolution image details, and auto-regressively generate text tokens. After that, a diffusion-based image decoder generates next image conditioned on the previous context features from LLM, where MMFS is also utilized to capture accurate visual conditions.

1. VFM-based Image Tokenizer $\mathcal{E}_{V}$ :

Purpose: This component processes input images to extract both low-resolution visual tokens for the LLM and high-resolution multi-scale image features for the MMFS module.
Components:
- Pre-trained Vision Encoder: A model like CLIP-ViT-L/14 [71] is used for initial feature extraction from the raw image $x^{V} \in \mathbb{R}^{H \times W \times 3}$ (e.g., $H=W=224$ ).
- Perceiver Resampler: This module takes the features from the vision encoder and maps each image to a fixed, small number of low-resolution visual tokens $e^{V} \in \mathbb{R}^{N \times C}$ (e.g., $N=32$ tokens, $C$ is the channel dimension). These tokens are fed directly into the LLM's input sequence.
- ViT-Adapter: A ViT-Adapter [12] is integrated to extract multi-scale image features $F^{V} \in \mathbb{R}^{\left(\sum_{i=1}^{L} H_{i} \times W_{i}\right) \times C}$ . These features retain fine-grained spatial information from the original image at different resolutions (e.g., $L=3$ scales, with $H_i = H/2^{i+2}, W_i = W/2^{i+2}$ ). These multi-scale features are crucial for the MMFS module to dynamically access detailed visual information.

2. LLM-based Multi-modal Model $\mathcal{D}_{\text {LLM}}$ :

Purpose: This is the central processing unit, responsible for understanding the interleaved multi-modal sequence and generating text. It also provides contextual conditions for image generation.
Components:
- Pre-trained LLM: A model like Vicuna-13B [107] is used as the backbone.
- Input Sequence: The LLM receives an input sequence $E \in \mathbb{R}^{K \times C}$ which is a concatenation of word embeddings ( $e_{n}^{L} \in \mathbb{R}^{1 \times C}$ ) and the low-resolution visual tokens ( $e_{n}^{V} \in \mathbb{R}^{N \times C}$ ) from the Image Tokenizer. These are arranged in their original interleaved order.
- Special Tokens: A special token $<BoI>$ ("Begin of Image") is introduced and placed in front of each image's visual tokens to explicitly signal the start of an image.
- Multi-Modal Feature Synchronizer (MMFS): This is the key innovation. MMFS modules are strategically inserted within the LLM's intermediate layers. They allow any query token (text or visual token) to directly access and extract fine-grained multi-scale image features ( $F^V$ ) from all previous images in the context on demand. This enables dynamic retrieval of image details that might have been lost in the initial low-resolution visual tokens.

3. DM-based Image Decoder $\mathcal{D}_{\text {DM}}$ :

Purpose: This component generates images based on the rich multi-modal context provided by the LLM.
Components:
- Pre-trained Diffusion Model: A model like Stable Diffusion v2.1 [74] is used for image generation.
- Conditional Input Preparation: The output features from the multi-modal LLM ( $\mathcal{D}_{\text {LLM}}$ ) are first processed by another Resampler to map them to a fixed number of conditional tokens (e.g., 77 tokens), matching the expected input length for the diffusion model's conditioning mechanism.
- Multi-Modal Feature Synchronizer (MMFS): Crucially, MMFS modules are also integrated into the DM (specifically, after each downsampling block in its U-Net architecture). This allows the image decoder to access and leverage fine-grained visual details from previous images (e.g., the original images in the input context or previously generated images) directly. This is particularly useful for tasks requiring precise visual alignment, like image translation.

The Multi-Modal Feature Synchronizer (MMFS) is designed to enable dynamic and efficient extraction of image details, compensating for the information loss inherent in using a limited number of input visual tokens. It leverages Deformable Attention [112] to achieve sparse and efficient attention over multi-scale image feature maps and across multiple images.

The following figure (Figure 5 from the original paper) shows the architecture of the MMFS module:

Image 12: Architecture of MMFS module. The query feature is passed through a linear projection and added with the image index embedding. Two linear projections are used to predict the sampling offsets and unnormalized attention weights of each image, respectively. The sampling offsets are added with the query's reference point to form the corresponding sampling locations, which are shared across all feature maps of the same image. The output is the weighted sum of the sampled spatial features.

Given a query token (from the LLM or DM) that requires detailed image features, the MMFS module operates as follows:

The output feature $f_{o} \in \mathbb{R}^{C}$ is computed by: $ \begin{aligned} q^{(m)} & =W_{q} \cdot f_{q}+\operatorname{PosEmbed}(m) \ p_{q}^{(m)} & =W_{p} \cdot q^{(m)}+\hat{p}{q}, \quad A{q}^{(m)}=W_{A} \cdot q^{(m)} \ p_{q} & =\operatorname{Concat}\left(p_{q}^{(1)}, \cdots, p_{q}^{(M)}\right) \ A_{q} & =\operatorname{softmax}\left(\operatorname{Concat}\left(A_{q}^{(1)}, \cdots, A_{q}^{(M)}\right)\right) \ f_{o} & =\operatorname{DeformAttn}\left(\left{F^{V}\right}{m=1}^{M}, A{q}, p_{q}\right) \end{aligned} $ Where:

$f_{q} \in \mathbb{R}^{C}$ : This is the feature of the query token from the LLM or DM that needs to access detailed image information. $C$ is the channel dimension.
$\operatorname{PosEmbed}(m) \in \mathbb{R}^{\tilde{M} \times C}$ : A learnable positional embedding that encodes the index $m$ of the reference image. This helps the MMFS distinguish between multiple images, where $\tilde{M}$ is the maximum number of reference images.
$W_{q}, W_{p}, W_{A}$ : These are learnable linear projection weights (matrices) that transform the query feature.
$q^{(m)}$ : The query feature $f_q$ transformed by $W_q$ and enhanced with the positional embedding for image $m$ . This query is specific to how $f_q$ interacts with image $m$ .
$\hat{p}_{q} \in[0,1]^{2}$ : The relative image coordinate of a reference point. This point guides where the Deformable Attention should sample. By default, if no spatial prior is available (e.g., for general text tokens), it's set to $(0.5, 0.5)$ (the center of the image). If a spatial prior exists (e.g., for pixels in the image decoder), it uses that coordinate.
$p_{q}^{(m)}$ : The predicted sampling offsets for image $m$ . It's calculated by transforming $q^{(m)}$ with $W_p$ and adding the reference point $\hat{p}_q$ . This determines where to sample features from image $m$ .
$A_{q}^{(m)}$ : The unnormalized attention weights for image $m$ , derived by transforming $q^{(m)}$ with $W_A$ . This determines how much attention to give to image $m$ .
$\operatorname{Concat}(\cdot)$ : Concatenates the values across different images.
$p_{q} \in \mathbb{R}^{\tilde{M} \times L \times K \times 2}$ : The concatenated sampling point coordinates for all $M$ reference images, across $L$ multi-scale feature levels, each with $K$ sampling points.
$A_{q} \in \mathbb{R}^{\tilde{M} \times L \times K}$ : The concatenated attention weights for all reference images and sampling points. The softmax function normalizes these weights across all $M$ images, ensuring that the attention dynamically focuses on the most relevant images.
$F^{V}$ : This represents the multi-scale image feature maps extracted by the VFM-based Image Tokenizer. These are the high-resolution features from which MMFS samples.
$\operatorname{DeformAttn}(\cdot)$ : The Deformable Attention operator. It samples features from the multi-scale image feature maps $F^V$ at the specified coordinates $p_q$ and performs a weighted summation according to the attention weights $A_q$ .

In essence: A query token decides (1) which images to look at (via softmax over $A_q^{(m)}$ ), (2) where within those images to look (via $p_q^{(m)}$ ), and (3) how much to weigh the sampled features. This dynamic, sparse attention allows for efficient and precise detail extraction.

Integration: MMFS modules are inserted into the multi-modal LLM every fixed number of blocks (e.g., every 4 blocks).
Query Token: The query token $f_q$ for MMFS is each token in the LLM's intermediate layers.
Causality: Crucially, the MMFS module only attends to previous images in the context, maintaining the causal relationship required for auto-regressive generation.
Reference Point: For the LLM's tokens, there's no inherent spatial location, so $\hat{p}_{q}$ is typically set to $(0.5,0.5)$ , meaning the Deformable Attention starts its search from the center of the image and learns offsets from there.
Output Integration: The output of MMFS ( $f_o$ ) is scaled by $\tanh (\alpha)$ (where $\alpha$ is a zero-initialized learnable scalar) before being added back to the original query token $f_q$ . This allows the model to control the contribution of the dynamically extracted visual features.

4.2.5. Image Decoder with MMFS

Conditional Input: The output features from the multi-modal LLM are first compressed by another Resampler to match the expected condition token length of the pre-trained diffusion model (e.g., 77 tokens for Stable Diffusion).
Integration: MMFS modules are inserted after each downsampling block within the U-Net architecture of the diffusion model (the DM's core component).
Query Token: Here, the query token $f_q$ iterates over each pixel (or feature location) in the feature maps within the U-Net.
Reference Point: For the image decoder, there is a clear spatial prior. Thus, $\hat{p}_{q}$ is set as the spatial coordinate of the query pixel itself. This means the Deformable Attention can sample relevant features directly around the pixel being processed in the image generation.
Output Integration: The output of MMFS ( $f_o$ ) is processed by a zero-initialized convolution before being added back to the query feature $f_q$ .

4.2.6. Training Target and Inference Pipeline

Unified Training Objective: The overall training objective is a combination of the Next-Text-Token Prediction loss ( $L_{\mathrm{NTP}}$ ) and the Next-Image Prediction loss ( $L_{\mathrm{NIP}}$ ): $ \mathcal{L}=\mathcal{L}{N T P}+\lambda \mathcal{L}{N I P} $ Where:
- $\mathcal{L}$ : The total loss to be minimized.
- $\mathcal{L}_{N T P}$ : The loss for predicting the next text token (Eq. 2).
- $\mathcal{L}_{N I P}$ : The loss for predicting the next image (Eq. 3).
- $\lambda$ : A hyperparameter that controls the relative weighting between the image and text decoding branches. The entire framework is optimized end-to-end.
Inference Pipeline:
- Auto-regressive Generation: During inference, images and text are generated sequentially (auto-regressively).
- Text Generation: Text tokens are sampled from the probability distribution predicted by the multi-modal LLM (using the $W \cdot \mathcal{D}_{\mathrm{LLM}}$ output).
- Image Generation Trigger: When the LLM generates the special $<BoI>$ token, it signals that an image needs to be generated next. At this point, the diffusion model ( $\mathcal{D}_{\mathrm{DM}}$ ) is activated.
- Conditional Image Generation: The diffusion model generates the next image, conditioned on the previous context features from the LLM. The integrated MMFS in the image decoder ensures that this generation process can access and utilize fine-grained visual details from any preceding images, leading to more consistent and accurate visual outputs.

4.3. Comparison with other generative models

The paper presents Figure 2 to illustrate different types of image-text generative modeling, highlighting the limitations of previous approaches compared to MM-Interleaved.

The following figure (Figure 2 from the original paper) illustrates different types of image-text generative modeling:

Image 9: Different types of image-text generative modeling. (a) and (b) can only generate text. (c) and (d) can generate both images and text. All types except (d) are limited by the fixed number of visual tokens extracted by context-insensitive Resampler, which will lose image details and is problematic in multi-image scenarios.

Let's break down each type:

** (a) BLIP-like (Text generation only):**
- Description: Models in this category (like BLIP [48], BLIP-2 [47], LLaVA [54]) primarily focus on understanding multi-modal inputs (image and text) to generate text responses.
- Visual Input: They use a Resampler to compress an image into a fixed number of visual tokens which are then fed into the multi-modal LLM.
- Output: Only text is generated.
- Limitation: Relies on a fixed number of visual tokens, potentially losing image details, and cannot generate images.
** (b) Flamingo-like (Text generation only):**
- Description: Similar to BLIP-like models in purpose (text generation), but with a different architecture for integrating visual information. Flamingo [3] injects Resampler-extracted visual features into intermediate layers of the LLM via gated cross-attention.
- Visual Input: Also uses a Resampler for a fixed number of visual tokens.
- Output: Only text is generated.
- Limitation: Shares the fixed number of visual tokens problem and cannot generate images.
** (c) GILL/Emu-like (Image and Text generation):**
- Description: These models (GILL [43], Emu [84]) represent attempts to generate both images and text. They typically follow a two-stage process or use methods like VQ-VAE [87] to convert images to discrete tokens for auto-regressive generation.
- Visual Input: Still relies on Resampler or similar mechanisms to produce a fixed, limited number of visual tokens.
- Output: Can generate both images and text.
- Limitation: Despite generating images, they are still fundamentally limited by the fixed number of visual tokens paradigm, which means critical image details can be lost, especially in multi-image scenarios.
** (d) MM-Interleaved (Image and Text generation with MMFS):**
- Description: This is the proposed MM-Interleaved model. It represents an advancement over previous generative models by addressing the core limitation of fixed visual tokens.
- Visual Input: It initially uses a Resampler for a small, fixed number of visual tokens fed to the LLM. However, its key innovation is the Multi-Modal Feature Synchronizer (MMFS).
- MMFS Role: The MMFS allows the LLM and image decoder to directly access multi-scale high-resolution image features from previous contexts on demand. This means the model isn't restricted to the coarse information in the initial visual tokens.
- Output: Generates both images and text.
- Advantage: By using MMFS, it overcomes the fixed number of visual tokens problem, efficiently captures image details, and performs well in multi-image scenarios, making it more robust and versatile.

5. Experimental Setup

5.1. Datasets

MM-Interleaved is pre-trained and fine-tuned on a diverse set of public datasets.

5.1.1. Pre-training Datasets

The model is pre-trained on a mixture of image-text pairs and interleaved image-text sequences. No in-house data is used.

MMC4 [111]: A large-scale corpus of interleaved image-text documents. This dataset is crucial for training the model on the interleaved format it aims to handle.
- Characteristics: Contains web-scraped documents with images interspersed with text, mimicking real-world online content.
- Filtering: To ensure quality, images with CLIP similarity scores below 0.24 are discarded, and a maximum of 6 images are kept per document. Documents without images or with only one image are also partially filtered.
- Data Example: A typical sample might be a blog post with several paragraphs of text, followed by an image, then more text, another image, and so on.
LAION-2B [77]: A massive dataset of image-text pairs.
- Characteristics: English subset of LAION-5B, containing billions of image-text pairs collected from the web. It's known for its diversity and scale.
- Filtering: Further filtered based on aesthetic scores.
- Data Example: An image of a cat with the caption "A fluffy cat sleeping on a sunny windowsill."
LAION-COCO [78]: A large subset of LAION-2B.
- Characteristics: Contains 600 million image-text pairs. The captions for these images were generated by pre-trained BLIP [48] models, rather than human annotations.
- Filtering: Text prompts shorter than 10 tokens are filtered out.
- Data Example: An image of a city street at night with a generated caption like "A city street with cars and buildings at night."
CC-12M [8]: Conceptual Captions 12 Million.
- Characteristics: A large dataset of image-text pairs where captions are conceptual descriptions rather than simple object labels.
- Captioning Strategy: Instead of using original annotations, the paper uses a pre-trained BLIP-2 model [47] to caption the images, ensuring consistency in caption quality with LAION-COCO.
- Data Example: An image of a person riding a bicycle with a generated caption such as "A person on a bicycle on a paved road."
Objects365 [79]: A large-scale dataset primarily for object detection.
- Characteristics: Contains millions of images with bounding box annotations for 365 object categories.
- Captioning Strategy: Similar to CC-12M, the BLIP-2 model [47] is used to generate captions for its images, transforming it into an image-text pair dataset for pre-training.
- Data Example: An image featuring multiple objects (e.g., a chair, a table, a laptop) with a generated caption describing the scene.

Data Concatenation Strategy: For image-text-pair datasets (LAION-2B, LAION-COCO, CC-12M, Objects365), multiple pairs are randomly sampled and concatenated to reach the maximum context length (2048 tokens). For interleaved datasets (MMC4), documents are split and concatenated. This strategy maximizes data efficiency and LLM context window utilization.

5.1.2. Supervised Fine-tuning (SFT) Datasets

VQA and Image Captioning:
- Instruction Format: Based on the image, please answer the question. {image}{question}. The answer is: {answer}
- Datasets:
  - LLaVAMix-665K [53]: A mixed dataset for visual instruction tuning.
  - COCO Caption [11]: Standard dataset for image captioning.
  - VQAv2 [27]: Visual Question Answering.
  - ChartQA [61]: QA on charts.
  - DocVQA [13]: QA on documents.
  - EST-VQA [93]: QA requiring external scientific knowledge.
  - InfoVQA [62]: QA on infographics.
  - STVQA [93]: Scene text VQA.
  - TextCaps [80]: Image captioning with reading comprehension.
  - LLaVAR [104]: Enhanced visual instruction tuning for text-rich images.
  - OCRVQA [63]: QA by reading text in images.
  - DVQA [41]: QA on data visualizations.
- Example (VQAv2): Image of a yellow school bus. Question: "What color is the bus?". Answer: "yellow".
Referring Expression Comprehension (REC):
- Instruction Format: $Question: Provide the bounding box coordinate of the region this sentence describes: {object}, Answer: (x1,y1)(x2,y2)$
- Datasets:
  - RefCOCO [42]
  - $RefCOCO+$ [59]
  - RefCOCOg [59]
- Characteristics: These datasets contain images with natural language descriptions (referring expressions) that uniquely identify an object or region within the image.
- Example (RefCOCO): Image of a kitchen scene. Referring expression: "the red mug on the counter". The model should output coordinates $(x1,y1)(x2,y2)$ corresponding to the mug.
Segmentation-to-Image Translation:
- Instruction Format: {segmentation image}{caption}{ground truth image}
- Dataset: ADE20K [108]
- Characteristics: A scene parsing dataset providing detailed semantic segmentation masks for images.
- Captioning Strategy: BLIP [48] is used to generate text captions for each image.
- Example: Input: a semantic segmentation map of a forest. Caption: "A dense forest with tall trees and green foliage." Output: a realistic image rendering of that forest.
Visual Storytelling:
- Datasets:
  - VIST [34]: Visual Storytelling dataset with sequences of 5 images and 5 corresponding captions.
  - PororoSV [51] and FlintstonesSV [57]: Cartoon storytelling datasets with sequences of images and captions.
- Characteristics: These datasets contain sequences of images and text that tell a coherent story, ideal for evaluating interleaved generation.
- Example (VIST): A sequence of images depicting a cat playing, then sleeping, with accompanying descriptive captions for each frame.

5.2. Evaluation Metrics

The paper employs a comprehensive suite of evaluation metrics for various tasks, ensuring a thorough assessment of MM-Interleaved's capabilities.

The following table (Table 14 from the original paper) summarizes the evaluation benchmarks:

Dataset	Task	Split	Metric
COCO [11]	Scene description	test	CIDEr( $\uparrow$ ) [88]
Flickr30k [69]	Scene description	test	CIDEr( $\uparrow$ ) [88]
NoCaps [2]	Scene description	test	CIDEr( $\uparrow$ ) [88]
Image2Paragraph [44]	Scene description	test	CIDEr( $\uparrow$ ) [88]
VQAv2 [27]	Scene understanding QA	test-dev	VQA Acc( $\uparrow$ ) [4]
OKVQA [60]	External knowledge QA	val	VQA Acc( $\uparrow$ ) [4]
GQA [35]	Scene understanding QA	test-dev	VQA Acc( $\uparrow$ ) [4]
VizWiz [29]	Scene understanding QA	test-dev	VQA Acc( $\uparrow$ ) [4]
TextVQA [81]	Text reading QA	val	VQA Acc( $\uparrow$ ) [4]
VisDial [15]	Image dialogue	val	NDCG( $\uparrow$ )
RefCOCO [42]	Referring experssion comprehension	-	IoU Acc( $\uparrow$ )
RefCOCO+ [59]	Referring experssion comprehension	-	IoU Acc( $\uparrow$ )
RefCOCOg [29]	Referring experssion comprehension	-	IoU Acc( $\uparrow$ )
MS-COCO [52]	Text-to-image generation	val-30X	FID( $\downarrow$ ) [31]
LS-COCO [70]	Text-to-image generation	val	FID( $\downarrow$ ) [31]
ADE20k [108]	Segmentation-to-image generation	val	mIoU( $\uparrow$ )
VIST [34]	Interleaved-context image generation	val	CLIP-Sim( $\uparrow$ ) [71], FID( $\downarrow$ ) [31]
PororoSV [51]	Interleaved-context multi-image generation	test	FID( $\downarrow$ ) [31]
FlintstonesSV [28]	Interleaved-context multi-image generation	test	FID( $\downarrow$ ) [31]

1. CIDEr (Consensus-based Image Description Evaluation) [88]

Conceptual Definition: CIDEr measures the similarity between a generated image caption and a set of human-written reference captions. It focuses on how well the generated caption captures the salient objects and their attributes, as well as the overall semantic meaning, rewarding consensus with human descriptions. It uses tf-idf (term frequency-inverse document frequency) weighting for n-grams to give more importance to less frequent but more discriminative words.
Mathematical Formula: $ \text{CIDEr}n(c, S) = \frac{1}{|S|} \sum{s \in S} \prod_{k=1}^n \frac{\min(c_k(w_i), s_k(w_i))}{\sum_{w_i \in c} c_k(w_i)} \cdot \text{idf}(w_i) $ The final CIDEr score is a weighted sum of CIDEr scores for different n-gram lengths: $ \text{CIDEr}(c, S) = \sum_{n=1}^N w_n \cdot \text{CIDEr}_n(c, S) $
Symbol Explanation:
- $c$ : The candidate (generated) caption.
- $S$ : A set of human-written reference captions for the same image.
- $s$ : A single reference caption from the set $S$ .
- $n$ : The n-gram length (e.g., $n=1$ for unigrams, $n=2$ for bigrams).
- $c_k(w_i)$ : The count of n-gram $w_i$ in the candidate caption $c$ .
- $s_k(w_i)$ : The count of n-gram $w_i$ in the reference caption $s$ .
- $\text{idf}(w_i)$ : The inverse document frequency weight for n-gram $w_i$ , which downweights common n-grams and upweights rare, informative ones.
- $N$ : Maximum n-gram length (typically 4).
- $w_n$ : Weight for the CIDEr score of n-gram length $n$ .
- $\uparrow$ : Indicates that a higher score is better.
  
  2. FID (Fréchet Inception Distance) [31]
Conceptual Definition: FID measures the similarity between the distribution of generated images and real images. It computes the Fréchet distance between two Gaussian distributions fitted to the feature representations of real and generated images (typically extracted from an Inception-v3 network). A lower FID score indicates higher quality and diversity of generated images, implying that they are more visually similar to real images.
Mathematical Formula: $ \text{FID} = ||\mu_1 - \mu_2||^2 + \text{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1\Sigma_2)^{1/2}) $
Symbol Explanation:
- $\mu_1$ : The mean of the feature vectors for real images.
- $\mu_2$ : The mean of the feature vectors for generated images.
- $\Sigma_1$ : The covariance matrix of the feature vectors for real images.
- $\Sigma_2$ : The covariance matrix of the feature vectors for generated images.
- $||\cdot||^2$ : Squared Euclidean distance.
- $\text{Tr}(\cdot)$ : The trace of a matrix.
- $\downarrow$ : Indicates that a lower score is better.
  
  3. VQA Acc (Visual Question Answering Accuracy) [4]
Conceptual Definition: VQA Acc measures the accuracy of a model's answers to questions about images. For each question, it compares the model's answer to several human-annotated ground-truth answers. If the model's answer matches at least 3 out of 10 human answers, it receives full credit; otherwise, partial credit is awarded based on the number of matches.
Mathematical Formula: $ \text{Accuracy}(a) = \min\left(\frac{\text{count}(a)}{\text{number of annotators}}, 1\right) $ The overall accuracy is the average of these scores across all questions.
Symbol Explanation:
- $a$ : The answer predicted by the model.
- $\text{count}(a)$ : The number of human annotators who provided the same answer $a$ .
- $\text{number of annotators}$ : Typically 10.
- $\uparrow$ : Indicates that a higher score is better.

4. NDCG (Normalized Discounted Cumulative Gain)

Conceptual Definition: NDCG is a measure of ranking quality, commonly used in information retrieval and dialogue systems. It evaluates the usefulness, or gain, of items based on their position in the result list. The gain is accumulated from the top to the bottom of the result list, with a penalty for items that appear lower in the list (discounted). It's "normalized" to range from 0 to 1, where 1 represents a perfect ranking. In Visual Dialog, it assesses how well the model ranks the correct answer among a set of candidates given the dialogue history.
Mathematical Formula: $ \text{DCG}p = \sum{i=1}^p \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)} $ $ \text{IDCG}p = \sum{i=1}^p \frac{2^{\text{ideal_rel}_i} - 1}{\log_2(i+1)} $ $ \text{NDCG}_p = \frac{\text{DCG}_p}{\text{IDCG}_p} $
Symbol Explanation:
- $p$ : The position in the ranked list.
- $\text{rel}_i$ : The relevance score of the item at position $i$ in the predicted list.
- $\text{ideal\_rel}_i$ : The relevance score of the item at position $i$ in the ideal (perfectly sorted) list.
- $\text{DCG}_p$ : Discounted Cumulative Gain at position $p$ .
- $\text{IDCG}_p$ : Ideal Discounted Cumulative Gain at position $p$ .
- $\uparrow$ : Indicates that a higher score is better.

5. IoU Acc (Intersection over Union Accuracy)

Conceptual Definition: IoU (also known as the Jaccard index) measures the overlap between two bounding boxes (or segmentation masks). For Referring Expression Comprehension (REC), it evaluates whether the model's predicted bounding box for a described object sufficiently overlaps with the ground-truth bounding box. IoU Acc is typically reported as the percentage of predictions where IoU is greater than a certain threshold (e.g., 0.5).
Mathematical Formula: $ \text{IoU}(B_p, B_{gt}) = \frac{\text{Area}(B_p \cap B_{gt})}{\text{Area}(B_p \cup B_{gt})} $
Symbol Explanation:
- $B_p$ : The predicted bounding box (or mask).
- $B_{gt}$ : The ground-truth bounding box (or mask).
- $\text{Area}(B_p \cap B_{gt})$ : The area of overlap (intersection) between the predicted and ground-truth boxes.
- $\text{Area}(B_p \cup B_{gt})$ : The area of the union of the predicted and ground-truth boxes.
- $\uparrow$ : Indicates that a higher score is better.
  
  6. CLIP-Sim (CLIP Similarity) [71]
Conceptual Definition: CLIP-Sim measures the semantic similarity between a generated image and a given text prompt using the CLIP model. CLIP is a contrastively pre-trained vision-language model that learns to embed images and text into a shared latent space. The cosine similarity of the embeddings from CLIP serves as a proxy for semantic alignment. Higher similarity indicates that the generated image is more semantically consistent with the text.
Mathematical Formula: $ \text{CLIP-Sim}(I, T) = \frac{E_I \cdot E_T}{||E_I|| \cdot ||E_T||} $
Symbol Explanation:
- $I$ : The generated image.
- $T$ : The text prompt (e.g., story caption).
- $E_I$ : The CLIP embedding of the image $I$ .
- $E_T$ : The CLIP embedding of the text $T$ .
- $\cdot$ : Dot product.
- $||\cdot||$ : L2 norm (magnitude) of the embedding vector.
- $\uparrow$ : Indicates that a higher score is better.

7. mIoU (Mean Intersection over Union)

Conceptual Definition: mIoU is a standard metric for image segmentation tasks. It calculates the IoU for each semantic class present in the ground truth, and then averages these IoU scores across all classes. In Segmentation-to-image Generation, it measures how accurately the generated image's semantic segmentation (when re-segmented by an external tool) matches the input semantic map. A higher mIoU indicates better structural and semantic fidelity of the generated image to the input segmentation.
Mathematical Formula: $ \text{IoU}k = \frac{TP_k}{TP_k + FP_k + FN_k} $ $ \text{mIoU} = \frac{1}{N{classes}} \sum_{k=1}^{N_{classes}} \text{IoU}_k $
Symbol Explanation:
- $k$ : Index for a specific semantic class.
- $TP_k$ : True Positives for class $k$ (pixels correctly identified as class $k$ ).
- $FP_k$ : False Positives for class $k$ (pixels incorrectly identified as class $k$ ).
- $FN_k$ : False Negatives for class $k$ (pixels of class $k$ incorrectly identified as something else).
- $N_{classes}$ : The total number of semantic classes.
- $\uparrow$ : Indicates that a higher score is better.

5.3. Baselines

The paper compares MM-Interleaved against a wide array of state-of-the-art multi-modal models, categorized by their primary capabilities:

Models for Text-Generation Only: These models primarily focus on visual comprehension and generating text responses, often struggling with image generation or multi-image scenarios.

MetaLM [30]
OF-9B (OpenFlamingo) [5]
IDEFICS-80B [36]
KOSMOS9-1 [33], KOSMOS9-2 [68]
Flamingo-9B [3], Flamingo-80B [5]
IDEFICS-80B-I [36]
mPLUG-DocOwl [97]
BLIP-2 [27] (Vicuna-TB, Vicuna-13B versions)
InstructBLIP [14] (Vicuna-TB, Vicuna-13B versions)
Shikra [28]
LLaVA-1.5 [53] (Vicuna-TB, Vicuna-13B versions)
Qwen-VL [6], Qwen-VL-Chat [6]

Text-to-Image Specialists: These are models specifically designed for high-quality image generation from text prompts. They typically do not handle interleaved data or text generation.
Retrieval Result (a baseline for image quality)
DALL-E [73]
CogView2 [16]
Stable Diffusion [74]
GLIDE [64]
Make-A-Scene [24]
DALL-E 2 [72]
Muse-3B [96]
Imagen-3.4B [76]
Parti-20B [100]

Models for both Image and Text Generation: These are the most direct competitors, as they also aim for joint image and text generation in interleaved contexts.
CM3-13B [1] (or CM3Leon [101])
VL-GPT [110]
GILL [43]
Emu-13B [84] (or Esau [84], Esau-I [84], Esau2 [81])
Next-GPT [95]
DreamLLM [17] (DreamLLM-7B-Stage1, DreamLLM-7B versions)
SEED-LLaMA [25], SEED-LLaMA-I [25]

Baselines for Specific Tasks:

Referring Expression Comprehension: OFA-L [89], VisionLLM-H [92], MiniGPT-V2 [9], Forest [98].
Segmentation-to-image Generation: VQGAN [19], LDM [75], PIPT [90], ControlNet [102].
Visual Storytelling: StoryDALL-E [58], AR-LDM [67], ACM-VSG [22], MiniGPT-5 [106].

These baselines are representative because they cover the spectrum of multi-modal research, from comprehension-focused LLMs to specialized image generators, and recent attempts at joint image-text generation. Comparing against such a diverse set allows the authors to demonstrate MM-Interleaved's versatility and superior performance across various benchmarks.

6. Results & Analysis

6.1. Core Results Analysis

The experiments demonstrate the versatility and strong performance of MM-Interleaved across a wide range of multi-modal comprehension and generation tasks.

6.1.1. Zero-shot Results after Pre-training

The paper first evaluates the zero-shot capabilities of the pre-trained MM-Interleaved model, meaning without any specific fine-tuning for the downstream tasks.

The following table (Table 1 from the original paper) presents the multi-modal comprehension evaluation:

Model	LLM	H I A	COCO	Flickr	NoCaps	I2Para.	VQAv2	OKVQA	GQA	VizWiz	TextVQA	VisDial
Models for Text-Generation Only
MetaLM [30]	MetaLM	- - -	82.2	43.3	58.7	-	41.1	11.4	-	41.1	11.4	-
OF-9B [5]	MPT-TB	- - -	79.5	59.5	-	-	52.7	37.8	-	27.5	24.2	-
IDEFICS-80B [36]	LLaMA-65B	- - -	91.8	53.7	65.0	-	60.0	-	45.2	36.0	30.9	-
KOSMO9-1 [33]	MetaLM	H - -	-	65.2	-	-	46.7	-	-	-	-	-
KOSMO9-2 [68]	KOSMO9-1	H - -	-	66.7	-	-	45.6	-	-	-	-	-
Flamingo-9B [3]	Chinchilla-TB	H - -	79.4	61.5	-	-	51.8	44.7	-	28.8	31.8	48.0
Flamingo-80B [5]	Chinchilla-70B	H - -	84.3	67.2	-	-	56.3	50.6	-	31.6	35.0	52.0
IDEFICS-80B-I [36]	LLaMA-65B	- I -	117.2	65.3	104.5	-	37.4	-	-	26.0	-	-
mPLUG-DocOwl [97]	LLaMA-TB	- I A	52.6	62.2	57.4	-	-	-	-	-	-	-
BLIP-2 [27]	Vicuna-TB	- I A	-	74.9	107.5	-	-	-	-	38.6	25.3	40.1
BLIP-2 [27]	Vicuna-13B	- I A	-	71.6	103.9	-	41.0	-	41.0	19.6	42.5	-
InstructBLIP [14]	Vicuna-TB	- I A	-	82.4	123.1	-	-	-	-	49.2	34.5	50.1
InstructBLIP [14]	Vicuna-13B	- I A	-	82.8	121.9	-	-	-	-	49.5	33.4	50.7
Shikra [28]	Vicuna-13B	- I A	117.5	73.9	-	-	77.4	-	-	-	-	-
LLaVA-1.5 [53]	Vicuna-TB	- I A	-	-	-	-	78.5	-	62.0	50.0	58.2	-
LLaVA-1.5 [53]	Vicuna-13B	- I A	-	-	-	-	80.0	-	63.3	53.6	61.3	-
Qwen-VL [6]	Qwen-TB	H I A	-	85.8	121.4	-	78.8	-	59.3	35.2	63.8	-
Qwen-VL-Chat [6]	Qwen-TB	H I A	-	81.0	120.2	-	78.2	-	57.5	38.9	61.5	-
Models for both Image and Text Generation
CM3Leon [101]	-	H - -	61.6	-	-	10.5	47.6	23.8	-	37.6	-	22.6
Esau [84]	Vicuna-13B	H - -	112.4	-	-	-	52.0	38.2	-	34.2	-	47.4
Esau-I [84]	Vicuna-13B	H - -	117.7	-	-	-	40.0	34.7	-	35.4	-	48.0
Esau2 [81]	LLaMA-33B	H - -	-	-	-	-	33.3	26.7	-	40.4	26.2	-
DreamLLM [17]	Vicuna-TB	- I -	115.4	-	-	17.4	56.6	44.3	-	38.1	34.9	-
VL-GPT [110]	LLaMA-TB	- - -	116.4	-	-	-	51.7	35.8	34.6	34.7	-	49.9
VL-GPT-I [110]	LLaMA-TB	- I A	133.7	-	-	-	67.2	50.3	51.5	38.9	-	51.8
SEED-LLaMA [25]	LLaMA2-Chat-13B	- I A	135.0	-	-	-	48.1	27.1	-	23.3	-	-
SEED-LLaMA-I [25]	LLaMA2-Chat-13B	- I A	126.0	-	-	-	63.4	43.2	-	49.4	-	-
MM-Interleaved	Vicuna-13B	- - -	129.0	85.8	106.4	23.5	57.0	40.0	46.7	40.8	37.2	48.7
MM-Interleaved-SFT	Vicuna-13B	- I A	140.5	93.0	123.2	30.3	80.2	51.7	60.5	54.9	61.0	53.7

The following are the results from Table 1 of the original paper:

Multi-modal Comprehension (Table 1):

MM-Interleaved (pre-trained, without any dataset-specific fine-tuning, indicated by "- - -") achieves state-of-the-art (SOTA) performance in the fully decontaminated setting (where training data does not overlap with evaluation data) across all evaluated captioning (COCO, Flickr30K, NoCaps, I2Para), VQA (VQAv2, OKVQA, VizWiz, TextVQA), and visual dialogue (VisDial) tasks.
It significantly outperforms other models that also do not use in-house data.
Even without fine-tuning, it surpasses many models (Flamingo-9B, Emu) that leverage vast in-house datasets (indicated by "H" in the "H I A" column). This highlights the architectural superiority of MM-Interleaved for image-text interaction.

For example, on Flickr captioning, MM-Interleaved achieves 85.8 CIDEr, significantly higher than IDEFICS-80B's 53.7 and even Flamingo-80B's 67.2.

The following table (Table 3 from the original paper) presents the zero-shot text-to-image generation results:

Model	MS-COCO LN-COCO
Text-to-Image Specialists
Retrieval Result	17.97
DALL-E [73]	~28
CogView2 [16]	24.00
Stable Diffusion [74]	12.43
GLIDE [64]	12.24
Make-A-Scene [24]	11.84
DALL-E 2 [72]	10.39
Muse-3B [96]	7.88
Imagen-3.4B [76]	7.27
Parti-20B [100]	7.23
Models for both Image and Text Generation
CM3-13B [1]	29.56
VL-GPT [110]	12.25
GILL [43]	12.20
Emu-13B [84]	11.66
Next-GPT [95]	11.28
CM3Leon-7B [101]	10.82
DreamLLM-7B-Stage1 [17]	8.76
DreamLLM-7B [17]	8.46
MM-Interleaved	7.90	23.88

The following are the results from Table 3 of the original paper:

Text-to-Image Generation (Table 3):

MM-Interleaved achieves competitive FID scores (7.90 on MS-COCO, 23.88 on LN-COCO) compared to existing image and text generation models, and even some specialist text-to-image models.
This is noteworthy given that many high-performing baselines (Emu, Muse, Imagen, Parti) rely on in-house data, which MM-Interleaved does not.

For example, its FID of 7.90 on MS-COCO is better than DreamLLM-7B (8.46) and CM3Leon-7B (10.82).

The following table (Table 2 from the original paper) presents the zero-shot results for interleaved image-text comprehension and generation on SEED-Bench-2:

Model	LLM	L₁ (Part-1)	Part-2	L₂	Part-3	L₃
Emu [84]	LLaMA-13B	42.5	41.1	42.4	41.4	42.3
Next-GPT [95]	Vicuna-7B	30.7	35.6	31.1	33.9	31.4
SEED-LLaMA [25]	LLaMA2-Chat-13B	43.9	43.4	43.8	52.3	44.8
MM-Interleaved	Vicuna-13B	43.9	46.1	44.1	52.1	45.0

The following are the results from Table 2 of the original paper:

Interleaved Comprehension and Generation (Table 2):
- On SEED-Bench-2, a benchmark specifically designed for interleaved image-text data, MM-Interleaved achieves SOTA results for both comprehension ( $L_1$ , $L_2$ ) and generation ( $L_3$ ) tasks.
- For L1 (image and text comprehension) and L2 (interleaved comprehension), MM-Interleaved matches or slightly outperforms SEED-LLaMA.
- For L3 (image and text generation), MM-Interleaved achieves 45.0, slightly surpassing SEED-LLaMA's 44.8. This is significant as it demonstrates MM-Interleaved's robust capability in managing complex interleaved sequences for both understanding and creative tasks.

6.1.2. Fine-tuning Results

Supervised fine-tuning (MM-Interleaved-SFT) further enhances the model's capabilities.

Multi-modal Comprehension (Table 1, MM-Interleaved-SFT row):

After fine-tuning, MM-Interleaved-SFT achieves SOTA performance across the board without using any in-house data.
It significantly improves upon its pre-trained zero-shot performance and outperforms all other baselines, including those using in-house data or more visual tokens.
For example, MM-Interleaved-SFT achieves 80.2 on VQAv2, matching LLaVA-1.5 (Vicuna-13B)'s 80.0.

Advantage over LLaVA-1.5: MM-Interleaved-SFT achieves this competitive understanding with significantly fewer visual tokens (64 tokens vs. LLaVA-1.5's 576 tokens), making it far more efficient and better suited for multi-image scenarios. Crucially, MM-Interleaved can also generate images, unlike LLaVA-1.5.

The following table (Table 4 from the original paper) presents the supervised fine-tuning results on the referring expression comprehension task:

Model	RefCOCO [42]			RefCOCO+ [59]			RefCOCOg [59]
	Val	Test-A	Test-B	Val	Test-A	Test-B	Val	Test
OFA-L [89]	79.96	83.67	76.39	68.29	76.00	61.75	67.57	67.50
VisionLLM-H [92]	-	86.70	-	-	-	-	-	-
Shikra [10]	87.01	90.61	80.24	81.60	87.36	72.12	82.27	82.19
MiniGPT-V2 [9]	88.69	91.65	85.33	79.97	85.12	74.45	84.44	84.66
Forest [98]	89.48	92.41	84.36	82.81	88.14	75.17	85.83	86.34
* Qwen-VL [6]	89.36	92.26	85.34	83.12	88.25	77.21	85.58	85.48
MM-Interleaved	89.92	92.59	86.54	82.99	88.57	77.07	85.21	84.92

The following are the results from Table 4 of the original paper:

Referring Expression Comprehension (REC) (Table 4):
- MM-Interleaved (fine-tuned) outperforms other methods on RefCOCO, $RefCOCO+$ , and RefCOCOg benchmarks.
- It matches or exceeds Qwen-VL, which utilizes an additional 22M in-house grounding dataset and trains at a higher image resolution (448 vs. MM-Interleaved's 224 pixels). This underscores the effectiveness of MMFS in synchronizing fine-grained image features for tasks requiring precise visual grounding.
  
  The following table (Table 5 from the original paper) presents segmentation-to-image generation and visual storytelling results:
  
  Groundtruth VQGAN [19] LDM [75]
  
  0.58 0.21 0.31
  
  PIPT [90] ControlNet [102] Ours
  
  0.26 0.35 0.44

Groundtruth	VQGAN [19]	LDM [75]
0.58	0.21	0.31
PIPT [90]	ControlNet [102]	Ours
0.26	0.35	0.44

The following are the results from Table 5a of the original paper:

Segmentation-to-Image Translation (Table 5a):
- MM-Interleaved significantly outperforms other baselines, including ControlNet, achieving an mIoU of 0.44 on ADE20K. This is a unique capability not typically shown by other multi-modal LLMs.
- This task demands precise pixel-level alignment, and the superior performance indicates MM-Interleaved's ability to leverage better representations learned from large-scale pre-training data and the effectiveness of MMFS for pixel-level visual conditions.
  
  Model CLIP Sim.( $\uparrow$ ) FID ( $\downarrow$ )
  
  GILL [43] 0.64 -
  
  MiniGPT-5 [106] 0.70 59.5
  
  MM-Interleaved 0.70 39.7

Model	CLIP Sim.( $\uparrow$ )	FID ( $\downarrow$ )
GILL [43]	0.64	-
MiniGPT-5 [106]	0.70	59.5
MM-Interleaved	0.70	39.7

The following are the results from Table 5b of the original paper:

Visual Storytelling (Table 5b):
- For generating the last image in a sequence on the VIST dataset, MM-Interleaved achieves SOTA performance. It matches MiniGPT-5 in CLIP Similarity (0.70) and significantly outperforms it in FID (39.7 vs. 59.5), indicating higher visual quality and diversity.
  
  Model Pororo Flintstones
  
  StoryDALL-E [58] 25.9 26.5
  
  AR-LDM [67] 17.4 19.3
  
  ACM-VSG [22] 15.4 18.4
  
  MM-Interleaved 14.7 18.7

Model	Pororo	Flintstones
StoryDALL-E [58]	25.9	26.5
AR-LDM [67]	17.4	19.3
ACM-VSG [22]	15.4	18.4
MM-Interleaved	14.7	18.7

The following are the results from Table 5c of the original paper:

Multi-image Generation in Visual Storytelling (Table 5c):
- For auto-regressively generating multiple images on PororoSV and FlintstonesSV datasets, MM-Interleaved again achieves SOTA FID scores (14.7 and 18.7 respectively), outperforming previous specialist methods. This highlights its capability to maintain visual consistency and quality across generated image sequences.

6.2. Ablation Studies / Parameter Analysis

Ablation studies were conducted using CLIP-ViT-L/14 (image encoder), OpenLLaMA-3B v2 (LLM), and miniSD (image decoder) on a reduced pre-training setup (10000 steps). The MMFS in these studies initially focuses on a single-image, single-scale setting ( $16 \times 16$ feature map of the nearest preceding image).

The following table (Table 6 from the original paper) presents an ablation study on using MMFS:

Image 13: The image is a table excerpt from the MM-Interleaved paper, showing the performance comparison of the model on Caption, Generation, OK-VQA, and TextVQA tasks under different configurations, highlighting the effect of using MMFS and varying token numbers.

6.2.1. Token Efficiency (Table 6a & 6b)

Impact of MMFS with Limited Visual Tokens (Table 6a):
- When equipped with MMFS (32 visual tokens), the model achieves 110.6 on COCO Caption and 30.0 on COCO Generation, 29.8 on OK-VQA, and 27.7 on TextVQA.
- Without MMFS, but using a much larger number of visual tokens (256), the performance is lower (107.0 for Caption, 32.2 for Generation, 28.7 for OK-VQA, 22.5 for TextVQA).
- Finding: This demonstrates that MMFS is highly effective. It allows the model to outperform a configuration using 8 times more visual tokens while maintaining high efficiency, showcasing its ability to preserve fine-grained details even with limited input visual tokens.
Impact of Input Image Resolution (Table 6b):
- Increasing the input image resolution from 224 to 448 (+MMFS (448)) leads to improved performance on COCO Caption (111.0 vs. 110.6) and OK-VQA (30.0 vs. 29.8).
- Finding: This indicates that MMFS can effectively leverage additional information available in higher-resolution images, even when the LLM still only receives a small fixed number of visual tokens. The dynamic feature extraction mechanism allows the model to exploit these finer details.

6.2.2. MMFS for Image Generation (Table 6c)

Segmentation-to-Image Translation (Table 6c):
- The task requires precise pixel-level alignment. With MMFS, the model achieves 0.40 mIoU on ADE20K.
- Without MMFS, the mIoU drops drastically to 0.00, effectively indicating failure.
- Finding: This strongly highlights the criticality of MMFS for tasks that demand fine-grained spatial control and detail. Simply relying on coarse Perceiver Resampler features is insufficient, as they lose the pixel-level information needed for such precise generation.

6.2.3. Comparison between Different Cross Attention Mechanisms (Table 7)

The following table (Table 7 from the original paper) presents an ablation on the design choice of MMFS:

Cross-Attn	Transition	Attn Input	COCO Cap.( $\uparrow$ )	COCO Gen.( $\downarrow$ )	OK-VQA ( $\uparrow$ )	TextVQA ( $\uparrow$ )
Deformable	None	16x16	110.6	30.0	29.8	27.7
Dense	None	16x16	108.5	30.6	28.4	23.6
Dense	Resampler	32 tokens	107.2	30.7	28.9	24.0

The following are the results from Table 7 of the original paper:

Comparison of Attention Mechanisms:
- The table compares Deformable Attention (used in MMFS) against Dense cross-attention.
- Directly replacing MMFS with vanilla dense cross-attention (Dense, None, 16x16 row) leads to a performance drop across all metrics. For instance, TextVQA drops from 27.7 to 23.6. This suggests that Deformable Attention is more effective, possibly due to its ability to focus on relevant spatial locations and its inherent efficiency.
- The Dense, Resampler, 32 tokens row is analogous to how Flamingo [3] integrates visual features. This configuration also performs worse than MMFS.
- Finding: Deformable Attention proves superior, especially for tasks like TextVQA that require capturing fine-grained information (e.g., text in images), indicating its efficiency and effectiveness in discerning specific details.

6.2.4. MMFS with Multi-Image and Multi-Scale (Table 8 & Figure 6 left)

The following table (Table 8 from the original paper) presents an ablation on multi-scale and multi-image usage of MMFS:

multi-scale	multi-image	Caption ( $\uparrow$ )	Generation ( $\downarrow$ )	OK-VQA ( $\uparrow$ )	TextVQA ( $\uparrow$ )
		110.6	30.0	29.8	27.7
$\checkmark$		111.2	29.5	30.3	30.3
$\checkmark$	$\checkmark$	111.2	29.9	31.1	31.1

The following are the results from Table 8 of the original paper:

Multi-scale and Multi-image Impact:
- Adding multi-scale capabilities to MMFS (first row with $\checkmark$ ) improves performance across tasks (e.g., COCO Caption from 110.6 to 111.2, TextVQA from 27.7 to 30.3).
- Further adding multi-image capabilities (second row with $\checkmark$ ) provides additional gains, particularly in VQA tasks (OK-VQA from 30.3 to 31.1, TextVQA from 30.3 to 31.1).
- Finding: This confirms that MMFS benefits from accessing visual features at multiple resolutions and from multiple preceding images, allowing for richer contextual understanding.
  
  The following figure (Figure 6 from the original paper) shows few-shot results on OKVQA and TextVQA, and additional GFLOPs over using 32 visual tokens with different numbers of image and text inputs:
  
  Image 14: The image is a chart showing the comparison of accuracy and computational complexity across different model configurations. The left graph presents accuracy changes on OKVQA and TextVQA datasets, while the right graph illustrates GFLOPs trends under various resolutions and attention mechanisms.
Few-shot Learning (Figure 6, left):
- The left graph shows performance on OKVQA and TextVQA as the number of few-shot examples ( $k$ ) increases.
- MMFS consistently outperforms the baseline (32 visual tokens only) as the number of context images grows.
- Finding: The ability to attend to multi-scale features from multiple images in the context enhances the model's in-context learning capabilities.

6.2.5. Computational Efficiency (Figure 6 right)

GFLOPs Comparison (Figure 6, right):
- Integrating MMFS into the LLM results in only a minimal increase in computational cost:
  - FLOPs: ~2% increase.
  - #parameters: ~2% increase.
  - Runtime: ~6% increase.
  - Memory: ~3% increase.
- When compared to using 256 visual tokens without MMFS (which performs worse, as seen in Table 6a), MMFS with 32 visual tokens is much more efficient:
  - 2.8x fewer FLOPs.
  - 1.3x faster runtime.
- Finding: MMFS provides a significant efficiency gain, achieving better performance with less computational overhead than simply increasing the number of fixed visual tokens or using dense cross-attention. This makes it practical for multi-image scenarios.

6.2.6. Pre-training with Different Loss Terms (Table 16)

The following table (Table 16 from the original paper) presents an ablation study on pre-training with different loss terms:

Loss Term	Caption ( $\uparrow$ )	Generation ( $\downarrow$ )	OK-VQA ( $\uparrow$ )	TextVQA ( $\uparrow$ )
$\mathcal{L}_{N T P}+100 \mathcal{L}_{N I P}$	106.2	31.1	29.8	24.5
$\mathcal{L}_{N T P}+10 \mathcal{L}_{N I P}$	110.6	30.0	29.8	27.7
$\mathcal{L}_{N T P}+\mathcal{L}_{N I P}$	110.0	31.4	29.3	26.0
$\mathcal{L}_{N T P}$ only	105.7	-	29.9	27.6
$\mathcal{L}_{N I P}$ only	-	34.2	-	-

The following are the results from Table 16 of the original paper:

Joint Training Benefits: Training with both Next-Text-Token Prediction ( $\mathcal{L}_{NTP}$ ) and Next-Image Prediction ( $\mathcal{L}_{NIP}$ ) losses significantly improves performance on both comprehension (Caption, OK-VQA, TextVQA) and generation tasks. For example, Caption jumps from 105.7 ( $\mathcal{L}_{NTP}$ only) to 110.6 ( $\mathcal{L}_{NTP}+10\mathcal{L}_{NIP}$ ).
Hyperparameter $\lambda$ : The weighting factor $\lambda=10$ for $\mathcal{L}_{NIP}$ (i.e., $\mathcal{L}_{NTP}+10\mathcal{L}_{NIP}$ ) yields the best overall balance and performance. Too high ( $\lambda=100$ ) or too low ( $\lambda=1$ ) weighting for image generation loss degrades results, suggesting a crucial interplay between text and image learning.
Finding: The mutual benefits of joint training are evident, as learning to generate both modalities helps each other, leading to better overall multi-modal capabilities.

6.2.7. The Relationship between MMFS and Resampler (Table 17)

The following table (Table 17 from the original paper) presents the complementary relationship between MMFS and Resampler:

w/ Resampler	w/ MMFS	Caption ( $\uparrow$ )	Generation ( $\downarrow$ )	OK-VQA ( $\uparrow$ )	TextVQA ( $\uparrow$ )
$\checkmark$	$\checkmark$	110.6	30.0	29.8	27.7
$\checkmark$		107.0	32.2	28.7	22.5
	$\checkmark$	102.7	32.0	27.3	22.0

The following are the results from Table 17 of the original paper:

Complementary Roles:
- Using both Resampler and MMFS ( $\checkmark$ / $\checkmark$ row) yields the best performance.
- Removing MMFS (but keeping Resampler) leads to a notable drop across tasks (e.g., Caption from 110.6 to 107.0, TextVQA from 27.7 to 22.5). This confirms MMFS's role in fine-grained detail.
- Removing Resampler (but keeping MMFS) results in an even larger performance degradation (e.g., Caption 102.7, TextVQA 22.0). In this setup, 32 randomly-initialized learnable embeddings are used as input visual tokens instead of Resampler outputs.
Finding: This demonstrates that the Resampler and MMFS play complementary roles. The Resampler provides effective coarse-grained initial visual tokens that offer a general understanding of the image, while MMFS then augments this by dynamically providing fine-grained details on demand. Both are necessary for optimal performance.

6.3. Qualitative Results

The paper provides several qualitative visualizations to further illustrate MM-Interleaved's capabilities.

Zero-shot Text Generation with Interleaved Images and Texts (Figure 7, 8):
- Figure 7 demonstrates emergent multi-modal in-context learning. The model can answer questions or generate text based on a sequence of images and text, showing reasoning capabilities.
- Figure 8 illustrates the model's ability to understand complex scenarios like robotics, gaming, and GUI interfaces. This implies robust visual parsing and contextual understanding.
  
  The following figure (Figure 15 from the original paper) shows examples of multi-modal text understanding and recognition in interleaved image-text data:
  
  Image 15: The image is an illustration demonstrating examples of multi-modal text understanding and recognition in interleaved image-text data. The left side shows fruit addition problems, the center displays company logos with descriptions, and the right side presents images with red-circled text, reflecting the model's ability to correlate visual details and textual information.

The following figure (Figure 16 from the original paper) shows examples of complex scenarios:

Image 16: The image is a diagram composed of multiple illustrations showing a robot picking and placing objects in 3D space, steps to adjust an iPhone screen to black and white mode through settings, and a demonstration of creating a fence in the Minecraft game.

Zero-shot Image Generation (Figure 9):
- Figure 9 shows that the model can generate appropriate images based on provided contexts, including desired styles or concepts, demonstrating its creative generative capabilities.
  
  The following figure (Figure 2 from the original paper) shows a conceptual diagram illustrating the image generation process based on style and description:
  
  Image 2: The image is a conceptual diagram illustrating the image generation process based on style and description. On the left, vibrant images of cats and dogs with specific styles, as well as photographs of sunflowers and a countryside house, are shown. On the right, images generated from textual descriptions include an elephant in a similar style, a portrait of an old man, an oil painting of sunflowers, and a photo depicting ocean waves.
Zero-shot Interleaved Image and Text Generation (Figure 10):
- Figure 10 illustrates the model's capacity to generate both images and text coherently and interleavedly, suitable for tasks like visual storytelling or multi-modal instructions.
  
  The following figure (Figure 3 from the original paper) shows an educational illustration rich in text and images:
  
  Image 3: The image is an educational illustration rich in text and images, presenting a story about a "beautiful rose" alongside steps for making apple juice. The left side contains the story narrative with corresponding pictures, while the right side details the apple juice preparation process with step-by-step instructions and images.
Text Reading QA (Figure 11):
- Visualizations on TextVQA show that MM-Interleaved with MMFS provides more accurate answers requiring fine-grained text details (e.g., reading numbers or brand names in images).
  
  The following figure (Figure 11 from the original paper) shows qualitative results on TextVQA:
  
  Image 4: The image is a set of question-and-answer illustrations demonstrating the MM-Interleaved model's ability to answer detailed questions about images, comparing results with and without the multi-modal feature synchronizer (MMFS), highlighting MMFS's improvement in recognizing numbers, brands, and contents.
Referring Expression Comprehension (Figure 12):
- Qualitative results on RefCOCOg show that MMFS leads to more accurate bounding box predictions, confirming its ability to better localize described objects.
  
  The following figure (Figure 12 from the original paper) shows referring expression comprehension on RefCOCOg:
  
  Image 5: The image is a multi-set comparison illustration showing differences in target localization coordinates between the MM-Interleaved model with and without the multi-modal feature synchronizer (MMFS), highlighting MMFS's enhancement in detail capturing ability.
Segmentation-to-Image Translation (Figure 13):
- This visualization clearly demonstrates that images generated with MMFS maintain a spatial layout significantly closer to the ground-truth image compared to results without MMFS, which often lack spatial alignment.
  
  The following figure (Figure 13 from the original paper) shows segmentation-to-image generation on ADE20k:
  
  Image 6: The image is a set of comparison illustrations showing segmentation maps, ground truth images, and generated images with and without the multi-scale multi-image feature synchronizer (MMFS) under different descriptive conditions, demonstrating that using MMFS allows more precise visual detail capture.
Multi-image Generation and Interleaved Image-Text Generation (Figure 14, 15):
- Figure 14 (Multi-image Generation for storytelling) shows that images generated with MMFS exhibit better spatial consistency (e.g., consistent backgrounds, character positions) and semantic alignment across a sequence of images than those generated without MMFS.
- Figure 15 (Interleaved Image-Text Generation) further highlights this, where MMFS enables the generation of coherent images and texts sequentially, balancing diversity with spatial semantic consistency in visual storytelling tasks.
  
  The following figure (Figure 14 from the original paper) shows a comparison of the MM-Interleaved model's performance in generating multi-frame animated images:
  
  Image 7: The image is Figure 14, showing a comparison of the MM-Interleaved model's performance in generating multi-frame animated images. It includes input context, ground truth multi-frame images (GT), images generated with the multi-modal feature synchronizer (MMFS), and images generated without MMFS, demonstrating the model's advantages in detail restoration and consistency.

The following figure (Figure 15 from the original paper) shows a comparative experimental illustration from the paper:

Image 8: The image is a comparative experimental illustration from the paper, showing the differences in image-text generation results with and without the multi-modal feature synchronizer (MMFS). It presents multiple frames of paired image-text comparisons to highlight the improved accuracy and detail capture enabled by MMFS.

6.4. Summary of Key Findings

Superior Performance: MM-Interleaved achieves SOTA results in zero-shot multi-modal comprehension and generation without using in-house data, and further improves to SOTA across various fine-tuned tasks.
Token Efficiency: MMFS significantly enhances token efficiency, allowing MM-Interleaved to achieve better or comparable performance with substantially fewer visual tokens than previous methods, especially beneficial in multi-image scenarios.
Fine-grained Detail Capture: MMFS is critical for tasks requiring pixel-level precision (e.g., segmentation-to-image translation, referring expression comprehension) and fine-grained visual details (e.g., TextVQA), where it vastly outperforms models without it.
Architectural Advantages: The Deformable Attention within MMFS is crucial for its effectiveness, outperforming dense cross-attention. The multi-scale and multi-image capabilities of MMFS further boost performance.
Mutual Benefits of Joint Training: End-to-end training with both text and image generation losses yields mutual benefits, improving both comprehension and generation capabilities.
Computational Efficiency: MMFS integrates with minimal overhead in FLOPs, parameters, runtime, and memory, making it a highly efficient solution for enhancing detail.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces MM-Interleaved, an innovative end-to-end generative model for interleaved image-text data. Its core contribution is the Multi-Modal Feature Synchronizer (MMFS), a novel module that enables dynamic, on-demand extraction of fine-grained visual details from multi-scale image features during both text and image generation. This mechanism effectively addresses the limitation of fixed visual tokens in multi-modal LLMs, which often leads to information loss, especially in multi-image scenarios. MM-Interleaved is robustly pre-trained on a diverse mix of paired and interleaved image-text corpora and further refined through supervised fine-tuning. The extensive experimental results unequivocally demonstrate its superior versatility and performance, achieving state-of-the-art (SOTA) results across numerous multi-modal comprehension and generation benchmarks, all without relying on any in-house data. Key findings highlight MMFS's role in improving token efficiency, enabling fine-grained visual understanding, and enhancing consistency in multi-image generation.

7.2. Limitations & Future Work

The authors explicitly acknowledge the following limitations:

Quality and Quantity of Public Data: A significant limitation is the relatively low quality and quantity of publicly available interleaved image-text data. This scarcity hinders the full exploitation of the potential of interleaved generative modeling. More high-quality, diverse datasets are needed to further push the boundaries of such models.
Hallucination Issues: Like other multi-modal models, MM-Interleaved may suffer from hallucination issues, where it generates content that is plausible but factually incorrect or not present in the visual input.
Biased Contents: The model might generate biased content due to noisy training data. As large web-scraped datasets can reflect societal biases, these can be inadvertently learned and perpetuated by generative models.

For future work, the authors implicitly suggest:
Improving Data: Investing more effort into improving the quality and quantity of interleaved image-text data.
Safety and Reliability: Enhancing the safety and reliability of multi-modal generative models by mitigating issues like hallucination and bias.

7.3. Personal Insights & Critique

Innovation of MMFS: The Multi-Modal Feature Synchronizer (MMFS) is a genuinely elegant solution to a critical problem in multi-modal LLMs. Instead of brute-forcing detail retention by increasing visual token count (which leads to computational explosion), MMFS introduces a dynamic, context-aware mechanism. This "look-up on demand" approach, akin to a human selectively focusing their attention, is highly intuitive and efficient. The use of Deformable Attention is a smart choice for this purpose, given its proven efficiency in computer vision tasks.
Generalizability and Versatility: The model's SOTA performance across such a broad range of tasks (captioning, VQA, text-to-image, segmentation-to-image, visual storytelling) without in-house data is highly impressive. This suggests a robust generalizability stemming from its architectural design and end-to-end training strategy. Its ability to handle tasks requiring pixel-level precision demonstrates a deeper integration of visual and linguistic understanding than many multi-modal LLMs that primarily focus on high-level concepts.
Efficiency as a Key Advantage: In an era where LLMs are becoming increasingly resource-intensive, the fact that MM-Interleaved achieves superior performance with fewer visual tokens and minimal computational overhead (2-6% increase) is a significant practical advantage. This makes it more deployable and sustainable.
Application Potential: The methods and conclusions are highly transferable. The dynamic feature synchronization concept could be applied to other modalities (e.g., audio, video) or other complex multi-modal tasks where fine-grained detail is crucial but raw input size is prohibitive. It could significantly impact interactive multi-modal AI assistants, automated content creation tools for blogs/news, and educational platforms.
Critique and Areas for Improvement:
- Complexity of MMFS Tuning: While efficient, the MMFS itself introduces additional hyperparameters (e.g., cross-attention frequency in LLM, number of sampling points, multi-scale levels). The sensitivity and optimal tuning of these for different tasks might be complex. The paper mentions $L=3$ and $K$ for sampling points, but a deeper dive into their impact on performance vs. efficiency could be valuable.
- Reliance on Pre-trained Models: MM-Interleaved heavily relies on powerful pre-trained VFM (CLIP-ViT), LLM (Vicuna), and DM (Stable Diffusion). While this is a common and effective strategy, the model's overall performance and any biases are inherently tied to these foundational models. A multi-modal LLM with a fully integrated training from scratch could potentially uncover different dynamics, though at immense computational cost.
- Interpretability: Like many deep learning models, understanding why MMFS chooses certain sampling points or how it prioritizes multi-scale features could be further explored for better interpretability and trustworthiness.
- Mitigating Hallucination and Bias: While acknowledged as limitations, the paper doesn't propose specific architectural or training solutions to directly address hallucination or bias. Future work could investigate how MMFS or its extensions could specifically help in grounding generations more firmly in visual facts or in detecting and correcting biases. The dynamic, fine-grained access to visual information should theoretically help reduce hallucination by allowing the model to "verify" details against the image more robustly. This is an exciting avenue to explore.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~45 min read · 61,219 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Task Formulation

4.2.2. Architecture

4.2.3. Multi-Modal Feature Synchronizer (MMFS)

4.2.4. Multi-modal LLM with MMFS

4.2.5. Image Decoder with MMFS

4.2.6. Training Target and Inference Pipeline

4.3. Comparison with other generative models

5. Experimental Setup

5.1. Datasets

5.1.1. Pre-training Datasets

5.1.2. Supervised Fine-tuning (SFT) Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Zero-shot Results after Pre-training

6.1.2. Fine-tuning Results

6.2. Ablation Studies / Parameter Analysis

6.2.1. Token Efficiency (Table 6a & 6b)

6.2.2. MMFS for Image Generation (Table 6c)

6.2.3. Comparison between Different Cross Attention Mechanisms (Table 7)

6.2.4. MMFS with Multi-Image and Multi-Scale (Table 8 & Figure 6 left)

6.2.5. Computational Efficiency (Figure 6 right)

6.2.6. Pre-training with Different Loss Terms (Table 16)

6.2.7. The Relationship between MMFS and Resampler (Table 17)

6.3. Qualitative Results

6.4. Summary of Key Findings

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers