Emu: Generative Pretraining in Multimodality
TL;DR Summary
Emu is a Transformer-based multimodal pretrained model that unifies text and visual embedding prediction, enabling versatile training on diverse data and excelling in zero/few-shot tasks with strong multimodal generation and instruction tuning capabilities.
Abstract
We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Emu: Generative Pretraining in Multimodality
- Authors: Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang.
- Affiliations: The authors are from the Beijing Academy of Artificial Intelligence (BAAI), Tsinghua University, and Peking University.
- Journal/Conference: The paper is available on arXiv, which is a preprint server. This means it has not yet undergone formal peer review for a conference or journal publication at the time of this version's release. However, arXiv is a standard platform for disseminating cutting-edge research in fields like AI.
- Publication Year: The first version was submitted in July 2023.
- Abstract: The paper introduces Emu, a Transformer-based multimodal foundation model designed to generate both images and text from varied multimodal inputs (e.g., interleaved text, images, video). Emu is an "omnivore" model trained autoregressively with a unified objective: to predict the next element in a sequence, which can be either a text token (via classification) or a visual embedding (via regression). This approach allows it to learn from diverse data sources like image-text pairs, web pages, and videos with subtitles. Emu serves as a generalist interface for tasks like image captioning, visual question answering (VQA), and text-to-image generation, demonstrating strong performance in zero-shot and few-shot settings compared to other large multimodal models. The paper also shows its potential as a multimodal assistant after instruction tuning.
- Original Source Link:
- Official Source: https://arxiv.org/abs/2307.05222
- PDF Link: http://arxiv.org/pdf/2307.05222v2
2. Executive Summary
- Background & Motivation (Why):
- Large Language Models (LLMs) have shown remarkable capabilities in text understanding and generation. The next frontier is extending this power to multimodality, creating Large Multimodal Models (LMMs).
- However, most existing LMMs (like Flamingo and its derivatives) primarily focus on a "vision-to-text" paradigm. They take images as input and only generate text. The visual part of the model (the vision encoder) is often frozen, and the training objective only involves predicting the next text token. This limits the model's capacity to generate visual content and fully integrate modalities.
- Furthermore, these models are often trained on static image-text pairs or documents, underutilizing rich, temporally correlated data like videos, which are an abundant source of naturally interleaved visual frames and text (subtitles).
- Main Contributions / Findings (What):
- Unified Generative Pretraining: Emu introduces a novel training objective that unifies text and image generation. Instead of only predicting the next text token, Emu is trained to predict the next element in a multimodal sequence, whether it is a discrete text token or a continuous visual embedding. This makes it a true "omnivore" model capable of both understanding and generating across modalities.
- Causal Transformer for Vision: The paper proposes a
Causal Transformermodule to convert non-sequential 2D image features into a 1D sequence of causal latent embeddings. This allows images to be modeled autoregressively in a latent space, similar to how text is modeled, without resorting to pixel-level generation. - Leveraging Diverse Data Sources: Emu's architecture allows it to learn from a wide array of data formats, including not just image-text pairs and web documents, but also video-text pairs and a new dataset introduced by the authors,
YT-Storyboard-1B, which consists of YouTube storyboard frames interleaved with subtitles. - A Generalist Multimodal Interface: As a result of its unified training, Emu can act as a single, generalist model for a wide range of tasks. It can perform image-to-text tasks (captioning, VQA) and text-to-image generation, including novel capabilities like in-context image generation (e.g., generating an image in the style of a preceding one) and image blending.
- State-of-the-Art Performance: Emu demonstrates excellent performance on various zero-shot and few-shot benchmarks, often outperforming comparable or even larger models in image/video question answering and showing competitive text-to-image generation quality.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Transformer: A neural network architecture that relies on a mechanism called
self-attentionto weigh the importance of different parts of the input sequence. It is the foundation for most modern LLMs and LMMs. - Autoregressive Model: A model that generates a sequence of data one step at a time, where the prediction for each step is conditioned on the previously generated steps. For example, when generating the sentence "The cat sat", the model first predicts "The", then predicts "cat" based on "The", and finally predicts "sat" based on "The cat". Emu extends this concept to sequences containing both text and image embeddings.
- Large Language Model (LLM): A massive autoregressive Transformer model trained on vast amounts of text data to predict the next word in a sentence (e.g., LLaMA, GPT-3). They exhibit emergent abilities in reasoning, understanding, and fluent text generation.
- Large Multimodal Model (LMM): An extension of LLMs that can process and reason about inputs from multiple modalities, typically vision (images/videos) and language (text).
- CLIP (Contrastive Language-Image Pre-training): A model trained to learn the relationship between images and text by mapping them into a shared embedding space. It learns to associate a text description like "a photo of a dog" with images of dogs. Emu uses a powerful variant,
EVA-CLIP, as its vision encoder. - Latent Diffusion Models (e.g., Stable Diffusion): A class of generative models that can create high-quality images from text prompts. They work by gradually removing noise from a random signal (in a compressed "latent" space) while being guided by a condition (like text or, in Emu's case, visual embeddings).
- Instruction Tuning: A fine-tuning process where a pretrained model is further trained on a dataset of "instructions" and their desired responses. This helps align the model's behavior with human intent, turning it into a helpful assistant (e.g., ChatGPT).
- Transformer: A neural network architecture that relies on a mechanism called
-
Previous Works:
- Flamingo: A pioneering LMM that demonstrated strong few-shot learning on multimodal tasks. It connected a pretrained vision encoder to a pretrained LLM using special
cross-attentionlayers, but it was only capable of text output and kept the vision and language models mostly frozen. - BLIP-2 / InstructBLIP: These models introduced a
Q-Former, a small, learnable module to bridge the gap between a frozen vision encoder and a frozen LLM. They are highly efficient but, like Flamingo, are primarily designed for vision-to-text tasks. - Kosmos-1: A multimodal model trained from scratch on web-scale multimodal corpora. It tokenizes images into discrete codes, but its training objective remains focused on next-token prediction for text.
- Prevailing LMMs: Most contemporary LMMs follow the Flamingo paradigm: they connect a vision encoder to an LLM and train only on a language modeling loss (predicting text). They do not generate images and do not have a training signal for the vision components.
- Flamingo: A pioneering LMM that demonstrated strong few-shot learning on multimodal tasks. It connected a pretrained vision encoder to a pretrained LLM using special
-
Differentiation: Emu's primary innovation lies in its unified generative objective. Unlike previous models that are uni-directional (vision -> text), Emu is bi-directional in its generative capability (vision -> text AND text -> vision). It achieves this by:
- Regressing visual embeddings: It explicitly trains the model to predict and generate continuous visual embeddings, not just text tokens.
- Causal latent representation for images: The
Causal Transformercreates a sequential, autoregressive-friendly representation of an image, making it possible to model image generation in the same framework as text generation. - End-to-end training: The entire model, including the vision components (via the Causal Transformer), is trained jointly, allowing for deeper cross-modal fusion.
4. Methodology (Core Technology & Implementation)
Emu's architecture is a composite system designed for unified multimodal modeling. It consists of four main parts, as shown in Figure 2 from the paper.
该图像是论文中关于Emu多模态建模流程的示意图。图示展示图像通过EVA-CLIP编码器转为嵌入向量,与文本标记交织为序列输入至因果Transformer,模型在训练时交替进行文本分类和视觉嵌入回归,推理阶段由稳定扩散解码器生成图像。
-
Principles: The core idea is to treat all data—text, images, and videos—as a single, interleaved sequence. The model learns to predict the "next thing" in this sequence, regardless of its modality. This is achieved by representing images not as raw pixels or discrete tokens, but as a sequence of continuous, causal embeddings that live in the same space as text token embeddings.
-
Steps & Procedures:
- Input Processing: An input sequence can contain text, images, and video frames.
- Text is tokenized into standard discrete tokens.
- Images and video frames are passed through the Visual Encoder.
- Visual Encoding: The Visual Encoder is a pretrained
EVA-CLIPmodel. It takes an image and converts it into a set of dense feature embeddings. These embeddings capture the semantic content of the image but are in a 2D spatial format and are not inherently sequential. - Causal Transformation: The 2D image embeddings from
EVA-CLIPare fed into the Causal Transformer. This module's job is to transform the spatial features into a 1D sequence of embeddings that have a causal (left-to-right) dependency.- Architecture: The Causal Transformer is similar to a standard Transformer decoder. It uses
causal self-attentionto ensure that predicting an embedding only depends on the preceding embeddings . It also usescross-attentionto incorporate the spatial information from theEVA-CLIPimage features.
- Architecture: The Causal Transformer is similar to a standard Transformer decoder. It uses
- Multimodal Sequence Formation: The resulting sequence of visual embeddings is then interleaved with the text tokens. Special tokens
[IMG]and are used to mark the beginning and end of an image's representation in the sequence. For a video, each sampled frame is treated as an image, resulting in multiple blocks of visual embeddings. - Autoregressive Modeling: This interleaved sequence is fed into the main Multimodal Modeling component, which is a large language model (
LLaMA-13B). This LLM processes the sequence autoregressively, learning to predict the next element at each position. - Output Generation (Inference):
- Text Generation: If the model needs to generate text (e.g., answer a question), it uses its standard language modeling head to predict the next text token.
- Image Generation: To generate an image, a prompt is given, followed by the
[IMG]token. The model then autoregressively generates the visual embeddings. These embeddings are then passed to the Visual Decoder.
- Visual Decoding: The Visual Decoder is a fine-tuned
Stable Diffusionmodel. It takes the visual embeddings generated by the LLM as a condition and uses the diffusion process to generate a realistic, high-resolution image.
- Input Processing: An input sequence can contain text, images, and video frames.
-
Mathematical Formulas & Key Details: The training objective is to maximize the log-likelihood of the multimodal sequences in the dataset . A sequence can contain both text tokens and visual embeddings. The objective is:
- Symbol Explanation:
-
: The parameters of the Emu model.
-
: The training dataset of multimodal sequences.
-
: A single multimodal sequence from the dataset.
-
: The -th element in the sequence, which can be a text token or a visual embedding.
-
: The probability of the -th element given all previous elements, as predicted by the model. This is the core of autoregressive modeling.
This unified objective is implemented with two different loss functions depending on the modality of the target element :
-
- For discrete text tokens: A standard cross-entropy loss is used. The model predicts a probability distribution over the entire vocabulary, and the loss measures how different this is from the ground-truth token.
- For continuous visual embeddings: An regression loss (mean squared error) is used. The model predicts a continuous vector, and the loss measures the Euclidean distance between the predicted embedding and the ground-truth embedding (which was generated by the Causal Transformer from the real image).
- Symbol Explanation:
5. Experimental Setup
-
Datasets: Emu is pretrained on a massive and diverse collection of data:
-
Image-Text Pairs:
LAION-2B(2 billion image-text pairs from the web) andLAION-COCO(a 600M subset of LAION with higher-quality captions generated by BLIP). -
Interleaved Image and Text:
Multimodal-C4 (MMC4), a large dataset of web documents containing interleaved images and text, crucial for learning in-context reasoning. -
Video-Text Pairs:
WebVid-10M, a dataset of 10 million short videos with descriptive text captions. -
Interleaved Video and Text:
YT-Storyboard-1B, a new dataset collected by the authors from YouTube. It consists of 1.8 billion storyboard thumbnails (key frames) paired with their corresponding subtitles, ordered by timestamp to form a naturally interleaved sequence (as shown in Figure 3).
该图像是图3,展示了视频故事板图像与字幕通过时间戳排序后形成的交错视频-文本数据序列,直观体现了多模态数据的时间对齐过程。
For instruction tuning, the model is fine-tuned on public datasets including
ShareGPTandAlpaca(language instructions),LLaVA(image-text instructions), andVideoChat/Video-ChatGPT(video instructions). -
-
Evaluation Metrics:
- CIDEr (Consensus-based Image Description Evaluation): Used for image captioning.
- Conceptual Definition: CIDEr measures the similarity of a machine-generated sentence to a set of human-written reference sentences. It evaluates consensus by treating each sentence as a "bag of words" (represented by TF-IDF vectors) and calculating the average cosine similarity between the candidate sentence and the reference sentences. It is designed to reward captions that capture concepts shared among human descriptions.
- Mathematical Formula:
- Symbol Explanation:
- : The candidate caption for image .
- : The set of reference (human) captions for image .
- : A function that maps a sentence to a vector of its n-grams, weighted by TF-IDF.
- : The dot product operator.
- : The L2 norm (magnitude) of the vector. The final CIDEr score is a weighted sum of scores for different n-gram lengths (typically n=1 to 4).
- VQA Accuracy: Used for Visual Question Answering tasks.
- Conceptual Definition: This metric measures the percentage of questions for which the model provides the correct answer. For the VQAv2 dataset, answers can be multi-word, and the evaluation is soft, giving partial credit if the model's answer matches one of the top human answers.
- Mathematical Formula:
- Symbol Explanation:
- : The answer predicted by the model.
- : The set of 10 ground-truth answers provided by humans.
- The formula counts how many of the 10 human annotators agree with the model's answer. This count is divided by 3 and capped at 1 to give partial credit. The final accuracy is the average over all questions.
- FID (Fréchet Inception Distance): Used for text-to-image generation.
- Conceptual Definition: FID measures the quality and diversity of generated images compared to a set of real images. It calculates the "distance" between the feature distributions of the generated images and real images. A lower FID score indicates that the generated images are statistically more similar to real images, suggesting higher quality and better diversity.
- Mathematical Formula:
- Symbol Explanation:
- and : The mean of the feature vectors (from an InceptionV3 model) for real images () and generated images (), respectively.
- and : The covariance matrices of the feature vectors for real and generated images.
- : The trace of a matrix (sum of diagonal elements).
- CIDEr (Consensus-based Image Description Evaluation): Used for image captioning.
-
Baselines: Emu is compared against several state-of-the-art models, including:
- LMMs for understanding:
PALI-X-55B,MetaLM,Kosmos-1,Flamingo-9B. - LMMs for generation:
GILL. - Unimodal generative models:
GLIDE,DALL-E 2,Imagen,Stable Diffusion v1.5 (SDv1.5). - Instruction-tuned models:
LLaMA-Adapter,MiniGPT-4,InstructBLIP,LLaVA-65B.
- LMMs for understanding:
6. Results & Analysis
-
Core Results:
Multimodal Understanding (Zero-shot) Table 1 shows Emu's zero-shot performance on various understanding tasks. The instruction-tuned version is
Emu-I. An asterisk*indicates prompting with two text-only examples.Manual Transcription of Table 1
Models Image-Text Tasks Video-Text Tasks COCO NoCaps Flickr30K VQAv2 OKVQA VizWiz MSVDQA MSRVTTQA NExTQA PALI-X-55B 149.2 126.3 - 86.0 66.1 - - 47.1 38.3 MetaLM 82.2 58.7 - 43.3 41.1 11.4 - - - Kosmos-1 84.7 - 67.1 51.0 - 29.2 - - - Flamingo-9B* 79.4 - 61.5 51.8 44.7 28.8 48.0 30.2 13.7 Emu 112.4 96.5 72.0 52.0 38.2 34.2 47.4 8.3 19.6 Emu * - - - 52.9 42.8 34.4 47.8 18.8 17.8 Emu-I 120.4 108.8 77.4 57.2 43.4 32.2 43.0 34.6 21.2 Emu-I * - - - 62.0 49.2 38.3 51.1 - 19.9 - Analysis: Emu achieves a very strong
CIDErscore of 112.4 on COCO captioning, significantly outperformingFlamingo-9BandKosmos-1. The instruction-tunedEmu-Ifurther boosts this to 120.4. On VQA tasks, Emu is competitive with or better thanFlamingo-9B, andEmu-Ishows substantial gains, even surpassing the much largerFlamingo-80Bon some tasks (as mentioned in the text). The powerfulPALI-X-55Bremains the top performer on most benchmarks it was evaluated on, but Emu's results are impressive for its size and novel training approach.
Text-to-Image Generation Table 2 shows the zero-shot FID score on MS-COCO. Lower is better.
Manual Transcription of Table 2
Models FID (↓) unimodal generation models GLIDE 12.24 Make-A-Scene 11.84 DALL-E 2 10.39 SDv1.5 9.93 Imagen 7.27 Parti 7.23 multimodal generation models GILL 12.20 Emu (ours) 11.66 - Analysis: Emu (FID 11.66) outperforms
GILL(FID 12.20), another LMM capable of image generation. However, it is still inferior to specialized text-to-image models likeStable Diffusion v1.5(FID 9.93) and state-of-the-art models likeImagen. The authors suggest this gap could be due to the short training duration for the visual decoder and the difference between the conditioning spaces (Emu's visual embeddings vs. text embeddings used for SD's pretraining).
- Analysis: Emu achieves a very strong
-
Ablations / Parameter Sensitivity (Few-shot Evaluation):
The few-shot evaluation in Table 3 acts as a study of Emu's in-context learning ability. is the number of example shots provided in the prompt.
Manual Transcription of Table 3
Models VQAv2 VizWiz MSVDQA MSRVTTQA k=2 k=4 k=8 k=2 k=4 k=8 k=2 k=4 k=8 k=2 k=4 k=8 Kosmos-1 51.4 51.8 51.4 31.4 35.3 39.0 - - - - - - Flamingo-9B - 56.3 58.0 - 34.9 39.4 - 36.2 40.8 - 18.2 23.9 PALI-X - 56.9 57.1 Emu 56.4 58.4 59.0 37.8 41.3 43.9 36.0 37.1 39.8 21.2 21.8 24.1 - Analysis: Emu consistently outperforms
Flamingo-9BandKosmos-1in few-shot settings. For instance, on VQAv2 with 4 shots, Emu scores 58.4% vs. Flamingo's 56.3%. More importantly, there is a clear positive correlation between the number of shots () and performance, demonstrating Emu's strong in-context learning ability, which is a key benefit of training on interleaved data.
- Analysis: Emu consistently outperforms
-
Qualitative Evaluation: The paper provides compelling qualitative examples that showcase capabilities beyond standard benchmarks.
-
Generalist Interface (Figure 1): This figure illustrates Emu's versatility, showing it can handle captioning, VQA, in-context learning (generating a description for a third image after seeing two examples), and text-to-image generation all within a single interface.
该图像是一个示意图,展示了Emu模型在多模态任务中的通用接口能力,包括图像描述、图像问答、上下文补全、图->文生成、文->图生成以及视频问答等多种视觉语言应用示例。 -
Advanced Understanding (Figure 4): This figure demonstrates Emu's ability to reason over multiple interleaved images, ground its answers in real-world knowledge (identifying Mahatma Gandhi), and provide detailed descriptions of video content.
该图像是论文中图4的示意图,展示了多模态理解的示例,包括多张交错图像理解、现实知识问答及视频内容的精细理解。 -
Multimodal Assistant (Figure 5, 6, 11): After instruction tuning,
Emu-Iacts as a capable multimodal assistant, engaging in multi-turn dialogues (Figure 6), following complex instructions (Figure 11, where it correctly lists books by an author before recommending one), and providing nuanced captions (Figure 10).
该图像是一个示意图,展示了模型对视频内容理解的对比,包含输入视频帧、提问文本及三个模型Emu、Video-ChatGPT和ImageBind-LLM的回答,体现Emu在视频理解细节上的优势。
该图像是一幅印象派风格的油画,描绘了水面上的船只和远处的日落。整体色调柔和,展示了水面和天空的自然光影变化。
该图像是一个图表,展示了论文中模型Emu与其他多模态模型(LLaVA、mPLUG-Owl、InstructBLIP)在根据输入图像和文本指令生成相关文本回答的对比。 -
In-context Image Generation (Figure 8): This is a standout capability. When prompted with an image in a specific style (e.g., an oil painting) followed by a text prompt, Emu generates a new image that adheres to both the text prompt and the visual style of the context image. This shows that the visual information is being used as a style condition for generation.
该图像是图8,展示了基于上下文的文本到图像生成示例。左侧为输入文本提示,右侧为根据提示生成的对应图片,包括老年人肖像、向日葵油画、可爱狗狗和猫咪等多样内容,体现了模型多模态生成能力。
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents Emu, a large multimodal model that pioneers a unified autoregressive pretraining objective for both text and vision. By learning to predict the next element in a sequence—be it a text token or a visual embedding—Emu becomes a versatile, generalist model. It effectively leverages diverse data sources, including videos, and demonstrates impressive capabilities in both understanding and generation tasks, such as in-context image generation and image blending, setting a new direction for the development of LMMs.
-
Limitations & Future Work: The authors transparently acknowledge several limitations:
- Hallucination: Like other LMMs, Emu can generate factually incorrect or nonsensical content, both in text and image domains.
- Inference Speed: The autoregressive nature of generation is inherently slow.
- Static Knowledge: The model's knowledge is frozen after pretraining and does not update.
- Language Bias: The model is trained predominantly on English data and performs poorly in other languages.
- Ethical Risks: The model inherits biases from its training data (sourced from the internet) and its base models (LLaMA, Stable Diffusion), and could potentially generate harmful or inappropriate content.
-
Personal Insights & Critique:
- Novelty & Impact: The core contribution—unifying image and text generation under a single autoregressive objective—is highly significant. It moves beyond the prevalent "vision-encoder + LLM" paradigm and treats vision as a first-class citizen in the generative process. The
Causal Transformeris a clever solution to make non-sequential image data compatible with autoregressive models. This work likely paves the way for future "omnivore" models that can seamlessly process and generate an even wider range of modalities (e.g., audio, 3D). - Potential Improvements: While the FID score for image generation is good, it doesn't match specialized models. As the authors note, more extensive training of the visual decoder could close this gap. Additionally, exploring alternatives to the loss for visual embeddings, such as a perceptual or adversarial loss, might improve the quality of the generated latent codes.
- Untested Assumptions: The paper assumes that a 1D causal sequence is a sufficient representation for a 2D image's content in the latent space. While the results are strong, the inherent 2D structure of images is lost. Exploring more complex latent structures (e.g., 2D grids of latent variables) could be a fruitful area for future research.
- Overall: Emu is a strong and well-executed piece of research that pushes the boundaries of what multimodal models can do. It provides a solid architectural and training framework for building models that not only understand the world through multiple senses but can also create new content across those same senses.
- Novelty & Impact: The core contribution—unifying image and text generation under a single autoregressive objective—is highly significant. It moves beyond the prevalent "vision-encoder + LLM" paradigm and treats vision as a first-class citizen in the generative process. The
Similar papers
Recommended via semantic vector search.