AiPaper
Status: completed

Show-o2: Improved Native Unified Multimodal Models

Autoregressive Modeling3D Causal Variational AutoencoderUnified Multimodal ModelsFlow MatchingMultimodal Vision-Language Generation
Original LinkPDFEdit PDF notes
Price: 0.10
Price: 0.10
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Show-o2 unifies autoregressive and flow matching on a 3D causal VAE space with dual-path fusion, enabling scalable image and video understanding and generation through a native multimodal model with two-stage training.

Abstract

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

English Analysis

1. Bibliographic Information

  • Title: Show-o2: Improved Native Unified Multimodal Models
  • Authors: Jinheng Xie (Show Lab, National University of Singapore), Zhenheng Yang (ByteDance), Mike Zheng Shou (Show Lab, National University of Singapore).
  • Journal/Conference: This paper is a preprint available on arXiv. An arXiv paper has not yet undergone formal peer review for publication in a journal or conference. It represents an early dissemination of research findings.
  • Publication Year: 2024 (First version submitted in June 2024).
  • Abstract: The paper introduces Show-o2, an improved "native" unified multimodal model that combines autoregressive modeling for text and flow matching for visuals. The model operates on a unified visual representation derived from a 3D causal variational autoencoder (VAE) space, making it scalable to both images and videos. This representation is built using a dual-path fusion of spatial and temporal information. The model architecture is based on a large language model (LLM) with a dedicated language head and flow head. A novel two-stage training recipe is proposed to effectively train the model and scale it to larger sizes. The authors report that Show-o2 excels at a wide variety of multimodal tasks, including understanding and generating content across text, images, and videos.
  • Original Source Link: https://arxiv.org/abs/2506.15564

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: The field of AI has seen separate successes in Large Language Models (LLMs) for text understanding, Large Multimodal Models (LMMs) for visual understanding, and generative models for creating images/videos. The next frontier is to create a single, Unified Multimodal Model (UMM) that can both understand and generate content across multiple modalities (text, image, video).
    • Gaps in Prior Work: Existing approaches to creating UMMs either stitch together separate, pre-trained models (which can be inefficient and complex) or struggle to train a single "native" model without degrading its original language abilities. Furthermore, many UMMs focus only on text and images, lacking robust, scalable support for video.
    • Fresh Angle / Innovation: Show-o2 proposes a "native" UMM architecture that is inherently designed for both understanding and generation across text, images, and video. Its key innovations are:
      1. A unified visual representation built on a 3D causal VAE, making it naturally scalable to video.
      2. A hybrid modeling approach using autoregressive modeling for text and flow matching (a modern, efficient generative technique) for visuals within a single LLM backbone.
      3. A practical two-stage training recipe that first teaches the model visual generation without harming its language skills, and then fine-tunes the entire system for integrated performance.
  • Main Contributions / Findings (What):

    • Novel Model Architecture: The paper presents Show-o2, a native UMM that integrates an LLM with a language head for text prediction and a flow head for image/video generation.
    • Scalable Visual Representation: A dual-path mechanism is designed to create a unified visual representation from a 3D causal VAE space. This captures both high-level semantics and low-level details, and is inherently compatible with both images and videos.
    • Efficient Training Strategy: A two-stage training pipeline is introduced. This method avoids the need for massive text corpora to prevent "catastrophic forgetting" of language abilities, making it more efficient and effective for scaling up models.
    • State-of-the-Art Performance: The Show-o2 models are shown to achieve state-of-the-art or highly competitive results on a wide array of benchmarks for both multimodal understanding (e.g., visual question answering) and visual generation (e.g., text-to-image, text-to-video).

3. Prerequisite Knowledge & Related Work

This section explains the foundational concepts needed to understand the paper, based on its Introduction and Related Work sections.

  • Foundational Concepts:

    • Large Language Models (LLMs): These are massive neural networks (e.g., GPT, Llama) trained on vast amounts of text data. Their core capability is predicting the next word in a sequence, which allows them to perform tasks like text generation, summarization, and question answering.
    • Large Multimodal Models (LMMs): These are an extension of LLMs that can process and reason about inputs from multiple modalities, most commonly text and images. They typically use a "vision encoder" to convert an image into a format that the LLM can understand, allowing for tasks like describing an image or answering questions about it.
    • Visual Generative Models: These models create new visual content (images or videos) from a given prompt, usually text. Two dominant paradigms are:
      • Autoregressive (AR) Modeling: This approach generates an image pixel by pixel or patch by patch, where each new piece is predicted based on the ones generated before it. It is similar to how LLMs generate text word by word.
      • Diffusion Modeling: This technique starts with random noise and gradually "denoises" it over several steps to form a coherent image, guided by the text prompt. It is known for producing high-quality images but can be slow due to the iterative steps.
    • Flow Matching: A more recent generative modeling technique related to diffusion. Instead of learning to reverse a noising process step-by-step, flow matching learns a continuous transformation (a "vector field" or "flow") that directly maps a simple noise distribution to the complex distribution of real images. This can be more efficient and stable to train than diffusion models.
    • Variational Autoencoder (VAE): A type of neural network consisting of an encoder and a decoder. The encoder compresses input data (like an image) into a low-dimensional latent space, and the decoder reconstructs the original data from this compressed representation. The 3D causal VAE used in this paper is a specialized version designed to handle video data by considering the temporal dimension.
    • Unified Multimodal Models (UMMs): The central topic of the paper. These models aim to combine the capabilities of LMMs (understanding) and visual generative models (generation) into a single, cohesive system. The paper distinguishes between two main types:
      1. Native UMMs: A single model is trained from the ground up to perform both understanding and generation tasks (e.g., Show-o2, Chameleon).
      2. Assembling Tailored Models: Separate, specialized models for understanding (e.g., an LMM) and generation (e.g., Stable Diffusion) are connected using "adapters" or other lightweight components (e.g., NExT-GPT).
  • Differentiation: The paper positions Show-o2 against other UMMs in Table 1.

    manually transcribed table Table 1: Comparative analysis of selected unified multimodal models.

    Methods Und. & Gen. Representation Type of Unified Modeling
    Unified Decoupled Support Video Native Und. & Gen. Assembling Tailored Models Paradigm
    Chameleon [102] AR
    Transfusion [147] AR + Diff.
    Show-o [128] AR + Diff.
    VIULA [123] AR
    Emu3 [114] AR
    LlamaFusion [95] AR + Diff.
    Show-o2 (Ours) AR + Diff.
    Janus-Series [26, 27, 79] × AR (+Diff)
    UnidFluid [38] × AR + MAR
    Mogao [65] × AR + Diff.
    BAGEL [32] AR + Diff.
    NExT-GPT [120] AR + Diff.
    SEED-X [40] × AR + Diff.
    ILLUME [111] × AR + Diff.
    MetaMorph [106] × AR + Diff.
    MetaQueries [83] × AR + Diff.
    TokenFlow* [89] * AR

    (Note: In the original paper, the checkmarks and crosses for some models like Transfusion and Show-o are shifted. The table above reflects the authors' intended classification based on the text. The paradigm for Show-o2 is listed as "AR + Diff." but the paper uses Flow Matching, which is a diffusion-like paradigm.)

    Show-o2 distinguishes itself by being a native UMM that uses a unified representation for both understanding and generation, and crucially, offers native support for video. This combination is rare among existing models.

4. Methodology (Core Technology & Implementation)

The core of Show-o2 is its architecture and training process, designed to natively unify understanding and generation for text, images, and videos.

  • Principles: The main idea is to build upon a pre-trained LLM and equip it with the ability to "read" and "write" visual information. This is achieved by converting visuals into a special sequence of tokens that the LLM can process, and then training dedicated "heads" to handle text output (understanding) and visual output (generation) separately but within the same model.

  • Steps & Procedures (Overall Architecture): The overall workflow is illustrated in Figure 1 from the paper.

    Figure 1: Our approach begins by encoding input texts, images, and videos into continuous embeddings and visual latents. The visual latents are processed through a dual-path extraction and spatial (-… 该图像是示意图,展示了一个多模态叙事生成过程,上方为云朵变化的连续视频帧,下方为相应的文字描述,体现了图像和文本的联合表达与生成。

    1. Input Processing: Input text, images, and videos are fed into the model.
      • Text is converted into text embeddings by a standard tokenizer.
      • Images and videos are processed by a 3D causal VAE encoder, which converts them into a compressed sequence of visual latents.
    2. Unified Visual Representation: The raw visual latents are not directly fed to the LLM. Instead, they pass through a dual-path mechanism to create a richer representation.
      • Semantic Path: Semantic layers (S()\mathcal{S}(\cdot)), based on a pre-trained SigLIP model, extract high-level, abstract information (the "what").
      • Detail Path: A projector (P()\mathcal{P}(\cdot)) preserves low-level, fine-grained details and structural information (the "how").
      • Fusion: The outputs of both paths are combined via Spatial (-Temporal) Fusion (STF) to create the final unified visual representations. This ensures the model gets both the semantic gist and the pixel-level details.
    3. Sequence Modeling: The text embeddings and unified visual representations are arranged into a single sequence, which is then fed into the pre-trained LLM. The model uses omni-attention, which allows it to process the sequence causally (left-to-right) while enabling full self-attention within the visual tokens.
    4. Output Generation: The LLM's output is directed to one of two heads:
      • Language Head: For text-based tasks (e.g., answering a question), this head uses standard autoregressive modeling to predict the next text token.
      • Flow Head: For visual generation tasks, this head is trained with flow matching to predict a "velocity field" that transforms noise into the desired image/video latents.
    5. Decoding: The output from the language head is de-tokenized into text. The output from the flow head (visual latents) is passed to the 3D causal VAE decoder to reconstruct the final image or video.
  • Mathematical Formulas & Key Details:

    • Unified Visual Representation: A key component is how the model handles visual inputs, especially for generation, which involves noise. For a visual input X\mathbf{X}, the VAE produces latents. These latents are noised for training the generative component. The noised latents xt\mathbf{x}_t at time tt are a linear interpolation between pure noise x0\mathbf{x}_0 and the clean data latents x1\mathbf{x}_1. xt=tx1+(1t)x0 \mathbf { x } _ { t } = t \cdot \mathbf { x } _ { 1 } + ( 1 - t ) \cdot \mathbf { x } _ { 0 }

      • xt\mathbf{x}_t: The visual latents at noise level tt.

      • x1\mathbf{x}_1: The clean visual latents from the VAE.

      • x0\mathbf{x}_0: A random noise tensor, typically sampled from a standard normal distribution N(0,1)\mathcal{N}(0, 1).

      • tt: A time-step value from [0, 1], where t=0t=0 is pure noise and t=1t=1 is the clean data.

        The semantic layers S()\mathcal{S}(\cdot) are pre-trained via distillation to mimic the behavior of a powerful vision model (SigLIP) on both clean and noised latents. The loss for this distillation is: Ldistill=1nlogsin(S(xt),SigLIP(X)) \mathcal { L } _ { \mathrm { distill } } = - \frac { 1 } { n } \sum \log \sin ( S ( \mathbf { x } _ { t } ) , \mathrm { S i g L I P } ( \mathbf { X } ) )

      • Ldistill\mathcal{L}_{\mathrm{distill}}: The distillation loss.

      • S(xt)S(\mathbf{x}_t): The feature representation from the semantic layers for noised latents.

      • SigLIP(X)\mathrm{SigLIP}(\mathbf{X}): The feature representation from the original SigLIP model for the clean input image X\mathbf{X}.

      • sim(,)\mathrm{sim}(\cdot, \cdot): Cosine similarity function. This loss encourages the semantic layers to produce features similar to SigLIP's features.

        The final unified representation u\mathbf{u} is created by the fusion mechanism: u=STF(S(xt),P(xt)) \mathbf { u } = \mathrm { S T F } ( S ( \mathbf { x } _ { t } ) , \mathcal { P } ( \mathbf { x } _ { t } ) )

      • u\mathbf{u}: The unified visual representation fed to the LLM.

      • STF\mathrm{STF}: The Spatial (-Temporal) Fusion function, which concatenates and processes the semantic and detailed features.

      • P(xt)\mathcal{P}(\mathbf{x}_t): The low-level feature representation from the projector.

    • Training Objective: The model is trained with a combined loss function that addresses both text prediction and visual generation. L=αLNTP+LFM \mathcal { L } = \alpha \mathcal { L } _ { \mathrm { N T P } } + \mathcal { L } _ { \mathrm { F M } }

      • L\mathcal{L}: The total loss.
      • LNTP\mathcal{L}_{\mathrm{NTP}}: Next Token Prediction loss for the language head, a standard cross-entropy loss for text generation.
      • LFM\mathcal{L}_{\mathrm{FM}}: Flow Matching loss for the flow head, which measures the difference between the predicted velocity field and the true velocity field needed to transform noise into the target image.
      • α\alpha: A weighting factor to balance the two loss components.
    • Two-Stage Training Recipe: This is a critical part of the methodology, designed to efficiently train the model without requiring a massive text corpus to retain language knowledge.

      manually transcribed table Table 2: Trainable components and datasets in the training stages.

      Trainable Components Datasets
      # Image-Text # Video-Text # Interleaved Data
      Stage-1 Projector
      Spatial (-Temporal) Fusion
      Flow Head
      66M WebVid [8]
      Pandas [23]
      OmniCorpus [60]
      Stage-2 Full Model (w/o VAE) 9M HQ Und.
      16M HQ Gen.
      OpenVid-1M [80] Gen.
      1.5M Internal Data Gen.
      1.6M Video Und.
      VIST [47]
      CoMM [24]
      • Stage 1: Learning Visual Generation. The core LLM is mostly frozen. Training focuses only on the newly added generative components: the projector, the fusion module, and the flow head. This stage uses a large dataset of image-text and video-text pairs to teach the model how to generate visuals from text prompts, without corrupting the LLM's pre-existing language abilities.
      • Stage 2: Integrated Fine-tuning. The entire model (except the VAE) is unfrozen and fine-tuned on a smaller, higher-quality dataset containing both multimodal understanding and generation instruction data. This stage integrates the two capabilities and aligns the model's behavior with user instructions.

5. Experimental Setup

  • Datasets:

    • Pre-training (Stage 1): A large-scale dataset of ~66M image-text pairs curated from public sources (CC12M, COYO, LAION-Aesthetic) and synthetic data. For video, WebVid and Pandas datasets were used. For interleaved text-image data, OmniCorpus was used.
    • Fine-tuning (Stage 2): High-quality instruction-following datasets were used.
      • Understanding: 9M samples from Densefusion-1M and LLaVA-OneVision.
      • Generation: 16M high-quality samples filtered from the pre-training data.
      • Video: OpenVid-1M, internal data, VIST, and CoMM.
    • Text Rendering Improvement: A subset of TextAtlas was used to improve the model's ability to render text in generated images.
  • Evaluation Metrics: The paper evaluates Show-o2 on a comprehensive set of benchmarks for both understanding and generation.

    • Multimodal Understanding Benchmarks: MME, GQA, SEED-Bench, MMBench, MMMU, MMStar, AI2D. These benchmarks test various capabilities like perception, attribute recognition, spatial awareness, and reasoning. The primary metric is typically Accuracy.
    • Video Understanding Benchmarks: VOTA, NExT-QA, Perception-Test, MV-Bench. These measure zero-shot video question-answering accuracy.
    • Image Generation Benchmarks:
      • GenEval: Measures how well a model can generate images that follow compositional instructions (e.g., object counts, colors, positions). The metric is Accuracy.
      • DPG-Bench: Evaluates fine-grained text-to-image generation with prompts covering entities, attributes, and relations. The metric is a score based on human preference.
      • OneIG-Bench: A comprehensive benchmark that evaluates alignment, text rendering, reasoning, style, and diversity.
    • Video Generation Benchmarks (VBench): This benchmark evaluates text-to-video and image-to-video generation across many dimensions. The metrics are scores from 0-100 or 0-1.
      • Quality Score (QS): Overall visual quality.
      • Semantic Score (SS): How well the video matches the text prompt.
      • Temporal Flickering (TF): Lower is better, measures inconsistency between frames.
      • Motion Smoothness (MS): Quality of motion.
      • And many others like Subject Consistency (SC), Dynamic Degree (DD), etc.
  • Baselines: The paper compares Show-o2 against a wide range of models:

    • Understanding-Only Models: LLaVA-v1.5, Qwen-VL-Chat, LLaVA-OV.
    • Generation-Only Models: Stable Diffusion 3 (SD3), DALL-E 3, PixArt-Σ.
    • Assembled UMMs: NExT-GPT, SEED-X, ILLUME.
    • Native UMMs: Show-o (the predecessor), Janus-Pro, Emu3, BAGEL, Mogao.

6. Results & Analysis

  • Core Results:

    Multimodal Understanding on Images and Videos Table 3 shows that Show-o2 performs exceptionally well on understanding tasks.

    manually transcribed table Table 3: Evaluation on multimodal understanding benchmarks.

    Types Models # Params. MME↑ (p) GQA↑ SEED ↑ (all) MMB↑ (en) MMMU↑ (val) MMStar ↑ AI2D ↑
    Und. Only LLaVA-v1.5 [71] 7B 1510.7 62.0 58.6 64.3
    Qwen-VL-Chat [6] 7B 1487.6 57.5 58.2 60.6 57.7
    LLaVA-OV [56] 7B 1580.0 - 80.8 48.8 57.5 81.4
    Unify via Assembling Tailored Models NExT-GPT [128] 13B 57.5 58.0
    SEED-X [40] 17B 1457.0 49.1 66.5 70.1 35.6
    MetaMorph [106] 8B - - 71.8 75.2 - -
    TokenFlow-XL* [89] 14B 1551.1 62.5 72.6 76.8 43.2 75.9
    ILLUME [111] 7B 1445.3 - 72.9 75.1 38.2 71.4
    Native Unified BAGEL [32] 14B 1687.0 85.0 55.3
    Show-o [128] 1.3B 1097.2 58.0 51.5 - 27.4 -
    JanusFlow [79] 1.5B 1333.1 60.3 70.5 74.9 29.3 -
    SynerGen-VL [58] 2.4B 1381.0 - - 53.7 34.2 -
    Janus-Pro [26] 1.5B 1444.0 59.3 68.3 75.5 36.3 -
    Show-02 (Ours) 1.5B 1450.9 60.0 65.6 67.4 37.1 43.4 69.0
    Emu3 [114] 8B 60.3 68.2 58.5 31.6 - 70.0
    VILA-U [123] 7B 1401.8 60.8 59.0 - - - -
    MUSE-VL [129] 7B - 69.1 72.1 39.7 49.6 69.8
    Liquid [118] 8B 1448.0 61.1 - - - - -
    Janus-Pro [26] 7B 1567.1 62.0 72.1 79.2 41.0 -
    Mogao [65] 7B 1592.0 60.9 74.6 75.0 44.2
    Show-02 (Ours) 7B 1620.5 63.1 69.8 79.3 48.9 56.6 78.6
    • The 1.5B Show-o2 model is highly competitive with other models of similar size.
    • The 7B Show-o2 model surpasses other powerful native UMMs like Janus-Pro and Mogao on most benchmarks, and even outperforms the much larger TokenFlow-XL (14B) on several key metrics like MME, GQA, and MMMU. This demonstrates the effectiveness of the dual-path visual representation and the overall architecture.

    Visual Generation Tables 5, 6, and 7 show that Show-o2 is also a very strong generative model.

    manually transcribed table Table 5: Evaluation on the GenEval [41] benchmark.

    Type Method # Params. # Data Single Obj. Two Obj. Counting Colors Position Color Attri. Overall↑
    Gen. Only SD3-Medium [37] - - 0.99 0.94 0.72 0.89 0.33 0.60 0.74
    Unifying via Assembling Tailored Models SEED-X [40] 17B 158M+ 0.97 0.58 0.26 0.80 0.19 0.14 0.49
    TokenFlow-XL* [89] 14B 60M 0.95 0.60 0.41 0.81 0.16 0.24 0.55
    ILLUME [111] 7B 15M+ 0.99 0.86 0.45 0.71 0.39 0.28 0.61
    MetaQuery-XL [83] 7B 28M+ - - - - - - 0.80
    Native Unified Show-o [128] 1.3B 2.0B 0.98 0.80 0.66 0.84 0.31 0.50 0.68
    Emu3 [114] 8B - - - - - - 0.66
    ... ... ... ... ... ... ... ... ... ...
    Janus-Pro [26] 7B 144M 0.99 0.89 0.59 0.90 0.79 0.66 0.80
    Mogao [65] 7B - 1.00 0.97 0.83 0.93 0.84 0.80 0.89
    Show-02 (Ours) 7B 66M 1.00 0.87 0.58 0.92 0.52 0.62 0.76
    • On compositional benchmarks like GenEval and DPG-Bench (Table 6), Show-o2 achieves competitive or superior performance to other UMMs like Janus-Pro and Mogao, despite being trained on significantly less data (66M vs. 144M+). This highlights the efficiency of the training recipe.
    • On DPG-Bench, the 7B Show-o2 model achieves the highest overall score (86.14) among all listed models, including specialized generation-only models like SD3-Medium.
    • For video generation (Table 8), the Show-o2 model (with only ~2B effective parameters for generation) is competitive with or outperforms much larger models like Show-1 (6B) and Emu3 (8B).

    Qualitative Results Figure 2 (in Image 10) showcases the model's versatility. It can describe images in detail, count objects, read text within an image, provide step-by-step instructions based on world knowledge, and generate high-quality images from text prompts.

    Figure 2: Multimodal understanding and generation examples. 该图像是论文中展示的示意图,展示了Show-o2模型在图像到视频生成及多模态生成上的应用。上部是连续生成的视频帧,中部下部展示了模型对不同输入图像的文本描述生成效果。

7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully presents Show-o2, an improved native unified multimodal model that effectively bridges the gap between multimodal understanding and generation. By leveraging a dual-path unified visual representation on a 3D causal VAE space, combining autoregressive modeling with flow matching, and designing an efficient two-stage training recipe, the authors have developed a system that is scalable to images and videos. The model demonstrates state-of-the-art performance on numerous benchmarks, proving that a single, native model can excel at both understanding and generation without compromising on performance or requiring prohibitively large training resources.

  • Limitations & Future Work:

    • Text Rendering: While the paper mentions adding TextAtlas data to improve text rendering, this remains a challenging area for most generative models. The results on OneIG-Bench (Table 7) show a low text score, indicating room for improvement.
    • Computational Cost for Video: The authors note that they did not include interleaved and video data when training the larger 7B model due to the huge computational cost. This suggests that scaling up video capabilities in such unified models is still a significant engineering and resource challenge.
    • Reliance on VAE: The quality of the VAE is a potential bottleneck. Any artifacts or information loss from the VAE encoder/decoder will directly impact the final output quality.
  • Personal Insights & Critique:

    • The two-stage training recipe is arguably the most impactful practical contribution of this paper. It offers an elegant solution to the common problem of "catastrophic forgetting" in LLMs when they are adapted for new, non-textual tasks. By first teaching the new skill (generation) with the LLM frozen, and then holistically fine-tuning, the method preserves valuable language knowledge while efficiently adding new capabilities.
    • The architectural choice of a dual-path visual representation is very insightful. It recognizes that understanding and generation have different feature requirements: understanding needs high-level semantics, while generation needs low-level, pixel-perfect detail. By explicitly modeling and fusing both, the model is better equipped for both tasks.
    • The adoption of flow matching over the more traditional diffusion modeling is a forward-looking choice. Flow matching promises faster training and inference, making the development of these complex UMMs more tractable.
    • Overall, Show-o2 represents a significant step forward in the quest for truly unified AI systems. It provides a strong architectural blueprint and a practical training strategy that will likely influence future work in building versatile, all-in-one multimodal agents.

Discussion

Leave a comment

Sign in to join the discussion.

No comments yet. Start the discussion!