Show-o2: Improved Native Unified Multimodal Models
TL;DR Summary
Show-o2 unifies autoregressive and flow matching on a 3D causal VAE space with dual-path fusion, enabling scalable image and video understanding and generation through a native multimodal model with two-stage training.
Abstract
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
English Analysis
1. Bibliographic Information
- Title: Show-o2: Improved Native Unified Multimodal Models
- Authors: Jinheng Xie (Show Lab, National University of Singapore), Zhenheng Yang (ByteDance), Mike Zheng Shou (Show Lab, National University of Singapore).
- Journal/Conference: This paper is a preprint available on arXiv. An arXiv paper has not yet undergone formal peer review for publication in a journal or conference. It represents an early dissemination of research findings.
- Publication Year: 2024 (First version submitted in June 2024).
- Abstract: The paper introduces
Show-o2
, an improved "native" unified multimodal model that combinesautoregressive modeling
for text andflow matching
for visuals. The model operates on a unified visual representation derived from a3D causal variational autoencoder (VAE)
space, making it scalable to both images and videos. This representation is built using adual-path
fusion of spatial and temporal information. The model architecture is based on a large language model (LLM) with a dedicatedlanguage head
andflow head
. A noveltwo-stage training recipe
is proposed to effectively train the model and scale it to larger sizes. The authors report thatShow-o2
excels at a wide variety of multimodal tasks, including understanding and generating content across text, images, and videos. - Original Source Link: https://arxiv.org/abs/2506.15564
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: The field of AI has seen separate successes in Large Language Models (LLMs) for text understanding, Large Multimodal Models (LMMs) for visual understanding, and generative models for creating images/videos. The next frontier is to create a single, Unified Multimodal Model (UMM) that can both understand and generate content across multiple modalities (text, image, video).
- Gaps in Prior Work: Existing approaches to creating UMMs either stitch together separate, pre-trained models (which can be inefficient and complex) or struggle to train a single "native" model without degrading its original language abilities. Furthermore, many UMMs focus only on text and images, lacking robust, scalable support for video.
- Fresh Angle / Innovation:
Show-o2
proposes a "native" UMM architecture that is inherently designed for both understanding and generation across text, images, and video. Its key innovations are:- A unified visual representation built on a
3D causal VAE
, making it naturally scalable to video. - A hybrid modeling approach using autoregressive modeling for text and flow matching (a modern, efficient generative technique) for visuals within a single LLM backbone.
- A practical two-stage training recipe that first teaches the model visual generation without harming its language skills, and then fine-tunes the entire system for integrated performance.
- A unified visual representation built on a
-
Main Contributions / Findings (What):
- Novel Model Architecture: The paper presents
Show-o2
, a native UMM that integrates an LLM with alanguage head
for text prediction and aflow head
for image/video generation. - Scalable Visual Representation: A
dual-path
mechanism is designed to create a unified visual representation from a3D causal VAE
space. This captures both high-level semantics and low-level details, and is inherently compatible with both images and videos. - Efficient Training Strategy: A
two-stage training pipeline
is introduced. This method avoids the need for massive text corpora to prevent "catastrophic forgetting" of language abilities, making it more efficient and effective for scaling up models. - State-of-the-Art Performance: The
Show-o2
models are shown to achieve state-of-the-art or highly competitive results on a wide array of benchmarks for both multimodal understanding (e.g., visual question answering) and visual generation (e.g., text-to-image, text-to-video).
- Novel Model Architecture: The paper presents
3. Prerequisite Knowledge & Related Work
This section explains the foundational concepts needed to understand the paper, based on its Introduction and Related Work sections.
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks (e.g., GPT, Llama) trained on vast amounts of text data. Their core capability is predicting the next word in a sequence, which allows them to perform tasks like text generation, summarization, and question answering.
- Large Multimodal Models (LMMs): These are an extension of LLMs that can process and reason about inputs from multiple modalities, most commonly text and images. They typically use a "vision encoder" to convert an image into a format that the LLM can understand, allowing for tasks like describing an image or answering questions about it.
- Visual Generative Models: These models create new visual content (images or videos) from a given prompt, usually text. Two dominant paradigms are:
- Autoregressive (AR) Modeling: This approach generates an image pixel by pixel or patch by patch, where each new piece is predicted based on the ones generated before it. It is similar to how LLMs generate text word by word.
- Diffusion Modeling: This technique starts with random noise and gradually "denoises" it over several steps to form a coherent image, guided by the text prompt. It is known for producing high-quality images but can be slow due to the iterative steps.
- Flow Matching: A more recent generative modeling technique related to diffusion. Instead of learning to reverse a noising process step-by-step, flow matching learns a continuous transformation (a "vector field" or "flow") that directly maps a simple noise distribution to the complex distribution of real images. This can be more efficient and stable to train than diffusion models.
- Variational Autoencoder (VAE): A type of neural network consisting of an encoder and a decoder. The encoder compresses input data (like an image) into a low-dimensional
latent space
, and the decoder reconstructs the original data from this compressed representation. The3D causal VAE
used in this paper is a specialized version designed to handle video data by considering the temporal dimension. - Unified Multimodal Models (UMMs): The central topic of the paper. These models aim to combine the capabilities of LMMs (understanding) and visual generative models (generation) into a single, cohesive system. The paper distinguishes between two main types:
- Native UMMs: A single model is trained from the ground up to perform both understanding and generation tasks (e.g.,
Show-o2
,Chameleon
). - Assembling Tailored Models: Separate, specialized models for understanding (e.g., an LMM) and generation (e.g., Stable Diffusion) are connected using "adapters" or other lightweight components (e.g.,
NExT-GPT
).
- Native UMMs: A single model is trained from the ground up to perform both understanding and generation tasks (e.g.,
-
Differentiation: The paper positions
Show-o2
against other UMMs in Table 1.manually transcribed table Table 1: Comparative analysis of selected unified multimodal models.
Methods Und. & Gen. Representation Type of Unified Modeling Unified Decoupled Support Video Native Und. & Gen. Assembling Tailored Models Paradigm Chameleon [102] ✓ ✓ AR Transfusion [147] ✓ ✓ AR + Diff. Show-o [128] ✓ ✓ AR + Diff. VIULA [123] ✓ ✓ AR Emu3 [114] ✓ ✓ AR LlamaFusion [95] ✓ ✓ AR + Diff. Show-o2 (Ours) ✓ ✓ ✓ AR + Diff. Janus-Series [26, 27, 79] × ✓ AR (+Diff) UnidFluid [38] × ✓ AR + MAR Mogao [65] × ✓ AR + Diff. BAGEL [32] ✓ ✓ AR + Diff. NExT-GPT [120] ✓ ✓ AR + Diff. SEED-X [40] × ✓ AR + Diff. ILLUME [111] × ✓ AR + Diff. MetaMorph [106] × ✓ AR + Diff. MetaQueries [83] × ✓ AR + Diff. TokenFlow* [89] ✓ * AR (Note: In the original paper, the checkmarks and crosses for some models like Transfusion and Show-o are shifted. The table above reflects the authors' intended classification based on the text. The paradigm for Show-o2 is listed as "AR + Diff." but the paper uses Flow Matching, which is a diffusion-like paradigm.)
Show-o2
distinguishes itself by being a native UMM that uses a unified representation for both understanding and generation, and crucially, offers native support for video. This combination is rare among existing models.
4. Methodology (Core Technology & Implementation)
The core of Show-o2
is its architecture and training process, designed to natively unify understanding and generation for text, images, and videos.
-
Principles: The main idea is to build upon a pre-trained LLM and equip it with the ability to "read" and "write" visual information. This is achieved by converting visuals into a special sequence of tokens that the LLM can process, and then training dedicated "heads" to handle text output (understanding) and visual output (generation) separately but within the same model.
-
Steps & Procedures (Overall Architecture): The overall workflow is illustrated in Figure 1 from the paper.
该图像是示意图,展示了一个多模态叙事生成过程,上方为云朵变化的连续视频帧,下方为相应的文字描述,体现了图像和文本的联合表达与生成。
- Input Processing: Input text, images, and videos are fed into the model.
- Text is converted into
text embeddings
by a standard tokenizer. - Images and videos are processed by a
3D causal VAE encoder
, which converts them into a compressed sequence ofvisual latents
.
- Text is converted into
- Unified Visual Representation: The raw
visual latents
are not directly fed to the LLM. Instead, they pass through adual-path
mechanism to create a richer representation.- Semantic Path:
Semantic layers
(), based on a pre-trainedSigLIP
model, extract high-level, abstract information (the "what"). - Detail Path: A
projector
() preserves low-level, fine-grained details and structural information (the "how"). - Fusion: The outputs of both paths are combined via
Spatial (-Temporal) Fusion (STF)
to create the finalunified visual representations
. This ensures the model gets both the semantic gist and the pixel-level details.
- Semantic Path:
- Sequence Modeling: The
text embeddings
andunified visual representations
are arranged into a single sequence, which is then fed into the pre-trained LLM. The model usesomni-attention
, which allows it to process the sequence causally (left-to-right) while enabling full self-attention within the visual tokens. - Output Generation: The LLM's output is directed to one of two heads:
- Language Head: For text-based tasks (e.g., answering a question), this head uses standard
autoregressive modeling
to predict the next text token. - Flow Head: For visual generation tasks, this head is trained with
flow matching
to predict a "velocity field" that transforms noise into the desired image/video latents.
- Language Head: For text-based tasks (e.g., answering a question), this head uses standard
- Decoding: The output from the language head is de-tokenized into text. The output from the flow head (visual latents) is passed to the
3D causal VAE decoder
to reconstruct the final image or video.
- Input Processing: Input text, images, and videos are fed into the model.
-
Mathematical Formulas & Key Details:
-
Unified Visual Representation: A key component is how the model handles visual inputs, especially for generation, which involves noise. For a visual input , the VAE produces latents. These latents are noised for training the generative component. The noised latents at time are a linear interpolation between pure noise and the clean data latents .
-
: The visual latents at noise level .
-
: The clean visual latents from the VAE.
-
: A random noise tensor, typically sampled from a standard normal distribution .
-
: A time-step value from
[0, 1]
, where is pure noise and is the clean data.The semantic layers are pre-trained via distillation to mimic the behavior of a powerful vision model (
SigLIP
) on both clean and noised latents. The loss for this distillation is: -
: The distillation loss.
-
: The feature representation from the semantic layers for noised latents.
-
: The feature representation from the original
SigLIP
model for the clean input image . -
: Cosine similarity function. This loss encourages the semantic layers to produce features similar to
SigLIP
's features.The final unified representation is created by the fusion mechanism:
-
: The unified visual representation fed to the LLM.
-
: The Spatial (-Temporal) Fusion function, which concatenates and processes the semantic and detailed features.
-
: The low-level feature representation from the projector.
-
-
Training Objective: The model is trained with a combined loss function that addresses both text prediction and visual generation.
- : The total loss.
- : Next Token Prediction loss for the language head, a standard cross-entropy loss for text generation.
- : Flow Matching loss for the flow head, which measures the difference between the predicted velocity field and the true velocity field needed to transform noise into the target image.
- : A weighting factor to balance the two loss components.
-
Two-Stage Training Recipe: This is a critical part of the methodology, designed to efficiently train the model without requiring a massive text corpus to retain language knowledge.
manually transcribed table Table 2: Trainable components and datasets in the training stages.
Trainable Components Datasets # Image-Text # Video-Text # Interleaved Data Stage-1 Projector
Spatial (-Temporal) Fusion
Flow Head66M WebVid [8]
Pandas [23]OmniCorpus [60] Stage-2 Full Model (w/o VAE) 9M HQ Und.
16M HQ Gen.OpenVid-1M [80] Gen.
1.5M Internal Data Gen.
1.6M Video Und.VIST [47]
CoMM [24]- Stage 1: Learning Visual Generation. The core LLM is mostly frozen. Training focuses only on the newly added generative components: the
projector
, thefusion
module, and theflow head
. This stage uses a large dataset of image-text and video-text pairs to teach the model how to generate visuals from text prompts, without corrupting the LLM's pre-existing language abilities. - Stage 2: Integrated Fine-tuning. The entire model (except the VAE) is unfrozen and fine-tuned on a smaller, higher-quality dataset containing both multimodal understanding and generation instruction data. This stage integrates the two capabilities and aligns the model's behavior with user instructions.
- Stage 1: Learning Visual Generation. The core LLM is mostly frozen. Training focuses only on the newly added generative components: the
-
5. Experimental Setup
-
Datasets:
- Pre-training (Stage 1): A large-scale dataset of ~66M image-text pairs curated from public sources (
CC12M
,COYO
,LAION-Aesthetic
) and synthetic data. For video,WebVid
andPandas
datasets were used. For interleaved text-image data,OmniCorpus
was used. - Fine-tuning (Stage 2): High-quality instruction-following datasets were used.
- Understanding: 9M samples from
Densefusion-1M
andLLaVA-OneVision
. - Generation: 16M high-quality samples filtered from the pre-training data.
- Video:
OpenVid-1M
, internal data,VIST
, andCoMM
.
- Understanding: 9M samples from
- Text Rendering Improvement: A subset of
TextAtlas
was used to improve the model's ability to render text in generated images.
- Pre-training (Stage 1): A large-scale dataset of ~66M image-text pairs curated from public sources (
-
Evaluation Metrics: The paper evaluates
Show-o2
on a comprehensive set of benchmarks for both understanding and generation.- Multimodal Understanding Benchmarks:
MME
,GQA
,SEED-Bench
,MMBench
,MMMU
,MMStar
,AI2D
. These benchmarks test various capabilities like perception, attribute recognition, spatial awareness, and reasoning. The primary metric is typically Accuracy. - Video Understanding Benchmarks:
VOTA
,NExT-QA
,Perception-Test
,MV-Bench
. These measure zero-shot video question-answering accuracy. - Image Generation Benchmarks:
GenEval
: Measures how well a model can generate images that follow compositional instructions (e.g., object counts, colors, positions). The metric is Accuracy.DPG-Bench
: Evaluates fine-grained text-to-image generation with prompts covering entities, attributes, and relations. The metric is a score based on human preference.OneIG-Bench
: A comprehensive benchmark that evaluates alignment, text rendering, reasoning, style, and diversity.
- Video Generation Benchmarks (
VBench
): This benchmark evaluates text-to-video and image-to-video generation across many dimensions. The metrics are scores from 0-100 or 0-1.Quality Score (QS)
: Overall visual quality.Semantic Score (SS)
: How well the video matches the text prompt.Temporal Flickering (TF)
: Lower is better, measures inconsistency between frames.Motion Smoothness (MS)
: Quality of motion.- And many others like
Subject Consistency (SC)
,Dynamic Degree (DD)
, etc.
- Multimodal Understanding Benchmarks:
-
Baselines: The paper compares
Show-o2
against a wide range of models:- Understanding-Only Models:
LLaVA-v1.5
,Qwen-VL-Chat
,LLaVA-OV
. - Generation-Only Models:
Stable Diffusion 3 (SD3)
,DALL-E 3
,PixArt-Σ
. - Assembled UMMs:
NExT-GPT
,SEED-X
,ILLUME
. - Native UMMs:
Show-o
(the predecessor),Janus-Pro
,Emu3
,BAGEL
,Mogao
.
- Understanding-Only Models:
6. Results & Analysis
-
Core Results:
Multimodal Understanding on Images and Videos Table 3 shows that
Show-o2
performs exceptionally well on understanding tasks.manually transcribed table Table 3: Evaluation on multimodal understanding benchmarks.
Types Models # Params. MME↑ (p) GQA↑ SEED ↑ (all) MMB↑ (en) MMMU↑ (val) MMStar ↑ AI2D ↑ Und. Only LLaVA-v1.5 [71] 7B 1510.7 62.0 58.6 64.3 Qwen-VL-Chat [6] 7B 1487.6 57.5 58.2 60.6 57.7 LLaVA-OV [56] 7B 1580.0 − - 80.8 48.8 57.5 81.4 Unify via Assembling Tailored Models NExT-GPT [128] 13B − 57.5 58.0 − − SEED-X [40] 17B 1457.0 49.1 66.5 70.1 35.6 − MetaMorph [106] 8B - - 71.8 75.2 - - TokenFlow-XL* [89] 14B 1551.1 62.5 72.6 76.8 43.2 75.9 ILLUME [111] 7B 1445.3 - 72.9 75.1 38.2 71.4 Native Unified BAGEL [32] 14B 1687.0 − 85.0 55.3 − Show-o [128] 1.3B 1097.2 58.0 51.5 - 27.4 - JanusFlow [79] 1.5B 1333.1 60.3 70.5 74.9 29.3 - SynerGen-VL [58] 2.4B 1381.0 - - 53.7 34.2 - Janus-Pro [26] 1.5B 1444.0 59.3 68.3 75.5 36.3 - Show-02 (Ours) 1.5B 1450.9 60.0 65.6 67.4 37.1 43.4 69.0 Emu3 [114] 8B 60.3 68.2 58.5 31.6 - 70.0 VILA-U [123] 7B 1401.8 60.8 59.0 - - - - MUSE-VL [129] 7B - 69.1 72.1 39.7 49.6 69.8 Liquid [118] 8B 1448.0 61.1 - - - - - Janus-Pro [26] 7B 1567.1 62.0 72.1 79.2 41.0 - Mogao [65] 7B 1592.0 60.9 74.6 75.0 44.2 Show-02 (Ours) 7B 1620.5 63.1 69.8 79.3 48.9 56.6 78.6 - The 1.5B
Show-o2
model is highly competitive with other models of similar size. - The 7B
Show-o2
model surpasses other powerful native UMMs likeJanus-Pro
andMogao
on most benchmarks, and even outperforms the much largerTokenFlow-XL
(14B) on several key metrics likeMME
,GQA
, andMMMU
. This demonstrates the effectiveness of thedual-path
visual representation and the overall architecture.
Visual Generation Tables 5, 6, and 7 show that
Show-o2
is also a very strong generative model.manually transcribed table Table 5: Evaluation on the GenEval [41] benchmark.
Type Method # Params. # Data Single Obj. Two Obj. Counting Colors Position Color Attri. Overall↑ Gen. Only SD3-Medium [37] - - 0.99 0.94 0.72 0.89 0.33 0.60 0.74 Unifying via Assembling Tailored Models SEED-X [40] 17B 158M+ 0.97 0.58 0.26 0.80 0.19 0.14 0.49 TokenFlow-XL* [89] 14B 60M 0.95 0.60 0.41 0.81 0.16 0.24 0.55 ILLUME [111] 7B 15M+ 0.99 0.86 0.45 0.71 0.39 0.28 0.61 MetaQuery-XL [83] 7B 28M+ - - - - - - 0.80 Native Unified Show-o [128] 1.3B 2.0B 0.98 0.80 0.66 0.84 0.31 0.50 0.68 Emu3 [114] 8B - - - - - - 0.66 ... ... ... ... ... ... ... ... ... ... Janus-Pro [26] 7B 144M 0.99 0.89 0.59 0.90 0.79 0.66 0.80 Mogao [65] 7B - 1.00 0.97 0.83 0.93 0.84 0.80 0.89 Show-02 (Ours) 7B 66M 1.00 0.87 0.58 0.92 0.52 0.62 0.76 - On compositional benchmarks like
GenEval
andDPG-Bench
(Table 6),Show-o2
achieves competitive or superior performance to other UMMs likeJanus-Pro
andMogao
, despite being trained on significantly less data (66M vs. 144M+). This highlights the efficiency of the training recipe. - On
DPG-Bench
, the 7BShow-o2
model achieves the highest overall score (86.14) among all listed models, including specialized generation-only models likeSD3-Medium
. - For video generation (Table 8), the
Show-o2
model (with only ~2B effective parameters for generation) is competitive with or outperforms much larger models likeShow-1
(6B) andEmu3
(8B).
Qualitative Results Figure 2 (in Image 10) showcases the model's versatility. It can describe images in detail, count objects, read text within an image, provide step-by-step instructions based on world knowledge, and generate high-quality images from text prompts.
该图像是论文中展示的示意图,展示了Show-o2模型在图像到视频生成及多模态生成上的应用。上部是连续生成的视频帧,中部下部展示了模型对不同输入图像的文本描述生成效果。
- The 1.5B
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents
Show-o2
, an improved native unified multimodal model that effectively bridges the gap between multimodal understanding and generation. By leveraging adual-path
unified visual representation on a3D causal VAE
space, combiningautoregressive modeling
withflow matching
, and designing an efficienttwo-stage training recipe
, the authors have developed a system that is scalable to images and videos. The model demonstrates state-of-the-art performance on numerous benchmarks, proving that a single, native model can excel at both understanding and generation without compromising on performance or requiring prohibitively large training resources. -
Limitations & Future Work:
- Text Rendering: While the paper mentions adding
TextAtlas
data to improve text rendering, this remains a challenging area for most generative models. The results onOneIG-Bench
(Table 7) show a low text score, indicating room for improvement. - Computational Cost for Video: The authors note that they did not include interleaved and video data when training the larger 7B model due to the huge computational cost. This suggests that scaling up video capabilities in such unified models is still a significant engineering and resource challenge.
- Reliance on VAE: The quality of the VAE is a potential bottleneck. Any artifacts or information loss from the VAE encoder/decoder will directly impact the final output quality.
- Text Rendering: While the paper mentions adding
-
Personal Insights & Critique:
- The two-stage training recipe is arguably the most impactful practical contribution of this paper. It offers an elegant solution to the common problem of "catastrophic forgetting" in LLMs when they are adapted for new, non-textual tasks. By first teaching the new skill (generation) with the LLM frozen, and then holistically fine-tuning, the method preserves valuable language knowledge while efficiently adding new capabilities.
- The architectural choice of a
dual-path
visual representation is very insightful. It recognizes that understanding and generation have different feature requirements: understanding needs high-level semantics, while generation needs low-level, pixel-perfect detail. By explicitly modeling and fusing both, the model is better equipped for both tasks. - The adoption of
flow matching
over the more traditionaldiffusion modeling
is a forward-looking choice. Flow matching promises faster training and inference, making the development of these complex UMMs more tractable. - Overall,
Show-o2
represents a significant step forward in the quest for truly unified AI systems. It provides a strong architectural blueprint and a practical training strategy that will likely influence future work in building versatile, all-in-one multimodal agents.
Similar papers
Recommended via semantic vector search.
Discussion
Leave a comment
No comments yet. Start the discussion!