X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
TL;DR Summary
X-Omni employs reinforcement learning to enhance discrete autoregressive image generation, integrating a semantic tokenizer, a unified language-image model, and an offline diffusion decoder to improve visual fidelity and instruction adherence.
Abstract
Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
1.2. Authors
Zigang Geng*, Yibing Wang*, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu†, Xiaosong Zhang‡, Linus Di Wang, Jie Jiang *Equal contribution †Project Lead ‡Corresponding Author Affiliation: Tencent Hunyuan X
1.3. Journal/Conference
This paper is published as a preprint on arXiv. The arXiv platform is a widely recognized and influential repository for research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers published on arXiv are typically peer-reviewed later for formal conference or journal publication, but the platform allows for rapid dissemination of research findings.
1.4. Publication Year
The paper was published on arXiv on 2025-07-29.
1.5. Abstract
This paper addresses the persistent challenges faced by autoregressive (AR) models using discrete tokens for image generation, such as low visual fidelity, distorted outputs, and difficulty in following complex instructions. These issues are often attributed to cumulative errors during AR inference and information loss from discretization. While recent trends have shifted towards hybrid models combining diffusion objectives for images and AR for language, this work re-establishes the viability of unified AR modeling. The authors demonstrate that reinforcement learning (RL) can effectively mitigate these artifacts and significantly enhance generation quality. Their proposed framework, X-Omni, integrates a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder. X-Omni, utilizing a 7B language model, achieves state-of-the-art performance in image generation, producing aesthetically high-quality images, precisely following instructions, and accurately rendering long texts.
1.6. Original Source Link
https://arxiv.org/abs/2507.22058 Publication Status: Preprint on arXiv.
1.7. PDF Link
https://arxiv.org/pdf/2507.22058v1.pdf
2. Executive Summary
2.1. Background & Motivation
The "next token prediction" paradigm has revolutionized language modeling, leading to powerful large language models (LLMs). Researchers have attempted to extend this paradigm to visual content, aiming for a unified approach to image generation and understanding. However, directly applying autoregressive (AR) modeling with discrete image tokens has consistently faced significant hurdles. These include:
-
Low visual fidelity: Generated images often lack detail and realism.
-
Distorted outputs: Images can appear structurally incorrect or unnatural.
-
Failure to adhere to complex instructions: The models struggle to precisely render intricate details or follow nuanced prompts.
These shortcomings are primarily theorized to arise from two sources:
-
Cumulative errors during autoregressive inference: As tokens are generated sequentially, small errors at early steps can propagate and amplify, leading to significant degradation in later parts of the image.
-
Information loss during the discretization process: Converting continuous visual information into discrete tokens (quantization) inevitably involves some loss of detail, which can limit the ultimate quality of the generated image.
Due to these challenges, recent research has increasingly moved away from fully unified AR modeling for images. Instead, the trend has been towards hybrid approaches that jointly train image generation using
diffusion objectivesand language generation withautoregressive objectives. While these hybrid models can achieve high fidelity, they introduce architectural and modeling heterogeneity, posing challenges for deep integration of semantic capabilities and knowledge transfer across modalities.
The core problem the paper aims to solve is to "make discrete autoregressive image generative models great again" by addressing their inherent limitations, thereby enabling a truly unified approach for both image and language generation within a single framework. The paper's innovative idea is that reinforcement learning (RL) can be a powerful tool to overcome these long-standing issues in discrete AR image generation.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
- Demonstration of RL's Efficacy in Discrete AR Image Generation: The most pivotal contribution is showing that reinforcement learning can effectively mitigate artifacts, reduce cumulative errors, and largely enhance the generation quality of discrete autoregressive image models. This directly addresses the main limitations that pushed research away from unified AR approaches.
- Introduction of X-Omni Framework: The paper proposes
X-Omni, a novel framework for unified image and language modeling. This framework comprises:- A
semantic image tokenizer: which converts images into discrete tokens while preserving rich semantic information. - A
unified autoregressive model: a single model capable of processing and generating both language and image tokens. - An
offline diffusion decoder: used for reconstructing high-fidelity images from the generated discrete tokens.
- A
- State-of-the-Art Performance: X-Omni achieves state-of-the-art performance in image generation tasks using a relatively compact 7B language model. It produces images with high aesthetic quality and demonstrates robust capabilities in following complex instructions.
- Exceptional Long Text Rendering: A key finding and capability of X-Omni is its ability to accurately render long texts in both English and Chinese, a notoriously difficult task for text-to-image models. The paper introduces a new benchmark,
LongText-Bench, to evaluate this specific capability. - Independence from Classifier-Free Guidance (CFG): X-Omni is shown to generate high-quality images without relying on
classifier-free guidance (CFG)during autoregressive sampling. This not only reduces computational costs but also indicates a higher intrinsic consistency in the model's generation process for visual and language tokens, which is a unique advantage over other AR models. - Holistic Optimization through RL: The work highlights that RL provides comprehensive supervision throughout the sampling process, bridging the distribution gap between AR-generated tokens and the diffusion decoder's expectations, which supervised fine-tuning (SFT) alone struggles with. This holistic optimization is crucial for performance gains.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Autoregressive Models and Next Token Prediction
Autoregressive models are a class of statistical models that predict future values based on past values. In the context of machine learning, especially for sequences like text or tokens, an autoregressive model predicts the next element in a sequence given all preceding elements. This process is often called "next token prediction."
- Next Token Prediction: Imagine writing a sentence word by word. An autoregressive language model works similarly: given "The quick brown fox," it predicts the next word "jumps." This prediction is based on the probability distribution over all possible next words, learned from vast amounts of text data. The model then appends the predicted word and continues the process.
- Sequential Generation: This sequential nature means that each prediction depends on all previous predictions. While powerful for coherent text, in image generation, errors can accumulate, leading to degraded quality over longer sequences.
- Application in Language (LLMs): Large Language Models (LLMs) like GPT-series are prime examples of autoregressive models, excelling at generating human-like text, translating, summarizing, and answering questions by predicting the most probable next word or token.
3.1.2. Discrete Tokens vs. Continuous Tokens
In generative models, especially for images, visual information can be represented in different forms:
- Continuous Tokens (Latent Codes): These are typically floating-point vectors that represent features of an image in a continuous latent space. They allow for fine-grained variations but can be harder to interpret and might struggle with complex, multimodal distributions without strong distributional assumptions (e.g., Gaussian). Many early visual language models (VLMs) used continuous tokens, often optimized with
Mean Squared Error (MSE)loss orcosine similarity. - Discrete Tokens: These are integer indices, similar to words in a vocabulary. Images are first passed through a
tokenizerthat quantizes (maps) continuous visual features into a finite set of discrete codes (a "codebook"). This approach allows for modeling more complex, non-Gaussian distributions and aligns well with thenext token predictionparadigm of language models. However, the quantization process inherently involvesinformation loss, which can limit the level of detail in generated images. The challenge is to make this discretization "lossy enough" to be efficient but "rich enough" to retain visual quality.
3.1.3. Diffusion Models
Diffusion models are a class of generative models that have achieved state-of-the-art results in image generation. They work by learning to reverse a gradual diffusion process (noise addition).
- How they work:
- Forward (Diffusion) Process: A diffusion model starts with a clean image and gradually adds Gaussian noise to it over many steps, eventually transforming it into pure noise.
- Reverse (Denoising) Process: The model is trained to learn the reverse of this process: starting from pure noise, it iteratively removes noise in small steps, guided by a text prompt or other conditions, until a coherent image emerges.
- Strengths: Diffusion models are excellent at producing high-fidelity and diverse images.
- Role in X-Omni: In X-Omni, a pre-trained
diffusion decoderacts as the final stage, taking the discrete semantic tokens generated by the autoregressive model and reconstructing them into high-resolution pixel images. This means the AR model doesn't generate pixels directly but semantic codes that guide the diffusion process.
3.1.4. Reinforcement Learning (RL)
Reinforcement learning is an area of machine learning concerned with how intelligent agents ought to take actions in an environment to maximize the cumulative reward.
- Agent, Environment, Action, Reward, Policy:
- Agent: The model (e.g., the autoregressive image generator in X-Omni) that makes decisions.
- Environment: The external system with which the agent interacts (e.g., the process of generating image tokens and then decoding them into an image).
- Action: The output generated by the agent (e.g., predicting the next discrete image token).
- Reward: A scalar feedback signal from the environment that indicates how good or bad the agent's action was (e.g., a score representing image quality, alignment, or text rendering accuracy).
- Policy (): A strategy that the agent uses to determine its next action based on the current state. In X-Omni, this is the autoregressive model's token prediction probabilities.
- Goal: The agent's goal is to learn a policy that maximizes the total expected reward over time.
- Role in X-Omni: RL is used to fine-tune the autoregressive model. Instead of relying solely on
supervised fine-tuning (SFT)which predicts ground truth tokens, RL allows the model to explore different token sequences and receive feedback (rewards) on the quality of the final generated image. This helps the AR model generate tokens that are better aligned with the diffusion decoder's expectations and human preferences, effectively mitigating cumulative errors.
3.1.5. Vector Quantization (VQ)
Vector Quantization (VQ) is a signal processing technique used for data compression. In the context of image generation, it's used to discretize continuous visual features.
- How it works:
- Encoder: A neural network (e.g.,
ViTin X-Omni) processes a continuous input (an image) and outputs a continuous feature vector (embedding). - Codebook: A fixed-size dictionary (or codebook) contains a set of learned discrete "codes" or "prototypes" (e.g., 16,384 vocabulary size, 2,048 dimension in X-Omni).
- Quantization: For each continuous feature vector from the encoder, the
vector quantizerfinds the closest code vector in the codebook (e.g., using Euclidean distance) and replaces the continuous vector with the index of that closest code. This index is thediscrete token.
- Encoder: A neural network (e.g.,
- Benefit: VQ enables the use of discrete tokens for images, making them compatible with autoregressive models designed for discrete sequences (like text).
- Challenge: The choice of codebook size and the quality of the learned codes directly impact the
information lossduring discretization.
3.1.6. Transformer Architecture
The Transformer is a neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), which has become the de-facto standard for sequence modeling tasks, particularly in natural language processing.
- Self-Attention Mechanism: The core innovation of the Transformer is the
self-attentionmechanism. It allows the model to weigh the importance of different parts of the input sequence when processing each element. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices derived from the input embeddings.
- calculates the dot product similarity between queries and keys, representing attention scores.
- is a scaling factor to prevent large dot products from pushing the softmax into regions with extremely small gradients.
- normalizes the scores to create probability distributions.
- The result is a weighted sum of the value vectors, where the weights are the attention scores.
- Positional Encoding: Since Transformers process sequences in parallel and lack an inherent notion of order,
positional encodingsare added to the input embeddings to inject information about the relative or absolute position of tokens in the sequence. X-Omni uses1D RoPE(Rotary Positional Embedding), which applies a rotation matrix to the query and key vectors before their dot product, effectively embedding relative position information.
3.2. Previous Works
3.2.1. Discrete Token Autoregressive (AR) Models for Image Generation
Early pioneers sought to extend the next token prediction paradigm to images by converting them into sequences of discrete tokens.
iGPT(Image GPT) andDALL-E: These models were among the first to demonstrate the feasibility of generating images autoregressively from text. DALL-E, in particular, showed impressive capability in generating images from diverse text prompts.- Limitations: Despite their groundbreaking nature, these early models (and even subsequent improvements like
Chameleon[15] andEmu3[12]) wereplagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions. This was primarily attributed tocumulative errorsduring sequential generation andinformation lossduring the discretization process. This led to a perception that discrete AR models were inherently limited in image quality.
3.2.2. Continuous Token Autoregressive Models
Some approaches [10, 11, 34] focused on autoregressively predicting continuous visual tokens or using query tokens to predict them in parallel.
- Characteristics: These methods often rely on
MSE lossorcosine similarityfor supervision and makedistributional assumptions(e.g., Gaussian) about the latent space. - Limitations: These assumptions can
limit the range of image distributionsthey can effectively represent. Furthermore, the paper notes thatthe means by which reinforcement learning can be integrated into continuous token methods remains unclear, potentially constraining their upper limits.
3.2.3. Hybrid AR and Diffusion Models
Given the limitations of pure AR models for images and the rise of diffusion models for high-fidelity generation, a new trend emerged: combining the strengths of both.
- Serial Architectures [13, 28, 27]: In this setup, an autoregressive model might generate
query tokensorcontinuous conditionsthat then guide a separate diffusion model's generation process. - Parallel Architectures [25, 40, 24, 32, 41, 42, 31, 33, 30]: More prevalent, these involve the autoregressive model (processing language) and the diffusion model (processing visual latents) performing forward passes simultaneously, with information exchanged via
self-attention layers. - Limitations: While achieving high fidelity, these hybrid approaches often suffer from a
loose connection between the two componentsand across-modal modeling mismatch. The architectural heterogeneity makesintegrating robust semantic capabilitieschallenging. Additionally,reinforcement learning techniques for diffusion models are less maturecompared to those for AR models, limiting the application of RL in optimizing these hybrid models.
3.2.4. Reinforcement Learning for Generative Models
RL has been explored for optimizing generative models, but predominantly for diffusion models.
- RL for Diffusion Models:
Reward gradients[43, 44, 45]: Fine-tuning diffusion models using differentiable rewards.Reward-weighted negative log-likelihood[46, 47]: Optimizing through specific objectives.- Extensions of
DPO[59],PPO[60],GRPO[61] to diffusion models [48, 47, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58].
- Limitations: The paper argues that the
potential of RL for diffusion models remains limited due to insufficient diversity in sampling. - Contrast with X-Omni: X-Omni, in contrast, applies RL to
autoregressive generative models, leveraging the greater maturity and applicability of RL techniques in this domain to achieve strong results.
3.3. Technological Evolution
The evolution of generative AI for images has seen a journey from initial attempts at direct pixel generation (e.g., GANs), to autoregressive models tokenizing images for sequential generation (e.g., DALL-E), then the emergence of diffusion models offering superior quality and diversity. As AR models faced quality and instruction-following limitations, research largely shifted to hybrid diffusion-AR architectures. This paper represents a pivotal moment in this timeline, aiming to demonstrate that with the right technique (Reinforcement Learning), the original discrete autoregressive paradigm can be revived and made competitive, even superior, to hybrid approaches, fostering a truly unified modeling approach. X-Omni fits within this timeline as an effort to bring full unification back into focus by solving the inherent problems of discrete AR models.
3.4. Differentiation Analysis
Compared to the main methods in related work, X-Omni offers several core differences and innovations:
- Unified AR Rejuvenation: Unlike the trend of moving away from unified AR models towards hybrid AR-diffusion architectures (e.g.,
MetaQuery[28],UniWorld-V1[29],OmniGen2[31],Show-02[32]) that maintain separate generative mechanisms, X-Omniadvocates for a discrete autoregressive framework for image generation that unifies the modeling of text and images. It aims to makenext token predictiontruly effective for both modalities in a single model. - RL for AR (not Diffusion): The paper uniquely applies
Reinforcement Learningto directly optimize theautoregressive image token generation process. While RL has been explored for diffusion models, X-Omni leverages the maturity of RL for AR methods tomitigate cumulative errorsandalign token distributionwith the diffusion decoder, a gap that SFT alone cannot bridge. This is a crucial distinction from models that use RL primarily on diffusion components. - Semantic-Rich Tokenization: X-Omni employs
semantic encoderslikeSigLIP 2to generate discrete tokens. This is in contrast to some approaches that rely on lossy quantization without explicit semantic guidance (e.g.,Chameleon[15],Emu3[12]) or those that reconstruct semantic features for discretization but still face a distribution gap (e.g.,LaViT[38]). X-Omni's approach, combined with RL, aims to ensure these semantic tokens are optimized for both generation and understanding. - Elimination of CFG Dependency: A significant innovation is that X-Omni can generate high-quality images
without relying on classifier-free guidance (CFG)in the autoregressive component. This is a stark contrast to other autoregressive image generation models (e.g.,Emu3[12],JanusPro[21]) thatheavily rely on CFGto enhance sampling quality. This suggests X-Omni has a more intrinsically aligned and consistent generative process. - Strong Instruction and Long Text Following: X-Omni demonstrates
robust capability in following complex instructionsandaccurate rendering of long texts, an area where many prior unified models struggle, particularly for non-English languages. The introduction ofLongText-Benchspecifically highlights this strength. - Streamlined Architecture for Multi-turn Interaction: By having an isomorphic autoregressive modeling for both image and text, X-Omni allows for
seamless integration of image generation and image understanding within a unified network. This eliminates the need tore-extract semantic embeddings from generated images for understanding purposesin multi-turn interactions, making it more streamlined than approaches with heterogeneous image encoders [21, 30, 33].
4. Methodology
4.1. Principles
The core principle behind X-Omni is to revive and significantly enhance the capabilities of discrete autoregressive models for unified image generation and understanding by leveraging Reinforcement Learning (RL). The intuition is that while next token prediction offers a powerful paradigm for sequential data, its application to complex visual content suffers from cumulative errors and a distribution gap between the autoregressively generated tokens and what a high-fidelity image decoder expects. Supervised fine-tuning (SFT) alone struggles to bridge this gap because it optimizes for ground truth tokens, not the quality of the final decoded image. RL, by providing holistic, reward-based feedback on the decoded image quality, can guide the autoregressive model to generate token sequences that are intrinsically more compatible with the image decoder and better aligned with human preferences and instructions. This effectively mitigates error propagation and leads to higher quality, more controllable image generation within a unified, text-like token framework.
4.2. Core Methodology In-depth (Layer by Layer)
The X-Omni framework integrates image and text tokens within a unified autoregressive architecture, depicted in Figure 3. It consists of three main components: a semantic image tokenizer, a unified autoregressive model, and an offline diffusion decoder.
该图像是X-Omni架构的示意图,展示了统一的自回归模型在文本分词器和SigLIP-VQ之间的中枢作用,并通过Detokenization和扩散解码器分别生成文本和图像。
Figure 3: The architecture of X-Omni.
4.2.1. Image Tokenization
The first step in processing images for the autoregressive model is tokenization, converting continuous visual information into discrete tokens.
- Goal: To convert continuous images into discrete tokens while retaining rich semantic information, rather than focusing solely on pixel-level reconstruction.
- Visual Semantic Extractor: X-Omni uses a pre-trained
SigLIP2-g[62] Vision Transformer (ViT) as the core visual semantic extractor. ThisViTis known for its strong visual understanding capabilities. - Vector Quantizer (VQ): Building on the
ViTencoder, avector quantizeris incorporated to perform the discretization. This VQ has:- A
codebookof 16,384 vocabulary size. - Each code vector has a dimension of 2,048.
- A
- Alignment with LLM: The
SigLIP-VQ tokenizeris aligned with a pre-trained large language model (LLM),Qwen2.5-1.5B[9], on visual understanding tasks. This ensures that the discrete tokens carry meaningful semantic information that the LLM can interpret. - Adapter: A
residual blockserves as an adapter between the visual tokenizer and the LLM, facilitating smooth information transfer. - Frozen Components: Both the
visual encoder() and thevector quantizerare kept frozen during subsequent training phases (pre-training, supervised fine-tuning, and reinforcement learning). This ensures stability and consistency in the tokenization process, meaning the mapping from images to discrete tokens remains fixed.
4.2.2. Autoregressive Modeling
Once images are tokenized into discrete tokens, both visual and language tokens are naturally unified within a single autoregressive architecture for multimodal modeling.
- Base Model: X-Omni adopts
Qwen2.5-7B[9] as its base pre-trained language model. - Integrating Visual Perception: To enable this text-based LLM to handle visual information:
- Vision-Specific Blocks: Four randomly initialized
vision-specific blocksare inserted both before and after the originalTransformer layersof theQwen2.5-7Bmodel. - Structure: These vision-specific blocks have the same structural configuration as standard Transformer blocks (e.g., self-attention, feed-forward networks).
- Functionality: Crucially, these blocks
operate exclusively on image tokens, leaving text tokens unaffected. This allows the model to learn visual-specific patterns without interfering with its linguistic capabilities. - Embedding and Classification Heads: Randomly initialized
embedding layersandclassification headsare introduced specifically for image tokens.
- Vision-Specific Blocks: Four randomly initialized
- Architectural Efficiency: These modifications introduce
no extra infrastructure complexitycompared to the original language model and maintainfull compatibility with distributed training strategies(tensor, pipeline, and context parallelism). - Unified Multimodal Sequence: For training, visual tokens and language tokens are
concatenated into a unified multimodal sequence. This sequence is then fed into the autoregressive model fornext-token prediction training.- Supervision:
- For
understanding tasks(e.g., VQA, captioning),only the language tokens are supervised. The model learns to predict the next word in a text sequence given an image. - For
generation tasks(e.g., text-to-image),only the visual tokens are supervised. The model learns to predict the next image token given a text prompt.
- For
- Supervision:
- Arbitrary Resolution Handling: To support various image resolutions, a
resolution information prefixis prepended to the visual tokens. This prefix follows the format:- and : Special tokens denoting the start and end of the multimodal sequence.
heightandwidth: Text tokens representing the spatial dimensions of the 2D image tokens.- : A special token indicating the start of the flattened image tokens.
visual tokens: The flattened sequence of discrete image tokens, with a length equal toheightmultiplied bywidth.
- Positional Encoding: The model utilizes
1D RoPE[63] (Rotary Positional Embedding), consistent with the original language model, instead of requiring additional 2D positional encoding for images. This allows the model to inherently understand the relative positions of tokens within the sequence, whether text or image.
4.2.3. Diffusion Decoder
The final stage of image generation involves reconstructing high-resolution images from the discrete semantic tokens generated by the autoregressive model.
- Choice of Decoder: A
well-pretrained diffusion modelis used for this purpose. Specifically,FLUX.1-dev[64] is chosen. - Integration:
- A
linear layeris added to map the 2048-dimensional semantic embedding tokens (output by the VQ) to the feature channel dimensions of theFLUX.1-devmodel. - These mapped features are then
integrated into the intermediate layer featuresof the diffusion model.
- A
- Input and Objective: The diffusion decoder uses the
semantic tokens extracted by the image tokenizer (and subsequently generated by the AR model)asinput conditions. It is trained with the objective ofimage reconstruction, meaning it learns to turn these semantic codes into a high-fidelity image. This component operates offline, meaning its training is separate from the AR model's primary training, and it acts as a fixed "render engine" during the RL phase.
4.2.4. Reinforcement Learning with GRPO
The critical innovation of X-Omni is the application of reinforcement learning to fine-tune the autoregressive model. This is done to address two main challenges:
-
Bridging the Distribution Gap: Supervised training optimizes the AR model to predict
ground truth tokens. However, these ground truth tokens might not be perfectly aligned with thesemantic embeddingsthat thediffusion decoderexpects for optimal image reconstruction. RL helps bridge this gap. -
Mitigating Error Propagation: The sequential nature of AR generation means early errors can accumulate. RL provides a holistic reward signal based on the final image quality, guiding the model to make better decisions throughout the entire sampling process.
The overall pipeline for the reinforcement learning process is illustrated in Figure 2(a).
该图像是图2,包含强化学习流程示意图、训练曲线及生成结果对比图,展示X-Omni在训练步骤增加时图像生成奖励迅速超越SFT BoN,生成的文本渲染、审美质量与指令遵循能力逐步提升。
Figure 2: During the reinforcement learning process, X-Omni's image generation reward quickly surpassed the best-of-N results achieved by the SFT model (SFT BoN) and demonstrated steady improvement. This progress was reflected in the model's text rendering capabilities, the aesthetic quality of the generated images, and its ability to follow instructions, all of which improved gradually.
4.2.4.1. GRPO Algorithm
X-Omni adopts the Group Relative Policy Optimization (GRPO) [61] algorithm. This algorithm is chosen because it sidesteps the need for a separate critic network, which reduces computational overhead and simplifies the training process compared to some other RL algorithms (e.g., standard Proximal Policy Optimization (PPO)).
The steps for GRPO are:
-
Trajectory Generation: For each text prompt sampled from the training distribution , the model generates a group of trajectories (sequences of autoregressive tokens), denoted as . These trajectories are generated using the
previous policy, which is a snapshot of the model's policy before the current update. -
Image Decoding: Each generated token trajectory is then passed through the
fixed diffusion decoderto produce a corresponding image . -
Reward Scoring: Each image is evaluated by the
reward function(described below) to yield a scalarreward. -
Advantage Computation: The
advantagesfor each trajectory are computed by normalizing this group of rewards, following the original GRPO procedure. This normalization helps in stabilizing training by providing a relative measure of goodness within the generated group. -
Policy Optimization: The current policy model is then optimized by maximizing the following objective function:
$ \begin{array} { l } { \displaystyle \mathcal { I } _ { G R P O } ( \theta ) = \mathbb { E } [ p \sim \mathcal { D } , \left{ \boldsymbol { o } _ { i } \right} _ { i = 1 } ^ { G } \sim \pi _ { \theta _ { \mathrm { o l d } } } ( \cdot | p ) ] } \ { \displaystyle \frac { 1 } { { G } } \sum _ { i = 1 } ^ { G } \left( \operatorname* { m i n } \left( \frac { \pi _ { \theta } \left( \boldsymbol { o } _ { i } | p \right) } { \pi _ { \theta _ { \mathrm { o l d } } } \left( \boldsymbol { o } _ { i } | p \right) } A _ { i } , \mathrm { c l i p } \left( \frac { \pi _ { \theta } \left( \boldsymbol { o } _ { i } | p \right) } { \pi _ { \theta _ { \mathrm { o l d } } } \left( \boldsymbol { o } _ { i } | p \right) } , 1 - \epsilon , 1 + \epsilon \right) A _ { i } \right) - \beta \mathbb { D } _ { K L } ( \pi _ { \theta } | | \pi _ { \theta _ { \boldsymbol { r } \epsilon , f } } ) \right) , } \end{array} $
Where:
-
: The objective function for Group Relative Policy Optimization, which is to be maximized with respect to the current policy parameters .
-
: Expectation over prompts sampled from the distribution , and over groups of trajectories generated by the
old policygiven prompt . -
: The number of trajectories generated for each prompt (group size).
-
: The probability of generating trajectory given prompt under the
current policy(which is being optimized). -
: The probability of generating trajectory given prompt under the
old policy(before the current update), used for importance sampling. -
: The advantage score for trajectory , derived from the normalized rewards. A positive advantage indicates that the trajectory was better than average for its group.
-
: This part applies a clipping mechanism, similar to PPO. It takes the minimum of two terms:
- : The standard policy gradient term, weighted by the ratio of new to old policy probabilities.
- : A clipped version of the policy ratio. This term ensures that the policy updates do not deviate too far from the old policy in a single step, preventing large, unstable updates.
-
: A hyperparameter that defines the clipping range (e.g., 0.2).
-
: A hyperparameter that weights the KL divergence regularization term.
-
: The
Kullback-Leibler (KL) divergencebetween thecurrent policyand areference policy. This term acts as a regularization, preventing the new policy from diverging too much from a stable reference policy, which could be the initial SFT model or an earlier version of the RL policy. It is estimated using anunbiased estimator[65].This formulation helps
efficiently fine-tune the policy by balancing reward maximization against divergence from a stable reference model, crucial for stable RL training.
4.2.4.2. Reward System
A comprehensive reward system is constructed to provide multifaceted supervision for the autoregressive model during RL. This system combines several specialized components, each targeting a distinct aspect of image generation quality. The individual reward signals are then combined through a weighted aggregation mechanism to form the final reward score .
- Human Preference Score (HPSv2) [66]:
- Purpose: To assess the
aesthetic qualityandhuman preference alignmentof generated images. - Mechanism:
HPSv2is a model specifically trained to predict human preferences for images. It generalizes well across various image distributions. - Limitation addressed: While HPSv2 typically operates on resolution, X-Omni specializes in high-resolution generation.
- Purpose: To assess the
- Unified Reward Score [67]:
- Purpose: To address the resolution limitation of HPSv2 and provide a
holistic feedbackfor human alignment, aggregatingmultiple quality dimensionsinto a unified score. - Mechanism: This model is designed for comprehensive human alignment assessment, suitable for high-resolution images.
- Purpose: To address the resolution limitation of HPSv2 and provide a
- Text-Image Alignment Score:
- Purpose: To ensure
semantic consistencybetween the input prompt and the generated image, and tominimize semantic hallucinations. - Mechanism: The
Qwen2.5-VL-32B[68] vision-language model is used. It computes an alignment reward by assessing how accurately the generated image reflects the content described in the prompt. This leverages the VLM's sophisticated image understanding capabilities.
- Purpose: To ensure
- OCR Accuracy Score:
-
Purpose: To address the critical challenge of
text rendering accuracywithin images, especially for prompts requiring textual generation. -
Mechanism: For such prompts, an
OCR-based rewardquantifies the fidelity of rendered text against the ground truth.GOT-OCR2.0[69] andPaddleOCR[70] are jointly employed to evaluate the generated images and calculate an accuracy score for text rendering.The
multi-faceted approachensures that the model learns to balance various quality criteria simultaneously, leading to high-fidelity images that meet human expectations across multiple dimensions.
-
5. Experimental Setup
5.1. Datasets
The training of X-Omni involves three main stages: pre-training, supervised fine-tuning (SFT), and reinforcement learning (RL), each utilizing specific datasets.
5.1.1. Pre-training
The autoregressive model is pre-trained on a mixture of image generation and image understanding data.
- Image Generation Dataset:
- Sources: Open-source datasets including
COYO-700M[71],DataComp-1B[72], andLAION-2B[73]. - Curation: The authors apply specific filters to these large datasets, collecting approximately
200M high-quality images. - Caption Enhancement: Recognizing the limitations of original captions (limited information, noise), the
Qwen2.5-VL-72Bmodel [68] is used to generatehigh-quality image-text pairs with dense captions. This ensures richer textual descriptions for training. - Image Resolution: All images are resized to a short edge of 384 pixels and a maximum long edge of 1152, maintaining their native aspect ratio.
- Token Count: The final image generation dataset results in
600B multimodal tokensafter tokenization.
- Sources: Open-source datasets including
- Image Understanding Dataset:
- Sources: 59M data samples, including those used in
LLaVA-OneVision[74],BLIP3-KALE[75], andInfinity-MM[76]. - Image Resolution: The same resize strategy as for image generation data is applied.
- Token Count: This dataset produces about
100B multimodal tokensfor image understanding tasks.
- Sources: 59M data samples, including those used in
5.1.2. Supervised Finetuning
After large-scale pre-training, a supervised fine-tuning (SFT) stage further refines the model.
- Data Sources:
30K high-quality datafromBLIP3o-60k[27].30K synthetic text-to-image data.- High-quality pre-training tokens filtered from the pre-training dataset itself.
- Mixed image understanding data from
LLaVA-NeXt[77],Cauldron[78], andCambrian-1[79].
- Token Count: In total,
1.5B tokensare used for training during the SFT stage.
5.1.3. Reinforcement Learning
The reinforcement learning stage focuses on enhancing the model's capabilities through reward-based optimization.
- Input Requirement: The
on-policy GRPO algorithmused requires only text prompts as input, as images are generated by the model itself during training for evaluation. - Dataset Curation: A diverse dataset of
180K prompt samplesfrom three distinct categories is curated to comprehensively cover the desired capabilities:80K Cleaned Prompts from Midjourney Dataset[80]: Randomly sampled prompts to align training with real-world user preferences and expectations for creative content.50K Text-Rendering-Focused Prompts: To specifically enhance text rendering capabilities, abucket-based sampling strategyis applied to text-rich image data. Prompts are sorted by text length into different buckets, and 10K samples are extracted from each, focusing on varied text lengths.50K Prompts from Natural Image Data: Randomly sampled to improve bothaesthetic qualityandinstruction following abilities.
- Rationale: This balanced composition provides a comprehensive training foundation for the RL phase, enabling the model to excel across diverse user interaction scenarios.
5.2. Evaluation Metrics
5.2.1. Text Rendering Metrics
For evaluating the model's ability to render text within images, two benchmarks are used: OneIG-Bench and the newly proposed LongText-Bench.
5.2.1.1. OneIG-Bench [81]
OneIG-Bench evaluates text rendering using a composite score derived from three metrics: Edit Distance, Completion Rate, and Word Accuracy.
- Edit Distance (Levenshtein Distance):
-
Conceptual Definition: Measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other. It quantifies the dissimilarity between two strings. A lower edit distance indicates higher accuracy.
-
Mathematical Formula: For two strings and of length and respectively, the Levenshtein distance
lev(a, b)is given by: $ lev(a,b) = \begin{cases}|a| & \text{if } |b|=0 \ |b| & \text{if } |a|=0 \
lev(a_{\text{tail}}, b_{\text{tail}}) & \text{if } a[0]=b[0] \\ 1 + \min \begin{cases} lev(a_{\text{tail}}, b) \\ lev(a, b_{\text{tail}}) \\ lev(a_{\text{tail}}, b_{\text{tail}}) \end{cases} & \text{if } a[0] \neq b[0]\end{cases} $ (Note: The recursive definition provided here is one common way to define it, often implemented using dynamic programming for efficiency.
a[0]andb[0]refer to the first characters, and and refer to the rest of the strings.) -
Symbol Explanation:
a, b: The two strings being compared (e.g., ground truth text and OCR-recognized text).- : The lengths of strings and .
a[0], b[0]: The first character of strings and .- : The rest of the string after the first character.
lev(a, b): The Levenshtein distance between and .
-
- Completion Rate:
- Conceptual Definition: Measures the proportion of target words from the prompt that are successfully recognized and rendered in the generated image. It quantifies how complete the text rendering is. A higher completion rate indicates more comprehensive text rendering.
- Word Accuracy:
- Conceptual Definition: Measures the percentage of correctly recognized words compared to the total number of words in the ground truth text. It is a direct measure of the correctness of individual words. A higher word accuracy indicates more precise text rendering.
5.2.1.2. LongText-Bench
This is a novel benchmark proposed in the paper to specifically evaluate the model's ability to accurately render long texts.
- Conceptual Definition: It consists of 160 meticulously designed prompts covering 8 different scenarios aimed at evaluating the capacity to precisely render long Chinese and English texts. The benchmark uses a
Text Accuracy score. - Text Accuracy Score:
- Conceptual Definition: For
LongText-Bench, theText Accuracy scoreis used as the final metric. This is because for long texts, where OCR recognition results might not preserve the relative order of different text segments,Edit Distancebecomes less suitable.Text Accuracylikely refers to a modified word accuracy or phrase accuracy that accounts for correctness regardless of order, or a simple count of correctly identified words/phrases from the prompt. - OCR Model:
Qwen2.5-VL-7B[68] is employed as the OCR model to parse the generated texts for evaluation. - Sampling: Four images are generated for each prompt to mitigate random errors in generation or OCR.
- Conceptual Definition: For
5.2.2. Text-to-Image Generation Metrics
The performance in general text-to-image generation is evaluated on two widely recognized benchmarks:
- DPG-Bench [90]:
- Conceptual Definition: This benchmark typically evaluates the model's ability to follow complex generation instructions, often involving
global attributes,entity-specific details,attributes of entities, andrelations between entities. It assesses the faithfulness of the generated image to the prompt's detailed content.
- Conceptual Definition: This benchmark typically evaluates the model's ability to follow complex generation instructions, often involving
- GenEval [91]:
- Conceptual Definition:
GenEvalis an object-focused benchmark designed to evaluate text-to-image generation. It tests specific capabilities like generatingsingle objects,multiple objects,counting abilities,color attributes,positional accuracy, andcolor attributes of specific objects. It focuses on fine-grained control and understanding of compositional prompts.
- Conceptual Definition:
5.2.3. Image Understanding Metrics
To assess image understanding capabilities, a comprehensive set of public benchmarks are used:
- POPE (Popular Object Predication Error) [96]:
- Conceptual Definition: Evaluates the tendency of Large Vision-Language Models (LVLMs) to hallucinate objects. It measures how often the model incorrectly claims the presence of common objects in an image. Lower POPE scores indicate less hallucination.
- GQA (Visual Question Answering) [97]:
- Conceptual Definition: A dataset for real-world visual reasoning and compositional question answering. It requires models to perform complex reasoning over images to answer questions, including spatial understanding, object properties, and relationships. Scores typically reflect accuracy in answering questions.
- MMB (MMBench) [98]:
- Conceptual Definition: A comprehensive benchmark suite designed to evaluate the "all-around player" capability of multimodal models across various tasks and domains. It covers a broad range of multimodal understanding abilities. Scores are usually an aggregate performance measure.
- SEED (SEEDBench-Img) [99]:
- Conceptual Definition: A benchmark specifically for
Multimodal Large Language Models (MLLMs), focusing on diverse understanding tasks related to images, such as visual reasoning, fact-based Q&A, and fine-grained comprehension.
- Conceptual Definition: A benchmark specifically for
- DocVQA (Document Visual Question Answering) [100]:
- Conceptual Definition: A dataset and task focused on visual question answering over document images. It tests the model's ability to read and understand information presented in documents, often involving text extraction and reasoning on structured layouts.
- OCRB (OCRBench) [101]:
- Conceptual Definition: A benchmark specifically designed to evaluate the
Optical Character Recognition (OCR)capabilities of large multimodal models, assessing their ability to accurately extract and understand text from images. Higher scores indicate better OCR performance.
- Conceptual Definition: A benchmark specifically designed to evaluate the
5.3. Baselines
The paper compares X-Omni against a range of models, categorized into "Gen. Only Models" (focused primarily on image generation) and "Unified Models" (aiming for both image generation and understanding).
- Generation Only Models:
FLUX.1-dev[64]HiDream-I1-Full[82]Kolors 2.0[83]Seedream 3.0[84]SDXL[86]DALL-E[87]SD3-medium[88]
- Unified Models:
-
Janus-Pro[21] -
BLIP3-o[27] -
BAGEL[30] -
OmniGen2[31] -
Show-02[32] -
GPT-4o[85] -
Emu3[12] -
MetaQuery[28] -
UniWorld-V1[29] -
Ovis-U1[89] -
Mogao[33] -
LLaVA-1.5[93] (for understanding only comparison) -
LLaVA-NeXT[77] (for understanding only comparison) -
VILA[94] (for understanding only comparison) -
MiniCPM-Llama3-V 2.5[95] (for understanding only comparison) -
LLaVA-OneVision[74] (for understanding only comparison)These baselines represent a comprehensive selection of state-of-the-art models in both specialized image generation and unified multimodal AI, providing a strong context for X-Omni's performance claims.
-
5.4. Implementation Details
5.4.1. Pre-training
The autoregressive model undergoes a three-stage pre-training strategy:
- Stage 1 (Vision-Specific Blocks):
- Only the
randomly initialized vision-specific blocksandvision token embeddingsare unfrozen and trained. - Data:
Generation data. - Batch Size: 256.
- Sequence Length: 16,384.
- Steps: 10,000 steps.
- Tokens Consumed: 42B tokens.
- Learning Rate: (constant).
- Only the
- Stage 2 (All Trainable Components):
All componentsin the autoregressive model are set to be trainable (unfrozen).- Data: A
mixture of generation and understanding data. - Batch Size: 256.
- Sequence Length: 16,384.
- Steps: 150,000 steps.
- Tokens Consumed: 629B tokens.
- Learning Rate: (constant).
- Stage 3 (High-Quality Tokens with Annealing):
- Data:
42B high-quality tokensfiltered from the pre-training dataset. - Learning Rate:
Learning rate annealingis applied, meaning the learning rate gradually decreases over time, helping the model converge better.
- Data:
5.4.2. Supervised Finetuning
- Trainable Parameters: Only the
learnable parametersof the autoregressive model are trained. - Batch Size: 64.
- Sequence Length: 16,384.
- Tokens Trained: 1.5B tokens.
- Learning Rate Schedule: The learning rate is gradually reduced from to 0 during this stage.
- Consistency:
Sequence packing(how sequences are combined to fill the context window) remains consistent with pre-training, with only data and learning rate schedule differing.
5.4.3. Reinforcement Learning
- Learning Rate: .
- Training Steps: 200 steps.
- Batch Size: 512.
- Rollouts: Each prompt generates 16
rollouts(trajectories), ensuring adequate exploration of the action space. - Loss Function: A combination of the
standard policy gradient lossand aKL divergence termweighted at 0.01. This configuration balances between maximizing rewards and maintaining stability by not deviating too much from the previous policy. - Chinese Model Adaptation: The Chinese text rendering model is specifically derived by incorporating training on Chinese data at an intermediate checkpoint during the reinforcement learning stage.
6. Results & Analysis
After the multi-stage training process (pre-training, SFT, and RL), X-Omni's performance is evaluated across text rendering, text-to-image generation, and image understanding benchmarks.
6.1. Core Results Analysis
6.1.1. Text Rendering
X-Omni demonstrates a significant enhancement in rendering long texts, a capability greatly improved through reinforcement learning.
The following are the results from Table 1 of the original paper:
| Method | OneIG-Bench [81] | LongText-Bench | ||
| English | Chinese | English | Chinese | |
| Gen. Only Models | ||||
| FLUX.1-dev [64] | 0.523 | 0.607 | 0.005 | |
| HiDream-I1-Full [82] | 0.707 | 0.205 | 0.543 | 0.024 |
| Kolors 2.0 [83] | 0.427 | 0.502 | 0.258 | 0.329 |
| Seedream 3.0 [84] | 0.865 | 0.928 | 0.896 | 0.878 |
| Unified Models | ||||
| Janus-Pro [21] | 0.001 | 0.148 | 0.019 | 0.006 |
| BLIP3-o [27] | 0.013 | 0.092 | 0.021 | 0.018 |
| BAGEL [30] | 0.244 | 0.365 | 0.373 | 0.310 |
| OmniGen2 [31] | 0.680 | - | 0.561 | 0.059 |
| Show-02 [32] | 0.002 | - | 0.006 | 0.002 |
| GPT-4o [85] | 0.857 | 0.650 | 0.956 | 0.619 |
| x-Omni | 0.901 | 0.895 | 0.900 | 0.814 |
Analysis of Table 1:
-
OneIG-Bench (English): X-Omni achieves the highest score of 0.901, significantly outperforming recent open-source unified models like
BAGEL(0.244),OmniGen2(0.680), andShow-02(0.002). It also surpassesGPT-4o(0.857), indicating superior English text rendering in short to medium contexts. -
OneIG-Bench (Chinese): X-Omni scores 0.895, outperforming most models, including
GPT-4o(0.650). It is comparable to the specialized commercial text-to-image systemSeedream 3.0(0.928), which is impressive for a unified model. -
LongText-Bench (English): X-Omni scores 0.900, significantly better than other unified models, though it falls slightly behind
GPT-4o(0.956) in this specific category. This benchmark, designed for longer texts, further highlights X-Omni's robustness. -
LongText-Bench (Chinese): X-Omni achieves a leading score of 0.814, outperforming all other models by a large margin, demonstrating its strong capability in rendering complex, long Chinese texts. Many other models, including
GPT-4o(0.619), struggle considerably with Chinese long text.The qualitative comparison examples in Figure 4 further solidify these findings, showing X-Omni's ability to precisely render long text according to instructions, while many other unified models fail.
该图像是论文中Figure 4的示意图,展示了多模态统一模型在文本渲染上的对比效果。图中包含多组带文字和图片的海报样式,涵盖了家庭、旅行和装修提示等主题,体现了模型在复杂文本生成和视觉结合上的能力。
Figure 4: Text rendering comparison with other unified multimodal models.
6.1.2. Text-to-Image Generation
X-Omni's text-to-image generation capabilities are evaluated on DPG-Bench and GenEval.
The following are the results from Table 2 of the original paper:
| Method | Global | Entity | Attribute | Relation | Other | Overall |
| Gen. Only Models | ||||||
| SDXL [86] | 83.27 | 82.43 | 80.91 | 86.76 | 80.41 | 74.65 |
| DALL-E [87] | 90.97 | 89.61 | 88.39 | 90.58 | 89.83 | 83.50 |
| SD3-medium [88] | 87.90 | 91.01 | 88.83 | 80.70 | 88.68 | 84.08 |
| FLUX.1-dev [64] | 82.10 | 89.50 | 88.70 | 91.10 | 89.40 | 84.00 |
| Unified Models | ||||||
| Emu3 [12] | 85.21 | 86.68 | 86.84 | 90.22 | 83.15 | 80.60 |
| Janus-Pro [21] | 86.90 | 88.90 | 89.40 | 89.32 | 89.48 | 84.19 |
| MetaQuery [28] | - | - | - | - | - | 82.05 |
| BLIP3-o [27] | - | - | - | - | - | 81.60 |
| UniWorld-V1 [29] | 83.64 | 88.39 | 88.44 | 89.27 | 87.22 | 81.38 |
| Ovis-U1 [89] | 82.37 | 90.08 | 88.68 | 93.35 | 85.20 | 83.72 |
| Mogao [33] | 82.37 | 90.03 | 88.26 | 93.18 | 85.40 | 84.33 |
| BAGEL [30] | 88.94 | 90.37 | 91.29 | 90.82 | 88.67 | 85.07 |
| OmniGen2 [31] | 88.81 | 88.83 | 90.18 | 89.37 | 90.27 | 83.57 |
| Show-02 [32] | 89.00 | 91.78 | 89.96 | 91.81 | 91.64 | 86.14 |
| GPT-4o* [85] | 82.27 | 91.27 | 87.67 | 93.85 | 88.71 | 86.23 |
| x-Omni | 84.80 | 92.59 | 90.63 | 94.75 | 84.20 | 87.65 |
Analysis of Table 2 (DPG-Bench):
-
X-Omni achieves an
Overallscore of 87.65, which isstate-of-the-artcompared to all other unified models listed, surpassing evenGPT-4o(86.23) andShow-02(86.14). -
It particularly excels in
Entity(92.59) andRelation(94.75) categories, indicating strong capabilities in accurately rendering specific objects and their interactions as described in prompts. This highlights the effectiveness of RL in improving instruction following. -
The performance demonstrates X-Omni's precise adherence to complex generation instructions.
The following are the results from Table 3 of the original paper:
Method Single Two Counting Colors Position Color Attr. Overall Gen. Only Models SDXL [86] 0.98 0.74 0.39 0.85 0.15 0.23 0.55 DALL-E [87] 0.96 0.87 0.47 0.83 0.43 0.45 0.67 SD3-medium [88] 0.99 0.94 0.72 0.89 0.33 0.60 0.74 FLUX.1-dev [64] 0.98 0.93 0.75 0.93 0.68 0.65 0.82 Unified Models Emu3 [12] 0.99 0.81 0.42 0.80 0.49 0.45 0.66 Janus-Pro [21] 0.99 0.89 0.59 0.90 0.79 0.66 0.80 MetaQuery [28] - - - - 0.80 BLIP3-0 [27] - - - - - - 0.84 UniWorld-V1 [29] 0.99 0.93 0.81 0.89 0.74 0.71 0.84 Mogao [33] 1.00 0.97 0.83 0.93 0.84 0.80 0.89 BAGEL [30] 0.98 0.95 0.84 0.95 0.78 0.77 0.88 OmniGen2 [31] 0.99 0.96 0.74 0.98 0.72 0.75 0.86 Show-o2 [32] 1.00 0.87 0.58 0.92 0.52 0.62 0.76 GPT-40* [85] 0.99 0.92 0.85 0.92 0.75 0.61 0.84 X-Omni 0.98 0.95 0.75 0.91 0.71 0.68 0.83
Analysis of Table 3 (GenEval):
-
X-Omni achieves an
Overallscore of 0.83, which iscomparableto other strong unified models likeUniWorld-V1(0.84) andGPT-4o(0.84). -
While
Mogao(0.89) andBAGEL(0.88) show slightly higher overall scores, X-Omni demonstrates solid performance across various fine-grained aspects of image generation, such asSingle(0.98),Two(0.95) objects, andColors(0.91). -
This indicates that X-Omni maintains strong capabilities in generating images that align with compositional prompts.
Figure 5 presents additional qualitative examples, showcasing X-Omni's ability to generate aesthetically pleasing images at arbitrary resolutions. These images exhibit high detail, color fidelity, and thematic representation, further supporting the quantitative results.
该图像是多张高质量图片的拼接展示,呈现了人像、风光、艺术创作和生活场景等多样视觉内容,展示了 X-Omni 在图像生成中对细节、色彩和主题的高度还原与表现力。
Figure 5: Qualitative cases of X-Omni.
6.1.3. Image Understanding
The image understanding performance is evaluated across various benchmarks.
The following are the results from Table 4 of the original paper:
| Method | #LLM | POPE | GQA | MMB | SEED | DocVQA | OCRB |
| Und. Only Models | |||||||
| LLaVA-1.5 [93] | 7B | 85.9 | 62.0 | 64.3 | 66.1 | - | 318 |
| LLaVA-NeXT [77] | 7B | 86.5 | 64.2 | 67.4 | 68.2 | 74.4 | 532 |
| VILA [94] | 7B | 85.5 | 62.3 | 68.9 | 61.1 | - | - |
| MiniCPM-Llama3-V 2.5 [95] | 8B | - | - | 77.2 | - | 84.8 | 725 |
| LLaVA-OneVision [74] | 7B | - | - | 80.8 | 75.4 | 87.5 | 622 |
| Unified Models | |||||||
| Emu3 [12] | 7B | 85.2 | 60.3 | 58.5 | 68.2 | 76.3 | 687 |
| Janus-Pro [21] | 7B | 87.4 | 62.0 | 79.2 | 72.1 | - | 595 |
| Mogao [33] | 7B | 88.9 | 60.9 | 75.0 | 74.6 | - | - |
| Show-02 [32] | 7B | - | 63.1 | 79.3 | 69.8 | - | |
| X-Omni | 7B | 89.3 | 62.8 | 74.8 | 74.1 | 88.6 | 704 |
Analysis of Table 4:
- Overall Performance: Despite X-Omni's primary focus on improving generative capabilities through RL, it achieves
comparable resultsto other state-of-the-art unified models likeShow-02on various image understanding benchmarks. - Outperforming Earlier Unified Models: X-Omni surpasses earlier unified works such as
Emu3andJanus-Proin overall understanding. - Strong OCR Performance: Notably, X-Omni's score on
OCRBench(704) significantly exceeds that ofEmu3(687) andJanus-Pro(595), as well as the specialized image understanding modelLLaVA-OneVision(622). This strong OCR capability aligns with its excellent text rendering results and suggests that the unified semantic tokenization and RL training contribute to robust text understanding in visual contexts. - DocVQA: X-Omni also performs very well on
DocVQA(88.6), outperformingLLaVA-OneVision(87.5) andMiniCPM-Llama3-V 2.5(84.8), further demonstrating its ability to process and reason with text in images.
6.2. Ablation Studies / Parameter Analysis
6.2.1. X-Omni Does Not Rely on Classifier-Free Guidance (CFG)
A key finding is that X-Omni can generate high-quality images without relying on classifier-free guidance (CFG) in its autoregressive component.
-
What is CFG?
Classifier-Free Guidance (CFG)is a technique commonly used in generative models (especially diffusion models and some autoregressive models) to improve the alignment between generated content and the conditioning input (e.g., text prompt). It works by combining predictions from aconditional model(guided by the prompt) and anunconditional model(not guided by the prompt) during sampling. By extrapolating between these two, CFG can enhance the strength of the guidance, leading to higher quality and better prompt adherence, often at the cost of diversity. Autoregressive models likeEmu3andJanusProtypically rely heavily on CFG to enhance the sampling of visual tokens. -
X-Omni's Independence: Figure 6 compares the
CFGdependencies. It shows that X-Omnimaintains consistently high generation quality, regardless of whether CFG is present. In contrast, other autoregressive models (Emu3, Janus-Pro) experience a significant drop in generation quality when CFG is absent (indicated by thecfg_scale=1setting, which means no guidance).
该图像是图表,展示了使用与不使用无分类器引导(CFG)时,三种模型(Emu3、Janus-Pro、X-Omni)生成蓝色花瓶图片的对比效果,突出X-Omni模型在CFG辅助下生成细节更丰富、颜色更自然的图像。
Figure 6: Comparison of dependency on classifier-free guidance (CFG).
Implications:
-
Reduced Computational Cost: Operating without CFG means that X-Omni avoids the need to run both conditional and unconditional inference paths during sampling, thereby
lowering the computational costof autoregressive inference. -
Higher Intrinsic Consistency: This independence suggests a
higher level of consistencybetween the generation processes for visual and language tokens within X-Omni. The model inherently generates tokens that are well-aligned with the prompt and the diffusion decoder's expectations, without needing an external boost from CFG. This is a strong indicator of the effectiveness of the RL training in optimizing the intrinsic generation quality.Unless specified, all qualitative examples in the paper were generated without CFG during autoregressive sampling, reinforcing this finding.
6.2.2. RL Outperforms SFT with Best-of-N Sampling
The paper highlights that Reinforcement Learning (RL) training provides significant advantages over Supervised Fine-Tuning (SFT) even when SFT is augmented with Best-of-N sampling.
- What is Best-of-N sampling?
Best-of-N samplingis a technique where a generative model produces multiple (N) outputs for a given prompt, and then the "best" one is selected using some evaluation metric (e.g., a reward model or human preference). This method can improve perceived quality by allowing the model to generate diverse options and cherry-pick the most favorable one. In language modeling, SFT models combined with Best-of-N sampling often set a high bar that RL training struggles to surpass. - RL's Advantage: As illustrated in Figure 2(b), the RL training in X-Omni leads to a performance curve that quickly
surpasses the best-of-N results achieved by the SFT model (SFT BoN). This indicates that RL is not just incrementally improving SFT but is fundamentally optimizing the generative process in a way that SFT cannot.
Reasons for this disparity (different from language modeling):
- Alignment of AR and Diffusion Modules:
SFT: Trains theautoregressive (AR) moduleand thediffusion decoderseparately. The AR model is optimized to predictground truth tokens, while the diffusion model reconstructs pixels from these tokens. This separation can lead to adistribution gapwhere AR-generated tokens, even if accurate to ground truth, might not be optimal for the diffusion decoder, resulting inperformance degradation.RL: Plays acrucial role in aligning these two modules. By receiving a reward based on the final image decoded by the diffusion model, the AR model learns to generate token sequences that are intrinsically compatible with the diffusion decoder's expectations, effectively bridging this gap.
- Holistic Optimization for Image Features:
Languagehas strong sequential dependencies, and errors are often localized. While sequential,image features are inherently local and spatially complex. Errors in early image tokens can have complex, non-local effects on the final image.RL's holistic optimizationharnessesrich, multifaceted information from various local regions within a single image(via the comprehensive reward system). This allows for highly efficient learning to produce globally coherent and high-quality images by optimizing the entire sequence of token generation with respect to the end goal.
6.2.3. LongText-Bench Details
To fully evaluate X-Omni's ability to accurately render long texts, the authors propose a novel benchmark: LongText-Bench.
The following are the results from Figure 7 of the original paper:
该图像是图表,展示了LongText-Bench与OneIG-Bench在英文(左图)和中文(右图)文本长度分布上的对比。图中以柱状图形式直观体现了两者在不同字数区间的分布差异。
Figure 7: Comparison between our proposed LongText-Bench and OneIG-Bench with respect to the length of rendered texts in English (left) and Chinese (right).
Benchmark Statistics (Figure 7 Analysis):
- Focus:
LongText-Benchfocuses on evaluating text rendering performance forlonger textsin both English and Chinese, addressing the limitation ofOneIG-Benchwhich primarily uses shorter prompts. - English Text Length:
LongText-Bench "short" category: Text lengths are concentrated within 10-30 words.LongText-Bench "long" category: Predominantly falls within 30-50 words.- In comparison,
OneIG-BenchEnglish texts are generally shorter than both categories ofLongText-Bench.
- Chinese Text Length:
LongText-Bench "short" category: Most prompts contain 20 to 40 characters.LongText-Bench "long" category: Typically exceeds 60 characters in length.- Similar to English,
OneIG-BenchChinese texts are shorter overall.
- Conclusion: The increased text length in
LongText-Bencheffectively highlights X-Omni's superior capability in rendering more extensive textual content, especially in Chinese, where it significantly outperforms other models.
Prompt Construction:
- Methodology: Prompts are meticulously curated using an
automatic pipelinefollowed bymanual post-review. - Scenarios: 8 common scenarios featuring text-rich contexts are defined:
signboards, objects with labels, printed materials, web pages, slides, posters, captions, and dialogues. - GPT-4o Generation: For each category,
GPT-4ois instructed to generate 20 prompts: 10 with short text contents and 10 with longer text contents. - Manual Review: After initial generation, a manual review is conducted for each prompt, and text lengths are adjusted to achieve a balanced distribution across the "short" and "long" categories.
- Total Prompts: This process results in a total of
160 promptscovering 8 categories for evaluating long text rendering tasks.
Evaluation Metric:
- OCR Model:
Qwen2.5-VL-7B[68] is used as the OCR model to parse generated texts. - Metric: The
Text Accuracy scoreis adopted as the final metric forLongText-Bench. - Rationale for Text Accuracy: For long text scenarios, standard
Edit Distanceis deemed unsuitable because OCR recognition results may not account for the relative order of different text segments (e.g., if multiple lines of text are incorrectly ordered by OCR but individual words are correct).Text Accuracylikely provides a more robust measure of overall correctness in such complex scenarios. - Sampling: Four images are generated for each prompt to minimize the impact of random errors during either image generation or OCR recognition.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully demonstrates that discrete autoregressive models for image generation can achieve state-of-the-art results by leveraging reinforcement learning (RL). The proposed X-Omni framework unifies image and language modeling within a single autoregressive architecture, composed of a semantic image tokenizer, a unified autoregressive model, and an offline diffusion decoder. Through a sophisticated GRPO-based RL training, guided by a multi-faceted reward system (including human preference, unified quality, text-image alignment, and OCR accuracy), X-Omni effectively mitigates the long-standing issues of low visual fidelity, distorted outputs, and poor instruction adherence that have plagued prior discrete AR models. Key achievements include producing high aesthetic quality images, robustly following complex instructions, and exceptionally accurate rendering of long texts in both English and Chinese. Furthermore, X-Omni uniquely operates without reliance on classifier-free guidance (CFG) during autoregressive sampling, indicating a higher intrinsic consistency and efficiency in its generative process.
7.2. Limitations & Future Work
The paper itself does not explicitly list a "Limitations" section. However, based on the challenges in the field and the solutions proposed, potential limitations of X-Omni or areas for future work can be inferred:
-
Complexity of Reward System: While the multi-faceted reward system is a strength, its complexity (integrating HPSv2, Unified Reward, Qwen2.5-VL-32B, GOT-OCR2.0, PaddleOCR) can be substantial to develop, maintain, and scale. Future work might explore simpler or more generalized reward models.
-
Computational Cost of RL: Although
GRPOsidesteps a critic network, the process of generating multiple rollouts () for each prompt, decoding them with a diffusion model, and then evaluating them with a comprehensive reward system, can still be computationally intensive. Scaling RL training to even larger datasets or models might pose challenges. -
Reliance on External Models: X-Omni relies on several frozen, pre-trained external models (SigLIP2-g, Qwen2.5-1.5B, FLUX.1-dev, Qwen2.5-VL-32B, GOT-OCR2.0, PaddleOCR). While effective, this creates a dependency chain and assumes the capabilities of these external models are sufficient and perfectly aligned. End-to-end learning of all components could be a future direction, albeit more challenging.
-
Generalizability to Novel Tasks: While X-Omni shows strong performance on generation and understanding, its generalizability to a wider array of multimodal tasks (e.g., video generation, 3D content generation, advanced interactive reasoning) might require further adaptation.
-
Prompt Engineering for RL: The curation of diverse prompts for RL (Midjourney, text-rich, natural images) is crucial. Exploring more automated or adaptive ways to generate effective RL prompts could be valuable.
The paper's success in "making discrete autoregressive image generative models great again" inherently suggests future work will focus on:
-
Further scaling the unified autoregressive model.
-
Exploring the limits of RL in refining multimodal generation.
-
Applying this unified AR framework to even more complex multimodal tasks beyond static image generation.
-
Developing even more advanced and efficient semantic tokenizers.
7.3. Personal Insights & Critique
X-Omni is a highly impactful paper that successfully challenges a prevailing paradigm in multimodal AI. For a considerable period, the limitations of discrete autoregressive models in image generation seemed insurmountable, leading many researchers to pivot towards diffusion-based or hybrid approaches. This paper's core innovation lies in demonstrating that Reinforcement Learning is the critical missing piece to unlock the full potential of discrete autoregressive modeling for images.
My key insights are:
- The Power of Holistic Feedback: The paper effectively argues that
supervised fine-tuning (SFT)is fundamentally limited when a distribution mismatch exists between intermediate representations (AR tokens) and the final desired output (high-quality decoded image). RL, by providingholistic, end-to-end feedbackbased on the actual quality of the generated image, can directly optimize for the downstream task, effectively bridging theAR-diffusion gap. This insight is generalizable beyond image generation to any sequential generative task where intermediate steps have complex interactions with the final quality. - Strategic Reward System Design: The success of the RL component hinges on a
comprehensive and well-designed reward system. By combininghuman preference,unified quality,text-image alignment, andOCR accuracysignals, X-Omni effectively guides the model to optimize for multiple critical aspects simultaneously. This multi-objective reward engineering is a crucial takeaway for applying RL to complex creative tasks. - Unified Architectures are Still Viable: The paper makes a compelling case for
unified autoregressive architecturesfor vision and language. The ability to manage both modalities with a singlenext token predictionmechanism not only simplifies the architecture but alsopromotes knowledge transferandseamless multi-turn interactionsbetween generation and understanding. This contrasts with hybrid models that often treat vision and language as separate, albeit interacting, components. - The Significance of CFG Independence: The finding that X-Omni operates effectively without
classifier-free guidance (CFG)is particularly insightful. CFG is often seen as a necessary "hack" to boost quality in generative models. X-Omni's independence from it suggests that its internal representation learning and generative process are more intrinsically aligned and robust, a testament to the efficacy of its RL training in making the model's unconditional generation high-quality and consistent. This could pave the way for more efficient and fundamentally better-designed generative models. - Long Text Rendering as a Benchmark: The introduction of
LongText-Benchis an important contribution. Rendering long, accurate text within images has been a persistent challenge, and by creating a dedicated benchmark, the paper highlights a critical capability that many models (includingGPT-4o) struggle with, where X-Omni excels. This is particularly valuable for applications requiring precise textual integration in visual content.
Potential issues or areas for improvement:
-
Interpretability of RL Policy: While RL improves performance, understanding why the AR model makes certain token choices under the RL policy can be more opaque than with pure SFT. Further research into interpretability of RL-trained generative policies could be beneficial.
-
Generalization to Diverse Styles/Domains: While aesthetic quality is high, the reward system might implicitly favor certain aesthetic styles or content types seen during training. Testing generalization to highly diverse or niche artistic styles could be an interesting future exploration.
-
Cold Start Problem for RL: The paper starts with a heavily pre-trained and SFT-tuned model before RL. The
cold start problemfor RL in generative models (i.e., training from scratch with RL) remains a significant challenge.In conclusion, X-Omni provides a strong foundation for future research in unified multimodal generation, proving that with strategic application of reinforcement learning, discrete autoregressive models can indeed be "made great again" and push the boundaries of what is possible in AI.
Similar papers
Recommended via semantic vector search.