Paper status: completed

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

Published:07/30/2025

Discrete Autoregressive Image Generation (1)Reinforcement Learning for Image Generation Optimization (1)Semantic Image Tokenizer (1)Unified Language-Image Autoregressive Modeling (1)Offline Diffusion Decoder (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

X-Omni employs reinforcement learning to enhance discrete autoregressive image generation, integrating a semantic tokenizer, a unified language-image model, and an offline diffusion decoder to improve visual fidelity and instruction adherence.

Abstract

Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.

Mind Map

In-depth Reading

English Analysis~38 min read · 54,014 chars

1. Bibliographic Information

1.1. Title

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

1.2. Authors

Zigang Geng*, Yibing Wang*, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu†, Xiaosong Zhang‡, Linus Di Wang, Jie Jiang *Equal contribution †Project Lead ‡Corresponding Author Affiliation: Tencent Hunyuan X

1.3. Journal/Conference

This paper is published as a preprint on arXiv. The arXiv platform is a widely recognized and influential repository for research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers published on arXiv are typically peer-reviewed later for formal conference or journal publication, but the platform allows for rapid dissemination of research findings.

1.4. Publication Year

The paper was published on arXiv on 2025-07-29.

1.5. Abstract

This paper addresses the persistent challenges faced by autoregressive (AR) models using discrete tokens for image generation, such as low visual fidelity, distorted outputs, and difficulty in following complex instructions. These issues are often attributed to cumulative errors during AR inference and information loss from discretization. While recent trends have shifted towards hybrid models combining diffusion objectives for images and AR for language, this work re-establishes the viability of unified AR modeling. The authors demonstrate that reinforcement learning (RL) can effectively mitigate these artifacts and significantly enhance generation quality. Their proposed framework, X-Omni, integrates a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder. X-Omni, utilizing a 7B language model, achieves state-of-the-art performance in image generation, producing aesthetically high-quality images, precisely following instructions, and accurately rendering long texts.

1.6. Original Source Link

https://arxiv.org/abs/2507.22058 Publication Status: Preprint on arXiv.

1.7. PDF Link

https://arxiv.org/pdf/2507.22058v1.pdf

2. Executive Summary

2.1. Background & Motivation

The "next token prediction" paradigm has revolutionized language modeling, leading to powerful large language models (LLMs). Researchers have attempted to extend this paradigm to visual content, aiming for a unified approach to image generation and understanding. However, directly applying autoregressive (AR) modeling with discrete image tokens has consistently faced significant hurdles. These include:

Low visual fidelity: Generated images often lack detail and realism.
Distorted outputs: Images can appear structurally incorrect or unnatural.
Failure to adhere to complex instructions: The models struggle to precisely render intricate details or follow nuanced prompts.

These shortcomings are primarily theorized to arise from two sources:

Cumulative errors during autoregressive inference: As tokens are generated sequentially, small errors at early steps can propagate and amplify, leading to significant degradation in later parts of the image.
Information loss during the discretization process: Converting continuous visual information into discrete tokens (quantization) inevitably involves some loss of detail, which can limit the ultimate quality of the generated image.

Due to these challenges, recent research has increasingly moved away from fully unified AR modeling for images. Instead, the trend has been towards hybrid approaches that jointly train image generation using diffusion objectives and language generation with autoregressive objectives. While these hybrid models can achieve high fidelity, they introduce architectural and modeling heterogeneity, posing challenges for deep integration of semantic capabilities and knowledge transfer across modalities.

The core problem the paper aims to solve is to "make discrete autoregressive image generative models great again" by addressing their inherent limitations, thereby enabling a truly unified approach for both image and language generation within a single framework. The paper's innovative idea is that reinforcement learning (RL) can be a powerful tool to overcome these long-standing issues in discrete AR image generation.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

Demonstration of RL's Efficacy in Discrete AR Image Generation: The most pivotal contribution is showing that reinforcement learning can effectively mitigate artifacts, reduce cumulative errors, and largely enhance the generation quality of discrete autoregressive image models. This directly addresses the main limitations that pushed research away from unified AR approaches.
Introduction of X-Omni Framework: The paper proposes X-Omni, a novel framework for unified image and language modeling. This framework comprises:
- A semantic image tokenizer: which converts images into discrete tokens while preserving rich semantic information.
- A unified autoregressive model: a single model capable of processing and generating both language and image tokens.
- An offline diffusion decoder: used for reconstructing high-fidelity images from the generated discrete tokens.
State-of-the-Art Performance: X-Omni achieves state-of-the-art performance in image generation tasks using a relatively compact 7B language model. It produces images with high aesthetic quality and demonstrates robust capabilities in following complex instructions.
Exceptional Long Text Rendering: A key finding and capability of X-Omni is its ability to accurately render long texts in both English and Chinese, a notoriously difficult task for text-to-image models. The paper introduces a new benchmark, LongText-Bench, to evaluate this specific capability.
Independence from Classifier-Free Guidance (CFG): X-Omni is shown to generate high-quality images without relying on classifier-free guidance (CFG) during autoregressive sampling. This not only reduces computational costs but also indicates a higher intrinsic consistency in the model's generation process for visual and language tokens, which is a unique advantage over other AR models.
Holistic Optimization through RL: The work highlights that RL provides comprehensive supervision throughout the sampling process, bridging the distribution gap between AR-generated tokens and the diffusion decoder's expectations, which supervised fine-tuning (SFT) alone struggles with. This holistic optimization is crucial for performance gains.

3.1. Foundational Concepts

3.1.1. Autoregressive Models and Next Token Prediction

Autoregressive models are a class of statistical models that predict future values based on past values. In the context of machine learning, especially for sequences like text or tokens, an autoregressive model predicts the next element in a sequence given all preceding elements. This process is often called "next token prediction."

Next Token Prediction: Imagine writing a sentence word by word. An autoregressive language model works similarly: given "The quick brown fox," it predicts the next word "jumps." This prediction is based on the probability distribution over all possible next words, learned from vast amounts of text data. The model then appends the predicted word and continues the process.
Sequential Generation: This sequential nature means that each prediction depends on all previous predictions. While powerful for coherent text, in image generation, errors can accumulate, leading to degraded quality over longer sequences.
Application in Language (LLMs): Large Language Models (LLMs) like GPT-series are prime examples of autoregressive models, excelling at generating human-like text, translating, summarizing, and answering questions by predicting the most probable next word or token.

3.1.2. Discrete Tokens vs. Continuous Tokens

In generative models, especially for images, visual information can be represented in different forms:

Continuous Tokens (Latent Codes): These are typically floating-point vectors that represent features of an image in a continuous latent space. They allow for fine-grained variations but can be harder to interpret and might struggle with complex, multimodal distributions without strong distributional assumptions (e.g., Gaussian). Many early visual language models (VLMs) used continuous tokens, often optimized with Mean Squared Error (MSE) loss or cosine similarity.
Discrete Tokens: These are integer indices, similar to words in a vocabulary. Images are first passed through a tokenizer that quantizes (maps) continuous visual features into a finite set of discrete codes (a "codebook"). This approach allows for modeling more complex, non-Gaussian distributions and aligns well with the next token prediction paradigm of language models. However, the quantization process inherently involves information loss, which can limit the level of detail in generated images. The challenge is to make this discretization "lossy enough" to be efficient but "rich enough" to retain visual quality.

3.1.3. Diffusion Models

Diffusion models are a class of generative models that have achieved state-of-the-art results in image generation. They work by learning to reverse a gradual diffusion process (noise addition).

How they work:
1. Forward (Diffusion) Process: A diffusion model starts with a clean image and gradually adds Gaussian noise to it over many steps, eventually transforming it into pure noise.
2. Reverse (Denoising) Process: The model is trained to learn the reverse of this process: starting from pure noise, it iteratively removes noise in small steps, guided by a text prompt or other conditions, until a coherent image emerges.
Strengths: Diffusion models are excellent at producing high-fidelity and diverse images.
Role in X-Omni: In X-Omni, a pre-trained diffusion decoder acts as the final stage, taking the discrete semantic tokens generated by the autoregressive model and reconstructing them into high-resolution pixel images. This means the AR model doesn't generate pixels directly but semantic codes that guide the diffusion process.

3.1.4. Reinforcement Learning (RL)

Reinforcement learning is an area of machine learning concerned with how intelligent agents ought to take actions in an environment to maximize the cumulative reward.

Agent, Environment, Action, Reward, Policy:
- Agent: The model (e.g., the autoregressive image generator in X-Omni) that makes decisions.
- Environment: The external system with which the agent interacts (e.g., the process of generating image tokens and then decoding them into an image).
- Action: The output generated by the agent (e.g., predicting the next discrete image token).
- Reward: A scalar feedback signal from the environment that indicates how good or bad the agent's action was (e.g., a score representing image quality, alignment, or text rendering accuracy).
- Policy ( $\pi$ ): A strategy that the agent uses to determine its next action based on the current state. In X-Omni, this is the autoregressive model's token prediction probabilities.
Goal: The agent's goal is to learn a policy that maximizes the total expected reward over time.
Role in X-Omni: RL is used to fine-tune the autoregressive model. Instead of relying solely on supervised fine-tuning (SFT) which predicts ground truth tokens, RL allows the model to explore different token sequences and receive feedback (rewards) on the quality of the final generated image. This helps the AR model generate tokens that are better aligned with the diffusion decoder's expectations and human preferences, effectively mitigating cumulative errors.

3.1.5. Vector Quantization (VQ)

Vector Quantization (VQ) is a signal processing technique used for data compression. In the context of image generation, it's used to discretize continuous visual features.

How it works:
1. Encoder: A neural network (e.g., ViT in X-Omni) processes a continuous input (an image) and outputs a continuous feature vector (embedding).
2. Codebook: A fixed-size dictionary (or codebook) contains a set of learned discrete "codes" or "prototypes" (e.g., 16,384 vocabulary size, 2,048 dimension in X-Omni).
3. Quantization: For each continuous feature vector from the encoder, the vector quantizer finds the closest code vector in the codebook (e.g., using Euclidean distance) and replaces the continuous vector with the index of that closest code. This index is the discrete token.
Benefit: VQ enables the use of discrete tokens for images, making them compatible with autoregressive models designed for discrete sequences (like text).
Challenge: The choice of codebook size and the quality of the learned codes directly impact the information loss during discretization.

3.1.6. Transformer Architecture

The Transformer is a neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), which has become the de-facto standard for sequence modeling tasks, particularly in natural language processing.

Self-Attention Mechanism: The core innovation of the Transformer is the self-attention mechanism. It allows the model to weigh the importance of different parts of the input sequence when processing each element. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
- $Q K^T$ calculates the dot product similarity between queries and keys, representing attention scores.
- $\sqrt{d_k}$ is a scaling factor to prevent large dot products from pushing the softmax into regions with extremely small gradients.
- $\mathrm{softmax}$ normalizes the scores to create probability distributions.
- The result is a weighted sum of the value vectors, where the weights are the attention scores.
Positional Encoding: Since Transformers process sequences in parallel and lack an inherent notion of order, positional encodings are added to the input embeddings to inject information about the relative or absolute position of tokens in the sequence. X-Omni uses 1D RoPE (Rotary Positional Embedding), which applies a rotation matrix to the query and key vectors before their dot product, effectively embedding relative position information.

3.2. Previous Works

3.2.1. Discrete Token Autoregressive (AR) Models for Image Generation

Early pioneers sought to extend the next token prediction paradigm to images by converting them into sequences of discrete tokens.

iGPT (Image GPT) and DALL-E: These models were among the first to demonstrate the feasibility of generating images autoregressively from text. DALL-E, in particular, showed impressive capability in generating images from diverse text prompts.
Limitations: Despite their groundbreaking nature, these early models (and even subsequent improvements like Chameleon [15] and Emu3 [12]) were plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions. This was primarily attributed to cumulative errors during sequential generation and information loss during the discretization process. This led to a perception that discrete AR models were inherently limited in image quality.

3.2.2. Continuous Token Autoregressive Models

Some approaches [10, 11, 34] focused on autoregressively predicting continuous visual tokens or using query tokens to predict them in parallel.

Characteristics: These methods often rely on MSE loss or cosine similarity for supervision and make distributional assumptions (e.g., Gaussian) about the latent space.
Limitations: These assumptions can limit the range of image distributions they can effectively represent. Furthermore, the paper notes that the means by which reinforcement learning can be integrated into continuous token methods remains unclear, potentially constraining their upper limits.

3.2.3. Hybrid AR and Diffusion Models

Given the limitations of pure AR models for images and the rise of diffusion models for high-fidelity generation, a new trend emerged: combining the strengths of both.

Serial Architectures [13, 28, 27]: In this setup, an autoregressive model might generate query tokens or continuous conditions that then guide a separate diffusion model's generation process.
Parallel Architectures [25, 40, 24, 32, 41, 42, 31, 33, 30]: More prevalent, these involve the autoregressive model (processing language) and the diffusion model (processing visual latents) performing forward passes simultaneously, with information exchanged via self-attention layers.
Limitations: While achieving high fidelity, these hybrid approaches often suffer from a loose connection between the two components and a cross-modal modeling mismatch. The architectural heterogeneity makes integrating robust semantic capabilities challenging. Additionally, reinforcement learning techniques for diffusion models are less mature compared to those for AR models, limiting the application of RL in optimizing these hybrid models.

3.2.4. Reinforcement Learning for Generative Models

RL has been explored for optimizing generative models, but predominantly for diffusion models.

RL for Diffusion Models:
- Reward gradients [43, 44, 45]: Fine-tuning diffusion models using differentiable rewards.
- Reward-weighted negative log-likelihood [46, 47]: Optimizing through specific objectives.
- Extensions of DPO [59], PPO [60], GRPO [61] to diffusion models [48, 47, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58].
Limitations: The paper argues that the potential of RL for diffusion models remains limited due to insufficient diversity in sampling.
Contrast with X-Omni: X-Omni, in contrast, applies RL to autoregressive generative models, leveraging the greater maturity and applicability of RL techniques in this domain to achieve strong results.

3.3. Technological Evolution

The evolution of generative AI for images has seen a journey from initial attempts at direct pixel generation (e.g., GANs), to autoregressive models tokenizing images for sequential generation (e.g., DALL-E), then the emergence of diffusion models offering superior quality and diversity. As AR models faced quality and instruction-following limitations, research largely shifted to hybrid diffusion-AR architectures. This paper represents a pivotal moment in this timeline, aiming to demonstrate that with the right technique (Reinforcement Learning), the original discrete autoregressive paradigm can be revived and made competitive, even superior, to hybrid approaches, fostering a truly unified modeling approach. X-Omni fits within this timeline as an effort to bring full unification back into focus by solving the inherent problems of discrete AR models.

3.4. Differentiation Analysis

Compared to the main methods in related work, X-Omni offers several core differences and innovations:

Unified AR Rejuvenation: Unlike the trend of moving away from unified AR models towards hybrid AR-diffusion architectures (e.g., MetaQuery [28], UniWorld-V1 [29], OmniGen2 [31], Show-02 [32]) that maintain separate generative mechanisms, X-Omni advocates for a discrete autoregressive framework for image generation that unifies the modeling of text and images. It aims to make next token prediction truly effective for both modalities in a single model.
RL for AR (not Diffusion): The paper uniquely applies Reinforcement Learning to directly optimize the autoregressive image token generation process. While RL has been explored for diffusion models, X-Omni leverages the maturity of RL for AR methods to mitigate cumulative errors and align token distribution with the diffusion decoder, a gap that SFT alone cannot bridge. This is a crucial distinction from models that use RL primarily on diffusion components.
Semantic-Rich Tokenization: X-Omni employs semantic encoders like SigLIP 2 to generate discrete tokens. This is in contrast to some approaches that rely on lossy quantization without explicit semantic guidance (e.g., Chameleon [15], Emu3 [12]) or those that reconstruct semantic features for discretization but still face a distribution gap (e.g., LaViT [38]). X-Omni's approach, combined with RL, aims to ensure these semantic tokens are optimized for both generation and understanding.
Elimination of CFG Dependency: A significant innovation is that X-Omni can generate high-quality images without relying on classifier-free guidance (CFG) in the autoregressive component. This is a stark contrast to other autoregressive image generation models (e.g., Emu3 [12], JanusPro [21]) that heavily rely on CFG to enhance sampling quality. This suggests X-Omni has a more intrinsically aligned and consistent generative process.
Strong Instruction and Long Text Following: X-Omni demonstrates robust capability in following complex instructions and accurate rendering of long texts, an area where many prior unified models struggle, particularly for non-English languages. The introduction of LongText-Bench specifically highlights this strength.
Streamlined Architecture for Multi-turn Interaction: By having an isomorphic autoregressive modeling for both image and text, X-Omni allows for seamless integration of image generation and image understanding within a unified network. This eliminates the need to re-extract semantic embeddings from generated images for understanding purposes in multi-turn interactions, making it more streamlined than approaches with heterogeneous image encoders [21, 30, 33].

4. Methodology

4.1. Principles

The core principle behind X-Omni is to revive and significantly enhance the capabilities of discrete autoregressive models for unified image generation and understanding by leveraging Reinforcement Learning (RL). The intuition is that while next token prediction offers a powerful paradigm for sequential data, its application to complex visual content suffers from cumulative errors and a distribution gap between the autoregressively generated tokens and what a high-fidelity image decoder expects. Supervised fine-tuning (SFT) alone struggles to bridge this gap because it optimizes for ground truth tokens, not the quality of the final decoded image. RL, by providing holistic, reward-based feedback on the decoded image quality, can guide the autoregressive model to generate token sequences that are intrinsically more compatible with the image decoder and better aligned with human preferences and instructions. This effectively mitigates error propagation and leads to higher quality, more controllable image generation within a unified, text-like token framework.

4.2. Core Methodology In-depth (Layer by Layer)

The X-Omni framework integrates image and text tokens within a unified autoregressive architecture, depicted in Figure 3. It consists of three main components: a semantic image tokenizer, a unified autoregressive model, and an offline diffusion decoder.

Figure 3: The architecture of X-Omni. 该图像是X-Omni架构的示意图，展示了统一的自回归模型在文本分词器和SigLIP-VQ之间的中枢作用，并通过Detokenization和扩散解码器分别生成文本和图像。

Figure 3: The architecture of X-Omni.

4.2.1. Image Tokenization

The first step in processing images for the autoregressive model is tokenization, converting continuous visual information into discrete tokens.

Goal: To convert continuous images into discrete tokens while retaining rich semantic information, rather than focusing solely on pixel-level reconstruction.
Visual Semantic Extractor: X-Omni uses a pre-trained SigLIP2-g [62] Vision Transformer (ViT) as the core visual semantic extractor. This ViT is known for its strong visual understanding capabilities.
Vector Quantizer (VQ): Building on the ViT encoder, a vector quantizer is incorporated to perform the discretization. This VQ has:
- A codebook of 16,384 vocabulary size.
- Each code vector has a dimension of 2,048.
Alignment with LLM: The SigLIP-VQ tokenizer is aligned with a pre-trained large language model (LLM), Qwen2.5-1.5B [9], on visual understanding tasks. This ensures that the discrete tokens carry meaningful semantic information that the LLM can interpret.
Adapter: A residual block serves as an adapter between the visual tokenizer and the LLM, facilitating smooth information transfer.
Frozen Components: Both the visual encoder ( $SigLIP2-g ViT$ ) and the vector quantizer are kept frozen during subsequent training phases (pre-training, supervised fine-tuning, and reinforcement learning). This ensures stability and consistency in the tokenization process, meaning the mapping from images to discrete tokens remains fixed.

4.2.2. Autoregressive Modeling

Once images are tokenized into discrete tokens, both visual and language tokens are naturally unified within a single autoregressive architecture for multimodal modeling.

Base Model: X-Omni adopts Qwen2.5-7B [9] as its base pre-trained language model.
Integrating Visual Perception: To enable this text-based LLM to handle visual information:
- Vision-Specific Blocks: Four randomly initialized vision-specific blocks are inserted both before and after the original Transformer layers of the Qwen2.5-7B model.
- Structure: These vision-specific blocks have the same structural configuration as standard Transformer blocks (e.g., self-attention, feed-forward networks).
- Functionality: Crucially, these blocks operate exclusively on image tokens, leaving text tokens unaffected. This allows the model to learn visual-specific patterns without interfering with its linguistic capabilities.
- Embedding and Classification Heads: Randomly initialized embedding layers and classification heads are introduced specifically for image tokens.
Architectural Efficiency: These modifications introduce no extra infrastructure complexity compared to the original language model and maintain full compatibility with distributed training strategies (tensor, pipeline, and context parallelism).
Unified Multimodal Sequence: For training, visual tokens and language tokens are concatenated into a unified multimodal sequence. This sequence is then fed into the autoregressive model for next-token prediction training.
- Supervision:
  - For understanding tasks (e.g., VQA, captioning), only the language tokens are supervised. The model learns to predict the next word in a text sequence given an image.
  - For generation tasks (e.g., text-to-image), only the visual tokens are supervised. The model learns to predict the next image token given a text prompt.
Arbitrary Resolution Handling: To support various image resolutions, a resolution information prefix is prepended to the visual tokens. This prefix follows the format: $language tokens <SOM> height width <Image> visual tokens <EOM> language tokens$ $l an gu a g e t o k e n s < SOM > h e i g h tw i d t h < I ma g e > v i s u a lt o k e n s < EOM > l an gu a g e t o k e n s$
- $<SOM>$ and $<EOM>$ : Special tokens denoting the start and end of the multimodal sequence.
- height and width: Text tokens representing the spatial dimensions of the 2D image tokens.
- $<Image>$ : A special token indicating the start of the flattened image tokens.
- visual tokens: The flattened sequence of discrete image tokens, with a length equal to height multiplied by width.
Positional Encoding: The model utilizes 1D RoPE [63] (Rotary Positional Embedding), consistent with the original language model, instead of requiring additional 2D positional encoding for images. This allows the model to inherently understand the relative positions of tokens within the sequence, whether text or image.

4.2.3. Diffusion Decoder

The final stage of image generation involves reconstructing high-resolution images from the discrete semantic tokens generated by the autoregressive model.

Choice of Decoder: A well-pretrained diffusion model is used for this purpose. Specifically, FLUX.1-dev [64] is chosen.
Integration:
- A linear layer is added to map the 2048-dimensional semantic embedding tokens (output by the VQ) to the feature channel dimensions of the FLUX.1-dev model.
- These mapped features are then integrated into the intermediate layer features of the diffusion model.
Input and Objective: The diffusion decoder uses the semantic tokens extracted by the image tokenizer (and subsequently generated by the AR model) as input conditions. It is trained with the objective of image reconstruction, meaning it learns to turn these semantic codes into a high-fidelity image. This component operates offline, meaning its training is separate from the AR model's primary training, and it acts as a fixed "render engine" during the RL phase.

4.2.4. Reinforcement Learning with GRPO

The critical innovation of X-Omni is the application of reinforcement learning to fine-tune the autoregressive model. This is done to address two main challenges:

Bridging the Distribution Gap: Supervised training optimizes the AR model to predict ground truth tokens. However, these ground truth tokens might not be perfectly aligned with the semantic embeddings that the diffusion decoder expects for optimal image reconstruction. RL helps bridge this gap.
Mitigating Error Propagation: The sequential nature of AR generation means early errors can accumulate. RL provides a holistic reward signal based on the final image quality, guiding the model to make better decisions throughout the entire sampling process.

The overall pipeline for the reinforcement learning process is illustrated in Figure 2(a).

该图像是图2，包含强化学习流程示意图、训练曲线及生成结果对比图，展示X-Omni在训练步骤增加时图像生成奖励迅速超越SFT BoN，生成的文本渲染、审美质量与指令遵循能力逐步提升。

Figure 2: During the reinforcement learning process, X-Omni's image generation reward quickly surpassed the best-of-N results achieved by the SFT model (SFT BoN) and demonstrated steady improvement. This progress was reflected in the model's text rendering capabilities, the aesthetic quality of the generated images, and its ability to follow instructions, all of which improved gradually.

4.2.4.1. GRPO Algorithm

X-Omni adopts the Group Relative Policy Optimization (GRPO) [61] algorithm. This algorithm is chosen because it sidesteps the need for a separate critic network, which reduces computational overhead and simplifies the training process compared to some other RL algorithms (e.g., standard Proximal Policy Optimization (PPO)).

The steps for GRPO are:

Trajectory Generation: For each text prompt $p$ sampled from the training distribution $\mathcal{D}$ , the model generates a group of $G$ trajectories (sequences of autoregressive tokens), denoted as $\left\{ \boldsymbol{o}_1, \boldsymbol{o}_2, \hdots, \boldsymbol{o}_G \right\}$ . These trajectories are generated using the previous policy $\pi_{\theta_{old}}$ , which is a snapshot of the model's policy before the current update.
Image Decoding: Each generated token trajectory $\boldsymbol{o}_i$ is then passed through the fixed diffusion decoder to produce a corresponding image $I_i$ .
Reward Scoring: Each image $I_i$ is evaluated by the reward function (described below) to yield a scalar reward $r_i$ .
Advantage Computation: The advantages $A_i$ for each trajectory are computed by normalizing this group of rewards, following the original GRPO procedure. This normalization helps in stabilizing training by providing a relative measure of goodness within the generated group.
Policy Optimization: The current policy model $\pi_{\theta}$ is then optimized by maximizing the following objective function:

$ \begin{array} { l } { \displaystyle \mathcal { I } _ { G R P O } ( \theta ) = \mathbb { E } [ p \sim \mathcal { D } , \left{ \boldsymbol { o } _ { i } \right} _ { i = 1 } ^ { G } \sim \pi _ { \theta _ { \mathrm { o l d } } } ( \cdot | p ) ] } \ { \displaystyle \frac { 1 } { { G } } \sum _ { i = 1 } ^ { G } \left( \operatorname* { m i n } \left( \frac { \pi _ { \theta } \left( \boldsymbol { o } _ { i } | p \right) } { \pi _ { \theta _ { \mathrm { o l d } } } \left( \boldsymbol { o } _ { i } | p \right) } A _ { i } , \mathrm { c l i p } \left( \frac { \pi _ { \theta } \left( \boldsymbol { o } _ { i } | p \right) } { \pi _ { \theta _ { \mathrm { o l d } } } \left( \boldsymbol { o } _ { i } | p \right) } , 1 - \epsilon , 1 + \epsilon \right) A _ { i } \right) - \beta \mathbb { D } _ { K L } ( \pi _ { \theta } | | \pi _ { \theta _ { \boldsymbol { r } \epsilon , f } } ) \right) , } \end{array} $

Where:

$\mathcal{I}_{GRPO}(\theta)$ : The objective function for Group Relative Policy Optimization, which is to be maximized with respect to the current policy parameters $\theta$ .
$\mathbb{E}[\hdots]$ : Expectation over prompts $p$ sampled from the distribution $\mathcal{D}$ , and over groups of trajectories $\{\boldsymbol{o}_i\}_{i=1}^G$ generated by the old policy $\pi_{\theta_{old}}$ given prompt $p$ .
$G$ : The number of trajectories generated for each prompt (group size).
$\pi_{\theta}(\boldsymbol{o}_i | p)$ : The probability of generating trajectory $\boldsymbol{o}_i$ given prompt $p$ under the current policy (which is being optimized).
$\pi_{\theta_{old}}(\boldsymbol{o}_i | p)$ : The probability of generating trajectory $\boldsymbol{o}_i$ given prompt $p$ under the old policy (before the current update), used for importance sampling.
$A_i$ : The advantage score for trajectory $\boldsymbol{o}_i$ , derived from the normalized rewards. A positive advantage indicates that the trajectory was better than average for its group.
$\operatorname*{min}(\hdots)$ : This part applies a clipping mechanism, similar to PPO. It takes the minimum of two terms:
1. $\frac{\pi_{\theta}(\boldsymbol{o}_i | p)}{\pi_{\theta_{old}}(\boldsymbol{o}_i | p)} A_i$ : The standard policy gradient term, weighted by the ratio of new to old policy probabilities.
2. $\mathrm{clip}\left(\frac{\pi_{\theta}(\boldsymbol{o}_i | p)}{\pi_{\theta_{old}}(\boldsymbol{o}_i | p)}, 1 - \epsilon, 1 + \epsilon\right) A_i$ : A clipped version of the policy ratio. This term ensures that the policy updates do not deviate too far from the old policy in a single step, preventing large, unstable updates.
$\epsilon$ : A hyperparameter that defines the clipping range (e.g., 0.2).
$\beta$ : A hyperparameter that weights the KL divergence regularization term.
$\mathbb{D}_{KL}(\pi_{\theta} | | \pi_{\theta_{ref}})$ : The Kullback-Leibler (KL) divergence between the current policy $\pi_{\theta}$ and a reference policy $\pi_{\theta_{ref}}$ . This term acts as a regularization, preventing the new policy from diverging too much from a stable reference policy, which could be the initial SFT model or an earlier version of the RL policy. It is estimated using an unbiased estimator [65].

This formulation helps efficiently fine-tune the policy by balancing reward maximization against divergence from a stable reference model, crucial for stable RL training.

4.2.4.2. Reward System

A comprehensive reward system is constructed to provide multifaceted supervision for the autoregressive model during RL. This system combines several specialized components, each targeting a distinct aspect of image generation quality. The individual reward signals are then combined through a weighted aggregation mechanism to form the final reward score $r_i$ .

Human Preference Score (HPSv2) [66]:
- Purpose: To assess the aesthetic quality and human preference alignment of generated images.
- Mechanism: HPSv2 is a model specifically trained to predict human preferences for images. It generalizes well across various image distributions.
- Limitation addressed: While HPSv2 typically operates on $224 \times 224$ resolution, X-Omni specializes in high-resolution generation.
Unified Reward Score [67]:
- Purpose: To address the resolution limitation of HPSv2 and provide a holistic feedback for human alignment, aggregating multiple quality dimensions into a unified score.
- Mechanism: This model is designed for comprehensive human alignment assessment, suitable for high-resolution images.
Text-Image Alignment Score:
- Purpose: To ensure semantic consistency between the input prompt and the generated image, and to minimize semantic hallucinations.
- Mechanism: The Qwen2.5-VL-32B [68] vision-language model is used. It computes an alignment reward by assessing how accurately the generated image reflects the content described in the prompt. This leverages the VLM's sophisticated image understanding capabilities.
OCR Accuracy Score:
- Purpose: To address the critical challenge of text rendering accuracy within images, especially for prompts requiring textual generation.
- Mechanism: For such prompts, an OCR-based reward quantifies the fidelity of rendered text against the ground truth. GOT-OCR2.0 [69] and PaddleOCR [70] are jointly employed to evaluate the generated images and calculate an accuracy score for text rendering.
  
  The multi-faceted approach ensures that the model learns to balance various quality criteria simultaneously, leading to high-fidelity images that meet human expectations across multiple dimensions.

5. Experimental Setup

5.1. Datasets

The training of X-Omni involves three main stages: pre-training, supervised fine-tuning (SFT), and reinforcement learning (RL), each utilizing specific datasets.

5.1.1. Pre-training

The autoregressive model is pre-trained on a mixture of image generation and image understanding data.

Image Generation Dataset:
- Sources: Open-source datasets including COYO-700M [71], DataComp-1B [72], and LAION-2B [73].
- Curation: The authors apply specific filters to these large datasets, collecting approximately 200M high-quality images.
- Caption Enhancement: Recognizing the limitations of original captions (limited information, noise), the Qwen2.5-VL-72B model [68] is used to generate high-quality image-text pairs with dense captions. This ensures richer textual descriptions for training.
- Image Resolution: All images are resized to a short edge of 384 pixels and a maximum long edge of 1152, maintaining their native aspect ratio.
- Token Count: The final image generation dataset results in 600B multimodal tokens after tokenization.
Image Understanding Dataset:
- Sources: 59M data samples, including those used in LLaVA-OneVision [74], BLIP3-KALE [75], and Infinity-MM [76].
- Image Resolution: The same resize strategy as for image generation data is applied.
- Token Count: This dataset produces about 100B multimodal tokens for image understanding tasks.

5.1.2. Supervised Finetuning

After large-scale pre-training, a supervised fine-tuning (SFT) stage further refines the model.

Data Sources:
- 30K high-quality data from BLIP3o-60k [27].
- 30K synthetic text-to-image data.
- High-quality pre-training tokens filtered from the pre-training dataset itself.
- Mixed image understanding data from LLaVA-NeXt [77], Cauldron [78], and Cambrian-1 [79].
Token Count: In total, 1.5B tokens are used for training during the SFT stage.

5.1.3. Reinforcement Learning

The reinforcement learning stage focuses on enhancing the model's capabilities through reward-based optimization.

Input Requirement: The on-policy GRPO algorithm used requires only text prompts as input, as images are generated by the model itself during training for evaluation.
Dataset Curation: A diverse dataset of 180K prompt samples from three distinct categories is curated to comprehensively cover the desired capabilities:
1. 80K Cleaned Prompts from Midjourney Dataset [80]: Randomly sampled prompts to align training with real-world user preferences and expectations for creative content.
2. 50K Text-Rendering-Focused Prompts: To specifically enhance text rendering capabilities, a bucket-based sampling strategy is applied to text-rich image data. Prompts are sorted by text length into different buckets, and 10K samples are extracted from each, focusing on varied text lengths.
3. 50K Prompts from Natural Image Data: Randomly sampled to improve both aesthetic quality and instruction following abilities.
Rationale: This balanced composition provides a comprehensive training foundation for the RL phase, enabling the model to excel across diverse user interaction scenarios.

5.2. Evaluation Metrics

5.2.1. Text Rendering Metrics

For evaluating the model's ability to render text within images, two benchmarks are used: OneIG-Bench and the newly proposed LongText-Bench.

5.2.1.1. OneIG-Bench [81]

OneIG-Bench evaluates text rendering using a composite score derived from three metrics: Edit Distance, Completion Rate, and Word Accuracy.

Edit Distance (Levenshtein Distance):
- Conceptual Definition: Measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other. It quantifies the dissimilarity between two strings. A lower edit distance indicates higher accuracy.
- Mathematical Formula: For two strings $a$ and $b$ of length $|a|$ and $|b|$ respectively, the Levenshtein distance lev(a, b) is given by: $ lev(a,b) = \begin{cases}
  
  |a| & \text{if } |b|=0 \ |b| & \text{if } |a|=0 \
```
lev(a_{\text{tail}}, b_{\text{tail}}) & \text{if } a[0]=b[0] \\
1 + \min
\begin{cases}
    lev(a_{\text{tail}}, b) \\
    lev(a, b_{\text{tail}}) \\
    lev(a_{\text{tail}}, b_{\text{tail}})
\end{cases}
& \text{if } a[0] \neq b[0]
```
  \end{cases} $ (Note: The recursive definition provided here is one common way to define it, often implemented using dynamic programming for efficiency. a[0] and b[0] refer to the first characters, and $a_{\text{tail}}$ and $b_{\text{tail}}$ refer to the rest of the strings.)
- Symbol Explanation:
  - a, b: The two strings being compared (e.g., ground truth text and OCR-recognized text).
  - $|a|, |b|$ : The lengths of strings $a$ and $b$ .
  - a[0], b[0]: The first character of strings $a$ and $b$ .
  - $a_{\text{tail}}, b_{\text{tail}}$ : The rest of the string after the first character.
  - lev(a, b): The Levenshtein distance between $a$ and $b$ .
Completion Rate:
- Conceptual Definition: Measures the proportion of target words from the prompt that are successfully recognized and rendered in the generated image. It quantifies how complete the text rendering is. A higher completion rate indicates more comprehensive text rendering.
Word Accuracy:
- Conceptual Definition: Measures the percentage of correctly recognized words compared to the total number of words in the ground truth text. It is a direct measure of the correctness of individual words. A higher word accuracy indicates more precise text rendering.

5.2.1.2. LongText-Bench

This is a novel benchmark proposed in the paper to specifically evaluate the model's ability to accurately render long texts.

Conceptual Definition: It consists of 160 meticulously designed prompts covering 8 different scenarios aimed at evaluating the capacity to precisely render long Chinese and English texts. The benchmark uses a Text Accuracy score.
Text Accuracy Score:
- Conceptual Definition: For LongText-Bench, the Text Accuracy score is used as the final metric. This is because for long texts, where OCR recognition results might not preserve the relative order of different text segments, Edit Distance becomes less suitable. Text Accuracy likely refers to a modified word accuracy or phrase accuracy that accounts for correctness regardless of order, or a simple count of correctly identified words/phrases from the prompt.
- OCR Model: Qwen2.5-VL-7B [68] is employed as the OCR model to parse the generated texts for evaluation.
- Sampling: Four images are generated for each prompt to mitigate random errors in generation or OCR.

5.2.2. Text-to-Image Generation Metrics

The performance in general text-to-image generation is evaluated on two widely recognized benchmarks:

DPG-Bench [90]:
- Conceptual Definition: This benchmark typically evaluates the model's ability to follow complex generation instructions, often involving global attributes, entity-specific details, attributes of entities, and relations between entities. It assesses the faithfulness of the generated image to the prompt's detailed content.
GenEval [91]:
- Conceptual Definition: GenEval is an object-focused benchmark designed to evaluate text-to-image generation. It tests specific capabilities like generating single objects, multiple objects, counting abilities, color attributes, positional accuracy, and color attributes of specific objects. It focuses on fine-grained control and understanding of compositional prompts.

5.2.3. Image Understanding Metrics

To assess image understanding capabilities, a comprehensive set of public benchmarks are used:

POPE (Popular Object Predication Error) [96]:
- Conceptual Definition: Evaluates the tendency of Large Vision-Language Models (LVLMs) to hallucinate objects. It measures how often the model incorrectly claims the presence of common objects in an image. Lower POPE scores indicate less hallucination.
GQA (Visual Question Answering) [97]:
- Conceptual Definition: A dataset for real-world visual reasoning and compositional question answering. It requires models to perform complex reasoning over images to answer questions, including spatial understanding, object properties, and relationships. Scores typically reflect accuracy in answering questions.
MMB (MMBench) [98]:
- Conceptual Definition: A comprehensive benchmark suite designed to evaluate the "all-around player" capability of multimodal models across various tasks and domains. It covers a broad range of multimodal understanding abilities. Scores are usually an aggregate performance measure.
SEED (SEEDBench-Img) [99]:
- Conceptual Definition: A benchmark specifically for Multimodal Large Language Models (MLLMs), focusing on diverse understanding tasks related to images, such as visual reasoning, fact-based Q&A, and fine-grained comprehension.
DocVQA (Document Visual Question Answering) [100]:
- Conceptual Definition: A dataset and task focused on visual question answering over document images. It tests the model's ability to read and understand information presented in documents, often involving text extraction and reasoning on structured layouts.
OCRB (OCRBench) [101]:
- Conceptual Definition: A benchmark specifically designed to evaluate the Optical Character Recognition (OCR) capabilities of large multimodal models, assessing their ability to accurately extract and understand text from images. Higher scores indicate better OCR performance.

5.3. Baselines

The paper compares X-Omni against a range of models, categorized into "Gen. Only Models" (focused primarily on image generation) and "Unified Models" (aiming for both image generation and understanding).

Generation Only Models:
- FLUX.1-dev [64]
- HiDream-I1-Full [82]
- Kolors 2.0 [83]
- Seedream 3.0 [84]
- SDXL [86]
- DALL-E [87]
- SD3-medium [88]
Unified Models:
- Janus-Pro [21]
- BLIP3-o [27]
- BAGEL [30]
- OmniGen2 [31]
- Show-02 [32]
- GPT-4o [85]
- Emu3 [12]
- MetaQuery [28]
- UniWorld-V1 [29]
- Ovis-U1 [89]
- Mogao [33]
- LLaVA-1.5 [93] (for understanding only comparison)
- LLaVA-NeXT [77] (for understanding only comparison)
- VILA [94] (for understanding only comparison)
- MiniCPM-Llama3-V 2.5 [95] (for understanding only comparison)
- LLaVA-OneVision [74] (for understanding only comparison)
  
  These baselines represent a comprehensive selection of state-of-the-art models in both specialized image generation and unified multimodal AI, providing a strong context for X-Omni's performance claims.

5.4. Implementation Details

5.4.1. Pre-training

The autoregressive model undergoes a three-stage pre-training strategy:

Stage 1 (Vision-Specific Blocks):
- Only the randomly initialized vision-specific blocks and vision token embeddings are unfrozen and trained.
- Data: Generation data.
- Batch Size: 256.
- Sequence Length: 16,384.
- Steps: 10,000 steps.
- Tokens Consumed: 42B tokens.
- Learning Rate: $1 \times 10^{-4}$ (constant).
Stage 2 (All Trainable Components):
- All components in the autoregressive model are set to be trainable (unfrozen).
- Data: A mixture of generation and understanding data.
- Batch Size: 256.
- Sequence Length: 16,384.
- Steps: 150,000 steps.
- Tokens Consumed: 629B tokens.
- Learning Rate: $1 \times 10^{-4}$ (constant).
Stage 3 (High-Quality Tokens with Annealing):
- Data: 42B high-quality tokens filtered from the pre-training dataset.
- Learning Rate: Learning rate annealing is applied, meaning the learning rate gradually decreases over time, helping the model converge better.

5.4.2. Supervised Finetuning

Trainable Parameters: Only the learnable parameters of the autoregressive model are trained.
Batch Size: 64.
Sequence Length: 16,384.
Tokens Trained: 1.5B tokens.
Learning Rate Schedule: The learning rate is gradually reduced from $1 \times 10^{-5}$ to 0 during this stage.
Consistency: Sequence packing (how sequences are combined to fill the context window) remains consistent with pre-training, with only data and learning rate schedule differing.

5.4.3. Reinforcement Learning

Learning Rate: $1 \times 10^{-6}$ .
Training Steps: 200 steps.
Batch Size: 512.
Rollouts: Each prompt generates 16 rollouts (trajectories), ensuring adequate exploration of the action space.
Loss Function: A combination of the standard policy gradient loss and a KL divergence term weighted at 0.01. This configuration balances between maximizing rewards and maintaining stability by not deviating too much from the previous policy.
Chinese Model Adaptation: The Chinese text rendering model is specifically derived by incorporating training on Chinese data at an intermediate checkpoint during the reinforcement learning stage.

6. Results & Analysis

After the multi-stage training process (pre-training, SFT, and RL), X-Omni's performance is evaluated across text rendering, text-to-image generation, and image understanding benchmarks.

6.1. Core Results Analysis

6.1.1. Text Rendering

X-Omni demonstrates a significant enhancement in rendering long texts, a capability greatly improved through reinforcement learning.

The following are the results from Table 1 of the original paper:

Method	OneIG-Bench [81]		LongText-Bench
Method	English	Chinese	English	Chinese
Gen. Only Models
FLUX.1-dev [64]	0.523		0.607	0.005
HiDream-I1-Full [82]	0.707	0.205	0.543	0.024
Kolors 2.0 [83]	0.427	0.502	0.258	0.329
Seedream 3.0 [84]	0.865	0.928	0.896	0.878
Unified Models
Janus-Pro [21]	0.001	0.148	0.019	0.006
BLIP3-o [27]	0.013	0.092	0.021	0.018
BAGEL [30]	0.244	0.365	0.373	0.310
OmniGen2 [31]	0.680	-	0.561	0.059
Show-02 [32]	0.002	-	0.006	0.002
GPT-4o [85]	0.857	0.650	0.956	0.619
x-Omni	0.901	0.895	0.900	0.814

Analysis of Table 1:

OneIG-Bench (English): X-Omni achieves the highest score of 0.901, significantly outperforming recent open-source unified models like BAGEL (0.244), OmniGen2 (0.680), and Show-02 (0.002). It also surpasses GPT-4o (0.857), indicating superior English text rendering in short to medium contexts.
OneIG-Bench (Chinese): X-Omni scores 0.895, outperforming most models, including GPT-4o (0.650). It is comparable to the specialized commercial text-to-image system Seedream 3.0 (0.928), which is impressive for a unified model.
LongText-Bench (English): X-Omni scores 0.900, significantly better than other unified models, though it falls slightly behind GPT-4o (0.956) in this specific category. This benchmark, designed for longer texts, further highlights X-Omni's robustness.
LongText-Bench (Chinese): X-Omni achieves a leading score of 0.814, outperforming all other models by a large margin, demonstrating its strong capability in rendering complex, long Chinese texts. Many other models, including GPT-4o (0.619), struggle considerably with Chinese long text.

The qualitative comparison examples in Figure 4 further solidify these findings, showing X-Omni's ability to precisely render long text according to instructions, while many other unified models fail.

该图像是论文中Figure 4的示意图，展示了多模态统一模型在文本渲染上的对比效果。图中包含多组带文字和图片的海报样式，涵盖了家庭、旅行和装修提示等主题，体现了模型在复杂文本生成和视觉结合上的能力。

Figure 4: Text rendering comparison with other unified multimodal models.

6.1.2. Text-to-Image Generation

X-Omni's text-to-image generation capabilities are evaluated on DPG-Bench and GenEval.

The following are the results from Table 2 of the original paper:

Method	Global	Entity	Attribute	Relation	Other	Overall
Gen. Only Models
SDXL [86]	83.27	82.43	80.91	86.76	80.41	74.65
DALL-E [87]	90.97	89.61	88.39	90.58	89.83	83.50
SD3-medium [88]	87.90	91.01	88.83	80.70	88.68	84.08
FLUX.1-dev [64]	82.10	89.50	88.70	91.10	89.40	84.00
Unified Models
Emu3 [12]	85.21	86.68	86.84	90.22	83.15	80.60
Janus-Pro [21]	86.90	88.90	89.40	89.32	89.48	84.19
MetaQuery [28]	-	-	-	-	-	82.05
BLIP3-o [27]	-	-	-	-	-	81.60
UniWorld-V1 [29]	83.64	88.39	88.44	89.27	87.22	81.38
Ovis-U1 [89]	82.37	90.08	88.68	93.35	85.20	83.72
Mogao [33]	82.37	90.03	88.26	93.18	85.40	84.33
BAGEL [30]	88.94	90.37	91.29	90.82	88.67	85.07
OmniGen2 [31]	88.81	88.83	90.18	89.37	90.27	83.57
Show-02 [32]	89.00	91.78	89.96	91.81	91.64	86.14
GPT-4o* [85]	82.27	91.27	87.67	93.85	88.71	86.23
x-Omni	84.80	92.59	90.63	94.75	84.20	87.65

Analysis of Table 2 (DPG-Bench):

X-Omni achieves an Overall score of 87.65, which is state-of-the-art compared to all other unified models listed, surpassing even GPT-4o (86.23) and Show-02 (86.14).
It particularly excels in Entity (92.59) and Relation (94.75) categories, indicating strong capabilities in accurately rendering specific objects and their interactions as described in prompts. This highlights the effectiveness of RL in improving instruction following.

The performance demonstrates X-Omni's precise adherence to complex generation instructions.

The following are the results from Table 3 of the original paper:

Method	Single	Two	Counting	Colors	Position	Color Attr.	Overall
Gen. Only Models
SDXL [86]	0.98	0.74	0.39	0.85	0.15	0.23	0.55
DALL-E [87]	0.96	0.87	0.47	0.83	0.43	0.45	0.67
SD3-medium [88]	0.99	0.94	0.72	0.89	0.33	0.60	0.74
FLUX.1-dev [64]	0.98	0.93	0.75	0.93	0.68	0.65	0.82
Unified Models
Emu3 [12]	0.99	0.81	0.42	0.80	0.49	0.45	0.66
Janus-Pro [21]	0.99	0.89	0.59	0.90	0.79	0.66	0.80
MetaQuery [28]			-	-	-	-	0.80
BLIP3-0 [27]	-	-	-	-	-	-	0.84
UniWorld-V1 [29]	0.99	0.93	0.81	0.89	0.74	0.71	0.84
Mogao [33]	1.00	0.97	0.83	0.93	0.84	0.80	0.89
BAGEL [30]	0.98	0.95	0.84	0.95	0.78	0.77	0.88
OmniGen2 [31]	0.99	0.96	0.74	0.98	0.72	0.75	0.86
Show-o2 [32]	1.00	0.87	0.58	0.92	0.52	0.62	0.76
GPT-40* [85]	0.99	0.92	0.85	0.92	0.75	0.61	0.84
X-Omni	0.98	0.95	0.75	0.91	0.71	0.68	0.83

Analysis of Table 3 (GenEval):

X-Omni achieves an Overall score of 0.83, which is comparable to other strong unified models like UniWorld-V1 (0.84) and GPT-4o (0.84).
While Mogao (0.89) and BAGEL (0.88) show slightly higher overall scores, X-Omni demonstrates solid performance across various fine-grained aspects of image generation, such as Single (0.98), Two (0.95) objects, and Colors (0.91).
This indicates that X-Omni maintains strong capabilities in generating images that align with compositional prompts.

Figure 5 presents additional qualitative examples, showcasing X-Omni's ability to generate aesthetically pleasing images at arbitrary resolutions. These images exhibit high detail, color fidelity, and thematic representation, further supporting the quantitative results.

该图像是多张高质量图片的拼接展示，呈现了人像、风光、艺术创作和生活场景等多样视觉内容，展示了 X-Omni 在图像生成中对细节、色彩和主题的高度还原与表现力。

Figure 5: Qualitative cases of X-Omni.

6.1.3. Image Understanding

The image understanding performance is evaluated across various benchmarks.

The following are the results from Table 4 of the original paper:

Method	#LLM	POPE	GQA	MMB	SEED	DocVQA	OCRB
Und. Only Models
LLaVA-1.5 [93]	7B	85.9	62.0	64.3	66.1	-	318
LLaVA-NeXT [77]	7B	86.5	64.2	67.4	68.2	74.4	532
VILA [94]	7B	85.5	62.3	68.9	61.1	-	-
MiniCPM-Llama3-V 2.5 [95]	8B	-	-	77.2	-	84.8	725
LLaVA-OneVision [74]	7B	-	-	80.8	75.4	87.5	622
Unified Models
Emu3 [12]	7B	85.2	60.3	58.5	68.2	76.3	687
Janus-Pro [21]	7B	87.4	62.0	79.2	72.1	-	595
Mogao [33]	7B	88.9	60.9	75.0	74.6	-	-
Show-02 [32]	7B	-	63.1	79.3	69.8		-
X-Omni	7B	89.3	62.8	74.8	74.1	88.6	704

Analysis of Table 4:

Overall Performance: Despite X-Omni's primary focus on improving generative capabilities through RL, it achieves comparable results to other state-of-the-art unified models like Show-02 on various image understanding benchmarks.
Outperforming Earlier Unified Models: X-Omni surpasses earlier unified works such as Emu3 and Janus-Pro in overall understanding.
Strong OCR Performance: Notably, X-Omni's score on OCRBench (704) significantly exceeds that of Emu3 (687) and Janus-Pro (595), as well as the specialized image understanding model LLaVA-OneVision (622). This strong OCR capability aligns with its excellent text rendering results and suggests that the unified semantic tokenization and RL training contribute to robust text understanding in visual contexts.
DocVQA: X-Omni also performs very well on DocVQA (88.6), outperforming LLaVA-OneVision (87.5) and MiniCPM-Llama3-V 2.5 (84.8), further demonstrating its ability to process and reason with text in images.

6.2. Ablation Studies / Parameter Analysis

6.2.1. X-Omni Does Not Rely on Classifier-Free Guidance (CFG)

A key finding is that X-Omni can generate high-quality images without relying on classifier-free guidance (CFG) in its autoregressive component.

What is CFG? Classifier-Free Guidance (CFG) is a technique commonly used in generative models (especially diffusion models and some autoregressive models) to improve the alignment between generated content and the conditioning input (e.g., text prompt). It works by combining predictions from a conditional model (guided by the prompt) and an unconditional model (not guided by the prompt) during sampling. By extrapolating between these two, CFG can enhance the strength of the guidance, leading to higher quality and better prompt adherence, often at the cost of diversity. Autoregressive models like Emu3 and JanusPro typically rely heavily on CFG to enhance the sampling of visual tokens.
X-Omni's Independence: Figure 6 compares the CFG dependencies. It shows that X-Omni maintains consistently high generation quality, regardless of whether CFG is present. In contrast, other autoregressive models (Emu3, Janus-Pro) experience a significant drop in generation quality when CFG is absent (indicated by the cfg_scale=1 setting, which means no guidance).

该图像是图表，展示了使用与不使用无分类器引导（CFG）时，三种模型（Emu3、Janus-Pro、X-Omni）生成蓝色花瓶图片的对比效果，突出X-Omni模型在CFG辅助下生成细节更丰富、颜色更自然的图像。

Figure 6: Comparison of dependency on classifier-free guidance (CFG).

Implications:

Reduced Computational Cost: Operating without CFG means that X-Omni avoids the need to run both conditional and unconditional inference paths during sampling, thereby lowering the computational cost of autoregressive inference.
Higher Intrinsic Consistency: This independence suggests a higher level of consistency between the generation processes for visual and language tokens within X-Omni. The model inherently generates tokens that are well-aligned with the prompt and the diffusion decoder's expectations, without needing an external boost from CFG. This is a strong indicator of the effectiveness of the RL training in optimizing the intrinsic generation quality.

Unless specified, all qualitative examples in the paper were generated without CFG during autoregressive sampling, reinforcing this finding.

6.2.2. RL Outperforms SFT with Best-of-N Sampling

The paper highlights that Reinforcement Learning (RL) training provides significant advantages over Supervised Fine-Tuning (SFT) even when SFT is augmented with Best-of-N sampling.

What is Best-of-N sampling? Best-of-N sampling is a technique where a generative model produces multiple (N) outputs for a given prompt, and then the "best" one is selected using some evaluation metric (e.g., a reward model or human preference). This method can improve perceived quality by allowing the model to generate diverse options and cherry-pick the most favorable one. In language modeling, SFT models combined with Best-of-N sampling often set a high bar that RL training struggles to surpass.
RL's Advantage: As illustrated in Figure 2(b), the RL training in X-Omni leads to a performance curve that quickly surpasses the best-of-N results achieved by the SFT model (SFT BoN). This indicates that RL is not just incrementally improving SFT but is fundamentally optimizing the generative process in a way that SFT cannot.

Reasons for this disparity (different from language modeling):

Alignment of AR and Diffusion Modules:
- SFT: Trains the autoregressive (AR) module and the diffusion decoder separately. The AR model is optimized to predict ground truth tokens, while the diffusion model reconstructs pixels from these tokens. This separation can lead to a distribution gap where AR-generated tokens, even if accurate to ground truth, might not be optimal for the diffusion decoder, resulting in performance degradation.
- RL: Plays a crucial role in aligning these two modules. By receiving a reward based on the final image decoded by the diffusion model, the AR model learns to generate token sequences that are intrinsically compatible with the diffusion decoder's expectations, effectively bridging this gap.
Holistic Optimization for Image Features:
- Language has strong sequential dependencies, and errors are often localized. While sequential, image features are inherently local and spatially complex. Errors in early image tokens can have complex, non-local effects on the final image.
- RL's holistic optimization harnesses rich, multifaceted information from various local regions within a single image (via the comprehensive reward system). This allows for highly efficient learning to produce globally coherent and high-quality images by optimizing the entire sequence of token generation with respect to the end goal.

6.2.3. LongText-Bench Details

To fully evaluate X-Omni's ability to accurately render long texts, the authors propose a novel benchmark: LongText-Bench.

The following are the results from Figure 7 of the original paper:

Figure 7: Comparison between our proposed LongText-Bench and OneIG-Bench with respect to the length of rendered texts in English (left) and Chinese (right). 该图像是图表，展示了LongText-Bench与OneIG-Bench在英文（左图）和中文（右图）文本长度分布上的对比。图中以柱状图形式直观体现了两者在不同字数区间的分布差异。

Figure 7: Comparison between our proposed LongText-Bench and OneIG-Bench with respect to the length of rendered texts in English (left) and Chinese (right).

Benchmark Statistics (Figure 7 Analysis):

Focus: LongText-Bench focuses on evaluating text rendering performance for longer texts in both English and Chinese, addressing the limitation of OneIG-Bench which primarily uses shorter prompts.
English Text Length:
- LongText-Bench "short" category: Text lengths are concentrated within 10-30 words.
- LongText-Bench "long" category: Predominantly falls within 30-50 words.
- In comparison, OneIG-Bench English texts are generally shorter than both categories of LongText-Bench.
Chinese Text Length:
- LongText-Bench "short" category: Most prompts contain 20 to 40 characters.
- LongText-Bench "long" category: Typically exceeds 60 characters in length.
- Similar to English, OneIG-Bench Chinese texts are shorter overall.
Conclusion: The increased text length in LongText-Bench effectively highlights X-Omni's superior capability in rendering more extensive textual content, especially in Chinese, where it significantly outperforms other models.

Prompt Construction:

Methodology: Prompts are meticulously curated using an automatic pipeline followed by manual post-review.
Scenarios: 8 common scenarios featuring text-rich contexts are defined: signboards, objects with labels, printed materials, web pages, slides, posters, captions, and dialogues.
GPT-4o Generation: For each category, GPT-4o is instructed to generate 20 prompts: 10 with short text contents and 10 with longer text contents.
Manual Review: After initial generation, a manual review is conducted for each prompt, and text lengths are adjusted to achieve a balanced distribution across the "short" and "long" categories.
Total Prompts: This process results in a total of 160 prompts covering 8 categories for evaluating long text rendering tasks.

Evaluation Metric:

OCR Model: Qwen2.5-VL-7B [68] is used as the OCR model to parse generated texts.
Metric: The Text Accuracy score is adopted as the final metric for LongText-Bench.
Rationale for Text Accuracy: For long text scenarios, standard Edit Distance is deemed unsuitable because OCR recognition results may not account for the relative order of different text segments (e.g., if multiple lines of text are incorrectly ordered by OCR but individual words are correct). Text Accuracy likely provides a more robust measure of overall correctness in such complex scenarios.
Sampling: Four images are generated for each prompt to minimize the impact of random errors during either image generation or OCR recognition.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully demonstrates that discrete autoregressive models for image generation can achieve state-of-the-art results by leveraging reinforcement learning (RL). The proposed X-Omni framework unifies image and language modeling within a single autoregressive architecture, composed of a semantic image tokenizer, a unified autoregressive model, and an offline diffusion decoder. Through a sophisticated GRPO-based RL training, guided by a multi-faceted reward system (including human preference, unified quality, text-image alignment, and OCR accuracy), X-Omni effectively mitigates the long-standing issues of low visual fidelity, distorted outputs, and poor instruction adherence that have plagued prior discrete AR models. Key achievements include producing high aesthetic quality images, robustly following complex instructions, and exceptionally accurate rendering of long texts in both English and Chinese. Furthermore, X-Omni uniquely operates without reliance on classifier-free guidance (CFG) during autoregressive sampling, indicating a higher intrinsic consistency and efficiency in its generative process.

7.2. Limitations & Future Work

The paper itself does not explicitly list a "Limitations" section. However, based on the challenges in the field and the solutions proposed, potential limitations of X-Omni or areas for future work can be inferred:

Complexity of Reward System: While the multi-faceted reward system is a strength, its complexity (integrating HPSv2, Unified Reward, Qwen2.5-VL-32B, GOT-OCR2.0, PaddleOCR) can be substantial to develop, maintain, and scale. Future work might explore simpler or more generalized reward models.
Computational Cost of RL: Although GRPO sidesteps a critic network, the process of generating multiple rollouts ( $G=16$ ) for each prompt, decoding them with a diffusion model, and then evaluating them with a comprehensive reward system, can still be computationally intensive. Scaling RL training to even larger datasets or models might pose challenges.
Reliance on External Models: X-Omni relies on several frozen, pre-trained external models (SigLIP2-g, Qwen2.5-1.5B, FLUX.1-dev, Qwen2.5-VL-32B, GOT-OCR2.0, PaddleOCR). While effective, this creates a dependency chain and assumes the capabilities of these external models are sufficient and perfectly aligned. End-to-end learning of all components could be a future direction, albeit more challenging.
Generalizability to Novel Tasks: While X-Omni shows strong performance on generation and understanding, its generalizability to a wider array of multimodal tasks (e.g., video generation, 3D content generation, advanced interactive reasoning) might require further adaptation.
Prompt Engineering for RL: The curation of diverse prompts for RL (Midjourney, text-rich, natural images) is crucial. Exploring more automated or adaptive ways to generate effective RL prompts could be valuable.

The paper's success in "making discrete autoregressive image generative models great again" inherently suggests future work will focus on:
Further scaling the unified autoregressive model.
Exploring the limits of RL in refining multimodal generation.
Applying this unified AR framework to even more complex multimodal tasks beyond static image generation.
Developing even more advanced and efficient semantic tokenizers.

7.3. Personal Insights & Critique

X-Omni is a highly impactful paper that successfully challenges a prevailing paradigm in multimodal AI. For a considerable period, the limitations of discrete autoregressive models in image generation seemed insurmountable, leading many researchers to pivot towards diffusion-based or hybrid approaches. This paper's core innovation lies in demonstrating that Reinforcement Learning is the critical missing piece to unlock the full potential of discrete autoregressive modeling for images.

My key insights are:

The Power of Holistic Feedback: The paper effectively argues that supervised fine-tuning (SFT) is fundamentally limited when a distribution mismatch exists between intermediate representations (AR tokens) and the final desired output (high-quality decoded image). RL, by providing holistic, end-to-end feedback based on the actual quality of the generated image, can directly optimize for the downstream task, effectively bridging the AR-diffusion gap. This insight is generalizable beyond image generation to any sequential generative task where intermediate steps have complex interactions with the final quality.
Strategic Reward System Design: The success of the RL component hinges on a comprehensive and well-designed reward system. By combining human preference, unified quality, text-image alignment, and OCR accuracy signals, X-Omni effectively guides the model to optimize for multiple critical aspects simultaneously. This multi-objective reward engineering is a crucial takeaway for applying RL to complex creative tasks.
Unified Architectures are Still Viable: The paper makes a compelling case for unified autoregressive architectures for vision and language. The ability to manage both modalities with a single next token prediction mechanism not only simplifies the architecture but also promotes knowledge transfer and seamless multi-turn interactions between generation and understanding. This contrasts with hybrid models that often treat vision and language as separate, albeit interacting, components.
The Significance of CFG Independence: The finding that X-Omni operates effectively without classifier-free guidance (CFG) is particularly insightful. CFG is often seen as a necessary "hack" to boost quality in generative models. X-Omni's independence from it suggests that its internal representation learning and generative process are more intrinsically aligned and robust, a testament to the efficacy of its RL training in making the model's unconditional generation high-quality and consistent. This could pave the way for more efficient and fundamentally better-designed generative models.
Long Text Rendering as a Benchmark: The introduction of LongText-Bench is an important contribution. Rendering long, accurate text within images has been a persistent challenge, and by creating a dedicated benchmark, the paper highlights a critical capability that many models (including GPT-4o) struggle with, where X-Omni excels. This is particularly valuable for applications requiring precise textual integration in visual content.

Potential issues or areas for improvement:

Interpretability of RL Policy: While RL improves performance, understanding why the AR model makes certain token choices under the RL policy can be more opaque than with pure SFT. Further research into interpretability of RL-trained generative policies could be beneficial.
Generalization to Diverse Styles/Domains: While aesthetic quality is high, the reward system might implicitly favor certain aesthetic styles or content types seen during training. Testing generalization to highly diverse or niche artistic styles could be an interesting future exploration.
Cold Start Problem for RL: The paper starts with a heavily pre-trained and SFT-tuned model before RL. The cold start problem for RL in generative models (i.e., training from scratch with RL) remains a significant challenge.

In conclusion, X-Omni provides a strong foundation for future research in unified multimodal generation, proving that with strategic application of reinforcement learning, discrete autoregressive models can indeed be "made great again" and push the boundaries of what is possible in AI.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~38 min read · 54,014 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Autoregressive Models and Next Token Prediction

3.1.2. Discrete Tokens vs. Continuous Tokens

3.1.3. Diffusion Models

3.1.4. Reinforcement Learning (RL)

3.1.5. Vector Quantization (VQ)

3.1.6. Transformer Architecture

3.2. Previous Works

3.2.1. Discrete Token Autoregressive (AR) Models for Image Generation

3.2.2. Continuous Token Autoregressive Models

3.2.3. Hybrid AR and Diffusion Models

3.2.4. Reinforcement Learning for Generative Models

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Image Tokenization

4.2.2. Autoregressive Modeling

4.2.3. Diffusion Decoder

4.2.4. Reinforcement Learning with GRPO

4.2.4.1. GRPO Algorithm

4.2.4.2. Reward System

5. Experimental Setup

5.1. Datasets

5.1.1. Pre-training

5.1.2. Supervised Finetuning

5.1.3. Reinforcement Learning

5.2. Evaluation Metrics

5.2.1. Text Rendering Metrics

5.2.1.1. OneIG-Bench [81]

5.2.1.2. LongText-Bench

5.2.2. Text-to-Image Generation Metrics

5.2.3. Image Understanding Metrics

5.3. Baselines

5.4. Implementation Details

5.4.1. Pre-training

5.4.2. Supervised Finetuning

5.4.3. Reinforcement Learning

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Text Rendering

6.1.2. Text-to-Image Generation

6.1.3. Image Understanding

6.2. Ablation Studies / Parameter Analysis

6.2.1. X-Omni Does Not Rely on Classifier-Free Guidance (CFG)

6.2.2. RL Outperforms SFT with Best-of-N Sampling

6.2.3. LongText-Bench Details

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers