Abstract

Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.

1. Bibliographic Information

1.1. Title

The central topic of the paper is "NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale". It focuses on advancing autoregressive models for text-to-image generation by utilizing continuous image tokens.

1.2. Authors

The paper lists "NextStep-Team" as the authors, without individual names in the main author list. However, a detailed "Contributors and Acknowledgments" section provides individual names.

Researchers (Core Executors *, Project Leader †, listed alphabetically by first name): Chunrui Han*, Guopeng Li*, Jingwei Wu*, Quan Sun*†, Yan Cai*, Yuang Peng*, Zheng Ge*†, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu.
Contributors (Support in data, systems, platforms, early versions, part-time): Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yinging Wang, Yu Zhou, Yucheng Han, Ziyang Meng.
Sponsors: Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu.
Acknowledgments for insightful discussions: Tianhong Li and Yonglong Tian.

Their specific research backgrounds and affiliations are not explicitly stated for individual authors in the paper, but the "NextStep-Team" and homepage https://stepfun.ai/research/en/nextstep1 suggest an affiliation with StepFun AI.

1.3. Journal/Conference

The paper is an arXiv preprint arXiv:2508.10711, published at 2025-08-14T14:54:22.000Z. As an arXiv preprint, it has not yet undergone formal peer review or been published in a specific journal or conference. arXiv is a widely used open-access repository for preprints, making research publicly available before, or sometimes instead of, formal publication.

1.4. Publication Year

The publication year is 2025 (specifically, August 14, 2025).

1.5. Abstract

The abstract introduces NextStep-1, a novel approach to autoregressive (AR) text-to-image generation that addresses limitations of existing AR models. Current AR models either rely on computationally intensive diffusion models for continuous image tokens or use vector quantization (VQ) for discrete tokens, which incurs quantization loss. NextStep-1 is a 14-billion-parameter AR model combined with a 157-million-parameter flow matching head. It trains on discrete text tokens and continuous image tokens using next-token prediction objectives. The model achieves state-of-the-art performance among AR models in text-to-image generation, demonstrating high-fidelity image synthesis and strong capabilities in image editing. The authors plan to release their code and models to facilitate open research.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2508.10711. This is a preprint, indicating it is publicly available but has not undergone formal peer review. The PDF link is https://arxiv.org/pdf/2508.10711v2.pdf.

2. Executive Summary

2.1. Background & Motivation

The paper addresses the challenge of building high-performing autoregressive (AR) models for text-to-image generation.

Core Problem: Current autoregressive approaches for text-to-image generation face two main limitations:
1. Reliance on heavy diffusion models: Some AR models generate semantic embeddings, which then condition a separate, computationally intensive diffusion model to produce the final image. This makes the overall process less unified and efficient.
2. Use of vector quantization (VQ): Other AR models convert images into discrete visual tokens using VQ. This method suffers from quantization loss (information loss during the conversion from continuous to discrete representations) and issues like exposure bias (discrepancy between training and inference distributions when predicting discrete tokens sequentially).
Importance of the Problem: Autoregressive models, inspired by their success in large language models (LLMs), offer a scalable and flexible paradigm for unifying multimodal inputs into a single sequence. Overcoming their limitations in image generation is crucial for developing versatile and powerful general-purpose AI systems that can handle both text and image generation seamlessly. A significant performance gap has persisted between AR models and state-of-the-art diffusion methods, particularly in image quality and consistency.
Paper's Entry Point/Innovative Idea: NextStep-1 pushes the AR paradigm forward by directly modeling continuous image tokens within an autoregressive framework, leveraging a flow matching head instead of diffusion models for continuous token processing or VQ for discrete tokens. This aims to combine the strengths of AR generation (scalability, flexibility, unified sequence modeling) with the quality benefits of continuous representations, without the computational overhead of full diffusion models or the loss from VQ.

2.2. Main Contributions / Findings

The paper makes several primary contributions and reports key findings:

NextStep-1 Model: Introduction of NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, designed for next-token prediction on discrete text tokens and continuous image tokens. This unified approach provides a simple yet effective architecture for text-to-image generation.
State-of-the-Art AR Performance: NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, demonstrating high-fidelity image synthesis. It shows competitive performance across challenging benchmarks, including WISE, GenAI-Bench, DPG-Bench, and OneIG-Bench, showcasing strong compositional understanding, linguistic understanding, and world knowledge.
Versatility in Image Editing: The model, specifically NextStep-1-Edit (fine-tuned version), exhibits strong performance in instruction-based image editing tasks, achieving competitive scores on GEdit-Bench and ImgEdit-Bench. This highlights the versatility of the unified AR approach.
Robust Image Tokenizer Design: The paper introduces an image tokenizer fine-tuned from Flux VAE that incorporates channel-wise normalization and stochastic perturbation. This design is crucial for enhancing the robustness of continuous image tokens, promoting a well-dispersed, normalized latent space, and ensuring stable convergence even at higher dimensionalities (e.g., 16 channels), mitigating issues like variance collapse and visual artifacts under strong classifier-free guidance (CFG).
Invariance to Flow Matching Head Size: A key finding is that the model's generation quality is surprisingly insensitive to the size of the flow matching head. This suggests that the Transformer backbone performs the core generative modeling, with the flow matching head acting as a lightweight sampler, translating contextual predictions into continuous tokens.
Importance of Regularized Latent Space: The authors demonstrate a counter-intuitive inverse correlation between generation loss and synthesis quality, where higher noise intensity during tokenizer training (increasing generation loss) paradoxically improves image quality. This is attributed to noise regularization creating a well-conditioned, more dispersed latent space, which is critical for generation.
Open Research Facilitation: The authors commit to releasing their code and models to the community.

3.1. Foundational Concepts

To understand NextStep-1, several foundational concepts in machine learning, particularly in generative AI, are essential:

Autoregressive Models (AR Models):
- Conceptual Definition: Autoregressive models are a class of statistical models that predict future values in a sequence based on past observations. In machine learning, especially in natural language processing (NLP) and generative AI, they model the probability distribution of a sequence of data by factorizing it into a product of conditional probabilities. This means each element in the sequence is predicted conditioned on all preceding elements.
- How it Works (for sequences): If you have a sequence $X = (x_1, x_2, \dots, x_n)$ , an autoregressive model calculates the probability of this sequence as: $ p(X) = p(x_1) \cdot p(x_2 | x_1) \cdot p(x_3 | x_1, x_2) \cdot \dots \cdot p(x_n | x_1, \dots, x_{n-1}) $ Or, more compactly, using the product notation as seen in the paper: $ p(x) = \prod_{i=1}^{n} p(x_i \mid x_{ $x_{<i}$ next-token prediction (NTP).
- In NextStep-1's context: The model processes a unified sequence of discrete text tokens and continuous image tokens. When predicting a text token, it uses a language modeling head. When predicting an image token (or patch), it uses a flow matching head.
Transformers:
- Conceptual Definition: Transformers are a neural network architecture introduced in 2017, which have become the backbone of most state-of-the-art large language models (LLMs) and are increasingly used in vision tasks. They are particularly good at handling sequential data without relying on recurrent (RNNs) or convolutional (CNNs) layers, instead using attention mechanisms to weigh the importance of different parts of the input sequence.
- Causal Transformer: A causal transformer (often called a decoder-only transformer) is a variant where the attention mechanism is restricted to only attend to past and current tokens in a sequence, but not future tokens. This makes it suitable for autoregressive generation tasks, ensuring that predictions for $x_i$ only depend on $x_{<i}$ .
- Positional Encoding (RoPE): Since transformers process sequences in parallel and don't inherently understand the order of tokens, positional encodings are added to the input embeddings to inject positional information. Rotary Position Embedding (RoPE) is a specific type of positional encoding that encodes absolute position with a rotation matrix and naturally incorporates relative position dependencies, which can be beneficial for longer sequences.
Latent Space and Variational Autoencoders (VAEs):
- Conceptual Definition: A latent space is a lower-dimensional representation of data where similar data points are closer together. Variational Autoencoders (VAEs) are a type of generative model that learn to encode input data into a latent space and then decode from this latent space back into the original data space. They are often used as image tokenizers to compress images into a more compact, continuous representation (the latents).
- Components: A VAE consists of an encoder (maps input to latent distribution parameters, usually mean and variance) and a decoder (maps sampled latent vectors back to the input space).
- In NextStep-1: The image tokenizer is based on a VAE (specifically, fine-tuned from Flux VAE). It converts high-resolution images into 16-channel latents at an $8 \times$ spatial downsampling factor. These latents form the continuous image tokens.
Vector Quantization (VQ):
- Conceptual Definition: VQ is a data compression technique that reduces a continuous vector space into a discrete set of "codebook" vectors. In image generation, VQ-VAEs (or VQ-GANs) encode images into a latent space, and then each latent vector is "quantized" by finding the closest vector in a learned discrete codebook. This results in discrete visual tokens, making images representable as sequences of integers, similar to text.
- Limitation (as noted by paper): Quantization loss refers to the information lost during this discrete approximation.
Diffusion Models:
- Conceptual Definition: Diffusion models are a class of generative models that learn to reverse a gradual diffusion process (adding noise to data) to generate new data from random noise. They iteratively denoise a noisy input until it resembles real data.
- Computational Intensity: Diffusion models often require many steps to generate a high-quality image, making them computationally intensive at inference time.
- In NextStep-1's context: The paper contrasts NextStep-1 with AR models that rely on heavy diffusion models to process continuous image tokens after the AR model generates semantic embeddings. NextStep-1 aims to avoid this reliance by using flow matching for its continuous token generation.
Flow Matching:
- Conceptual Definition: Flow matching is a recent family of generative models that directly learn a vector field (a flow) that transports a simple base distribution (e.g., Gaussian noise) to a complex data distribution. Unlike diffusion models that learn to reverse a stochastic process, flow matching typically learns a deterministic ordinary differential equation (ODE) or stochastic differential equation (SDE) path, making sampling potentially faster and more stable.
- Velocity Vector: In flow matching, the model predicts a velocity vector that indicates the direction and magnitude of movement needed to transform a noisy sample at a given timestep into a clean target sample.
- In NextStep-1: A flow matching head is used to predict the continuous flow from a noise sample to the next target image patch, effectively sampling the continuous image tokens.
Cross-Entropy Loss:
- Conceptual Definition: Cross-entropy loss is a common loss function used in classification tasks and for training models that output probability distributions (like language models). It measures the difference between two probability distributions: the true distribution (one-hot encoded for discrete tokens) and the predicted distribution from the model.
- Formula: For a single discrete token prediction, if $y_i$ is the true probability (1 for the correct class, 0 otherwise) and $\hat{y}_i$ is the predicted probability, the cross-entropy loss is: $ \mathcal{L}{\mathrm{CE}} = - \sum{c=1}^{C} y_c \log(\hat{y}_c) $ where $C$ is the number of classes (vocabulary size for text tokens).
- In NextStep-1: Used for discrete text tokens via the language modeling head.
Mean Squared Error (MSE):
- Conceptual Definition: Mean Squared Error is a common loss function used in regression tasks. It measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value.
- Formula: For two vectors $A$ and $B$ , $\mathrm{MSE}$ is: $ \mathcal{L}{\mathrm{MSE}} = \frac{1}{N} \sum{i=1}^{N} (A_i - B_i)^2 $
- In NextStep-1: Used as part of the flow matching loss to quantify the difference between the predicted and target velocity vectors for continuous image tokens.
Classifier-Free Guidance (CFG):
- Conceptual Definition: Classifier-Free Guidance is a technique used in generative models (especially diffusion models and VAE-based generation) to improve the alignment between generated samples and a given conditioning signal (e.g., text prompt). It works by linearly interpolating between a model's conditional prediction (guided by the prompt) and its unconditional prediction (not guided by the prompt). A higher guidance scale leads to stronger adherence to the prompt but can sometimes introduce artifacts or reduce diversity.
- Formula (as provided in paper): $ \tilde{\nu}(x | y) = (1 - w) \cdot \nu_{\theta}(x | \emptyset) + w \cdot \nu_{\theta}(x | y) $ where:
  - $\tilde{\nu}(x | y)$ is the guided prediction (e.g., predicted velocity vector in flow matching or denoised latent in diffusion).
  - $\nu_{\theta}(x | \emptyset)$ is the unconditional prediction (prediction given no conditioning, $\emptyset$ ).
  - $\nu_{\theta}(x | y)$ is the conditional prediction (prediction given conditioning $y$ , e.g., a text prompt).
  - $w$ is the guidance scale, a hyperparameter controlling the strength of the guidance.

3.2. Previous Works

The paper frames its work against two main categories of existing autoregressive text-to-image models and state-of-the-art diffusion models:

Autoregressive Models relying on Diffusion Models:
- (Chen et al., 2025a), (Dong et al., 2024), (Sun et al., 2023, 2024b), (Zhou et al., 2025) are cited.
- Mechanism: These models typically use an autoregressive Transformer to first generate a semantic embedding or a sequence of discrete tokens representing high-level image semantics. This embedding then serves as a condition for a separate, often heavy diffusion model, which generates the actual image in a single denoising process.
- Limitation: This approach is computationally intensive because the diffusion model is separate and handles the entire image generation, requiring significant resources. The AR model acts more as an orchestrator than a direct image generator.
Autoregressive Models employing Vector Quantization (VQ):
- (Eslami et al., 2021), (Yu et al., 2023), (Zheng et al., 2022), (Chen et al., 2025b), (Dong et al., 2024), (Sun et al., 2024a,b), (Tong et al., 2024), (Wang et al., 2024b) are cited.
- Mechanism: These models tokenize images into discrete visual tokens using Vector Quantization (VQ). The AR model then learns to predict these discrete tokens sequentially, similar to how LLMs predict text tokens.
- Limitations:
  - Quantization Loss: Information is lost when continuous image data is mapped to a finite set of discrete tokens. This can limit the fidelity of the generated images, especially fine details and textures.
  - Exposure Bias: During training, the model sees "ground truth" previous tokens. During inference, it uses its own predictions, which may accumulate errors over time, leading to a mismatch between training and inference data distributions.
  - Suboptimal Image Tokenization: The quality of the discrete tokens themselves can be a bottleneck for generation quality.
Recent Efforts with Continuous Latent Representations (prior AR work):
- (Fan et al., 2024), (Li et al., 2024c), (Sun et al., 2024c), (Tschannen et al., 2024, 2025) are mentioned as showing promise. These works attempt to use continuous latent variables directly within AR frameworks, moving away from VQ.
- Gap: Despite these efforts, a significant performance gap persisted between these AR models and state-of-the-art diffusion methods in image quality and consistency (e.g., Esser et al., 2024; Labs, 2024; Podell et al., 2024).
Flux VAE (Labs, 2024): The NextStep-1 image tokenizer is fine-tuned from Flux.1-dev VAE. This highlights the importance of a high-performance VAE for image reconstruction as a foundation for generative models.
$\sigma$ -VAE (Sun et al., 2024c): The stochastic perturbation technique used in NextStep-1's tokenizer is adapted from $\sigma$ -VAE, where it was employed to prevent variance collapse in the latent space. Variance collapse is a known issue in VAEs where the encoder learns to output a very narrow (low variance) distribution, making the latent space less expressive.

3.3. Technological Evolution

The field of generative AI has rapidly evolved:

Early Generative Models (GANs, basic VAEs): Focused on generating images but often struggled with diversity or stability.
Autoregressive Models for Text: Breakthroughs with Transformers (e.g., GPT series) demonstrated the power of next-token prediction for coherent and context-aware text generation.
Extending AR to Images (VQ-VAE/GAN + AR Transformer): The idea was to represent images as sequences of discrete tokens (similar to text) and apply AR Transformers. This led to models like DALL-E and VQGAN. However, quantization loss and exposure bias became limitations.
Diffusion Models Rise: Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs, e.g., Stable Diffusion) achieved unprecedented image quality and diversity, becoming the state-of-the-art for text-to-image generation.
Hybrid AR-Diffusion Models: Some AR models started to use Transformers to generate high-level semantic tokens, which then conditioned powerful diffusion models for the final image generation. While high quality, this approach often involved heavy, separate diffusion components.
Continuous Latent AR Models (Pre-NextStep-1): Attempts to use continuous image representations directly within AR models to avoid VQ limitations, but often faced performance gaps compared to diffusion models.
NextStep-1's Position: This paper represents an evolution towards a "pure autoregressive paradigm" that directly models continuous image tokens with a lightweight flow matching head, aiming to close the performance gap with diffusion models while retaining the architectural simplicity and scalability of AR Transformers. It builds on the idea of continuous latents but introduces specific tokenizer enhancements and leverages flow matching for patch-wise generation, which is distinct from using a heavy, full-image diffusion model.

3.4. Differentiation Analysis

Compared to the main methods in related work, NextStep-1 offers several core differences and innovations:

Discrete vs. Continuous Tokens:
- Differentiation: Unlike VQ-based autoregressive models that use discrete visual tokens and suffer from quantization loss and exposure bias, NextStep-1 directly processes continuous image tokens. This allows for higher fidelity and avoids the limitations of discrete representations.
Flow Matching Head vs. Heavy Diffusion Models:
- Differentiation: NextStep-1 employs a lightweight flow matching head (157M parameters) to model the distribution of each image patch autoregressively. This contrasts with AR models that rely on heavy, separate diffusion models (often hundreds of millions to billions of parameters) to generate an entire image, which is computationally intensive and less unified. NextStep-1's approach allows for a more integrated next-token prediction paradigm for continuous data.
Tokenizer Design for Stability:
- Innovation: The paper highlights a novel image tokenizer design (fine-tuned from Flux VAE) that incorporates channel-wise normalization and stochastic perturbation. This is crucial for:
  - Robustness to CFG: It mitigates statistical drift in per-token mean and variance, which typically leads to visual artifacts under strong classifier-free guidance in VAE-based AR models. This allows NextStep-1 to leverage strong guidance without degrading image quality.
  - Well-conditioned Latent Space: The noise regularization cultivates a more dispersed and robust latent space, which is empirically shown to be critical for achieving high-fidelity generation, even if it leads to higher reconstruction loss during tokenizer training.
Unified Autoregressive Generation:
- Differentiation: NextStep-1 maintains a "pure autoregressive paradigm" where the Transformer backbone performs the core generative modeling of conditional distributions for both text and image tokens. The flow matching head acts as a lightweight sampler, directly translating the transformer's contextual prediction into continuous visual tokens, rather than merely orchestrating a separate diffusion process. This unified approach provides architectural simplicity and flexibility.
Performance Gap Bridging:
- Innovation: By addressing the limitations of prior AR models and incorporating the tokenizer advancements, NextStep-1 significantly closes the performance gap between autoregressive models and state-of-the-art diffusion methods in terms of image quality and consistency, achieving competitive or superior results on various benchmarks for both text-to-image generation and image editing within the AR framework.

4. Methodology

4.1. Principles

The core idea behind NextStep-1 is to extend the successful autoregressive language modeling paradigm to image generation by treating both text and image data as a unified sequence of tokens. Instead of vector quantization (VQ) for discrete image tokens or relying on a separate, heavy diffusion model for continuous tokens, NextStep-1 directly models continuous image tokens using a lightweight flow matching head. The underlying principle is next-token prediction (NTP), where a causal transformer predicts the subsequent token in a multimodal sequence, whether it's a discrete text token or a continuous image patch. This design aims to achieve high-fidelity image synthesis with the scalability and flexibility inherent in autoregressive architectures.

The theoretical basis of the method centers on the factorization of a joint probability distribution over a multimodal sequence into a product of conditional probabilities, where each token's probability is conditioned on all preceding tokens. For discrete text tokens, this is a standard classification problem handled by a language modeling head with cross-entropy loss. For continuous image tokens, it transforms into a regression problem where a flow matching head learns to predict velocity vectors to generate the next image patch, optimized with flow matching loss. The intuition is that a powerful Transformer backbone can learn complex multimodal correlations and context, while modality-specific heads efficiently handle the final prediction in their respective data spaces.

4.2. Core Methodology In-depth (Layer by Layer)

The NextStep-1 framework is built upon a Causal Transformer backbone, an Image Tokenizer, a Language Modeling Head, and a Patch-wise Flow Matching Head. The overall architecture is illustrated in Figure 2.

The image is a diagram illustrating the workflow of the NextStep-1 framework. It shows how the autoregressive model processes text and image tokens through a causal transformer, and how the Flow Matching Head predicts the continuous flow from a noise sample to the target image patch during training.

Figure 2 | Overview of NextStep-1 Framework. NextStep-1 employs a causal transformer to process tokenized text and image tokens. During training, Flow Matching Head predicts the continuous flow from a noise sample to the next target image patch, conditioned on the output hidden state. At inference, this allows for generating images by iteratively guiding noise to create the next patch. 该图像是一个示意图，展示了NextStep-1框架的工作流程。图中包含了自回归模型如何通过因果变换器处理文本和图像令牌，及其训练中Flow Matching Head如何预测从噪声样本到目标图像块的连续流。

4.2.1. Unified Multimodal Generation with Continuous Visual Tokens

The framework unifies multimodal inputs (text and images) into a single sequential data stream. Images are first converted into continuous image tokens by an image tokenizer, and then combined with discrete text tokens.

Let $x = \{ x _ { 0 } , x _ { 1 } , . . . , x _ { n } \}$ represent the unified multimodal token sequence, where each x _ { i } can be either a discrete text token or a continuous visual token (image patch). The autoregressive objective under this unified sequence is formalized as:

$ p ( x ) = \prod _ { i = 1 } ^ { n } p ( x _ { i } \mid x _ { < i } ) . $

Here, p(x) is the joint probability of the entire sequence $x$ , and $p(x_i \mid x_{<i})$ is the conditional probability of the $i$ -th token given all preceding tokens $x_{<i}$ . The model learns to predict $x_i$ based on $x_{<i}$ .

The generation task proceeds by iteratively sampling the next token x _ { i } from the conditional distribution $p ( x _ { i } \mid x _ { < i } )$ .

For discrete text tokens, sampling is performed via a language modeling head.
For continuous image tokens, sampling is performed by a flow-matching head.

The training objective for NextStep-1 combines two distinct losses:

A standard cross-entropy loss ( $\mathcal { L } _ { \mathrm { t e x t } }$ ) for discrete text tokens.
A flow matching loss ( $\mathcal { L } _ { \mathrm { v i s u a l } }$ ) for continuous image tokens. Specifically, the flow matching loss is the mean squared error (MSE) between the predicted and target velocity vectors that map a noised image patch to its corresponding clean image patch.

The model is trained end-to-end by optimizing a weighted sum of these two losses:

$ \mathcal { L } _ { \mathrm { t o t a l } } = \lambda _ { \mathrm { t e x t } } \mathcal { L } _ { \mathrm { t e x t } } + \lambda _ { \mathrm { v i s u a l } } \mathcal { L } _ { \mathrm { v i s u a l } } $

where:

$\mathcal { L } _ { \mathrm { t o t a l } }$ is the total loss to be minimized.
$\mathcal { L } _ { \mathrm { t e x t } }$ is the loss for text tokens (cross-entropy).
$\mathcal { L } _ { \mathrm { v i s u a l } }$ is the loss for image tokens (flow matching loss, based on MSE).
$\lambda _ { \mathrm { t e x t } }$ and $\lambda _ { \mathrm { v i s u a l } }$ are hyperparameters that balance the contribution of text and visual losses, respectively.

4.2.2. Model Architecture

Image Tokenizer

The image tokenizer is a crucial component that converts raw images into continuous latent representations suitable for the Causal Transformer.

Initialization: It is fine-tuned from a pre-trained Flux VAE (Labs, 2024), chosen for its strong reconstruction performance, and adapted to the specific data distribution.
Encoding Process: The tokenizer first encodes an image into 16-channel latents ( ${\mathfrak { z }}$ ), applying an $8 \times$ spatial downsampling factor. For example, a $256 \times 256$ image would be encoded into a $32 \times 32 \times 16$ latent representation.
Latent Space Stabilization and Normalization: To ensure stable training and a well-behaved latent space, two techniques are applied:
1. Channel-wise Normalization: Each channel of the latent representation is standardized to have zero mean and unit variance. This enforces per-token statistical stability, which is critical for mitigating issues under high Classifier-Free Guidance (CFG) scales, as discussed in Section 6.2.
2. Stochastic Perturbation: To further enhance robustness and encourage a more uniform latent distribution, Gaussian noise is added to the normalized latents. This technique is adapted from $$\sigma-VAE (Sun et al., 2024c) to prevent variance collapse. The perturbed latent $\tilde{z}$ $\tilde{z}$ is calculated as: $ \tilde { z } = \mathrm { N o r m l i z a t i o n } ( z ) + \alpha \cdot \varepsilon , \quad \mathrm { w h e r e } \ \alpha \sim \mathcal { U } [ 0 , \gamma ] \ \mathrm { a n d } \ \varepsilon \sim N ( 0 , I ) $ where:
  - $z$ represents the original latents encoded by the VAE.
  - $\mathrm { N o r m l i z a t i o n } ( z )$ refers to the channel-wise normalization of $z$ .
  - $\varepsilon$ is standard Gaussian noise, meaning it is sampled from a normal distribution with mean 0 and standard deviation 1.
  - $\alpha$ is a scaling factor for the noise, sampled uniformly from the range $[0, \gamma]$ .
  - $\mathcal { U } [ 0 , \gamma ]$ denotes a uniform distribution between 0 and $\gamma$ .
  - $\gamma$ is a hyperparameter that controls the maximum intensity of the added noise. A higher $\gamma$ means more significant perturbation.
Sequence Flattening: The 16-channel latents are then transformed into a compact 1D sequence for the Causal Transformer. This involves:
1. Pixel-shuffling / Space-to-Depth Transformation: A $2 \times 2$ kernel is applied, which flattens $2 \times 2$ spatial regions of the latents into the channel dimension. For example, if the latents are $32 \times 32 \times 16$ , this transformation converts them into a $16 \times 16$ grid of 64-channel tokens ( $32/2 \times 32/2 \times (16 \times 2 \times 2)$ ).
2. Flattening to 1D: This $16 \times 16$ grid of 64-channel tokens is then flattened into a 1D sequence of 256 tokens (16 * 16 = 256). This 1D sequence serves as the input for the Causal Transformer.

Causal Transformer

Initialization: The Causal Transformer is initialized from the decoder-only Qwen2.5-14B (Yang et al., 2024), a large language model (LLM) backbone, suitable for autoregressive generation.
Input Sequence Format: The multimodal input sequence is organized in a specific format: {text} <image_area>h*w <boi> {image} <eoi>.. where:
- {text} represents discrete text tokens.
- <image_area>h*w is a special metadata token indicating the spatial dimensions (height and width) of the 2D image tokens that follow.
- $<boi>$ (beginning-of-image) is a special token marking the start of the continuous image token sequence.
- {image} represents the continuous image tokens generated by the image tokenizer.
- $<eoi>$ (end-of-image) is a special token marking the end of the image token sequence.
Positional Encoding: Standard 1D Rotary Position Embedding (RoPE) (Su et al., 2024) is used for positional information. Despite the existence of more complex 2D or multimodal RoPE alternatives, the authors found 1D RoPE effective and retained it for simplicity and efficiency.

Lightweight Heads for Modality-Specific Loss

The output hidden states from the Causal Transformer (LLM backbone) are passed to two lightweight heads, each responsible for computing modality-specific losses:

Language Modeling Head:
- Function: This head is responsible for predicting discrete text tokens.
- Loss: It computes Cross-Entropy loss for the hidden states corresponding to text tokens.
Patch-wise Flow Matching Head:
- Function: This head is responsible for generating continuous image tokens (patches). It follows the approach of (Li et al., 2024c).
- Architecture: It is a 157M-parameter MLP (Multi-Layer Perceptron) with 12 layers and 1536 hidden dimensions.
- Process: It uses each patch-wise image hidden state from the Causal Transformer as a condition. The head then denoises a target patch at various timesteps (t) and computes the patch-wise flow-matching loss (Lipman et al., 2023a). This loss measures the difference between the predicted velocity vector and the target velocity vector for transforming a noisy patch into a clean patch.
- Sampling: During inference, this head iteratively guides noise to create the next image patch, building the image autoregressively.
  
  This structured approach allows the powerful Causal Transformer to handle the complex multimodal context and next-token prediction logic, while the specialized, lightweight heads efficiently generate the final output in their respective modalities.

5. Experimental Setup

5.1. Datasets

To equip NextStep-1 with broad and versatile capabilities, a diverse training corpus comprising four main categories was constructed:

5.1.1. Text-only Corpus

Source & Scale: 400B text-only tokens sampled from Step-3 (Wang et al., 2025a).
Purpose: To preserve the extensive language capabilities inherent in the Qwen2.5-14B large language model (LLM) backbone used for initialization.
Characteristic: High-quality, diverse textual data.

5.1.2. Image-Text Pair Data

This forms the foundation for text-to-image generation capabilities. A comprehensive pipeline was developed for curation:

Data Sourcing: Collected from diverse sources, including web data, multi-task VQA (Visual Question Answering) data, and text-rich documents.
Quality-Based Filtering: A rigorous filtering process was applied, evaluating images based on:
- Aesthetic quality
- Watermark presence
- Clarity
- OCR (Optical Character Recognition) detection
- Text-image semantic alignment
Re-captioning: After deduplication, the filtered images were re-captioned using Step-1o-turbo 1 to generate rich and detailed captions in both English and Chinese.

Purpose: To provide a high-quality, large-scale dataset for training a model with a strong aesthetic sense and broad world knowledge, fundamental for text-to-image generation.

5.1.3. Instruction-Guided Image-to-Image Data

Curated to enable a wide range of practical applications beyond pure generation:

Visual Perception & Controllable Image Generation:
- Source & Scale: 1M samples synthesized by applying the annotator of ControlNet (Zhang et al., 2023b) to a portion of the high-quality image-text pair data.
- Characteristic: Data includes explicit control signals (e.g., edge maps, segmentation maps) alongside images.
- Example (ControlNet): If the input is an image of a cat and the instruction is "Turn this cat into a dog while preserving its pose," ControlNet might generate a pose map from the original cat image, which then guides the generation of the dog image.
Image Restoration & General Image Editing:
- Source & Scale: Initially collected 3.5M samples from GPT-Image-Edit (Wang et al., 2025d), Step1X-Edit (Liu et al., 2025), and a proprietary in-house dataset.
- Filtering: All editing data were subjected to a rigorous VLM-based filtering pipeline (Visual Language Model-based) assessing image-pair quality, rationality, consistency, and instruction alignment. This resulted in approximately 1M high-quality instruction-guided image-to-image data.
- Purpose: To strengthen the model's capabilities in tasks like editing, inpainting, outpainting, and other instruction-guided image manipulations.

5.1.4. Interleaved Data

Integrates text and images seamlessly to foster rich and nuanced sequential associations:

General Video-Interleaved Data:
- Source & Scale: A large-scale, 80M-sample video-interleaved dataset constructed via a meticulous curation pipeline inspired by Step-Video (Ma et al., 2025a), involving frame extraction, deduplication, and captioning.
- Purpose: To endow the model with extensive world knowledge from video content.
Tutorials:
- Source: Collected and processed tutorial videos using ASR (Automatic Speech Recognition) and OCR (Optical Character Recognition) tools, following the methodology of mmtextbook (Zhang et al., 2025).
- Purpose: Specifically targets text-rich real-world scenes, enhancing the model's textual understanding and generation in context.
Character-Centric Scenes (NextStep-Video-Interleave-5M):
- Source & Scale: A key contribution, comprising 5M samples. Video frames centered around specific characters were extracted, and rich, storytelling-style captions were generated, akin to (Oliveira and de Matos, 2025).
- Purpose: Significantly improves the model's capacity for multi-turn interaction and consistent character generation.
- Data Sample Example: The paper provides Figure 3, illustrating the processing of this data. The image is a diagram illustrating the steps of character binding and multimodal captioning data processing, including face detection, feature matching, and frame extraction processes. It details how to match characters using cosine similarity and provides a checklist to ensure consistency.
  
  该图像是一个示意图，展示了角色绑定和多模态标注数据处理的步骤，包括面部检测、特征匹配和框架提取等流程。图中详细描述了如何通过余弦相似度匹配角色，并提供检查列表以确保一致性。
Multi-View Data:
- Source: Curated from two open-source datasets: MV-ImageNet-v2 (Han et al., 2024) and Objaverse-XL (Deitke et al., 2023).
- Purpose: To bolster geometric reasoning and enhance the model's ability to maintain multi-view consistency (i.e., generating consistent images of an object from different viewpoints).

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, a complete explanation is provided:

WISE (World Knowledge-Informed Semantic Evaluation) (Niu et al., 2025):
- Conceptual Definition: WISE is a benchmark designed to evaluate a text-to-image generation model's ability to integrate world knowledge and perform semantic understanding. It emphasizes factual grounding, reasoning, and the correct depiction of entities, events, and relationships based on real-world knowledge across various domains (e.g., Cultural, Time, Space, Biology, Physics, Chemistry). A higher score indicates better knowledge awareness and semantic alignment.
- Mathematical Formula: The paper does not provide an explicit formula for WISE. Typically, such benchmarks involve human evaluation or automated metrics (e.g., CLIP-score variants, VQA models) on a set of knowledge-intensive prompts. For aggregated scores, it often involves averaging sub-category scores. Let $S_d$ be the score for a specific knowledge domain $d$ , and $N_d$ be the number of domains, the overall WISE score is often an average: $ \text{WISE Score} = \frac{1}{N_d} \sum_{d \in \text{Domains}} S_d $
- Symbol Explanation:
  - $\text{WISE Score}$ : The overall score quantifying world knowledge integration.
  - $S_d$ : The score obtained for a specific knowledge domain (e.g., Cultural, Time, Space).
  - $N_d$ : The total number of knowledge domains evaluated.
GenAI-Bench (Lin et al., 2024):
- Conceptual Definition: GenAI-Bench evaluates text-to-image generation models on their compositional and linguistic understanding, assessing how well they follow prompts, particularly for basic and advanced compositional instructions. It aims to measure the ability to generate images with multiple objects, attributes, and spatial relationships as described in the prompt. A higher score indicates better prompt following.
- Mathematical Formula: The paper does not provide an explicit formula. GenAI-Bench typically uses a combination of automated metrics (like CLIP-score for text-image alignment) and potentially human evaluation. The scores are often normalized between 0 and 1, with separate evaluations for "Basic" and "Advanced" prompts. $ \text{GenAI-Bench Score} = \text{Metric}(\text{Generated Images}, \text{Prompts}) $ Where Metric could be an automated alignment score.
- Symbol Explanation:
  - $\text{GenAI-Bench Score}$ : The score reflecting compositional and linguistic understanding.
  - $\text{Generated Images}$ : The images produced by the model.
  - $\text{Prompts}$ : The textual conditioning provided to the model.
DPG-Bench (Hu et al., 2024):
- Conceptual Definition: DPG-Bench (likely referring to "Detailed Prompt Generation Bench") assesses the model's compositional fidelity under complex and long prompts, especially those involving multiple objects and intricate scene descriptions. It measures how accurately the model can realize all elements and their relationships specified in a detailed textual input. A higher score indicates better adherence to complex prompts.
- Mathematical Formula: The paper does not provide an explicit formula. Similar to other alignment benchmarks, it likely involves a quantitative assessment of generated images against detailed prompts. Scores are typically higher values for better performance.
- Symbol Explanation:
  - $\text{DPG-Bench Score}$ : The score reflecting compositional fidelity for long, complex prompts.
OneIG-Bench (Omni-dimensional Nuanced Evaluation for Image Generation) (Chang et al., 2025):
- Conceptual Definition: OneIG-Bench provides an omnidirectional nuanced evaluation for text-to-image generation, assessing performance across several fine-grained aspects:
  - Alignment: How well the image matches the prompt's overall content.
  - Text: Ability to render text accurately within the image.
  - Reasoning: Performance on prompts requiring logical reasoning or world knowledge.
  - Style: Control over the aesthetic style specified in the prompt.
  - Diversity: Variety in generated images for the same prompt.
  - Overall: A composite score across these dimensions.
- Mathematical Formula: The paper does not provide explicit formulas for each sub-metric, but they are typically computed using Vision-Language Models (VLMs) or human evaluation. The overall score is usually an average or weighted sum of these sub-scores. For an overall score: $ \text{Overall Score} = \frac{1}{N_{sub}} \sum_{s \in \text{SubMetrics}} S_s $
- Symbol Explanation:
  - $\text{Overall Score}$ : The composite score across various fine-grained aspects.
  - $S_s$ : The score for a specific sub-metric (e.g., Alignment, Text, Reasoning).
  - $N_{sub}$ : The total number of sub-metrics evaluated.
GenEval (Ghosh et al., 2023):
- Conceptual Definition: GenEval is a benchmark for evaluating text-to-image alignment. It measures how accurately the generated image corresponds to the textual prompt, often focusing on fidelity to attributes and objects mentioned.
- Mathematical Formula: The paper does not provide an explicit formula. This metric often relies on internal scoring mechanisms, possibly using CLIP-based scores or human preference ratings, to quantify the degree of alignment.
- Symbol Explanation: Not provided by the paper, but GenEval is a scalar score.
GEdit-Bench (Liu et al., 2025):
- Conceptual Definition: GEdit-Bench evaluates the performance of instruction-guided image editing models. It uses GPT-4.1 (an advanced large language model) to assess three key metrics:
  - G_SC (Instruction Following Score): Measures how well the model followed the given editing instructions.
  - G_PQ (Perceptual Quality Score): Measures the overall visual quality and realism of the edited image.
  - G_O (Overall Score): A composite score reflecting the overall effectiveness and quality of the editing.
- Mathematical Formula: The paper does not provide explicit formulas, as these are typically proprietary scores derived from GPT-4.1 evaluations. They are likely on a defined scale (e.g., 1-10 or 1-7).
- Symbol Explanation:
  - $\text{G\_SC}$ : GPT-4.1 based Instruction Following Score.
  - $\text{G\_PQ}$ : GPT-4.1 based Perceptual Quality Score.
  - $\text{G\_O}$ : GPT-4.1 based Overall Score.
ImgEdit-Bench (Ye et al., 2025):
- Conceptual Definition: ImgEdit-Bench is another benchmark specifically for evaluating instruction-guided image editing models. It assesses the model's ability to perform various editing tasks based on textual instructions. A higher score indicates better editing capabilities.
- Mathematical Formula: The paper does not provide an explicit formula. This benchmark likely employs a quantitative metric or human evaluation score to assess the quality and adherence to editing instructions.
- Symbol Explanation: Not provided by the paper, but ImgEdit-Bench is a scalar score.
PSNR (Peak Signal-to-Noise Ratio):
- Conceptual Definition: PSNR is a common metric used to quantify the quality of reconstruction of lossy compression codecs or image restoration algorithms. It compares the maximum possible power of a signal to the power of corrupting noise that affects the fidelity of its representation. Higher PSNR values generally indicate a higher quality image reconstruction.
- Mathematical Formula: $ \text{PSNR} = 10 \cdot \log_{10} \left( \frac{\text{MAX}_I^2}{\text{MSE}} \right) $
- Symbol Explanation:
  - $\text{MAX}_I$ : The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image, or $2^k-1$ for a $k$ -bit image).
  - $\text{MSE}$ : Mean Squared Error between the original (ground truth) image and the reconstructed image. For two images $I$ and $K$ of size $m \times n$ : $ \text{MSE} = \frac{1}{m n} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $
SSIM (Structural Similarity Index Measure):
- Conceptual Definition: SSIM is a metric used for measuring the similarity between two images. Unlike PSNR which measures absolute errors, SSIM attempts to model the perceived change in structural information, which is more aligned with human visual perception. It considers three key factors: luminance, contrast, and structure. The SSIM value ranges from -1 to 1, where 1 indicates perfect structural similarity.
- Mathematical Formula: For two images $x$ and $y$ : $ \text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $
- Symbol Explanation:
  - $\mu_x$ : The average of image $x$ .
  - $\mu_y$ : The average of image $y$ .
  - $\sigma_x^2$ : The variance of image $x$ .
  - $\sigma_y^2$ : The variance of image $y$ .
  - $\sigma_{xy}$ : The covariance of images $x$ and $y$ .
  - $C_1 = (K_1L)^2$ and $C_2 = (K_2L)^2$ : Two small constants included to prevent division by zero or near-zero values, where $L$ is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and $K_1 \ll 1, K_2 \ll 1$ are small constants.
rFID (Reduced Fréchet Inception Distance):
- Conceptual Definition: Fréchet Inception Distance (FID) is a metric used to assess the quality of images generated by generative models. It measures the "distance" between the distribution of generated images and the distribution of real images in a feature space learned by a pre-trained Inception-v3 neural network. Lower FID scores indicate higher quality and more diverse generated images. rFID likely refers to a variant or subset of FID, possibly used for specific image types or a reduced set of features.
- Mathematical Formula: For two distributions, real ( $X_r$ ) and generated ( $X_g$ ), assumed to be multivariate Gaussian with means $\mu_r, \mu_g$ and covariance matrices $\Sigma_r, \Sigma_g$ : $ \text{FID}(X_r, X_g) = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2}) $
- Symbol Explanation:
  - $\mu_r$ : Mean of the feature vectors for real images.
  - $\mu_g$ : Mean of the feature vectors for generated images.
  - $\Sigma_r$ : Covariance matrix of the feature vectors for real images.
  - $\Sigma_g$ : Covariance matrix of the feature vectors for generated images.
  - $||\cdot||^2$ : Squared Euclidean distance.
  - $\text{Tr}(\cdot)$ : Trace of a matrix (sum of its diagonal elements).
  - $(\Sigma_r\Sigma_g)^{1/2}$ : Matrix square root.

5.3. Baselines

The NextStep-1 model is compared against a comprehensive set of baseline models across three categories for text-to-image generation and image editing.

5.3.1. Proprietary Models

These are state-of-the-art closed-source models known for high performance:

DALL-E 3 (Betker et al., 2023)
Seedream 3.0 (Gao et al., 2025)
GPT-4o (OpenAI, 2025b)
Imagen3 (Baldridge et al., 2024)
Recraft V3 (team, 2024)
Kolors 2.0 (team, 2025)
Imagen4 (deepmind Imagen4 team, 2025)
Gemini 2.0 (Gemini2, 2025)
Doubao (Shi et al., 2024)
Flux.1-Kontext-pro (Labs et al., 2025)

5.3.2. Diffusion Models

These models represent the current state-of-the-art in open-source or publicly available diffusion-based image generation:

Stable Diffusion 1.5 (Rombach et al., 2022)
Stable Diffusion XL (Podell et al., 2024)
Stable Diffusion 3 Medium (Esser et al., 2024)
Stable Diffusion 3.5 Large (Stability-AI, 2024)
PixArt-Alpha (Chen et al., 2024b)
Flux.1-dev (Labs, 2024)
Transfusion (Zhou et al., 2025)
CogView4 (Z.ai, 2025)
Lumina-Image 2.0 (Qin et al., 2025)
HiDream-I1-Full (Cai et al., 2025)
Mogao (Liao et al., 2025)
BAGEL (Deng et al., 2025)
Show-o2-7B (Xie et al., 2025b)
OmniGen2 (Wu et al., 2025b)
Qwen-Image (Wu et al., 2025a)
Playground v2.5 (Li et al., 2024b)
MetaQuery-XL (Pan et al., 2025)
BLIP3-o (Chen et al., 2025a)
SANA-1.5 1.6B (PAG) (Xie et al., 2025a)
SANA-1.5 4.8B (PAG) (Xie et al., 2025a)
Show-o2-1.5B (Xie et al., 2025b)

5.3.3. Autoregressive Models

These are other autoregressive models for text-to-image generation:

SEED-X (Ge et al., 2024)
Show-o (Xie et al., 2024)
VILA-U (Wu et al., 2024)
Emu3 (Wang et al., 2024b)
SimpleAR (Wang et al., 2025c)
Fluid (Fan et al., 2024)
Infinity (Han et al., 2025)
Janus-Pro-7B (Chen et al., 2025b)
Token-Shuffle (Ma et al., 2025b)
Show-o-512 (Xie et al., 2024)

The choice of baselines is comprehensive, covering leading models across different paradigms, allowing for a thorough evaluation of NextStep-1's performance relative to the current state-of-the-art.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance of Text-to-Image Generation

NextStep-1 is comprehensively evaluated on several benchmarks, demonstrating strong capabilities in text-to-image generation.

The following are the results from Table 2 of the original paper:

Method	GenEval↑	Basic	Advanced	DPG-Bench↑
Method	GenEval↑	GenAI-Bench↑		DPG-Bench↑
Proprietary
DALL-E 3 (Betker et al., 2023)	0.67	0.90	0.70	83.50
Seedream 3.0 (Gao et al., 2025)	0.84	-	-	88.27
GPT4o (OpenAI, 2025b)	0.84		-	85.15
Diffusion
Stable Diffusion 1.5 (Rombach et al., 2022)	0.43	-	-	-
Stable Diffusion XL (Podell et al., 2024)	0.55	0.83	0.63	74.65
Stable Diffusion 3 Medium (Esser et al., 2024)	0.74	0.88	0.65	84.08
Stable Diffusion 3.5 Large (Esser et al., 2024)	0.71	0.88	0.66	83.38
PixArt-Alpha (Chen et al., 2024b)	0.48	-	-	71.11
Flux.1-dev (Labs, 2024)	0.66	0.86	0.65	83.79
Transfusion (Zhou et al., 2025)	0.63	-	-	-
CogView4 (Z.ai, 2025)	0.73	-	-	85.13
Lumina-Image 2.0 (Qin et al., 2025)	0.73	-	-	87.20
HiDream-I1-Full (Cai et al., 2025)	0.83	0.91	0.66	85.89
Mogao (Liao et al., 2025)	0.89	-	0.68	84.33
BAGEL (Deng et al., 2025)	0.82/0.88‡	0.89/0.86‡	0.69/0.75†	85.07
Show-o2-7B (Xie et al., 2025b)	0.76	-		86.14
OmniGen2 (Wu et al., 2025b)	0.80/0.86*			83.57
Qwen-Image (Wu et al., 2025a)	0.87	-		88.32
AutoRegressive
SEED-X (Ge et al., 2024)	0.49	0.86	0.70
Show-o (Xie et al., 2024)	0.53	0.70	0.60
VILA-U (Wu et al., 2024)	-	0.76	0.64
Emu3 (Wang et al., 2024b)	0.54/0.65*	0.78	0.60	80.60
SimpleAR (Wang et al., 2025c)	0.63	-	-	81.97
Fluid (Fan et al., 2024)	0.69	-	-	-
Infinity (Han et al., 2025)	0.79	-	-	86.60
Janus-Pro-7B (Chen et al., 2025b)	0.80	0.86	0.66	84.19
Token-Shuffle (Ma et al., 2025b)	0.62	0.78	0.67	-
NextStep-1	0.63/0.73†	0.88/0.90*	0.67/0.74*	85.28

Note: * result is with rewriting. † result is with Self-CoT. ‡ results are not specified in the paper's table footnote but appear to denote Self-CoT or similar reasoning enhancement based on context.

Image-Text Alignment

GenEval: NextStep-1 achieves 0.63, which increases to 0.73 with Self-CoT (Self-Chain-of-Thought), indicating strong prompt-following ability. This is comparable to diffusion models like Transfusion (0.63) and Flux.1-dev (0.66), and significantly outperforms older diffusion models like Stable Diffusion 1.5 (0.43) and Stable Diffusion XL (0.55). Among autoregressive models, it outperforms Emu3 (0.54) and Show-o (0.53).
GenAI-Bench:
- Basic Prompts: NextStep-1 scores 0.88 (0.90 with Self-CoT), demonstrating excellent compositional abilities. This is on par with Stable Diffusion 3 Medium (0.88) and 3.5 Large (0.88), and slightly better than Flux.1-dev (0.86).
- Advanced Prompts: NextStep-1 achieves 0.67 (0.74 with Self-CoT), showcasing its capability to handle complex prompts. This is competitive with Stable Diffusion 3.5 Large (0.66) and BAGEL (0.69).

DPG-Bench: For long-context, multi-object scenes, NextStep-1 achieves 85.28, confirming its reliable compositional fidelity under complex prompts. This score is higher than DALL-E 3 (83.50), GPT4o (85.15), and Stable Diffusion 3 Medium (84.08), placing it among the top performers in this category.

The following are the results from Table 3 of the original paper:

Method	Alignment	Text	Reasoning	Style	Diversity	Overall↑
Proprietary
Imagen3 (Baldridge et al., 2024)	0.843	0.343	0.313	0.359	0.188	0.409
Recraft V3 (team, 2024)	0.810	0.795	0.323	0.378	0.205	0.502
Kolors 2.0 (team, 2025)	0.820	0.427	0.262	0.360	0.300	0.434
Seedream 3.0 (Gao et al., 2025)	0.818	0.865	0.275	0.413	0.277	0.530
Imagen4 (deepmind Imagen4 team, 2025)	0.857	0.805	0.338	0.377	0.199	0.515
GPT-4o (OpenAI, 2025b)	0.851	0.857	0.345	0.462	0.151	0.533
Diffusion
Stable Diffusion 1.5 (Rombach et al., 2022)	0.565	0.010	0.207	0.383	0.429	0.319
Stable Diffusion XL (Podell et al., 2024)	0.688	0.029	0.237	0.332	0.296	0.316
Stable Diffusion 3.5 Large (Stability-AI, 2024)	0.809	0.629	0.294	0.353	0.225	0.462
Flux.1-dev (Labs, 2024) CogView4 (Z.ai, 2025)	0.786	0.523	0.253	0.368	0.238	0.434
SANA-1.5 1.6B (PAG) (Xie et al., 2025a)	0.786	0.641	0.246	0.353	0.205	0.446
SANA-1.5 4.8B (PAG) (Xie et al., 2025a)	0.762	0.054	0.209	0.387	0.222	0.327
Lumina-Image 2.0 (Qin et al., 2025)	0.765	0.069	0.217	0.401	0.216	0.334
HiDream-I1-Full (Cai et al., 2025)	0.819	0.106	0.270	0.354	0.216	0.353
BLIP3-o (Chen et al., 2025a)	0.829	0.707	0.317	0.347	0.186	0.477
	0.711	0.013	0.223	0.361	0.229	0.307
BAGEL (Deng et al., 2025)	0.769	0.244	0.173	0.367	0.251	0.361
Show-o2-1.5B (Xie et al., 2025b)	0.798	0.002	0.219	0.317	0.186	0.304
Show-o2-7B (Xie et al., 2025b)	0.817	0.002	0.226	0.317	0.177	0.308
OmniGen2 (Wu et al., 2025b)	0.804	0.680	0.271	0.377	0.242	0.475
Qwen-Image (Wu et al., 2025a)	0.882	0.891	0.306	0.418	0.197	0.539
AutoRegressive
Emu3 (Wang et al., 2024b)	0.737	0.010	0.193	0.361	0.251
Janus-Pro (Chen et al., 2025b)	0.553	0.001	0.139	0.276	0.365	0.311 0.267
NextStep-1	0.826	0.507	0.224	0.332	0.199	0.417

OneIG-Bench (English Prompts): NextStep-1 achieves an overall score of 0.417. This result significantly outperforms its autoregressive peers, such as Emu3 (0.311) and Janus-Pro (0.267). Breaking down the metrics:

Alignment (0.826): Competitive with top diffusion models like Stable Diffusion 3.5 Large (0.809) and OmniGen2 (0.804).
Text (0.507): While not reaching the top proprietary models (GPT-4o at 0.857), it substantially outperforms most open-source diffusion models (many below 0.1) and autoregressive peers (Emu3 at 0.010, Janus-Pro at 0.001), indicating strong text rendering ability for an AR model.
Reasoning (0.224): Mid-range performance, indicating room for improvement but still better than some diffusion and AR models.
Style (0.332): In line with many diffusion models.

Diversity (0.199): Within the typical range for high-fidelity models, which sometimes show less diversity than lower-fidelity ones.

The following are the results from Table 4 of the original paper:

Model	Cultural	Time	Space	Biology	Physics	Chemistry	Overall↑	Overall (Rewrite)↑
Proprietary
GPT-4o (OpenAI, 2025b)	0.81	0.71	0.89	0.83	0.79	0.74	0.80	-
Diffusion
Stable Diffusion 1.5 (Rombach et al., 2022)	0.34	0.35	0.32	0.28	0.29	0.21	0.32	0.50
Stable Diffusion XL (Podell et al., 2024)	0.43	0.48	0.47	0.44	0.45	0.27	0.43	0.65
Stable Diffusion 3.5 Large (Stability-AI, 2024)	0.44	0.50	0.58	0.44	0.52	0.31	0.46	0.72
PixArt-Alpha (Chen et al., 2024b)	0.45	0.50	0.48	0.49	0.56	0.34	0.47	0.63
Playground v2.5 (Li et al., 2024b)	0.49	0.58	0.55	0.43	0.48	0.33	0.49	0.71
Flux.1-dev (Labs, 2024)	0.48	0.58	0.62	0.42	0.51	0.35	0.50	0.73
MetaQuery-XL (Pan et al., 2025)	0.56	0.55	0.62	0.49	0.63	0.41	0.55
BAGEL (Deng et al., 2025)	0.44/0.76‡	0.55/0.69†	0.68/0.75‡	0.44/0.65†	0.60/0.75†	0.39/0.58†	0.52/0.70†	0.71/0.77†
Qwen-Image (Wu et al., 2025a)	0.62	0.63	0.77	0.57	0.75	0.40	0.62	-
AutoRegressive
Show-o-512 (Xie et al., 2024)	0.28	0.40	0.48	0.30	0.46	0.30	0.35	0.64
VILA-U (Wu et al., 2024)	0.26	0.33	0.37	0.35	0.39	0.23	0.31	-
Emu3 (Wang et al., 2024b)	0.34	0.45	0.37	0.48	0.41	0.45	0.27	0.39
Janus-Pro-7B (Chen et al., 2025b)	0.30		0.49	0.36	0.42	0.26	0.35	0.71
NextStep-1	0.51/0.70‡	0.54/0.65‡	0.61/0.69‡	0.52/0.63†	0.63/0.73‡	0.48/0.52†	0.54/0.67*	0.79/0.83*

Note: * result is with Self-CoT. ‡ and † results are not specified in the paper's table footnote but appear to denote Self-CoT or similar reasoning enhancement based on context, possibly with rewriting. Given the previous table's footnote, it is likely that * implies Self-CoT and Rewrite refers to the prompt rewrite protocol.

World Knowledge

WISE Benchmark: NextStep-1 achieves an overall score of 0.54, which improves to 0.67 with Self-CoT. This is the best performance among autoregressive models, significantly outperforming Emu3 (0.27) and Janus-Pro-7B (0.35). It also exceeds most diffusion models, including Stable Diffusion 3.5 Large (0.46) and Flux.1-dev (0.50).
Prompt Rewrite Protocol: Under the prompt rewrite protocol, NextStep-1's score increases to 0.79 (0.83 with Self-CoT). This demonstrates robust knowledge-aware semantic alignment and cross-domain reasoning capabilities, approaching proprietary models like GPT-4o (0.80).

6.1.2. Performance of Image Editing

The following are the results from Table 5 of the original paper:

Model	G_SC	G_PQ	G_0	G_SC	G_PQ	G_0	ImgEdit-Bench↑
Model	GEdit-Bench-EN (Full Set)↑			GEdit-Bench-CN (Full Set)↑			ImgEdit-Bench↑
Proprietary
Gemini 2.0 (Gemini2, 2025)	6.87	7.44	6.51	5.26	7.60	5.14
Doubao (Shi et al., 2024)	7.22	7.89	6.98	7.17	7.79	6.84
GPT-4o (OpenAI, 2025b)	7.74	8.13	7.49	7.52	8.02	7.30	4.20
Flux.1-Kontext-pro (Labs et al., 2025)	7.02	7.60	6.56	1.11	7.36	1.23	-
Open-source
Instruct-Pix2Pix (Brooks et al., 2023)	3.30	6.19	3.22				1.88
MagicBrush (Zhang et al., 2023a)	4.52	6.37	4.19				1.83
AnyEdit (Yu et al., 2024a)	3.05	5.88	2.85				2.45
OmniGen (Xiao et al., 2024)	5.88	5.87	5.01				2.96
OmniGen2 (Wu et al., 2025b)	7.16	6.77	6.41	-	-		3.44
Step1X-Edit v1.0 (Liu et al., 2025)	7.13	7.00	6.44	7.30	7.14	6.66	3.06
Step1X-Edit v1.1 (Liu et al., 2025)	7.66	7.35	6.97	7.65	7.40	6.98	-
BAGEL (Deng et al., 2025)	7.36	6.83	6.52	7.34	6.85	6.50	3.42
Flux.1-Kontext-dev (Labs et al., 2025)	-	-	6.26	-	-	-	3.71
GPT-Image-Edit (Wang et al., 2025d)	-	-	7.24	-	-	-	3.80
NextStep-1	7.15	7.01	6.58	6.88	7.02	6.40	3.71

NextStep-1-Edit (fine-tuned on 1M edit-only data) demonstrates competitive performance in image editing:

GEdit-Bench-EN (Full Set): NextStep-1-Edit achieves an overall score (G_0) of 6.58. This is highly competitive with strong open-source models like OmniGen2 (6.41), Step1X-Edit v1.0 (6.44), and BAGEL (6.52). It also shows strong instruction following (G_SC=7.15) and perceptual quality (G_PQ=7.01).
ImgEdit-Bench: NextStep-1-Edit scores 3.71, which is on par with Flux.1-Kontext-dev (3.71) and very close to GPT-Image-Edit (3.80) and GPT-4o (4.20), outperforming many other open-source methods like OmniGen2 (3.44) and BAGEL (3.42).

These results highlight NextStep-1's versatility and capability to perform high-quality image editing tasks, demonstrating the power of its unified autoregressive approach.

6.1.3. Qualitative Performance

The image is an illustration that showcases the applications of NextStep-1 in high-fidelity image generation, diverse image editing, and complex free-form manipulation. The upper section displays examples of image generation, the middle part shows the functionalities of image editing, and the lower section introduces scenarios of free-form manipulation.

Figure 1 | Overview of NextStep-1 in high-fidelity image generation, diverse image editing, and complex free-form manipulation. 该图像是一个示意图，展示了NextStep-1在高保真图像生成、多样化图像编辑和复杂自由形式操作方面的应用。上方部分展示了图像生成的例子，中间部分展示了图像编辑的功能，下方则介绍了自由形式的操控场景。

Figure 1 provides qualitative examples, showcasing NextStep-1's capabilities in high-fidelity image generation, diverse image editing, and complex free-form manipulation. The generated images appear coherent, aesthetically pleasing, and align well with prompts. The editing examples show plausible and context-aware modifications.

6.2. Ablation Studies / Parameter Analysis

6.2.1. What Governs Image Generation: the AR Transformer or the FM Head?

The paper investigates the relative importance of the AR Transformer backbone and the Flow Matching (FM) head in image generation.

The following are the results from Table 6 of the original paper:

	Layers	Hidden Size	# Parameters
FM Head Small	6	1024	40M
FM Head Base	12	1536	157M
FM Head Large	24	2048	528M

The following are the results from Table 7 of the original paper:

	GenEval	GenAI-Bench	DPG-Bench
Baseline	0.59	0.77	85.15
w / FM Head Small	0.55	0.76	83.46
w / FM Head Base	0.55	0.75	84.68
w / FM Head Large	0.56	0.77	85.50

Experimental Setup: The authors ablated the flow matching head by testing three different sizes: Small (40M parameters), Base (157M parameters), and Large (528M parameters), as detailed in Table 6. For each experiment, only the head was re-initialized and trained for 10K steps.
Results (Table 7 & Figure 4): Despite significant variations in the FM head size, all three configurations yielded remarkably similar quantitative results across GenEval, GenAI-Bench, and DPG-Bench. Qualitatively, Figure 4 also shows that images generated with different head sizes are largely indistinguishable.

The image is a chart displaying images generated under different flow matching heads, including small, base, and large flow matching heads. These images depict animals, buildings, and dancers in various scenes, showcasing the model's capability for high-fidelity image synthesis.

该图像是图表，展示了在不同流匹配头下生成的图像，包括小流匹配头、基础流匹配头和大流匹配头下的图像合成效果。这些图像分别呈现了不同场景中的动物、建筑和舞者，展示了模型的高保真图像合成能力。
Analysis: This finding suggests a surprising insensitivity to the flow matching head's size. The authors interpret this as strong evidence that the Transformer backbone is primarily responsible for the core generative modeling, learning the complex conditional distribution $p ( x _ { i } \mid x _ { < i } )$ . The flow matching head acts more like a lightweight sampler, effectively translating the Transformer's high-level contextual predictions into the continuous image tokens, akin to how a simple LM head converts Transformer outputs into discrete text tokens. The essential generative logic resides within the Transformer's autoregressive NTP process.

6.2.2. Tokenizer is the Key to Image Generation

Mitigating Instability under Strong Classifier-Free Guidance (CFG)

Problem: VAE-based autoregressive models are prone to visual artifacts (e.g., gray patches) under strong classifier-free guidance (CFG) scales, even though CFG is used to enhance conditional fidelity.
Root Cause Identified: Previous work hypothesized that 1D positional embeddings caused this instability. However, NextStep-1's analysis reveals the true cause to be the amplification of token-level distributional shifts under high CFG scales. In diffusion models, normalization of latent variables typically ensures consistent scaling of conditional and unconditional predictions. In token-level AR models, global normalization does not guarantee per-token statistical consistency. This means small discrepancies between conditional and unconditional predictions are magnified by a large guidance scale, leading to drift in the statistics of generated tokens over the sequence.
Empirical Demonstration (Figure 5): The image is a chart illustrating the evolution of mean and variance per token over sampling steps under different CFG settings. The upper part of the chart displays the mean and variance for CFG=1.5 and CFG=3.0, indicating the stability and changes in image quality with parameter variations.

$Figure 5 | Evolution of per-token mean and variance over sampling steps under two CFG settings. At $\\mathrm { C F G } = 1 . 5$ , the mean and variance stay close to 0 and 1, respectively, indicating stability. At $\\mathrm { C F G } = 3 . 0 ,$ they drift significantly, causing image quality degradation. With normalization, the distributions of output latents remain stable across all CFG settings.$ 该图像是图表，展示了在不同CFG设置下，采样步骤中每个token的均值和方差的演变。上方的图显示了CFG=1.5和CFG=3.0的均值和方差，说明了在参数变化下图像质量的稳定性与变化。

Figure 5 shows that at a moderate CFG of 1.5, per-token mean and variance remain stable. In contrast, at a high CFG of 3.0, both statistics diverge significantly for later tokens, directly correlating with visual artifacts.
Solution: The NextStep-1 tokenizer design incorporates channel-wise normalization (Equation 3: $\tilde { z } = \mathrm { N o r m l i z a t i o n } ( z ) + \alpha \cdot \varepsilon$ ) which directly addresses this by enforcing per-token statistical stability. This prevents statistical drift and enables stable generation even with strong CFG.

A Regularized Latent Space is Critical for Generation

Counter-intuitive Finding: The authors discovered an inverse correlation between generation loss (during tokenizer training) and the final synthesis quality. Applying higher noise intensity ( $\gamma$ in Equation 3) during tokenizer training increases the tokenizer's reconstruction loss but paradoxically improves the quality of images generated by the autoregressive model. For example, NextStep-1 uses a tokenizer trained with $\gamma = 0.5$ , which had the highest generation loss but produced the highest-fidelity images. Low-loss tokenizers, conversely, yielded noisy outputs from the AR model.
Attribution: This phenomenon is attributed to noise regularization, which cultivates a well-conditioned latent space. This process enhances two key properties:
1. Robustness to Latent Perturbations: The tokenizer decoder becomes more robust to variations in the latent space (Figure 6).
2. More Dispersed Latent Distribution: The latent distribution becomes more uniform and closer to a standard normal distribution (Figure 7). This property has been found beneficial in prior work (Sun et al., 2024c; Yang et al., 2025; Yao et al., 2025).
  
  The image is a chart demonstrating the impact of noise perturbation on image tokenizer performance. The top section presents the relationship between quantitative metrics (rFID, PSNR, and SSIM) and noise standard deviation, while the bottom section shows reconstruction examples at different noise standard deviations (0.2 and 0.5).
  
  该图像是图表，展示了噪声扰动对图像编码器性能的影响。上面部分为定量指标（rFID、PSNR和SSIM）与噪声标准偏差的关系，下面部分展示了在不同噪声标准偏差（0.2和0.5）下的重建示例。

Figure 6 shows that as noise intensity (standard deviation) increases, rFID (lower is better for generation) tends to decrease, while PSNR and SSIM (higher is better for reconstruction) slightly decrease or remain stable within a certain range, but the rFID shows a more direct improvement.

The image is a histogram of multidimensional data, showing the comparison of empirical distribution and normal distribution for different dimensions (Dimension 0 to Dimension 15). In each subplot, the blue bars represent the empirical distribution, while the red curve represents the normal distribution, visually showcasing the characteristics of the data across dimensions.

该图像是多维数据的直方图，展示了不同维度（Dimension 0 至 Dimension 15）的经验分布与正态分布的比较。每个子图中，蓝色柱状图表示经验分布，红色曲线表示正态分布，直观展现了各个维度数据的特征。

The image is a histogram of the latent distribution, showcasing the latent distribution characteristics of the NextStep-1 VAE across different dimensions. Each chart represents the empirical and normal distributions of various dimensions, highlighting the model's performance in latent space.

该图像是 latent distribution 的直方图，展示了 NextStep-1 VAE 中不同维度的潜在分布特征。每个图表表示不同维度的经验和正态分布，强调了模型在隐空间的表现。

Figure 7 and 8 (original paper labels Latent Distribution of Flux.1-dev VAE and Latent Distribution of NextStep-1 VAE w/o Noise) visually compare latent distributions. NextStep-1 VAE (with noise regularization) aligns best with the normal distribution, reflecting a more dispersed latent distribution compared to Flux.1-dev VAE and NextStep-1 VAE w/o Noise.

Reconstruction Quality is the Upper Bound of Generation Quality

Principle: The fidelity of an image tokenizer's reconstruction fundamentally limits the maximum achievable quality of the generated images, especially for fine details and textures.
Validation: This principle is supported by recent studies (Dai et al., 2023; Esser et al., 2024; Labs, 2024), leading to a trend in diffusion models of using VAEs with exceptional reconstruction performance ( $PSNR > 30$ ).

Gap Bridging (Table 8): Historically, VQ-based autoregressive models struggled to surpass this threshold. NextStep-1 successfully applies autoregressive models to high-fidelity continuous VAEs, bridging this gap.

The following are the results from Table 8 of the original paper:

Tokenizer	Latent Shape	PSNR ↑	SSIM ↑
Discrete Tokenizer
SBER-MoVQGAN (270M) (Zheng et al., 2022)	32x32	27.04	0.74
LlamaGen (Sun et al., 2024a)	32x32	24.44	0.77
VAR (Tian et al., 2024)	680	22.12	0.62
TiTok-S-128 (Yu et al., 2024b)	128	17.52	0.44
Sefltok (Wang et al., 2025b)	1024	26.30	0.81
Continuous Tokenizer
Stable Diffusion 1.5 (Rombach et al., 2022)	32x32x4	25.18	0.73
Stable Diffusion XL (Podell et al., 2024)	32x32x4	26.22	0.77
Stable Diffusion 3 Medium (Esser et al., 2024)	32x32x16	30.00	0.88
Flux.1-dev (Labs, 2024)	32x32x16	31.64	0.91
NextStep-1	32x32x16	30.60	0.89

Table 8 shows NextStep-1's tokenizer achieves PSNR of 30.60 and SSIM of 0.89, which are competitive with leading continuous tokenizers like Stable Diffusion 3 Medium (30.00 PSNR) and Flux.1-dev (31.64 PSNR), and significantly higher than most discrete tokenizers.

6.3. Training Recipe

6.3.1. Training Image Tokenizer

The image tokenizer is initialized from Flux.1-dev VAE (Labs, 2024) and fine-tuned on the image-text dataset (Section 3.2).

Optimizer: AdamW (Loshchilov and Hutter, 2019) with parameters: $\beta _ { 1 } = 0 . 9$ , $\beta _ { 2 } = 0 . 9 5$ , $\varepsilon = 1 \times 1 0 ^ { - 8 }$ .
Training Steps: 50K steps.
Batch Size: 512.

Learning Rate: Constant $1 \times 1 0 ^ { - 5 }$ , with a linear warm-up of 1000 steps.

The following are the results from Table 1 of the original paper:

	Stage1	Stage2	Annealing	SFT	DPO
	Pre-Training			Post-Training
Hyperparameters
Learning Rate (Min, Max)	1 × 10-4	1 × 10-5	(0, 1 × 10−5)	(0, 1 × 10−5)	2 × 10-6
LR Scheduler	Constant	Constant	Cosine	Cosine	Constant
Weight Decay	0.1	0.1	0.1	0.1	0.1
Loss Weight (CE : MSE)	(0.01 : 1)	(0.01 : 1)	(0.01 : 1)	(0.01 : 1)	-
Training Steps	200K	100K	20K	10K	300
Warm-up Steps	5K	5K	0	500	200
Sequence Length per Rank	16K	16K	16K	8K
Image Area (Min, Max)	256×256	(256×256, 512×512)	(256×256, 512×512)	(256×256, 512×512)	(256×256, 512×512)
Image Tokens (Min, Max)	256	(256, 1024)	(256, 1024)	(256, 1024)	(256, 1024)
Training Tokens	1.23T	0.61T	40B	5B
Data Ratio
Text-only Corpus	0.2	0.2	0.2	0
Image-Text Pair Data	0.6	0.6	0.6	0.9
Image-to-Image Data	0.0	0.0	0.1	0.1
Interleaved Data	0.2	0.2	0.1	0

6.3.2. Pre-Training

Pre-training follows a three-stage curriculum. All model parameters (except the image tokenizer) are trained end-to-end.

Optimizer (general): AdamW with specific parameters (not explicitly stated for pre-training, but standard in LLMs).
Loss Weight (CE : MSE): Consistent ratio of (0.01 : 1) across all stages, balancing text (Cross-Entropy) and visual (Flow Matching) losses.

Stage1

Purpose: Learn foundational understanding of image structure and composition.
Image Resolution: Fixed $256 \times 256$ (resized and randomly cropped) for computational efficiency.
Data Mixture:
- Text-only Corpus: 20%
- Image-Text Pair Data: 60%
- Interleaved Data: 20%
Training Tokens: Approximately 1.23 Trillion tokens.
Training Steps: 200K steps.
Learning Rate: Constant $1 \times 10^{-4}$ .
Warm-up Steps: 5K steps.

Stage2

Purpose: Train the model on higher resolutions and finer details.
Image Resolution: Dynamic resolution strategy, targeting $256 \times 256$ and $512 \times 512$ base areas, utilizing different aspect ratio buckets.
Data Mixture: Enriched with more text-rich and video-interleaved data (same ratios as Stage1, but possibly higher quality subsets for text/video, given description).
- Text-only Corpus: 20%
- Image-Text Pair Data: 60%
- Interleaved Data: 20%
Training Tokens: Approximately 0.61 Trillion tokens.
Training Steps: 100K steps.
Learning Rate: Constant $1 \times 10^{-5}$ .
Warm-up Steps: 5K steps.

Annealing

Purpose: Sharpen model capabilities on a highly curated dataset, enhancing overall image structure, composition, texture, and aesthetic appeal.
Training Strategy: One epoch on a high-quality subset of 20M samples.
Data Source: Selected from Section 3.2 (Image-Text Pair Data) by applying stricter filtering thresholds (aesthetic score, image clarity, semantic similarity, watermark).
Data Mixture:
- Text-only Corpus: 20%
- Image-Text Pair Data: 60%
- Image-to-Image Data: 10% (introduced here)
- Interleaved Data: 10%
Training Tokens: 40 Billion tokens.
Training Steps: 20K steps.
Learning Rate: Cosine scheduler, from (0, $1 \times 10^{-5}$ ).
Warm-up Steps: 0.

6.3.3. Post-Training

Post-training aligns the model's output with human preferences and downstream tasks, via Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).

Supervised Fine-Tuning (SFT)

Purpose: Enhance instruction-following capabilities and align outputs with human preferences.
SFT Dataset: Total of 5M samples, comprising three components:
1. Human-selected image-text pairs: High semantic consistency and visual appeal, augmented by images from other generative models (distillation for complex/imaginative prompts).
2. Chain-of-Thought (CoT) data: (Deng et al., 2025; Wei et al., 2022) to improve text-to-image generation by incorporating a language-based reasoning step.
3. High-quality instruction-guided image-to-image data: From Section 3.3, to strengthen image editing capabilities.
Data Ratios:
- Image-Text Pair Data: 90%
- Image-to-Image Data: 10%
Training Tokens: 5 Billion tokens.
Training Steps: 10K steps.
Learning Rate: Cosine scheduler, from (0, $1 \times 10^{-5}$ ).
Warm-up Steps: 500 steps.

Direct Preference Optimization (DPO)

Purpose: Align the model with human preferences, inspired by Diffusion-DPO (Wallace et al., 2024).
Preference Datasets: Constructed from approximately 20,000 diverse prompts.
1. Standard DPO Dataset:
  - For each prompt $c$ , the SFT model generates 16 candidate images.
  - ImageReward (Xu et al., 2023) scores these images.
  - A preference pair $(y^w, y^l)$ is formed: $y^w$ (winning image) is randomly sampled from the top 4 candidates, $y^l$ (losing image) from the remaining 12.
2. Self-CoT DPO Dataset:
  - For each prompt $c$ , the model first generates a detailed textual Chain-of-Thought.
  - This CoT-enhanced prompt is then used to follow the identical pipeline as above, forming a preference pair $(y^w, y^l)$ .
Training Steps: 300 steps.
Learning Rate: Constant $2 \times 10^{-6}$ .
Warm-up Steps: 200 steps.

6.4. Inference Latency Analysis

The following are the results from Table 9 of the original paper:

Sequence Length	LLM Decoder	LM Head	FM Head	Total	w/o FM Head
Sequence Length	Last-token Latency (ms)			Accumulated Latency (s)
256	7.20	0.40	3.40	2.82	1.95
1024	7.23	0.40	3.40	11.31	7.83
4096	7.39	0.40	3.40	45.77	31.86

Table 9 presents an inference latency breakdown on an H100 GPU for a batch size of 1.

Dominant Bottleneck: The LLM Decoder (the Causal Transformer backbone) is the dominant component contributing to last-token latency (around 7.2-7.4 ms).
Substantial Contribution: The multi-step sampling in the flow matching head also constitutes a substantial portion of the per-token generation cost (3.40 ms). The LM Head is comparatively very fast (0.40 ms).
Accumulated Latency: As sequence length increases (e.g., from 256 to 4096 tokens), the total accumulated latency scales linearly, reaching 45.77 seconds for a sequence of 4096 tokens. Even without the FM Head's contribution, the LLM Decoder alone results in 31.86 seconds for 4096 tokens, highlighting the serial nature of autoregressive decoding as a bottleneck.

6.5. Failure Cases

The image is a diagram illustrating failure cases for high-dimensional continuous tokens. It showcases various instances of generated images with different styles and contents, highlighting the limitations of current image generation techniques in handling complex visual information.

Figure 8 | Failure cases for high-dimensional continuous tokens. 该图像是一个示意图，展示了高维连续标记的失败案例。图中包含多种生成图像的实例，包括不同的风格和内容，旨在突出当前图像生成技术在处理复杂视觉信息时的局限性。

Figure 8 illustrates some observed failure cases when transitioning to higher-dimensional latent spaces (e.g., 16 latent channels compared to 4). These artifacts include:

Local noise or block-shaped artifacts: Appearing in later stages of generation, potentially indicating numerical instabilities.
Global noise across the image: Could be a sign of under-convergence, suggesting that more training steps might mitigate the issue.
Subtle grid-like artifacts: May reveal limitations of the 1D positional encoding in capturing complex 2D spatial relationships.

7. Conclusion & Reflections

7.1. Conclusion Summary

NextStep-1 successfully advances the autoregressive (AR) paradigm for text-to-image generation by effectively integrating continuous image tokens with a lightweight flow matching head. The 14B AR model, initialized from Qwen2.5, demonstrates state-of-the-art performance among AR models across diverse benchmarks for text-to-image generation (GenEval, GenAI-Bench, DPG-Bench, OneIG-Bench), achieving high-fidelity synthesis, strong compositional understanding, linguistic capabilities, and world knowledge. Furthermore, its fine-tuned version, NextStep-1-Edit, shows competitive performance in instruction-guided image editing. Key to its success is a robust image tokenizer design featuring channel-wise normalization and stochastic perturbation, which stabilizes continuous latent spaces and enables effective use of classifier-free guidance. The research also highlights that the Transformer backbone is the primary driver of generative modeling, with the flow matching head acting as a lightweight sampler. NextStep-1 bridges the performance gap between AR and diffusion models while maintaining the architectural flexibility and scalability of AR systems.

7.2. Limitations & Future Work

The authors acknowledge several limitations and outline future research directions:

Artifacts:
- Limitation: Despite tokenizer improvements, NextStep-1 can still exhibit generative artifacts (local noise, block artifacts, global noise, grid-like patterns) when scaling to higher-dimensional continuous latent spaces (e.g., 16 channels).
- Future Work: Further investigation into the underlying causes (numerical instabilities, under-convergence, limitations of 1D positional encoding for 2D spatial relationships) is needed.
Inference Latency of Sequential Decoding:
- Limitation: The inherently sequential nature of autoregressive decoding leads to substantial inference latency, with the LLM Decoder and the multi-step Flow Matching Head being the dominant bottlenecks.
- Future Work:
  1. Flow Matching Head Acceleration: Reduce parameter count, apply distillation for few-step generation (Meng et al., 2023), or use more advanced few-step samplers (Lu et al., 2022, 2025).
  2. Autoregressive Backbone Acceleration: Adapt techniques from the LLM field such as speculative decoding (Leviathan et al., 2023) or multi-token prediction (Gloeckle et al., 2024) to image token generation.
Challenges in High-Resolution Training:
- Limitation: Scaling to high-resolution image generation is challenging compared to diffusion models. AR models require significantly more training steps due to sequential generation, and timestep shift techniques used in diffusion models are difficult to adapt to the Flow Matching Head's role as a lightweight sampler.
- Future Work: Designing high-resolution generation strategies specifically for patch-wise autoregressive models.
Challenges in Supervised Fine-Tuning (SFT):
- Limitation: SFT exhibits unstable dynamics, requiring datasets at the million-sample scale for substantial improvement. Smaller datasets lead to either marginal gains or abrupt overfitting, making it difficult to find intermediate checkpoints that balance alignment with general generative capability.
- Future Work: Developing more robust SFT strategies for autoregressive multimodal models that can effectively leverage smaller, high-quality datasets without sacrificing generalization.

7.3. Personal Insights & Critique

This paper presents a compelling step forward for autoregressive models in multimodal generation, particularly by embracing continuous tokens and a flow matching head. The core innovation lies in demonstrating that a Transformer can effectively manage continuous image generation without resorting to computationally heavier diffusion models or losing fidelity with vector quantization.

Inspirations and Transferability:
- Unified Multimodal Architecture: The idea of a single causal transformer processing both text and continuous image tokens within a unified sequence is elegant and scalable. This approach could inspire future generalist AI models capable of handling diverse modalities seamlessly, beyond just text and images, potentially incorporating audio or 3D data as continuous token streams.
- Robust Tokenizer Design: The findings on channel-wise normalization and stochastic perturbation in the image tokenizer are highly impactful. The counter-intuitive discovery that higher noise intensity during tokenizer training (leading to higher reconstruction loss) can improve overall generation quality due to a better-conditioned latent space is a crucial insight. This principle of "regularizing the latent space for downstream generation" could be broadly applicable to other generative tasks involving latent representations, not just AR models.
- Lightweight Generative Head: The insensitivity to the flow matching head's size is a powerful result, suggesting that the primary intelligence resides in the Transformer itself. This could simplify future model designs, allowing researchers to focus on scaling the main Transformer backbone rather than optimizing complex modality-specific heads, potentially reducing model complexity and training costs for various modalities.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
- Inference Latency: While acknowledged as a limitation, the linear scaling of latency with sequence length is a fundamental challenge for AR models, especially for very high-resolution images. The proposed solutions (speculative decoding, multi-token prediction) are promising but require significant adaptation for continuous image tokens. The current latency might make real-time high-resolution image generation impractical.
- Artifacts in High-Dimensional Latents: The presence of artifacts (local noise, grid-like patterns) when using 16-channel latents indicates that handling high-dimensional continuous latent spaces within an AR framework is still an unsolved problem. The suggestion that 1D positional encoding might struggle with 2D spatial relationships for image patches is plausible; exploring more sophisticated 2D or relative positional encodings might be necessary.
- Training Data Dependence: The model's reliance on million-scale datasets for stable SFT is a practical hurdle. Smaller, high-quality datasets are often more accessible for specific tasks or styles. Developing SFT techniques that are more data-efficient for AR multimodal models would be highly beneficial.
- Comparison with Proprietary Models: While NextStep-1 achieves state-of-the-art among autoregressive models and is competitive with many open-source diffusion models, it still trails some top-tier proprietary models like GPT-4o and Seedream 3.0 in overall performance on certain benchmarks (e.g., GenEval, WISE). Further scaling or architectural refinements might be needed to fully close this gap.
- Generalizability of Flow Matching: While flow matching is efficient, its long-term generalizability and robustness across all possible continuous data distributions compared to the more established diffusion process still warrant extensive research. The "black box" nature of why noise regularization helps generation quality (robustness vs. dispersion) also points to areas for deeper theoretical understanding.
  
  Overall, NextStep-1 makes a significant contribution by pushing the boundaries of autoregressive image generation. It demonstrates that the AR paradigm can achieve high-quality continuous image synthesis, offering a promising alternative to diffusion-based methods, especially if the current limitations can be addressed.

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

TL;DR Summary