NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
TL;DR Summary
NextStep-1 is an autoregressive model that effectively handles continuous image tokens without relying on heavy diffusion models or incurring quantization loss. It achieves state-of-the-art performance in text-to-image generation and excels in image editing, demonstrating its pow
Abstract
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale". It focuses on advancing autoregressive models for text-to-image generation by utilizing continuous image tokens.
1.2. Authors
The paper lists "NextStep-Team" as the authors, without individual names in the main author list. However, a detailed "Contributors and Acknowledgments" section provides individual names.
-
Researchers (Core Executors
*, Project Leader†, listed alphabetically by first name): Chunrui Han*, Guopeng Li*, Jingwei Wu*, Quan Sun*†, Yan Cai*, Yuang Peng*, Zheng Ge*†, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu. -
Contributors (Support in data, systems, platforms, early versions, part-time): Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yinging Wang, Yu Zhou, Yucheng Han, Ziyang Meng.
-
Sponsors: Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu.
-
Acknowledgments for insightful discussions: Tianhong Li and Yonglong Tian.
Their specific research backgrounds and affiliations are not explicitly stated for individual authors in the paper, but the "NextStep-Team" and homepage
https://stepfun.ai/research/en/nextstep1suggest an affiliation with StepFun AI.
1.3. Journal/Conference
The paper is an arXiv preprint arXiv:2508.10711, published at 2025-08-14T14:54:22.000Z. As an arXiv preprint, it has not yet undergone formal peer review or been published in a specific journal or conference. arXiv is a widely used open-access repository for preprints, making research publicly available before, or sometimes instead of, formal publication.
1.4. Publication Year
The publication year is 2025 (specifically, August 14, 2025).
1.5. Abstract
The abstract introduces NextStep-1, a novel approach to autoregressive (AR) text-to-image generation that addresses limitations of existing AR models. Current AR models either rely on computationally intensive diffusion models for continuous image tokens or use vector quantization (VQ) for discrete tokens, which incurs quantization loss. NextStep-1 is a 14-billion-parameter AR model combined with a 157-million-parameter flow matching head. It trains on discrete text tokens and continuous image tokens using next-token prediction objectives. The model achieves state-of-the-art performance among AR models in text-to-image generation, demonstrating high-fidelity image synthesis and strong capabilities in image editing. The authors plan to release their code and models to facilitate open research.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2508.10711. This is a preprint, indicating it is publicly available but has not undergone formal peer review.
The PDF link is https://arxiv.org/pdf/2508.10711v2.pdf.
2. Executive Summary
2.1. Background & Motivation
The paper addresses the challenge of building high-performing autoregressive (AR) models for text-to-image generation.
- Core Problem: Current autoregressive approaches for text-to-image generation face two main limitations:
- Reliance on
heavy diffusion models: Some AR models generate semantic embeddings, which then condition a separate, computationally intensive diffusion model to produce the final image. This makes the overall process less unified and efficient. - Use of
vector quantization (VQ): Other AR models convert images into discrete visual tokens using VQ. This method suffers fromquantization loss(information loss during the conversion from continuous to discrete representations) and issues likeexposure bias(discrepancy between training and inference distributions when predicting discrete tokens sequentially).
- Reliance on
- Importance of the Problem: Autoregressive models, inspired by their success in large language models (LLMs), offer a scalable and flexible paradigm for unifying multimodal inputs into a single sequence. Overcoming their limitations in image generation is crucial for developing versatile and powerful general-purpose AI systems that can handle both text and image generation seamlessly. A significant performance gap has persisted between AR models and state-of-the-art diffusion methods, particularly in image quality and consistency.
- Paper's Entry Point/Innovative Idea:
NextStep-1pushes the AR paradigm forward by directly modelingcontinuous image tokenswithin an autoregressive framework, leveraging aflow matching headinstead of diffusion models for continuous token processing or VQ for discrete tokens. This aims to combine the strengths of AR generation (scalability, flexibility, unified sequence modeling) with the quality benefits of continuous representations, without the computational overhead of full diffusion models or the loss from VQ.
2.2. Main Contributions / Findings
The paper makes several primary contributions and reports key findings:
- NextStep-1 Model: Introduction of
NextStep-1, a 14B autoregressive model paired with a 157Mflow matching head, designed for next-token prediction on discrete text tokens and continuous image tokens. This unified approach provides a simple yet effective architecture for text-to-image generation. - State-of-the-Art AR Performance:
NextStep-1achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, demonstrating high-fidelity image synthesis. It shows competitive performance across challenging benchmarks, includingWISE,GenAI-Bench,DPG-Bench, andOneIG-Bench, showcasing strong compositional understanding, linguistic understanding, and world knowledge. - Versatility in Image Editing: The model, specifically
NextStep-1-Edit(fine-tuned version), exhibits strong performance in instruction-based image editing tasks, achieving competitive scores onGEdit-BenchandImgEdit-Bench. This highlights the versatility of the unified AR approach. - Robust Image Tokenizer Design: The paper introduces an
image tokenizerfine-tuned from Flux VAE that incorporateschannel-wise normalizationandstochastic perturbation. This design is crucial for enhancing the robustness of continuous image tokens, promoting a well-dispersed, normalized latent space, and ensuring stable convergence even at higher dimensionalities (e.g., 16 channels), mitigating issues likevariance collapseandvisual artifactsunder strongclassifier-free guidance (CFG). - Invariance to Flow Matching Head Size: A key finding is that the model's generation quality is surprisingly insensitive to the size of the
flow matching head. This suggests that theTransformer backboneperforms the core generative modeling, with theflow matching headacting as a lightweight sampler, translating contextual predictions into continuous tokens. - Importance of Regularized Latent Space: The authors demonstrate a counter-intuitive inverse correlation between generation loss and synthesis quality, where higher noise intensity during tokenizer training (increasing generation loss) paradoxically improves image quality. This is attributed to
noise regularizationcreating a well-conditioned, more dispersed latent space, which is critical for generation. - Open Research Facilitation: The authors commit to releasing their code and models to the community.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand NextStep-1, several foundational concepts in machine learning, particularly in generative AI, are essential:
- Autoregressive Models (AR Models):
- Conceptual Definition: Autoregressive models are a class of statistical models that predict future values in a sequence based on past observations. In machine learning, especially in
natural language processing (NLP)andgenerative AI, they model the probability distribution of a sequence of data by factorizing it into a product of conditional probabilities. This means each element in the sequence is predicted conditioned on all preceding elements. - How it Works (for sequences): If you have a sequence , an autoregressive model calculates the probability of this sequence as:
$
p(X) = p(x_1) \cdot p(x_2 | x_1) \cdot p(x_3 | x_1, x_2) \cdot \dots \cdot p(x_n | x_1, \dots, x_{n-1})
$
Or, more compactly, using the product notation as seen in the paper:
$
p(x) = \prod_{i=1}^{n} p(x_i \mid x_{ denotes all tokens preceding . This sequential prediction is often referred to as
next-token prediction (NTP). - In
NextStep-1's context: The model processes a unified sequence of discrete text tokens and continuous image tokens. When predicting a text token, it uses alanguage modeling head. When predicting an image token (or patch), it uses aflow matching head.
- Conceptual Definition: Autoregressive models are a class of statistical models that predict future values in a sequence based on past observations. In machine learning, especially in
- Transformers:
- Conceptual Definition: Transformers are a neural network architecture introduced in 2017, which have become the backbone of most state-of-the-art
large language models (LLMs)and are increasingly used in vision tasks. They are particularly good at handling sequential data without relying on recurrent (RNNs) or convolutional (CNNs) layers, instead usingattention mechanismsto weigh the importance of different parts of the input sequence. - Causal Transformer: A
causal transformer(often called adecoder-only transformer) is a variant where theattention mechanismis restricted to only attend to past and current tokens in a sequence, but not future tokens. This makes it suitable for autoregressive generation tasks, ensuring that predictions for only depend on . - Positional Encoding (
RoPE): Since transformers process sequences in parallel and don't inherently understand the order of tokens,positional encodingsare added to the input embeddings to inject positional information.Rotary Position Embedding (RoPE)is a specific type of positional encoding that encodes absolute position with a rotation matrix and naturally incorporates relative position dependencies, which can be beneficial for longer sequences.
- Conceptual Definition: Transformers are a neural network architecture introduced in 2017, which have become the backbone of most state-of-the-art
- Latent Space and Variational Autoencoders (VAEs):
- Conceptual Definition: A
latent spaceis a lower-dimensional representation of data where similar data points are closer together.Variational Autoencoders (VAEs)are a type of generative model that learn to encode input data into a latent space and then decode from this latent space back into the original data space. They are often used asimage tokenizersto compress images into a more compact, continuous representation (thelatents). - Components: A VAE consists of an
encoder(maps input to latent distribution parameters, usually mean and variance) and adecoder(maps sampled latent vectors back to the input space). - In
NextStep-1: Theimage tokenizeris based on a VAE (specifically, fine-tuned from Flux VAE). It converts high-resolution images into16-channel latentsat an spatial downsampling factor. These latents form the continuous image tokens.
- Conceptual Definition: A
- Vector Quantization (VQ):
- Conceptual Definition: VQ is a data compression technique that reduces a continuous vector space into a discrete set of "codebook" vectors. In image generation,
VQ-VAEs(orVQ-GANs) encode images into a latent space, and then each latent vector is "quantized" by finding the closest vector in a learned discrete codebook. This results in discretevisual tokens, making images representable as sequences of integers, similar to text. - Limitation (as noted by paper):
Quantization lossrefers to the information lost during this discrete approximation.
- Conceptual Definition: VQ is a data compression technique that reduces a continuous vector space into a discrete set of "codebook" vectors. In image generation,
- Diffusion Models:
- Conceptual Definition: Diffusion models are a class of generative models that learn to reverse a gradual
diffusionprocess (adding noise to data) to generate new data from random noise. They iterativelydenoisea noisy input until it resembles real data. - Computational Intensity: Diffusion models often require many steps to generate a high-quality image, making them computationally intensive at inference time.
- In
NextStep-1's context: The paper contrastsNextStep-1with AR models that rely on heavy diffusion models to process continuous image tokens after the AR model generates semantic embeddings.NextStep-1aims to avoid this reliance by usingflow matchingfor its continuous token generation.
- Conceptual Definition: Diffusion models are a class of generative models that learn to reverse a gradual
- Flow Matching:
- Conceptual Definition:
Flow matchingis a recent family of generative models that directly learn avector field(aflow) that transports a simple base distribution (e.g., Gaussian noise) to a complex data distribution. Unlike diffusion models that learn to reverse a stochastic process, flow matching typically learns a deterministicordinary differential equation (ODE)orstochastic differential equation (SDE)path, making sampling potentially faster and more stable. - Velocity Vector: In flow matching, the model predicts a
velocity vectorthat indicates the direction and magnitude of movement needed to transform a noisy sample at a given timestep into a clean target sample. - In
NextStep-1: Aflow matching headis used to predict the continuous flow from a noise sample to the next target image patch, effectively sampling the continuous image tokens.
- Conceptual Definition:
- Cross-Entropy Loss:
- Conceptual Definition:
Cross-entropy lossis a common loss function used in classification tasks and for training models that output probability distributions (like language models). It measures the difference between two probability distributions: the true distribution (one-hot encoded for discrete tokens) and the predicted distribution from the model. - Formula: For a single discrete token prediction, if is the true probability (1 for the correct class, 0 otherwise) and is the predicted probability, the
cross-entropy lossis: $ \mathcal{L}{\mathrm{CE}} = - \sum{c=1}^{C} y_c \log(\hat{y}_c) $ where is the number of classes (vocabulary size for text tokens). - In
NextStep-1: Used fordiscrete text tokensvia thelanguage modeling head.
- Conceptual Definition:
- Mean Squared Error (MSE):
- Conceptual Definition:
Mean Squared Erroris a common loss function used in regression tasks. It measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value. - Formula: For two vectors and , is: $ \mathcal{L}{\mathrm{MSE}} = \frac{1}{N} \sum{i=1}^{N} (A_i - B_i)^2 $
- In
NextStep-1: Used as part of theflow matching lossto quantify the difference between the predicted and target velocity vectors for continuous image tokens.
- Conceptual Definition:
- Classifier-Free Guidance (CFG):
- Conceptual Definition:
Classifier-Free Guidanceis a technique used in generative models (especially diffusion models and VAE-based generation) to improve the alignment between generated samples and a given conditioning signal (e.g., text prompt). It works by linearly interpolating between a model's conditional prediction (guided by the prompt) and its unconditional prediction (not guided by the prompt). A higher guidance scale leads to stronger adherence to the prompt but can sometimes introduce artifacts or reduce diversity. - Formula (as provided in paper):
$
\tilde{\nu}(x | y) = (1 - w) \cdot \nu_{\theta}(x | \emptyset) + w \cdot \nu_{\theta}(x | y)
$
where:
- is the guided prediction (e.g., predicted velocity vector in flow matching or denoised latent in diffusion).
- is the unconditional prediction (prediction given no conditioning, ).
- is the conditional prediction (prediction given conditioning , e.g., a text prompt).
- is the
guidance scale, a hyperparameter controlling the strength of the guidance.
- Conceptual Definition:
3.2. Previous Works
The paper frames its work against two main categories of existing autoregressive text-to-image models and state-of-the-art diffusion models:
- Autoregressive Models relying on Diffusion Models:
- (Chen et al., 2025a), (Dong et al., 2024), (Sun et al., 2023, 2024b), (Zhou et al., 2025) are cited.
- Mechanism: These models typically use an autoregressive
Transformerto first generate a semantic embedding or a sequence of discrete tokens representing high-level image semantics. This embedding then serves as a condition for a separate, oftenheavy diffusion model, which generates the actual image in a single denoising process. - Limitation: This approach is computationally intensive because the diffusion model is separate and handles the entire image generation, requiring significant resources. The AR model acts more as an orchestrator than a direct image generator.
- Autoregressive Models employing Vector Quantization (VQ):
- (Eslami et al., 2021), (Yu et al., 2023), (Zheng et al., 2022), (Chen et al., 2025b), (Dong et al., 2024), (Sun et al., 2024a,b), (Tong et al., 2024), (Wang et al., 2024b) are cited.
- Mechanism: These models tokenize images into discrete visual tokens using
Vector Quantization (VQ). The AR model then learns to predict these discrete tokens sequentially, similar to howLLMspredict text tokens. - Limitations:
Quantization Loss: Information is lost when continuous image data is mapped to a finite set of discrete tokens. This can limit the fidelity of the generated images, especially fine details and textures.Exposure Bias: During training, the model sees "ground truth" previous tokens. During inference, it uses its own predictions, which may accumulate errors over time, leading to a mismatch between training and inference data distributions.Suboptimal Image Tokenization: The quality of the discrete tokens themselves can be a bottleneck for generation quality.
- Recent Efforts with Continuous Latent Representations (prior AR work):
- (Fan et al., 2024), (Li et al., 2024c), (Sun et al., 2024c), (Tschannen et al., 2024, 2025) are mentioned as showing promise. These works attempt to use continuous latent variables directly within AR frameworks, moving away from VQ.
- Gap: Despite these efforts, a significant performance gap persisted between these AR models and state-of-the-art diffusion methods in image quality and consistency (e.g., Esser et al., 2024; Labs, 2024; Podell et al., 2024).
- Flux VAE (Labs, 2024): The
NextStep-1image tokenizeris fine-tuned fromFlux.1-dev VAE. This highlights the importance of a high-performance VAE for image reconstruction as a foundation for generative models. - -VAE (Sun et al., 2024c): The
stochastic perturbationtechnique used inNextStep-1's tokenizer is adapted from -VAE, where it was employed to preventvariance collapsein the latent space.Variance collapseis a known issue in VAEs where the encoder learns to output a very narrow (low variance) distribution, making the latent space less expressive.
3.3. Technological Evolution
The field of generative AI has rapidly evolved:
- Early Generative Models (GANs, basic VAEs): Focused on generating images but often struggled with diversity or stability.
- Autoregressive Models for Text: Breakthroughs with
Transformers(e.g., GPT series) demonstrated the power ofnext-token predictionfor coherent and context-aware text generation. - Extending AR to Images (VQ-VAE/GAN + AR Transformer): The idea was to represent images as sequences of discrete tokens (similar to text) and apply AR
Transformers. This led to models likeDALL-EandVQGAN. However,quantization lossandexposure biasbecame limitations. - Diffusion Models Rise:
Denoising Diffusion Probabilistic Models (DDPMs)and Latent Diffusion Models (LDMs, e.g., Stable Diffusion) achieved unprecedented image quality and diversity, becoming the state-of-the-art for text-to-image generation. - Hybrid AR-Diffusion Models: Some AR models started to use
Transformersto generate high-level semantic tokens, which then conditioned powerfuldiffusion modelsfor the final image generation. While high quality, this approach often involved heavy, separate diffusion components. - Continuous Latent AR Models (Pre-NextStep-1): Attempts to use continuous image representations directly within AR models to avoid VQ limitations, but often faced performance gaps compared to diffusion models.
NextStep-1's Position: This paper represents an evolution towards a "pure autoregressive paradigm" that directly models continuous image tokens with a lightweightflow matching head, aiming to close the performance gap with diffusion models while retaining the architectural simplicity and scalability of ARTransformers. It builds on the idea of continuous latents but introduces specific tokenizer enhancements and leveragesflow matchingfor patch-wise generation, which is distinct from using a heavy, full-image diffusion model.
3.4. Differentiation Analysis
Compared to the main methods in related work, NextStep-1 offers several core differences and innovations:
- Discrete vs. Continuous Tokens:
- Differentiation: Unlike
VQ-based autoregressive modelsthat usediscrete visual tokensand suffer fromquantization lossandexposure bias,NextStep-1directly processescontinuous image tokens. This allows for higher fidelity and avoids the limitations of discrete representations.
- Differentiation: Unlike
- Flow Matching Head vs. Heavy Diffusion Models:
- Differentiation:
NextStep-1employs alightweight flow matching head(157M parameters) to model the distribution of each image patch autoregressively. This contrasts withAR models that rely on heavy, separate diffusion models(often hundreds of millions to billions of parameters) to generate an entire image, which is computationally intensive and less unified.NextStep-1's approach allows for a more integratednext-token predictionparadigm for continuous data.
- Differentiation:
- Tokenizer Design for Stability:
- Innovation: The paper highlights a novel
image tokenizerdesign (fine-tuned from Flux VAE) that incorporateschannel-wise normalizationandstochastic perturbation. This is crucial for:- Robustness to
CFG: It mitigates statistical drift in per-token mean and variance, which typically leads to visual artifacts under strongclassifier-free guidancein VAE-based AR models. This allowsNextStep-1to leverage strong guidance without degrading image quality. - Well-conditioned Latent Space: The noise regularization cultivates a more dispersed and robust latent space, which is empirically shown to be critical for achieving high-fidelity generation, even if it leads to higher reconstruction loss during tokenizer training.
- Robustness to
- Innovation: The paper highlights a novel
- Unified Autoregressive Generation:
- Differentiation:
NextStep-1maintains a "pure autoregressive paradigm" where theTransformer backboneperforms the core generative modeling of conditional distributions for both text and image tokens. Theflow matching headacts as a lightweight sampler, directly translating the transformer's contextual prediction into continuous visual tokens, rather than merely orchestrating a separate diffusion process. This unified approach provides architectural simplicity and flexibility.
- Differentiation:
- Performance Gap Bridging:
- Innovation: By addressing the limitations of prior AR models and incorporating the tokenizer advancements,
NextStep-1significantly closes the performance gap between autoregressive models and state-of-the-art diffusion methods in terms of image quality and consistency, achieving competitive or superior results on various benchmarks for both text-to-image generation and image editing within the AR framework.
- Innovation: By addressing the limitations of prior AR models and incorporating the tokenizer advancements,
4. Methodology
4.1. Principles
The core idea behind NextStep-1 is to extend the successful autoregressive language modeling paradigm to image generation by treating both text and image data as a unified sequence of tokens. Instead of vector quantization (VQ) for discrete image tokens or relying on a separate, heavy diffusion model for continuous tokens, NextStep-1 directly models continuous image tokens using a lightweight flow matching head. The underlying principle is next-token prediction (NTP), where a causal transformer predicts the subsequent token in a multimodal sequence, whether it's a discrete text token or a continuous image patch. This design aims to achieve high-fidelity image synthesis with the scalability and flexibility inherent in autoregressive architectures.
The theoretical basis of the method centers on the factorization of a joint probability distribution over a multimodal sequence into a product of conditional probabilities, where each token's probability is conditioned on all preceding tokens. For discrete text tokens, this is a standard classification problem handled by a language modeling head with cross-entropy loss. For continuous image tokens, it transforms into a regression problem where a flow matching head learns to predict velocity vectors to generate the next image patch, optimized with flow matching loss. The intuition is that a powerful Transformer backbone can learn complex multimodal correlations and context, while modality-specific heads efficiently handle the final prediction in their respective data spaces.
4.2. Core Methodology In-depth (Layer by Layer)
The NextStep-1 framework is built upon a Causal Transformer backbone, an Image Tokenizer, a Language Modeling Head, and a Patch-wise Flow Matching Head. The overall architecture is illustrated in Figure 2.
The image is a diagram illustrating the workflow of the NextStep-1 framework. It shows how the autoregressive model processes text and image tokens through a causal transformer, and how the Flow Matching Head predicts the continuous flow from a noise sample to the target image patch during training.
该图像是一个示意图,展示了NextStep-1框架的工作流程。图中包含了自回归模型如何通过因果变换器处理文本和图像令牌,及其训练中Flow Matching Head如何预测从噪声样本到目标图像块的连续流。
4.2.1. Unified Multimodal Generation with Continuous Visual Tokens
The framework unifies multimodal inputs (text and images) into a single sequential data stream. Images are first converted into continuous image tokens by an image tokenizer, and then combined with discrete text tokens.
Let represent the unified multimodal token sequence, where each x _ { i } can be either a discrete text token or a continuous visual token (image patch). The autoregressive objective under this unified sequence is formalized as:
$ p ( x ) = \prod _ { i = 1 } ^ { n } p ( x _ { i } \mid x _ { < i } ) . $
Here, p(x) is the joint probability of the entire sequence , and is the conditional probability of the -th token given all preceding tokens . The model learns to predict based on .
The generation task proceeds by iteratively sampling the next token x _ { i } from the conditional distribution .
-
For discrete text tokens, sampling is performed via a
language modeling head. -
For continuous image tokens, sampling is performed by a
flow-matching head.The
training objectiveforNextStep-1combines two distinct losses:
-
A standard
cross-entropy loss() fordiscrete text tokens. -
A
flow matching loss() forcontinuous image tokens. Specifically, the flow matching loss is themean squared error (MSE)between the predicted and targetvelocity vectorsthat map a noised image patch to its corresponding clean image patch.The model is trained end-to-end by optimizing a weighted sum of these two losses:
$ \mathcal { L } _ { \mathrm { t o t a l } } = \lambda _ { \mathrm { t e x t } } \mathcal { L } _ { \mathrm { t e x t } } + \lambda _ { \mathrm { v i s u a l } } \mathcal { L } _ { \mathrm { v i s u a l } } $
where:
- is the total loss to be minimized.
- is the loss for text tokens (cross-entropy).
- is the loss for image tokens (flow matching loss, based on MSE).
- and are hyperparameters that balance the contribution of text and visual losses, respectively.
4.2.2. Model Architecture
Image Tokenizer
The image tokenizer is a crucial component that converts raw images into continuous latent representations suitable for the Causal Transformer.
- Initialization: It is fine-tuned from a pre-trained
Flux VAE(Labs, 2024), chosen for its strong reconstruction performance, and adapted to the specific data distribution. - Encoding Process: The tokenizer first encodes an image into
16-channel latents(), applying an spatial downsampling factor. For example, a image would be encoded into a latent representation. - Latent Space Stabilization and Normalization: To ensure stable training and a well-behaved latent space, two techniques are applied:
- Channel-wise Normalization: Each channel of the latent representation is standardized to have zero mean and unit variance. This enforces per-token statistical stability, which is critical for mitigating issues under high
Classifier-Free Guidance (CFG)scales, as discussed in Section 6.2. - Stochastic Perturbation: To further enhance robustness and encourage a more uniform latent distribution, Gaussian noise is added to the normalized latents. This technique is adapted from $$\sigma
-VAE(Sun et al., 2024c) to preventvariance collapse. The perturbed latent is calculated as: $ \tilde { z } = \mathrm { N o r m l i z a t i o n } ( z ) + \alpha \cdot \varepsilon , \quad \mathrm { w h e r e } \ \alpha \sim \mathcal { U } [ 0 , \gamma ] \ \mathrm { a n d } \ \varepsilon \sim N ( 0 , I ) $ where:- represents the original latents encoded by the VAE.
- refers to the channel-wise normalization of .
- is standard Gaussian noise, meaning it is sampled from a normal distribution with mean 0 and standard deviation 1.
- is a scaling factor for the noise, sampled uniformly from the range .
- denotes a uniform distribution between 0 and .
- is a hyperparameter that controls the maximum intensity of the added noise. A higher means more significant perturbation.
- Channel-wise Normalization: Each channel of the latent representation is standardized to have zero mean and unit variance. This enforces per-token statistical stability, which is critical for mitigating issues under high
- Sequence Flattening: The
16-channel latentsare then transformed into a compact 1D sequence for theCausal Transformer. This involves:- Pixel-shuffling / Space-to-Depth Transformation: A kernel is applied, which flattens spatial regions of the latents into the channel dimension. For example, if the latents are , this transformation converts them into a grid of
64-channel tokens(). - Flattening to 1D: This grid of
64-channel tokensis then flattened into a 1D sequence of 256 tokens (16 * 16 = 256). This 1D sequence serves as the input for theCausal Transformer.
- Pixel-shuffling / Space-to-Depth Transformation: A kernel is applied, which flattens spatial regions of the latents into the channel dimension. For example, if the latents are , this transformation converts them into a grid of
Causal Transformer
- Initialization: The
Causal Transformeris initialized from the decoder-onlyQwen2.5-14B(Yang et al., 2024), alarge language model (LLM)backbone, suitable for autoregressive generation. - Input Sequence Format: The multimodal input sequence is organized in a specific format:
{text} <image_area>h*w <boi> {image} <eoi>..where:{text}represents discrete text tokens.<image_area>h*wis a special metadata token indicating the spatial dimensions (height and width) of the 2D image tokens that follow.- (beginning-of-image) is a special token marking the start of the continuous image token sequence.
{image}represents the continuous image tokens generated by theimage tokenizer.- (end-of-image) is a special token marking the end of the image token sequence.
- Positional Encoding: Standard
1D Rotary Position Embedding (RoPE)(Su et al., 2024) is used for positional information. Despite the existence of more complex 2D or multimodalRoPEalternatives, the authors found1D RoPEeffective and retained it for simplicity and efficiency.
Lightweight Heads for Modality-Specific Loss
The output hidden states from the Causal Transformer (LLM backbone) are passed to two lightweight heads, each responsible for computing modality-specific losses:
- Language Modeling Head:
- Function: This head is responsible for predicting
discrete text tokens. - Loss: It computes
Cross-Entropy lossfor the hidden states corresponding to text tokens.
- Function: This head is responsible for predicting
- Patch-wise Flow Matching Head:
-
Function: This head is responsible for generating
continuous image tokens(patches). It follows the approach of (Li et al., 2024c). -
Architecture: It is a
157M-parameter MLP(Multi-Layer Perceptron) with 12 layers and 1536 hidden dimensions. -
Process: It uses each
patch-wise image hidden statefrom theCausal Transformeras a condition. The head thendenoisesa target patch at varioustimesteps (t)and computes thepatch-wise flow-matching loss(Lipman et al., 2023a). This loss measures the difference between the predictedvelocity vectorand the targetvelocity vectorfor transforming a noisy patch into a clean patch. -
Sampling: During inference, this head iteratively guides noise to create the next image patch, building the image autoregressively.
This structured approach allows the powerful
Causal Transformerto handle the complex multimodal context andnext-token predictionlogic, while the specialized, lightweight heads efficiently generate the final output in their respective modalities.
-
5. Experimental Setup
5.1. Datasets
To equip NextStep-1 with broad and versatile capabilities, a diverse training corpus comprising four main categories was constructed:
5.1.1. Text-only Corpus
- Source & Scale: 400B text-only tokens sampled from
Step-3(Wang et al., 2025a). - Purpose: To preserve the extensive language capabilities inherent in the
Qwen2.5-14Blarge language model (LLM) backbone used for initialization. - Characteristic: High-quality, diverse textual data.
5.1.2. Image-Text Pair Data
This forms the foundation for text-to-image generation capabilities. A comprehensive pipeline was developed for curation:
- Data Sourcing: Collected from diverse sources, including
web data,multi-task VQA (Visual Question Answering) data, andtext-rich documents. - Quality-Based Filtering: A rigorous filtering process was applied, evaluating images based on:
- Aesthetic quality
- Watermark presence
- Clarity
- OCR (Optical Character Recognition) detection
- Text-image semantic alignment
- Re-captioning: After deduplication, the filtered images were re-captioned using
Step-1o-turbo 1to generate rich and detailed captions in both English and Chinese.
- Purpose: To provide a high-quality, large-scale dataset for training a model with a strong aesthetic sense and broad world knowledge, fundamental for text-to-image generation.
5.1.3. Instruction-Guided Image-to-Image Data
Curated to enable a wide range of practical applications beyond pure generation:
- Visual Perception & Controllable Image Generation:
- Source & Scale: 1M samples synthesized by applying the annotator of
ControlNet(Zhang et al., 2023b) to a portion of the high-quality image-text pair data. - Characteristic: Data includes explicit control signals (e.g., edge maps, segmentation maps) alongside images.
- Example (ControlNet): If the input is an image of a cat and the instruction is "Turn this cat into a dog while preserving its pose," ControlNet might generate a pose map from the original cat image, which then guides the generation of the dog image.
- Source & Scale: 1M samples synthesized by applying the annotator of
- Image Restoration & General Image Editing:
- Source & Scale: Initially collected 3.5M samples from
GPT-Image-Edit(Wang et al., 2025d),Step1X-Edit(Liu et al., 2025), and a proprietary in-house dataset. - Filtering: All editing data were subjected to a rigorous
VLM-based filtering pipeline(Visual Language Model-based) assessing image-pair quality, rationality, consistency, and instruction alignment. This resulted in approximately 1M high-quality instruction-guided image-to-image data. - Purpose: To strengthen the model's capabilities in tasks like editing, inpainting, outpainting, and other instruction-guided image manipulations.
- Source & Scale: Initially collected 3.5M samples from
5.1.4. Interleaved Data
Integrates text and images seamlessly to foster rich and nuanced sequential associations:
-
General Video-Interleaved Data:
- Source & Scale: A large-scale, 80M-sample video-interleaved dataset constructed via a meticulous curation pipeline inspired by
Step-Video(Ma et al., 2025a), involving frame extraction, deduplication, and captioning. - Purpose: To endow the model with extensive world knowledge from video content.
- Source & Scale: A large-scale, 80M-sample video-interleaved dataset constructed via a meticulous curation pipeline inspired by
-
Tutorials:
- Source: Collected and processed tutorial videos using
ASR (Automatic Speech Recognition)andOCR (Optical Character Recognition)tools, following the methodology ofmmtextbook(Zhang et al., 2025). - Purpose: Specifically targets text-rich real-world scenes, enhancing the model's textual understanding and generation in context.
- Source: Collected and processed tutorial videos using
-
Character-Centric Scenes (
NextStep-Video-Interleave-5M):-
Source & Scale: A key contribution, comprising 5M samples. Video frames centered around specific characters were extracted, and rich, storytelling-style captions were generated, akin to (Oliveira and de Matos, 2025).
-
Purpose: Significantly improves the model's capacity for multi-turn interaction and consistent character generation.
-
Data Sample Example: The paper provides Figure 3, illustrating the processing of this data. The image is a diagram illustrating the steps of character binding and multimodal captioning data processing, including face detection, feature matching, and frame extraction processes. It details how to match characters using cosine similarity and provides a checklist to ensure consistency.
该图像是一个示意图,展示了角色绑定和多模态标注数据处理的步骤,包括面部检测、特征匹配和框架提取等流程。图中详细描述了如何通过余弦相似度匹配角色,并提供检查列表以确保一致性。
-
-
Multi-View Data:
- Source: Curated from two open-source datasets:
MV-ImageNet-v2(Han et al., 2024) andObjaverse-XL(Deitke et al., 2023). - Purpose: To bolster geometric reasoning and enhance the model's ability to maintain multi-view consistency (i.e., generating consistent images of an object from different viewpoints).
- Source: Curated from two open-source datasets:
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, a complete explanation is provided:
-
WISE (World Knowledge-Informed Semantic Evaluation) (Niu et al., 2025):
- Conceptual Definition:
WISEis a benchmark designed to evaluate a text-to-image generation model's ability to integrate world knowledge and perform semantic understanding. It emphasizes factual grounding, reasoning, and the correct depiction of entities, events, and relationships based on real-world knowledge across various domains (e.g., Cultural, Time, Space, Biology, Physics, Chemistry). A higher score indicates better knowledge awareness and semantic alignment. - Mathematical Formula: The paper does not provide an explicit formula for WISE. Typically, such benchmarks involve human evaluation or automated metrics (e.g., CLIP-score variants, VQA models) on a set of knowledge-intensive prompts. For aggregated scores, it often involves averaging sub-category scores. Let be the score for a specific knowledge domain , and be the number of domains, the overall WISE score is often an average: $ \text{WISE Score} = \frac{1}{N_d} \sum_{d \in \text{Domains}} S_d $
- Symbol Explanation:
- : The overall score quantifying world knowledge integration.
- : The score obtained for a specific knowledge domain (e.g., Cultural, Time, Space).
- : The total number of knowledge domains evaluated.
- Conceptual Definition:
-
GenAI-Bench (Lin et al., 2024):
- Conceptual Definition:
GenAI-Benchevaluates text-to-image generation models on theircompositional and linguistic understanding, assessing how well they follow prompts, particularly for basic and advanced compositional instructions. It aims to measure the ability to generate images with multiple objects, attributes, and spatial relationships as described in the prompt. A higher score indicates better prompt following. - Mathematical Formula: The paper does not provide an explicit formula.
GenAI-Benchtypically uses a combination of automated metrics (likeCLIP-scorefor text-image alignment) and potentially human evaluation. The scores are often normalized between 0 and 1, with separate evaluations for "Basic" and "Advanced" prompts. $ \text{GenAI-Bench Score} = \text{Metric}(\text{Generated Images}, \text{Prompts}) $ WhereMetriccould be an automated alignment score. - Symbol Explanation:
- : The score reflecting compositional and linguistic understanding.
- : The images produced by the model.
- : The textual conditioning provided to the model.
- Conceptual Definition:
-
DPG-Bench (Hu et al., 2024):
- Conceptual Definition:
DPG-Bench(likely referring to "Detailed Prompt Generation Bench") assesses the model'scompositional fidelityundercomplex and long prompts, especially those involving multiple objects and intricate scene descriptions. It measures how accurately the model can realize all elements and their relationships specified in a detailed textual input. A higher score indicates better adherence to complex prompts. - Mathematical Formula: The paper does not provide an explicit formula. Similar to other alignment benchmarks, it likely involves a quantitative assessment of generated images against detailed prompts. Scores are typically higher values for better performance.
- Symbol Explanation:
- : The score reflecting compositional fidelity for long, complex prompts.
- Conceptual Definition:
-
OneIG-Bench (Omni-dimensional Nuanced Evaluation for Image Generation) (Chang et al., 2025):
- Conceptual Definition:
OneIG-Benchprovides anomnidirectional nuanced evaluationfor text-to-image generation, assessing performance across several fine-grained aspects:- Alignment: How well the image matches the prompt's overall content.
- Text: Ability to render text accurately within the image.
- Reasoning: Performance on prompts requiring logical reasoning or world knowledge.
- Style: Control over the aesthetic style specified in the prompt.
- Diversity: Variety in generated images for the same prompt.
- Overall: A composite score across these dimensions.
- Mathematical Formula: The paper does not provide explicit formulas for each sub-metric, but they are typically computed using
Vision-Language Models (VLMs)or human evaluation. The overall score is usually an average or weighted sum of these sub-scores. For an overall score: $ \text{Overall Score} = \frac{1}{N_{sub}} \sum_{s \in \text{SubMetrics}} S_s $ - Symbol Explanation:
- : The composite score across various fine-grained aspects.
- : The score for a specific sub-metric (e.g., Alignment, Text, Reasoning).
- : The total number of sub-metrics evaluated.
- Conceptual Definition:
-
GenEval (Ghosh et al., 2023):
- Conceptual Definition:
GenEvalis a benchmark for evaluatingtext-to-image alignment. It measures how accurately the generated image corresponds to the textual prompt, often focusing on fidelity to attributes and objects mentioned. - Mathematical Formula: The paper does not provide an explicit formula. This metric often relies on internal scoring mechanisms, possibly using
CLIP-based scoresor human preference ratings, to quantify the degree of alignment. - Symbol Explanation: Not provided by the paper, but
GenEvalis a scalar score.
- Conceptual Definition:
-
GEdit-Bench (Liu et al., 2025):
- Conceptual Definition:
GEdit-Benchevaluates the performance ofinstruction-guided image editingmodels. It usesGPT-4.1(an advancedlarge language model) to assess three key metrics:- G_SC (Instruction Following Score): Measures how well the model followed the given editing instructions.
- G_PQ (Perceptual Quality Score): Measures the overall visual quality and realism of the edited image.
- G_O (Overall Score): A composite score reflecting the overall effectiveness and quality of the editing.
- Mathematical Formula: The paper does not provide explicit formulas, as these are typically proprietary scores derived from
GPT-4.1evaluations. They are likely on a defined scale (e.g., 1-10 or 1-7). - Symbol Explanation:
- : GPT-4.1 based Instruction Following Score.
- : GPT-4.1 based Perceptual Quality Score.
- : GPT-4.1 based Overall Score.
- Conceptual Definition:
-
ImgEdit-Bench (Ye et al., 2025):
- Conceptual Definition:
ImgEdit-Benchis another benchmark specifically for evaluatinginstruction-guided image editingmodels. It assesses the model's ability to perform various editing tasks based on textual instructions. A higher score indicates better editing capabilities. - Mathematical Formula: The paper does not provide an explicit formula. This benchmark likely employs a quantitative metric or human evaluation score to assess the quality and adherence to editing instructions.
- Symbol Explanation: Not provided by the paper, but
ImgEdit-Benchis a scalar score.
- Conceptual Definition:
-
PSNR (Peak Signal-to-Noise Ratio):
- Conceptual Definition:
PSNRis a common metric used to quantify the quality of reconstruction of lossy compression codecs or image restoration algorithms. It compares the maximum possible power of a signal to the power of corrupting noise that affects the fidelity of its representation. Higher PSNR values generally indicate a higher quality image reconstruction. - Mathematical Formula: $ \text{PSNR} = 10 \cdot \log_{10} \left( \frac{\text{MAX}_I^2}{\text{MSE}} \right) $
- Symbol Explanation:
- : The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image, or for a -bit image).
- :
Mean Squared Errorbetween the original (ground truth) image and the reconstructed image. For two images and of size : $ \text{MSE} = \frac{1}{m n} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $
- Conceptual Definition:
-
SSIM (Structural Similarity Index Measure):
- Conceptual Definition:
SSIMis a metric used for measuring the similarity between two images. UnlikePSNRwhich measures absolute errors,SSIMattempts to model the perceived change in structural information, which is more aligned with human visual perception. It considers three key factors: luminance, contrast, and structure. The SSIM value ranges from -1 to 1, where 1 indicates perfect structural similarity. - Mathematical Formula: For two images and : $ \text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $
- Symbol Explanation:
- : The average of image .
- : The average of image .
- : The variance of image .
- : The variance of image .
- : The covariance of images and .
- and : Two small constants included to prevent division by zero or near-zero values, where is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and are small constants.
- Conceptual Definition:
-
rFID (Reduced Fréchet Inception Distance):
- Conceptual Definition:
Fréchet Inception Distance (FID)is a metric used to assess the quality of images generated by generative models. It measures the "distance" between the distribution of generated images and the distribution of real images in a feature space learned by a pre-trainedInception-v3neural network. Lower FID scores indicate higher quality and more diverse generated images.rFIDlikely refers to a variant or subset of FID, possibly used for specific image types or a reduced set of features. - Mathematical Formula: For two distributions, real () and generated (), assumed to be multivariate Gaussian with means and covariance matrices : $ \text{FID}(X_r, X_g) = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2}) $
- Symbol Explanation:
- : Mean of the feature vectors for real images.
- : Mean of the feature vectors for generated images.
- : Covariance matrix of the feature vectors for real images.
- : Covariance matrix of the feature vectors for generated images.
- : Squared Euclidean distance.
- : Trace of a matrix (sum of its diagonal elements).
- : Matrix square root.
- Conceptual Definition:
5.3. Baselines
The NextStep-1 model is compared against a comprehensive set of baseline models across three categories for text-to-image generation and image editing.
5.3.1. Proprietary Models
These are state-of-the-art closed-source models known for high performance:
DALL-E 3(Betker et al., 2023)Seedream 3.0(Gao et al., 2025)GPT-4o(OpenAI, 2025b)Imagen3(Baldridge et al., 2024)Recraft V3(team, 2024)Kolors 2.0(team, 2025)Imagen4(deepmind Imagen4 team, 2025)Gemini 2.0(Gemini2, 2025)Doubao(Shi et al., 2024)Flux.1-Kontext-pro(Labs et al., 2025)
5.3.2. Diffusion Models
These models represent the current state-of-the-art in open-source or publicly available diffusion-based image generation:
Stable Diffusion 1.5(Rombach et al., 2022)Stable Diffusion XL(Podell et al., 2024)Stable Diffusion 3 Medium(Esser et al., 2024)Stable Diffusion 3.5 Large(Stability-AI, 2024)PixArt-Alpha(Chen et al., 2024b)Flux.1-dev(Labs, 2024)Transfusion(Zhou et al., 2025)CogView4(Z.ai, 2025)Lumina-Image 2.0(Qin et al., 2025)HiDream-I1-Full(Cai et al., 2025)Mogao(Liao et al., 2025)BAGEL(Deng et al., 2025)Show-o2-7B(Xie et al., 2025b)OmniGen2(Wu et al., 2025b)Qwen-Image(Wu et al., 2025a)Playground v2.5(Li et al., 2024b)MetaQuery-XL(Pan et al., 2025)BLIP3-o(Chen et al., 2025a)SANA-1.5 1.6B (PAG)(Xie et al., 2025a)SANA-1.5 4.8B (PAG)(Xie et al., 2025a)Show-o2-1.5B(Xie et al., 2025b)
5.3.3. Autoregressive Models
These are other autoregressive models for text-to-image generation:
-
SEED-X(Ge et al., 2024) -
Show-o(Xie et al., 2024) -
VILA-U(Wu et al., 2024) -
Emu3(Wang et al., 2024b) -
SimpleAR(Wang et al., 2025c) -
Fluid(Fan et al., 2024) -
Infinity(Han et al., 2025) -
Janus-Pro-7B(Chen et al., 2025b) -
Token-Shuffle(Ma et al., 2025b) -
Show-o-512(Xie et al., 2024)The choice of baselines is comprehensive, covering leading models across different paradigms, allowing for a thorough evaluation of
NextStep-1's performance relative to the current state-of-the-art.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Performance of Text-to-Image Generation
NextStep-1 is comprehensively evaluated on several benchmarks, demonstrating strong capabilities in text-to-image generation.
The following are the results from Table 2 of the original paper:
| Method | GenEval↑ | GenAI-Bench↑ | DPG-Bench↑ | |
|---|---|---|---|---|
| Basic | Advanced | |||
| Proprietary | ||||
| DALL-E 3 (Betker et al., 2023) | 0.67 | 0.90 | 0.70 | 83.50 |
| Seedream 3.0 (Gao et al., 2025) | 0.84 | - | - | 88.27 |
| GPT4o (OpenAI, 2025b) | 0.84 | - | 85.15 | |
| Diffusion | ||||
| Stable Diffusion 1.5 (Rombach et al., 2022) | 0.43 | - | - | - |
| Stable Diffusion XL (Podell et al., 2024) | 0.55 | 0.83 | 0.63 | 74.65 |
| Stable Diffusion 3 Medium (Esser et al., 2024) | 0.74 | 0.88 | 0.65 | 84.08 |
| Stable Diffusion 3.5 Large (Esser et al., 2024) | 0.71 | 0.88 | 0.66 | 83.38 |
| PixArt-Alpha (Chen et al., 2024b) | 0.48 | - | - | 71.11 |
| Flux.1-dev (Labs, 2024) | 0.66 | 0.86 | 0.65 | 83.79 |
| Transfusion (Zhou et al., 2025) | 0.63 | - | - | - |
| CogView4 (Z.ai, 2025) | 0.73 | - | - | 85.13 |
| Lumina-Image 2.0 (Qin et al., 2025) | 0.73 | - | - | 87.20 |
| HiDream-I1-Full (Cai et al., 2025) | 0.83 | 0.91 | 0.66 | 85.89 |
| Mogao (Liao et al., 2025) | 0.89 | - | 0.68 | 84.33 |
| BAGEL (Deng et al., 2025) | 0.82/0.88‡ | 0.89/0.86‡ | 0.69/0.75† | 85.07 |
| Show-o2-7B (Xie et al., 2025b) | 0.76 | - | 86.14 | |
| OmniGen2 (Wu et al., 2025b) | 0.80/0.86* | 83.57 | ||
| Qwen-Image (Wu et al., 2025a) | 0.87 | - | 88.32 | |
| AutoRegressive | ||||
| SEED-X (Ge et al., 2024) | 0.49 | 0.86 | 0.70 | |
| Show-o (Xie et al., 2024) | 0.53 | 0.70 | 0.60 | |
| VILA-U (Wu et al., 2024) | - | 0.76 | 0.64 | |
| Emu3 (Wang et al., 2024b) | 0.54/0.65* | 0.78 | 0.60 | 80.60 |
| SimpleAR (Wang et al., 2025c) | 0.63 | - | - | 81.97 |
| Fluid (Fan et al., 2024) | 0.69 | - | - | - |
| Infinity (Han et al., 2025) | 0.79 | - | - | 86.60 |
| Janus-Pro-7B (Chen et al., 2025b) | 0.80 | 0.86 | 0.66 | 84.19 |
| Token-Shuffle (Ma et al., 2025b) | 0.62 | 0.78 | 0.67 | - |
| NextStep-1 | 0.63/0.73† | 0.88/0.90* | 0.67/0.74* | 85.28 |
Note: * result is with rewriting. † result is with Self-CoT. ‡ results are not specified in the paper's table footnote but appear to denote Self-CoT or similar reasoning enhancement based on context.
Image-Text Alignment
-
GenEval:
NextStep-1achieves 0.63, which increases to 0.73 withSelf-CoT(Self-Chain-of-Thought), indicating strong prompt-following ability. This is comparable to diffusion models likeTransfusion(0.63) andFlux.1-dev(0.66), and significantly outperforms older diffusion models likeStable Diffusion 1.5(0.43) andStable Diffusion XL(0.55). Among autoregressive models, it outperformsEmu3(0.54) andShow-o(0.53). -
GenAI-Bench:
- Basic Prompts:
NextStep-1scores 0.88 (0.90 withSelf-CoT), demonstrating excellent compositional abilities. This is on par withStable Diffusion 3 Medium(0.88) and3.5 Large(0.88), and slightly better thanFlux.1-dev(0.86). - Advanced Prompts:
NextStep-1achieves 0.67 (0.74 withSelf-CoT), showcasing its capability to handle complex prompts. This is competitive withStable Diffusion 3.5 Large(0.66) andBAGEL(0.69).
- Basic Prompts:
-
DPG-Bench: For long-context, multi-object scenes,
NextStep-1achieves 85.28, confirming its reliable compositional fidelity under complex prompts. This score is higher thanDALL-E 3(83.50),GPT4o(85.15), andStable Diffusion 3 Medium(84.08), placing it among the top performers in this category.The following are the results from Table 3 of the original paper:
Method Alignment Text Reasoning Style Diversity Overall↑ Proprietary Imagen3 (Baldridge et al., 2024) 0.843 0.343 0.313 0.359 0.188 0.409 Recraft V3 (team, 2024) 0.810 0.795 0.323 0.378 0.205 0.502 Kolors 2.0 (team, 2025) 0.820 0.427 0.262 0.360 0.300 0.434 Seedream 3.0 (Gao et al., 2025) 0.818 0.865 0.275 0.413 0.277 0.530 Imagen4 (deepmind Imagen4 team, 2025) 0.857 0.805 0.338 0.377 0.199 0.515 GPT-4o (OpenAI, 2025b) 0.851 0.857 0.345 0.462 0.151 0.533 Diffusion Stable Diffusion 1.5 (Rombach et al., 2022) 0.565 0.010 0.207 0.383 0.429 0.319 Stable Diffusion XL (Podell et al., 2024) 0.688 0.029 0.237 0.332 0.296 0.316 Stable Diffusion 3.5 Large (Stability-AI, 2024) 0.809 0.629 0.294 0.353 0.225 0.462 Flux.1-dev (Labs, 2024) CogView4 (Z.ai, 2025) 0.786 0.523 0.253 0.368 0.238 0.434 SANA-1.5 1.6B (PAG) (Xie et al., 2025a) 0.786 0.641 0.246 0.353 0.205 0.446 SANA-1.5 4.8B (PAG) (Xie et al., 2025a) 0.762 0.054 0.209 0.387 0.222 0.327 Lumina-Image 2.0 (Qin et al., 2025) 0.765 0.069 0.217 0.401 0.216 0.334 HiDream-I1-Full (Cai et al., 2025) 0.819 0.106 0.270 0.354 0.216 0.353 BLIP3-o (Chen et al., 2025a) 0.829 0.707 0.317 0.347 0.186 0.477 0.711 0.013 0.223 0.361 0.229 0.307 BAGEL (Deng et al., 2025) 0.769 0.244 0.173 0.367 0.251 0.361 Show-o2-1.5B (Xie et al., 2025b) 0.798 0.002 0.219 0.317 0.186 0.304 Show-o2-7B (Xie et al., 2025b) 0.817 0.002 0.226 0.317 0.177 0.308 OmniGen2 (Wu et al., 2025b) 0.804 0.680 0.271 0.377 0.242 0.475 Qwen-Image (Wu et al., 2025a) 0.882 0.891 0.306 0.418 0.197 0.539 AutoRegressive Emu3 (Wang et al., 2024b) 0.737 0.010 0.193 0.361 0.251 Janus-Pro (Chen et al., 2025b) 0.553 0.001 0.139 0.276 0.365 0.311 0.267 NextStep-1 0.826 0.507 0.224 0.332 0.199 0.417 -
OneIG-Bench (English Prompts):
NextStep-1achieves an overall score of 0.417. This result significantly outperforms its autoregressive peers, such asEmu3(0.311) andJanus-Pro(0.267). Breaking down the metrics:-
Alignment (0.826): Competitive with top diffusion models like
Stable Diffusion 3.5 Large(0.809) andOmniGen2(0.804). -
Text (0.507): While not reaching the top proprietary models (
GPT-4oat 0.857), it substantially outperforms most open-source diffusion models (many below 0.1) and autoregressive peers (Emu3at 0.010,Janus-Proat 0.001), indicating strong text rendering ability for an AR model. -
Reasoning (0.224): Mid-range performance, indicating room for improvement but still better than some diffusion and AR models.
-
Style (0.332): In line with many diffusion models.
-
Diversity (0.199): Within the typical range for high-fidelity models, which sometimes show less diversity than lower-fidelity ones.
The following are the results from Table 4 of the original paper:
Model Cultural Time Space Biology Physics Chemistry Overall↑ Overall (Rewrite)↑ Proprietary GPT-4o (OpenAI, 2025b) 0.81 0.71 0.89 0.83 0.79 0.74 0.80 - Diffusion Stable Diffusion 1.5 (Rombach et al., 2022) 0.34 0.35 0.32 0.28 0.29 0.21 0.32 0.50 Stable Diffusion XL (Podell et al., 2024) 0.43 0.48 0.47 0.44 0.45 0.27 0.43 0.65 Stable Diffusion 3.5 Large (Stability-AI, 2024) 0.44 0.50 0.58 0.44 0.52 0.31 0.46 0.72 PixArt-Alpha (Chen et al., 2024b) 0.45 0.50 0.48 0.49 0.56 0.34 0.47 0.63 Playground v2.5 (Li et al., 2024b) 0.49 0.58 0.55 0.43 0.48 0.33 0.49 0.71 Flux.1-dev (Labs, 2024) 0.48 0.58 0.62 0.42 0.51 0.35 0.50 0.73 MetaQuery-XL (Pan et al., 2025) 0.56 0.55 0.62 0.49 0.63 0.41 0.55 BAGEL (Deng et al., 2025) 0.44/0.76‡ 0.55/0.69† 0.68/0.75‡ 0.44/0.65† 0.60/0.75† 0.39/0.58† 0.52/0.70† 0.71/0.77† Qwen-Image (Wu et al., 2025a) 0.62 0.63 0.77 0.57 0.75 0.40 0.62 - AutoRegressive Show-o-512 (Xie et al., 2024) 0.28 0.40 0.48 0.30 0.46 0.30 0.35 0.64 VILA-U (Wu et al., 2024) 0.26 0.33 0.37 0.35 0.39 0.23 0.31 - Emu3 (Wang et al., 2024b) 0.34 0.45 0.37 0.48 0.41 0.45 0.27 0.39 Janus-Pro-7B (Chen et al., 2025b) 0.30 0.49 0.36 0.42 0.26 0.35 0.71 NextStep-1 0.51/0.70‡ 0.54/0.65‡ 0.61/0.69‡ 0.52/0.63† 0.63/0.73‡ 0.48/0.52† 0.54/0.67* 0.79/0.83*
-
Note: * result is with Self-CoT. ‡ and † results are not specified in the paper's table footnote but appear to denote Self-CoT or similar reasoning enhancement based on context, possibly with rewriting. Given the previous table's footnote, it is likely that * implies Self-CoT and Rewrite refers to the prompt rewrite protocol.
World Knowledge
- WISE Benchmark:
NextStep-1achieves an overall score of 0.54, which improves to 0.67 withSelf-CoT. This is the best performance among autoregressive models, significantly outperformingEmu3(0.27) andJanus-Pro-7B(0.35). It also exceeds most diffusion models, includingStable Diffusion 3.5 Large(0.46) andFlux.1-dev(0.50). - Prompt Rewrite Protocol: Under the prompt rewrite protocol,
NextStep-1's score increases to 0.79 (0.83 withSelf-CoT). This demonstrates robust knowledge-aware semantic alignment and cross-domain reasoning capabilities, approaching proprietary models likeGPT-4o(0.80).
6.1.2. Performance of Image Editing
The following are the results from Table 5 of the original paper:
| Model | GEdit-Bench-EN (Full Set)↑ | GEdit-Bench-CN (Full Set)↑ | ImgEdit-Bench↑ | ||||
|---|---|---|---|---|---|---|---|
| G_SC | G_PQ | G_0 | G_SC | G_PQ | G_0 | ||
| Proprietary | |||||||
| Gemini 2.0 (Gemini2, 2025) | 6.87 | 7.44 | 6.51 | 5.26 | 7.60 | 5.14 | |
| Doubao (Shi et al., 2024) | 7.22 | 7.89 | 6.98 | 7.17 | 7.79 | 6.84 | |
| GPT-4o (OpenAI, 2025b) | 7.74 | 8.13 | 7.49 | 7.52 | 8.02 | 7.30 | 4.20 |
| Flux.1-Kontext-pro (Labs et al., 2025) | 7.02 | 7.60 | 6.56 | 1.11 | 7.36 | 1.23 | - |
| Open-source | |||||||
| Instruct-Pix2Pix (Brooks et al., 2023) | 3.30 | 6.19 | 3.22 | 1.88 | |||
| MagicBrush (Zhang et al., 2023a) | 4.52 | 6.37 | 4.19 | 1.83 | |||
| AnyEdit (Yu et al., 2024a) | 3.05 | 5.88 | 2.85 | 2.45 | |||
| OmniGen (Xiao et al., 2024) | 5.88 | 5.87 | 5.01 | 2.96 | |||
| OmniGen2 (Wu et al., 2025b) | 7.16 | 6.77 | 6.41 | - | - | 3.44 | |
| Step1X-Edit v1.0 (Liu et al., 2025) | 7.13 | 7.00 | 6.44 | 7.30 | 7.14 | 6.66 | 3.06 |
| Step1X-Edit v1.1 (Liu et al., 2025) | 7.66 | 7.35 | 6.97 | 7.65 | 7.40 | 6.98 | - |
| BAGEL (Deng et al., 2025) | 7.36 | 6.83 | 6.52 | 7.34 | 6.85 | 6.50 | 3.42 |
| Flux.1-Kontext-dev (Labs et al., 2025) | - | - | 6.26 | - | - | - | 3.71 |
| GPT-Image-Edit (Wang et al., 2025d) | - | - | 7.24 | - | - | - | 3.80 |
| NextStep-1 | 7.15 | 7.01 | 6.58 | 6.88 | 7.02 | 6.40 | 3.71 |
NextStep-1-Edit (fine-tuned on 1M edit-only data) demonstrates competitive performance in image editing:
-
GEdit-Bench-EN (Full Set):
NextStep-1-Editachieves an overall score (G_0) of 6.58. This is highly competitive with strong open-source models likeOmniGen2(6.41),Step1X-Edit v1.0(6.44), andBAGEL(6.52). It also shows strong instruction following (G_SC=7.15) and perceptual quality (G_PQ=7.01). -
ImgEdit-Bench:
NextStep-1-Editscores 3.71, which is on par withFlux.1-Kontext-dev(3.71) and very close toGPT-Image-Edit(3.80) andGPT-4o(4.20), outperforming many other open-source methods likeOmniGen2(3.44) andBAGEL(3.42).These results highlight
NextStep-1's versatility and capability to perform high-quality image editing tasks, demonstrating the power of its unified autoregressive approach.
6.1.3. Qualitative Performance
The image is an illustration that showcases the applications of NextStep-1 in high-fidelity image generation, diverse image editing, and complex free-form manipulation. The upper section displays examples of image generation, the middle part shows the functionalities of image editing, and the lower section introduces scenarios of free-form manipulation.
该图像是一个示意图,展示了NextStep-1在高保真图像生成、多样化图像编辑和复杂自由形式操作方面的应用。上方部分展示了图像生成的例子,中间部分展示了图像编辑的功能,下方则介绍了自由形式的操控场景。
Figure 1 provides qualitative examples, showcasing NextStep-1's capabilities in high-fidelity image generation, diverse image editing, and complex free-form manipulation. The generated images appear coherent, aesthetically pleasing, and align well with prompts. The editing examples show plausible and context-aware modifications.
6.2. Ablation Studies / Parameter Analysis
6.2.1. What Governs Image Generation: the AR Transformer or the FM Head?
The paper investigates the relative importance of the AR Transformer backbone and the Flow Matching (FM) head in image generation.
The following are the results from Table 6 of the original paper:
| Layers | Hidden Size | # Parameters | |
| FM Head Small | 6 | 1024 | 40M |
| FM Head Base | 12 | 1536 | 157M |
| FM Head Large | 24 | 2048 | 528M |
The following are the results from Table 7 of the original paper:
| GenEval | GenAI-Bench | DPG-Bench | |
| Baseline | 0.59 | 0.77 | 85.15 |
| w / FM Head Small | 0.55 | 0.76 | 83.46 |
| w / FM Head Base | 0.55 | 0.75 | 84.68 |
| w / FM Head Large | 0.56 | 0.77 | 85.50 |
-
Experimental Setup: The authors ablated the
flow matching headby testing three different sizes: Small (40M parameters), Base (157M parameters), and Large (528M parameters), as detailed in Table 6. For each experiment, only the head was re-initialized and trained for 10K steps. -
Results (Table 7 & Figure 4): Despite significant variations in the
FM headsize, all three configurations yielded remarkably similar quantitative results acrossGenEval,GenAI-Bench, andDPG-Bench. Qualitatively, Figure 4 also shows that images generated with different head sizes are largely indistinguishable.The image is a chart displaying images generated under different flow matching heads, including small, base, and large flow matching heads. These images depict animals, buildings, and dancers in various scenes, showcasing the model's capability for high-fidelity image synthesis.
该图像是图表,展示了在不同流匹配头下生成的图像,包括小流匹配头、基础流匹配头和大流匹配头下的图像合成效果。这些图像分别呈现了不同场景中的动物、建筑和舞者,展示了模型的高保真图像合成能力。 -
Analysis: This finding suggests a surprising insensitivity to the
flow matching head's size. The authors interpret this as strong evidence that theTransformer backboneis primarily responsible for the core generative modeling, learning the complex conditional distribution . Theflow matching headacts more like a lightweightsampler, effectively translating theTransformer's high-level contextual predictions into the continuous image tokens, akin to how a simpleLM headconvertsTransformeroutputs into discrete text tokens. The essential generative logic resides within theTransformer's autoregressiveNTPprocess.
6.2.2. Tokenizer is the Key to Image Generation
Mitigating Instability under Strong Classifier-Free Guidance (CFG)
-
Problem: VAE-based autoregressive models are prone to visual artifacts (e.g., gray patches) under strong
classifier-free guidance (CFG)scales, even thoughCFGis used to enhance conditional fidelity. -
Root Cause Identified: Previous work hypothesized that 1D positional embeddings caused this instability. However,
NextStep-1's analysis reveals the true cause to be theamplification of token-level distributional shiftsunder highCFGscales. Indiffusion models, normalization of latent variables typically ensures consistent scaling of conditional and unconditional predictions. In token-level AR models, global normalization does not guarantee per-token statistical consistency. This means small discrepancies between conditional and unconditional predictions are magnified by a largeguidance scale, leading todriftin the statistics of generated tokens over the sequence. -
Empirical Demonstration (Figure 5): The image is a chart illustrating the evolution of mean and variance per token over sampling steps under different CFG settings. The upper part of the chart displays the mean and variance for CFG=1.5 and CFG=3.0, indicating the stability and changes in image quality with parameter variations.
该图像是图表,展示了在不同CFG设置下,采样步骤中每个token的均值和方差的演变。上方的图显示了CFG=1.5和CFG=3.0的均值和方差,说明了在参数变化下图像质量的稳定性与变化。Figure 5 shows that at a moderate
CFGof 1.5, per-token mean and variance remain stable. In contrast, at a highCFGof 3.0, both statistics diverge significantly for later tokens, directly correlating with visual artifacts. -
Solution: The
NextStep-1 tokenizerdesign incorporateschannel-wise normalization(Equation 3: ) which directly addresses this by enforcingper-token statistical stability. This prevents statistical drift and enables stable generation even with strongCFG.
A Regularized Latent Space is Critical for Generation
- Counter-intuitive Finding: The authors discovered an inverse correlation between
generation loss(during tokenizer training) and the final synthesis quality. Applying highernoise intensity( in Equation 3) during tokenizer training increases the tokenizer's reconstruction loss but paradoxically improves the quality of images generated by the autoregressive model. For example,NextStep-1uses a tokenizer trained with , which had the highest generation loss but produced the highest-fidelity images. Low-loss tokenizers, conversely, yielded noisy outputs from the AR model. - Attribution: This phenomenon is attributed to
noise regularization, whichcultivates a well-conditioned latent space. This process enhances two key properties:-
Robustness to Latent Perturbations: The tokenizer decoder becomes more robust to variations in the latent space (Figure 6).
-
More Dispersed Latent Distribution: The latent distribution becomes more uniform and closer to a standard normal distribution (Figure 7). This property has been found beneficial in prior work (Sun et al., 2024c; Yang et al., 2025; Yao et al., 2025).
The image is a chart demonstrating the impact of noise perturbation on image tokenizer performance. The top section presents the relationship between quantitative metrics (rFID, PSNR, and SSIM) and noise standard deviation, while the bottom section shows reconstruction examples at different noise standard deviations (0.2 and 0.5).
该图像是图表,展示了噪声扰动对图像编码器性能的影响。上面部分为定量指标(rFID、PSNR和SSIM)与噪声标准偏差的关系,下面部分展示了在不同噪声标准偏差(0.2和0.5)下的重建示例。
-
Figure 6 shows that as noise intensity (standard deviation) increases, rFID (lower is better for generation) tends to decrease, while PSNR and SSIM (higher is better for reconstruction) slightly decrease or remain stable within a certain range, but the rFID shows a more direct improvement.
The image is a histogram of multidimensional data, showing the comparison of empirical distribution and normal distribution for different dimensions (Dimension 0 to Dimension 15). In each subplot, the blue bars represent the empirical distribution, while the red curve represents the normal distribution, visually showcasing the characteristics of the data across dimensions.
该图像是多维数据的直方图,展示了不同维度(Dimension 0 至 Dimension 15)的经验分布与正态分布的比较。每个子图中,蓝色柱状图表示经验分布,红色曲线表示正态分布,直观展现了各个维度数据的特征。
The image is a histogram of the latent distribution, showcasing the latent distribution characteristics of the NextStep-1 VAE across different dimensions. Each chart represents the empirical and normal distributions of various dimensions, highlighting the model's performance in latent space.
该图像是 latent distribution 的直方图,展示了 NextStep-1 VAE 中不同维度的潜在分布特征。每个图表表示不同维度的经验和正态分布,强调了模型在隐空间的表现。
Figure 7 and 8 (original paper labels Latent Distribution of Flux.1-dev VAE and Latent Distribution of NextStep-1 VAE w/o Noise) visually compare latent distributions. NextStep-1 VAE (with noise regularization) aligns best with the normal distribution, reflecting a more dispersed latent distribution compared to Flux.1-dev VAE and NextStep-1 VAE w/o Noise.
Reconstruction Quality is the Upper Bound of Generation Quality
-
Principle: The fidelity of an
image tokenizer's reconstruction fundamentally limits the maximum achievable quality of the generated images, especially for fine details and textures. -
Validation: This principle is supported by recent studies (Dai et al., 2023; Esser et al., 2024; Labs, 2024), leading to a trend in
diffusion modelsof using VAEs with exceptional reconstruction performance (). -
Gap Bridging (Table 8): Historically,
VQ-based autoregressive modelsstruggled to surpass this threshold.NextStep-1successfully applies autoregressive models to high-fidelity continuous VAEs, bridging this gap.The following are the results from Table 8 of the original paper:
Tokenizer Latent Shape PSNR ↑ SSIM ↑ Discrete Tokenizer SBER-MoVQGAN (270M) (Zheng et al., 2022) 32x32 27.04 0.74 LlamaGen (Sun et al., 2024a) 32x32 24.44 0.77 VAR (Tian et al., 2024) 680 22.12 0.62 TiTok-S-128 (Yu et al., 2024b) 128 17.52 0.44 Sefltok (Wang et al., 2025b) 1024 26.30 0.81 Continuous Tokenizer Stable Diffusion 1.5 (Rombach et al., 2022) 32x32x4 25.18 0.73 Stable Diffusion XL (Podell et al., 2024) 32x32x4 26.22 0.77 Stable Diffusion 3 Medium (Esser et al., 2024) 32x32x16 30.00 0.88 Flux.1-dev (Labs, 2024) 32x32x16 31.64 0.91 NextStep-1 32x32x16 30.60 0.89
Table 8 shows NextStep-1's tokenizer achieves PSNR of 30.60 and SSIM of 0.89, which are competitive with leading continuous tokenizers like Stable Diffusion 3 Medium (30.00 PSNR) and Flux.1-dev (31.64 PSNR), and significantly higher than most discrete tokenizers.
6.3. Training Recipe
6.3.1. Training Image Tokenizer
The image tokenizer is initialized from Flux.1-dev VAE (Labs, 2024) and fine-tuned on the image-text dataset (Section 3.2).
-
Optimizer:
AdamW(Loshchilov and Hutter, 2019) with parameters: , , . -
Training Steps: 50K steps.
-
Batch Size: 512.
-
Learning Rate: Constant , with a linear warm-up of 1000 steps.
The following are the results from Table 1 of the original paper:
Pre-Training Post-Training Stage1 Stage2 Annealing SFT DPO Hyperparameters Learning Rate (Min, Max) 1 × 10-4 1 × 10-5 (0, 1 × 10−5) (0, 1 × 10−5) 2 × 10-6 LR Scheduler Constant Constant Cosine Cosine Constant Weight Decay 0.1 0.1 0.1 0.1 0.1 Loss Weight (CE : MSE) (0.01 : 1) (0.01 : 1) (0.01 : 1) (0.01 : 1) - Training Steps 200K 100K 20K 10K 300 Warm-up Steps 5K 5K 0 500 200 Sequence Length per Rank 16K 16K 16K 8K Image Area (Min, Max) 256×256 (256×256, 512×512) (256×256, 512×512) (256×256, 512×512) (256×256, 512×512) Image Tokens (Min, Max) 256 (256, 1024) (256, 1024) (256, 1024) (256, 1024) Training Tokens 1.23T 0.61T 40B 5B Data Ratio Text-only Corpus 0.2 0.2 0.2 0 Image-Text Pair Data 0.6 0.6 0.6 0.9 Image-to-Image Data 0.0 0.0 0.1 0.1 Interleaved Data 0.2 0.2 0.1 0
6.3.2. Pre-Training
Pre-training follows a three-stage curriculum. All model parameters (except the image tokenizer) are trained end-to-end.
- Optimizer (general): AdamW with specific parameters (not explicitly stated for pre-training, but standard in LLMs).
- Loss Weight (CE : MSE): Consistent ratio of (0.01 : 1) across all stages, balancing text (
Cross-Entropy) and visual (Flow Matching) losses.
Stage1
- Purpose: Learn foundational understanding of image structure and composition.
- Image Resolution: Fixed (resized and randomly cropped) for computational efficiency.
- Data Mixture:
Text-only Corpus: 20%Image-Text Pair Data: 60%Interleaved Data: 20%
- Training Tokens: Approximately 1.23 Trillion tokens.
- Training Steps: 200K steps.
- Learning Rate: Constant .
- Warm-up Steps: 5K steps.
Stage2
- Purpose: Train the model on higher resolutions and finer details.
- Image Resolution: Dynamic resolution strategy, targeting and base areas, utilizing different aspect ratio buckets.
- Data Mixture: Enriched with more text-rich and video-interleaved data (same ratios as Stage1, but possibly higher quality subsets for text/video, given description).
Text-only Corpus: 20%Image-Text Pair Data: 60%Interleaved Data: 20%
- Training Tokens: Approximately 0.61 Trillion tokens.
- Training Steps: 100K steps.
- Learning Rate: Constant .
- Warm-up Steps: 5K steps.
Annealing
- Purpose: Sharpen model capabilities on a highly curated dataset, enhancing overall image structure, composition, texture, and aesthetic appeal.
- Training Strategy: One epoch on a high-quality subset of 20M samples.
- Data Source: Selected from Section 3.2 (Image-Text Pair Data) by applying stricter filtering thresholds (aesthetic score, image clarity, semantic similarity, watermark).
- Data Mixture:
Text-only Corpus: 20%Image-Text Pair Data: 60%Image-to-Image Data: 10% (introduced here)Interleaved Data: 10%
- Training Tokens: 40 Billion tokens.
- Training Steps: 20K steps.
- Learning Rate: Cosine scheduler, from (0, ).
- Warm-up Steps: 0.
6.3.3. Post-Training
Post-training aligns the model's output with human preferences and downstream tasks, via Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).
Supervised Fine-Tuning (SFT)
- Purpose: Enhance instruction-following capabilities and align outputs with human preferences.
- SFT Dataset: Total of 5M samples, comprising three components:
- Human-selected image-text pairs: High semantic consistency and visual appeal, augmented by images from other generative models (distillation for complex/imaginative prompts).
- Chain-of-Thought (CoT) data: (Deng et al., 2025; Wei et al., 2022) to improve text-to-image generation by incorporating a language-based reasoning step.
- High-quality instruction-guided image-to-image data: From Section 3.3, to strengthen image editing capabilities.
- Data Ratios:
Image-Text Pair Data: 90%Image-to-Image Data: 10%
- Training Tokens: 5 Billion tokens.
- Training Steps: 10K steps.
- Learning Rate: Cosine scheduler, from (0, ).
- Warm-up Steps: 500 steps.
Direct Preference Optimization (DPO)
- Purpose: Align the model with human preferences, inspired by
Diffusion-DPO(Wallace et al., 2024). - Preference Datasets: Constructed from approximately 20,000 diverse prompts.
- Standard DPO Dataset:
- For each prompt , the SFT model generates 16 candidate images.
ImageReward(Xu et al., 2023) scores these images.- A preference pair is formed: (winning image) is randomly sampled from the top 4 candidates, (losing image) from the remaining 12.
- Self-CoT DPO Dataset:
- For each prompt , the model first generates a detailed textual
Chain-of-Thought. - This
CoT-enhanced promptis then used to follow the identical pipeline as above, forming a preference pair .
- For each prompt , the model first generates a detailed textual
- Standard DPO Dataset:
- Training Steps: 300 steps.
- Learning Rate: Constant .
- Warm-up Steps: 200 steps.
6.4. Inference Latency Analysis
The following are the results from Table 9 of the original paper:
| Sequence Length | Last-token Latency (ms) | Accumulated Latency (s) | |||
|---|---|---|---|---|---|
| LLM Decoder | LM Head | FM Head | Total | w/o FM Head | |
| 256 | 7.20 | 0.40 | 3.40 | 2.82 | 1.95 |
| 1024 | 7.23 | 0.40 | 3.40 | 11.31 | 7.83 |
| 4096 | 7.39 | 0.40 | 3.40 | 45.77 | 31.86 |
Table 9 presents an inference latency breakdown on an H100 GPU for a batch size of 1.
- Dominant Bottleneck: The
LLM Decoder(theCausal Transformerbackbone) is the dominant component contributing to last-token latency (around 7.2-7.4 ms). - Substantial Contribution: The
multi-step samplingin theflow matching headalso constitutes a substantial portion of the per-token generation cost (3.40 ms). TheLM Headis comparatively very fast (0.40 ms). - Accumulated Latency: As sequence length increases (e.g., from 256 to 4096 tokens), the total accumulated latency scales linearly, reaching 45.77 seconds for a sequence of 4096 tokens. Even without the
FM Head's contribution, theLLM Decoderalone results in 31.86 seconds for 4096 tokens, highlighting the serial nature of autoregressive decoding as a bottleneck.
6.5. Failure Cases
The image is a diagram illustrating failure cases for high-dimensional continuous tokens. It showcases various instances of generated images with different styles and contents, highlighting the limitations of current image generation techniques in handling complex visual information.
该图像是一个示意图,展示了高维连续标记的失败案例。图中包含多种生成图像的实例,包括不同的风格和内容,旨在突出当前图像生成技术在处理复杂视觉信息时的局限性。
Figure 8 illustrates some observed failure cases when transitioning to higher-dimensional latent spaces (e.g., 16 latent channels compared to 4). These artifacts include:
Local noise or block-shaped artifacts: Appearing in later stages of generation, potentially indicating numerical instabilities.Global noise across the image: Could be a sign of under-convergence, suggesting that more training steps might mitigate the issue.Subtle grid-like artifacts: May reveal limitations of the 1D positional encoding in capturing complex 2D spatial relationships.
7. Conclusion & Reflections
7.1. Conclusion Summary
NextStep-1 successfully advances the autoregressive (AR) paradigm for text-to-image generation by effectively integrating continuous image tokens with a lightweight flow matching head. The 14B AR model, initialized from Qwen2.5, demonstrates state-of-the-art performance among AR models across diverse benchmarks for text-to-image generation (GenEval, GenAI-Bench, DPG-Bench, OneIG-Bench), achieving high-fidelity synthesis, strong compositional understanding, linguistic capabilities, and world knowledge. Furthermore, its fine-tuned version, NextStep-1-Edit, shows competitive performance in instruction-guided image editing. Key to its success is a robust image tokenizer design featuring channel-wise normalization and stochastic perturbation, which stabilizes continuous latent spaces and enables effective use of classifier-free guidance. The research also highlights that the Transformer backbone is the primary driver of generative modeling, with the flow matching head acting as a lightweight sampler. NextStep-1 bridges the performance gap between AR and diffusion models while maintaining the architectural flexibility and scalability of AR systems.
7.2. Limitations & Future Work
The authors acknowledge several limitations and outline future research directions:
- Artifacts:
- Limitation: Despite tokenizer improvements,
NextStep-1can still exhibit generative artifacts (local noise, block artifacts, global noise, grid-like patterns) when scaling to higher-dimensional continuous latent spaces (e.g., 16 channels). - Future Work: Further investigation into the underlying causes (numerical instabilities, under-convergence, limitations of 1D positional encoding for 2D spatial relationships) is needed.
- Limitation: Despite tokenizer improvements,
- Inference Latency of Sequential Decoding:
- Limitation: The inherently sequential nature of autoregressive decoding leads to substantial inference latency, with the
LLM Decoderand the multi-stepFlow Matching Headbeing the dominant bottlenecks. - Future Work:
- Flow Matching Head Acceleration: Reduce parameter count, apply distillation for few-step generation (Meng et al., 2023), or use more advanced few-step samplers (Lu et al., 2022, 2025).
- Autoregressive Backbone Acceleration: Adapt techniques from the
LLMfield such asspeculative decoding(Leviathan et al., 2023) ormulti-token prediction(Gloeckle et al., 2024) to image token generation.
- Limitation: The inherently sequential nature of autoregressive decoding leads to substantial inference latency, with the
- Challenges in High-Resolution Training:
- Limitation: Scaling to high-resolution image generation is challenging compared to diffusion models. AR models require significantly more training steps due to sequential generation, and
timestep shifttechniques used in diffusion models are difficult to adapt to theFlow Matching Head's role as a lightweight sampler. - Future Work: Designing high-resolution generation strategies specifically for patch-wise autoregressive models.
- Limitation: Scaling to high-resolution image generation is challenging compared to diffusion models. AR models require significantly more training steps due to sequential generation, and
- Challenges in Supervised Fine-Tuning (SFT):
- Limitation:
SFTexhibits unstable dynamics, requiring datasets at the million-sample scale for substantial improvement. Smaller datasets lead to either marginal gains or abrupt overfitting, making it difficult to find intermediate checkpoints that balance alignment with general generative capability. - Future Work: Developing more robust
SFTstrategies for autoregressive multimodal models that can effectively leverage smaller, high-quality datasets without sacrificing generalization.
- Limitation:
7.3. Personal Insights & Critique
This paper presents a compelling step forward for autoregressive models in multimodal generation, particularly by embracing continuous tokens and a flow matching head. The core innovation lies in demonstrating that a Transformer can effectively manage continuous image generation without resorting to computationally heavier diffusion models or losing fidelity with vector quantization.
-
Inspirations and Transferability:
- Unified Multimodal Architecture: The idea of a single
causal transformerprocessing both text and continuous image tokens within a unified sequence is elegant and scalable. This approach could inspire future generalist AI models capable of handling diverse modalities seamlessly, beyond just text and images, potentially incorporating audio or 3D data as continuous token streams. - Robust Tokenizer Design: The findings on
channel-wise normalizationandstochastic perturbationin theimage tokenizerare highly impactful. The counter-intuitive discovery that higher noise intensity during tokenizer training (leading to higher reconstruction loss) can improve overall generation quality due to a better-conditioned latent space is a crucial insight. This principle of "regularizing the latent space for downstream generation" could be broadly applicable to other generative tasks involving latent representations, not just AR models. - Lightweight Generative Head: The insensitivity to the
flow matching head's size is a powerful result, suggesting that the primary intelligence resides in theTransformeritself. This could simplify future model designs, allowing researchers to focus on scaling the mainTransformerbackbone rather than optimizing complex modality-specific heads, potentially reducing model complexity and training costs for various modalities.
- Unified Multimodal Architecture: The idea of a single
-
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Inference Latency: While acknowledged as a limitation, the linear scaling of latency with sequence length is a fundamental challenge for AR models, especially for very high-resolution images. The proposed solutions (speculative decoding, multi-token prediction) are promising but require significant adaptation for continuous image tokens. The current latency might make real-time high-resolution image generation impractical.
-
Artifacts in High-Dimensional Latents: The presence of artifacts (local noise, grid-like patterns) when using 16-channel latents indicates that handling high-dimensional continuous latent spaces within an AR framework is still an unsolved problem. The suggestion that
1D positional encodingmight struggle with2D spatial relationshipsfor image patches is plausible; exploring more sophisticated 2D or relative positional encodings might be necessary. -
Training Data Dependence: The model's reliance on million-scale datasets for stable
SFTis a practical hurdle. Smaller, high-quality datasets are often more accessible for specific tasks or styles. DevelopingSFTtechniques that are more data-efficient for AR multimodal models would be highly beneficial. -
Comparison with Proprietary Models: While
NextStep-1achieves state-of-the-art among autoregressive models and is competitive with many open-source diffusion models, it still trails some top-tier proprietary models likeGPT-4oandSeedream 3.0in overall performance on certain benchmarks (e.g.,GenEval,WISE). Further scaling or architectural refinements might be needed to fully close this gap. -
Generalizability of Flow Matching: While
flow matchingis efficient, its long-term generalizability and robustness across all possible continuous data distributions compared to the more establisheddiffusion processstill warrant extensive research. The "black box" nature of why noise regularization helps generation quality (robustness vs. dispersion) also points to areas for deeper theoretical understanding.Overall,
NextStep-1makes a significant contribution by pushing the boundaries of autoregressive image generation. It demonstrates that the AR paradigm can achieve high-quality continuous image synthesis, offering a promising alternative to diffusion-based methods, especially if the current limitations can be addressed.
-
Similar papers
Recommended via semantic vector search.