Abstract

We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.

Bibliographic Information

Title: Mean Flows for One-step Generative Modeling
Authors: Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, Kaiming He
Affiliations: Carnegie Mellon University (CMU), Massachusetts Institute of Technology (MIT)
Journal/Conference: Not explicitly stated as published in a journal or conference, but submitted to arXiv. Given the authors and the quality of work, it is likely intended for a top-tier machine learning conference (e.g., ICLR, NeurIPS, ICML) or journal. arXiv is a preprint server widely used in academia for rapid dissemination of research.
Publication Year: 2025
Abstract: This paper introduces a novel framework, MeanFlow, for one-step generative modeling. Unlike traditional Flow Matching methods that model instantaneous velocity, MeanFlow proposes to characterize flow fields using average velocity. The core innovation is a derived identity between average and instantaneous velocities, which directly guides the training of a neural network. The MeanFlow model is self-contained, meaning it does not require pre-training, distillation, or curriculum learning. Empirically, MeanFlow achieves an impressive Fréchet Inception Distance (FID) of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256, significantly surpassing previous state-of-the-art (SOTA) one-step diffusion/flow models. The authors conclude that this work substantially reduces the performance gap between one-step and multi-step generative models, encouraging further foundational research in this area.
Original Source Link: https://arxiv.org/abs/2505.13447
PDF Link: https://arxiv.org/pdf/2505.13447v1.pdf

Executive Summary

Background & Motivation (Why)

The fundamental goal of generative modeling is to learn how to transform a simple, known probability distribution (like a Gaussian noise distribution) into a complex, target data distribution (like images of cats). Diffusion models and Flow Matching have emerged as highly successful frameworks for this task. However, these models traditionally require many sampling steps (or function evaluations, NFE) during the inference (generation) phase to produce high-quality samples, making them computationally expensive and slow.

The core problem this paper addresses is the challenge of one-step generative modeling – generating high-quality samples in just a single function evaluation (1-NFE). Achieving one-step generation is crucial for practical applications where speed is paramount, such as real-time content creation or interactive systems. Prior one-step or few-step methods often rely on complex techniques like distillation from multi-step models, pre-training, or carefully designed curriculum learning strategies to stabilise training and improve performance. These approaches can be intricate to implement and may introduce dependencies on other models or training procedures. The specific gap identified is the lack of a principled, self-contained framework for one-step generation that can achieve SOTA quality without these additional complexities.

The paper's novel approach, MeanFlow, introduces the concept of average velocity to characterise flow fields. Instead of modeling the instantaneous velocity at each moment in time (as Flow Matching typically does), MeanFlow directly learns the average velocity over a time interval. This conceptual shift allows for direct one-step generation by effectively "skipping" the intermediate steps.

Main Contributions / Findings (What)

Introduction of Average Velocity and MeanFlow Identity: The paper formally defines average velocity and, through a rigorous derivation, establishes a MeanFlow Identity. This identity describes a fundamental relationship between the average velocity and instantaneous velocity, providing a principled, self-contained mathematical basis for training a neural network for one-step generation.
Principled One-Step Generative Framework: MeanFlow offers a self-contained framework that trains from scratch without needing pre-training, distillation, or curriculum learning. This simplifies the training process and makes the model more robust.
Integration of Classifier-Free Guidance (CFG) for 1-NFE: The method naturally incorporates Classifier-Free Guidance (CFG) directly into its underlying ground-truth fields, enabling the benefits of CFG (improved sample quality) without incurring additional NFE during one-step sampling.
State-of-the-Art Empirical Performance: MeanFlow achieves an FID of 3.43 with 1-NFE on ImageNet 256x256, significantly outperforming previous SOTA one-step diffusion/flow models by a relative margin of 50-70%. This performance substantially narrows the gap between one-step and multi-step generative models. It also achieves an FID of 2.20 with 2-NFE, matching or exceeding the performance of leading multi-step models (e.g., DiT, SiT) that use hundreds of NFEs.
Scalability: The model demonstrates promising scalability with increasing model size, consistent with Transformer-based diffusion/flow models.

To understand MeanFlow, it's essential to first grasp the foundational concepts of generative modeling, diffusion models, and Flow Matching.

Foundational Concepts

Generative Modeling

Generative modeling is a branch of machine learning focused on learning the underlying probability distribution of a dataset. Once learned, a generative model can produce new samples that are similar to the training data. For example, given a dataset of images of faces, a generative model could generate new, realistic faces that were not in the original dataset. The goal is to transform a simple, known prior distribution (e.g., standard Gaussian noise) into the complex data distribution.

Diffusion Models

Diffusion models (e.g., DDPM [19], Score-based models [44]) are a class of generative models that learn to reverse a gradual diffusion process. This process involves progressively adding noise to data until it becomes pure noise. The model then learns to denoise the data, effectively reversing the diffusion process to transform noise back into coherent data. This reversal is typically achieved by predicting the noise added at each step or the score function (gradient of the log-probability density). During sampling (generation), these models iteratively apply a denoising function over many time steps, starting from random noise.

Flow Matching

Flow Matching [28, 2, 30] is a framework closely related to diffusion models that focuses on learning velocity fields. Instead of directly predicting noise or score functions, Flow Matching learns a velocity field that defines a continuous flow path transforming a simple prior distribution (like Gaussian noise) into the target data distribution.

Flow Path: A flow path is a continuous trajectory $z_t$ that connects a sample $x$ from the data distribution to a sample $\epsilon$ from the prior distribution. It's often defined as a linear interpolation: $z_t = a_t x + b_t \epsilon$ where $t$ is time (typically from 0 to 1), and $a_t$ , $b_t$ are predefined schedules (e.g., $a_t = 1-t$ , $b_t = t$ ).
Instantaneous Velocity ( $v_t$ ): The velocity at any point $z_t$ along this path is the time derivative of $z_t$ . If $z_t = (1-t)x + t\epsilon$ , then the instantaneous velocity is: $v_t = \frac{d z_t}{d t} = -x + \epsilon$ This is often called the conditional velocity because it depends on the specific pair $(x, \epsilon)$ that generated $z_t$ .
Conditional vs. Marginal Velocity:
- Conditional Velocity: Given a specific data point $x$ and a noise vector $\epsilon$ , the conditional velocity $v_t(z_t|x, \epsilon)$ describes the direction and speed of the flow path for that specific pair. As shown in Fig. 2 (left), a given point $z_t$ in the latent space can arise from different $(x, \epsilon)$ pairs, leading to different conditional velocities.
- Marginal Velocity: Because the neural network needs to learn a consistent velocity field for any $z_t$ it encounters, Flow Matching aims to model the marginal velocity $v(z_t, t)$ . This is the expected velocity at $z_t$ and time $t$ , averaged over all possible $(x, \epsilon)$ pairs that could lead to $z_t$ (Fig. 2 right). $v(z_t, t) \triangleq \mathbb{E}_{p_t(v_t|z_t)}[v_t]$ Training directly with this expectation is often intractable. However, it has been shown that minimizing a conditional Flow Matching (CFM) loss, which matches the predicted velocity to the conditional velocity $v_t(z_t|x, \epsilon)$ , is equivalent to minimizing the marginal Flow Matching loss.
Sampling: Once the neural network $v_\theta$ is trained to predict the velocity field, sampling new data points involves solving an Ordinary Differential Equation (ODE): $\frac{d}{dt} z_t = v(z_t, t)$ This ODE is typically solved numerically using methods like Euler's method over many discrete time steps, starting from noise ( $z_1 = \epsilon$ ) and integrating backwards to $z_0$ . Each step in this numerical integration constitutes a function evaluation (NFE).

Number of Function Evaluations (NFE)

NFE refers to how many times the neural network (which predicts velocity or noise) must be called during the sampling process to generate a single output sample.

Multi-step Generation: Models requiring many NFE (e.g., 250-1000) are multi-step models. They are generally slower but often produce higher quality samples.
Few-step Generation: Models aiming to reduce NFE to a small number (e.g., 2-50).
One-step Generation (1-NFE): The ultimate goal for fast inference, where a single forward pass of the neural network generates a complete sample. This is the focus of MeanFlow.

Latent Space Models

Many generative models operate not directly on high-resolution pixels but on a compressed latent space representation. A Variational Autoencoder (VAE) [37] is commonly used as a tokenizer to convert images into a lower-dimensional latent representation and vice versa. The generative model then learns to transform noise in this latent space, and a decoder converts the generated latent representation back to pixel space. This reduces computational cost and allows Transformer-based architectures to operate more efficiently.

Positional Embedding

In Transformer architectures, positional embeddings [48] are used to inject information about the relative or absolute position of elements in a sequence (or, in this case, time variables) into the model. Since Transformers are inherently permutation-invariant, positional embeddings are crucial for the model to understand the temporal context.

Previous Works & Technological Evolution

The field has evolved towards faster generation, primarily through distillation or consistency-based methods.

Distillation-based Methods: These approaches first train a multi-step diffusion model and then distill its knowledge into a smaller, faster model that requires fewer sampling steps (e.g., [39, 14, 41, 32]). This often means the few-step model is dependent on the performance and training of the multi-step teacher model.
Consistency Models (CMs) [46, 43, 15, 31, 49]: These are standalone generative models designed for one-step generation without distillation. They introduce a consistency constraint during training: the network should produce consistent outputs (i.e., map to the same original data point) when evaluated at different time steps along the same flow path. CMs often use a discretization curriculum [46] where the time domain is progressively constrained, making training sensitive and requiring careful scheduling. In the notation of MeanFlow, CMs focus on paths anchored at the data side ( $r \equiv 0$ $r \equiv 0$ ).
- Example: iCT [43] and ECT [15] are variants of Consistency Models.
Shortcut Models [13]: These models introduce a self-consistency loss function in addition to Flow Matching. They focus on relationships between flows at different discrete time intervals, effectively trying to learn a shortcut from noise to data. They are conditioned on two time variables, similar to MeanFlow's use of $r$ and $t$ .
Inductive Moment Matching (IMM) [52]: This method models the self-consistency of stochastic interpolants (similar to flow paths) at different time steps. It also uses two time variables.

Differentiation

MeanFlow differentiates itself from prior work in several key ways:

Principled Foundation: Unlike Consistency Models, which impose consistency constraints as properties of the neural network's behavior, MeanFlow derives its core MeanFlow Identity (Eq. 6) directly from the definition of average velocity (Eq. 3). This identity exists independently of any neural network and provides a ground-truth target field for learning, leading to more stable and robust training.
Self-Contained and From Scratch: MeanFlow does not rely on pre-training, distillation from multi-step models, or complex curriculum learning strategies often seen in Consistency Models. It is trained entirely from scratch, making it a simpler and more fundamental approach.
Average vs. Instantaneous Velocity: The central distinction is the target: MeanFlow models average velocity $u(z_t, r, t)$ across an interval [r, t], whereas traditional Flow Matching and its variants model instantaneous velocity $v(z_t, t)$ at a single point in time. This conceptual shift directly facilitates one-step generation.
Native CFG Integration: MeanFlow integrates Classifier-Free Guidance (CFG) by defining guided ground-truth velocity fields ( $v^{\mathrm{cfg}}$ and $u^{\mathrm{cfg}}$ ), allowing for 1-NFE sampling with guidance, whereas other methods might require extra NFE for CFG or struggle to integrate it seamlessly into a one-step framework.

Methodology (Core Technology & Implementation Details)

The MeanFlow model introduces the concept of average velocity to achieve efficient one-step generative modeling. This section details the mathematical derivation and practical implementation.

4.1 Mean Flows

Core Idea: Average Velocity vs. Instantaneous Velocity

Traditional Flow Matching models learn the instantaneous velocity $v(z_t, t)$ , which describes the direction and magnitude of movement at a specific point $z_t$ and time $t$ . To generate a sample, this instantaneous velocity must be integrated over time, typically through numerical ODE solvers that require multiple function evaluations (NFE).

MeanFlow proposes to model average velocity $u(z_t, r, t)$ . This average velocity is defined over a time interval [r, t] and represents the average rate of change of $z_t$ over that interval. The intuition is that if we can directly learn this average velocity, we can effectively "jump" from $z_t$ to $z_r$ in one step, without needing to compute all intermediate instantaneous velocities.

Definition of Average Velocity

The average velocity $u(z_t, r, t)$ is defined as the displacement between two time steps $t$ and $r$ divided by the time interval (t-r). The displacement itself is the integral of the instantaneous velocity $v(z_\tau, \tau)$ over that interval.

$u ( z _ { t } , r , t ) \triangleq \frac { 1 } { t - r } \int _ { r } ^ { t } v ( z _ { \tau } , \tau ) d \tau$

$u(z_t, r, t)$ : The average velocity vector. It is a function of the current state $z_t$ , the start time $r$ of the interval, and the end time $t$ of the interval.
$v(z_\tau, \tau)$ : The instantaneous velocity vector at an intermediate time $\tau$ .
$\int_r^t v(z_\tau, \tau) d\tau$ : This integral represents the total displacement of the trajectory $z_\tau$ from time $r$ to time $t$ .
(t-r): The duration of the time interval.

Key characteristics of average velocity $u$ :

It is a field that depends jointly on (r, t).
As $r \to t$ , the average velocity approaches the instantaneous velocity: $\lim_{r \to t} u = v$ .
It satisfies a natural consistency property: taking one large step from $r$ to $t$ is consistent with taking two smaller consecutive steps from $r$ to $s$ and then from $s$ to $t$ . This is due to the additivity of integrals: $(t-r)u(z_t, r, t) = \int_r^t v d\tau = \int_r^s v d\tau + \int_s^t v d\tau = (s-r)u(z_s, r, s) + (t-s)u(z_t, s, t)$ This means a neural network learning $u$ inherently learns these consistency relations without explicit constraints.

Figure 3 visually contrasts the average velocity with the instantaneous velocity. The instantaneous velocity (tangent to the path) and average velocity (aligned with the displacement vector) are generally not aligned for curved paths.

The MeanFlow Identity

Directly using Eq. (3) for training is problematic because it requires evaluating an integral, which is computationally expensive during training. The core insight of MeanFlow is to derive a relationship between $u$ and $v$ that is amenable to training.

Rewrite Eq. (3) as a displacement: $(t - r) u ( z _ { t } , r , t ) = \int _ { r } ^ { t } v ( z _ { \tau } , \tau ) d \tau$
Differentiate both sides with respect to $t$ : Treating $r$ $r$ as independent of $t$ $t$ .
- Left Hand Side (LHS) - Product Rule: $\frac{d}{dt} [(t-r) u(z_t, r, t)] = \frac{d}{dt}(t-r) \cdot u(z_t, r, t) + (t-r) \cdot \frac{d}{dt} u(z_t, r, t)$ Since $\frac{d}{dt}(t-r) = 1$ , the LHS becomes: $u(z_t, r, t) + (t-r) \frac{d}{dt} u(z_t, r, t)$
- Right Hand Side (RHS) - Fundamental Theorem of Calculus: $\frac{d}{dt} \int_r^t v(z_\tau, \tau) d\tau = v(z_t, t)$ (assuming $r$ is constant with respect to $t$ ).
Equate LHS and RHS and rearrange: $u ( z _ { t } , r , t ) + ( t - r ) \frac { d } { d t } u ( z _ { t } , r , t ) = v ( z _ { t } , t )$ Rearranging to isolate $u(z_t, r, t)$ : $\boxed { u ( z _ { t } , r , t ) = v ( z _ { t } , t ) - ( t - r ) \frac { d } { d t } u ( z _ { t } , r , t ) }$ This is the MeanFlow Identity. It states that the average velocity ( $u$ ) can be expressed in terms of the instantaneous velocity ( $v$ ) and the total derivative of the average velocity with respect to $t$ . This identity is crucial because it provides a "target" for learning $u$ that only involves $v$ and derivatives of $u$ , rather than an integral of $v$ .

Decomposing the Total Derivative

The term $\frac{d}{dt} u(z_t, r, t)$ is a total derivative. Since $u$ is a function of $z_t$ , $r$ , and $t$ , and $z_t$ itself is a function of $t$ (i.e., $z_t$ depends on $t$ through the flow path), we must use the chain rule for total derivatives:

$\frac { d } { d t } u ( z _ { t } , r , t ) = \frac { \partial u } { \partial z } \frac { d z _ { t } } { d t } + \frac { \partial u } { \partial r } \frac { d r } { d t } + \frac { \partial u } { \partial t } \frac { d t } { d t }$

$\frac{dz_t}{dt} = v(z_t, t)$ : This is the definition of instantaneous velocity from Eq. (2) in the paper's background.
$\frac{dr}{dt} = 0$ : We treat $r$ as an independent time parameter from the interval [r, t], meaning its value doesn't change as we differentiate with respect to $t$ .
$\frac{dt}{dt} = 1$ : The derivative of $t$ with respect to $t$ is 1.

Substituting these into the total derivative equation:

$\boxed { \frac { d } { d t } u ( z _ { t } , r , t ) = v ( z _ { t } , t ) \partial _ { z } u + \partial _ { t } u }$

$\partial_z u$ : This represents the Jacobian (matrix of partial derivatives) of $u$ with respect to $z_t$ .
$v(z_t, t) \partial_z u$ : This is a Jacobian-vector product (JVP). It represents how changes in $z_t$ (due to its motion with velocity $v$ ) affect $u$ .
$\partial_t u$ : This is the partial derivative of $u$ with respect to $t$ , holding $z_t$ and $r$ constant.

Training with Average Velocity

The goal is to train a neural network, $u_\theta$ , parameterized by $\theta$ , to approximate the average velocity $u$ . The MeanFlow Identity (Eq. 6), combined with the total derivative decomposition (Eq. 8), provides the target for this training.

The loss function minimises the squared difference between the network's prediction $u_\theta(z_t, r, t)$ and an effective regression target $u_{\mathrm{tgt}}$ .

$\mathcal { L } ( \boldsymbol { \theta } ) = \mathbb { E } \big \| u _ { \boldsymbol { \theta } } ( z _ { t } , r , t ) - \mathbf { s g } ( u _ { \mathrm { t g t } } ) \big \| _ { 2 } ^ { 2 }$ where the target $u_{\mathrm{tgt}}$ is derived from the MeanFlow Identity: $u _ { \mathrm { t g t } } = v ( z _ { t } , t ) - ( t - r ) ( v ( z _ { t } , t ) \partial _ { z } u _ { \boldsymbol { \theta } } + \partial _ { t } u _ { \boldsymbol { \theta } } )$

$u_\theta(z_t, r, t)$ : The output of the neural network, predicting the average velocity.
$\mathbf{sg}(\cdot)$ : Stop-gradient operation. This means that during backpropagation, the gradients are not passed through $u_{\mathrm{tgt}}$ . The target is treated as a fixed value, preventing double backpropagation through the derivatives of $u_\theta$ within $u_{\mathrm{tgt}}$ , which would lead to higher-order derivatives and make training much more complex and computationally expensive.
$v(z_t, t)$ : The instantaneous velocity. Following Flow Matching practice, this is replaced by the conditional velocity for training. For rectified flows (a common choice for $a_t, b_t$ ), this is $v_t = \epsilon - x$ .
$\partial_z u_\theta$ and $\partial_t u_\theta$ : These are the Jacobian and partial derivative of the neural network's output $u_\theta$ with respect to $z_t$ and $t$ , respectively. They are computed using automatic differentiation, specifically Jacobian-vector products (JVP).

The MeanFlow Identity is a necessary and sufficient condition for Eq. (3) if the boundary condition $S|_{t=r} = 0$ is met (where $S=(t-r)u$ ). The MeanFlow formulation inherently satisfies this.

Algorithm 1: Training MeanFlow

The pseudocode illustrates the training process:

Algorithm 1 MeanFlow: Training.
Input: A batch of data x, prior distribution samples e, time t, r, model u_theta
Output: Loss value

1.  Sample t from U(0,1)
2.  Sample r from U(0,t)  // Or some other distribution, with r < t
3.  Generate z_t = (1-t) * x + t * e  // The noisy sample at time t
4.  Compute conditional instantaneous velocity v_t = e - x
5.  Define a function f(z, r_val, t_val) that computes u_theta(z, r_val, t_val)
6.  Use jvp to compute the total derivative:
    # d/dt u(z_t, r, t) = (v_t @ partial_z u_theta) + partial_t u_theta
    # This combines the partial derivatives w.r.t. z and t, where z itself depends on t.
    (u_val, grad_u_val) = jax.jvp(f, (z_t, r, t), (v_t, 0, 1))  
    # (u_val is u_theta(z_t, r, t), grad_u_val is the total derivative d/dt u_theta)
    # The tangent vector for jvp is (v_t, 0, 1) because:
    # d z_t / d t = v_t
    # d r / d t = 0 (r is treated as independent of t for this derivative)
    # d t / d t = 1
7.  Compute the regression target:
    u_tgt = v_t - (t - r) * grad_u_val
    // Apply stop_gradient (sg) to u_tgt
8.  Compute the loss:
    loss = mean(square(u_val - sg(u_tgt)))
9.  Backpropagate and update theta

jvp (Jacobian-Vector Product): This operation efficiently computes the total derivative. It takes the function $f$ (our $u_\theta$ ), the arguments $(z_t, r, t)$ , and a tangent vector $(v_t, 0, 1)$ . The tangent vector specifies how each argument changes with respect to the differentiation variable (here, $t$ ). The jvp returns both the function's output u_val ( $u_\theta(z_t, r, t)$ ) and the total derivative grad_u_val ( $\frac{d}{dt} u_\theta(z_t, r, t)$ ). The jvp computation is efficient, often requiring only a single backward pass, similar to standard backpropagation.

Sampling

The main advantage of MeanFlow is its one-step sampling capability. Once $u_\theta$ is trained, to generate a sample $z_0$ from noise $z_1=\epsilon$ :

z _ { r } = z _ { t } - ( t - r ) u ( z _ { t } , r , t )

For one-step sampling from prior noise to data, we set $t=1$ (noise state) and $r=0$ (data state): $z _ { 0 } = z _ { 1 } - ( 1 - 0 ) u ( z _ { 1 } , 0 , 1 ) = \epsilon - u ( \epsilon , 0 , 1 )$

Algorithm 2: 1-step Sampling

Algorithm 2 MeanFlow: 1-step Sampling
Input: Model u_theta, prior distribution sample e
Output: Generated data x_0

1.  Sample epsilon ~ p_prior
2.  Compute x_0 = epsilon - u_theta(epsilon, r=0, t=1)  // Single function evaluation

This single evaluation of $u_\theta$ generates the final sample, making it highly efficient. Few-step sampling is also straightforward by applying Eq. (12) iteratively.

4.2 Mean Flows with Guidance

Classifier-Free Guidance (CFG) [18] is a technique commonly used in diffusion models to improve sample quality by amplifying the effect of a given class condition (e.g., "generate a cat"). It typically involves running the model twice: once with the desired condition and once unconditionally, then linearly combining their outputs. This often doubles the NFE.

MeanFlow integrates CFG by constructing a guided ground-truth velocity field that the network learns directly, thus preserving 1-NFE sampling.

Ground-truth Fields with CFG

The paper defines a new ground-truth instantaneous velocity field $v^{\mathrm{cfg}}(z_t, t | \mathbf{c})$ that incorporates guidance: $v ^ { \mathrm { c f g } } ( z _ { t } , t \mid \mathbf { c } ) \triangleq \omega v ( z _ { t } , t \mid \mathbf { c } ) + ( 1 - \omega ) v ( z _ { t } , t )$

$v(z_t, t | \mathbf{c})$ : The instantaneous velocity conditioned on a class condition $\mathbf{c}$ .
$v(z_t, t)$ : The class-unconditional instantaneous velocity. This is obtained by averaging over all possible conditions: $v(z_t, t) \triangleq \mathbb{E}_{\mathbf{c}}[v(z_t, t | \mathbf{c})]$ .
$\omega$ : The guidance scale, a hyperparameter controlling the strength of the guidance. Higher $\omega$ means stronger guidance.

Following the MeanFlow Identity, the corresponding average velocity field $u^{\mathrm{cfg}}$ must satisfy: $u ^ { \mathrm { c f g } } ( z _ { t } , r , t \mid \mathbf { c } ) = v ^ { \mathrm { c f g } } ( z _ { t } , t \mid \mathbf { c } ) - ( t - r ) { \frac { d } { d t } } u ^ { \mathrm { c f g } } ( z _ { t } , r , t \mid \mathbf { c } )$

Training with Guidance

The neural network $u_\theta^{\mathrm{cfg}}$ is parameterized to directly model $u^{\mathrm{cfg}}$ . The loss function is similar to the unguided case: $\mathcal { L } ( \theta ) = \mathbb { E } \big \| u _ { \theta } ^ { \mathrm { c f g } } ( z _ { t } , r , t \mid \mathbf { c } ) - \mathbf { s g } ( u _ { \mathrm { t g t } } ) \big \| _ { 2 } ^ { 2 }$ where $u_{\mathrm{tgt}}$ is now defined using a modified instantaneous velocity $\tilde{v}_t$ : $u _ { \mathrm { t g t } } = \tilde { v } _ { t } - ( t - r ) \big ( \tilde { v } _ { t } \partial _ { z } u _ { \theta } ^ { \mathrm { c f g } } + \partial _ { t } u _ { \theta } ^ { \mathrm { c f g } } \big )$ The modified instantaneous velocity $\tilde{v}_t$ is the crucial part for CFG integration: $\tilde { v } _ { t } \triangleq \omega v _ { t } + ( 1 - \omega ) u _ { \theta } ^ { \mathrm { c f g } } ( z _ { t } , t , t )$

$v_t$ : The sample-conditional instantaneous velocity (e.g., $\epsilon - x$ ).
$u_\theta^{\mathrm{cfg}}(z_t, t, t)$ : This term represents the network's prediction of average velocity when the interval is infinitesimally small (i.e., $r=t$ ). In this limit, average velocity equals instantaneous velocity, so $u_\theta^{\mathrm{cfg}}(z_t, t, t)$ serves as the network's prediction for the unconditional instantaneous velocity $v(z_t, t)$ . This is where the unconditional part of CFG comes from.

To expose the network to class-unconditional information, a dropout mechanism is applied to the class condition $\mathbf{c}$ during training (e.g., setting $\mathbf{c}$ to a null token for a certain percentage of samples) as in standard CFG [18]. When $\mathbf{c}$ is dropped, the model learns to predict the unconditional velocity.

Single-NFE Sampling with CFG

Because $u_\theta^{\mathrm{cfg}}$ is trained to directly model the guided average velocity $u^{\mathrm{cfg}}$ , sampling is still 1-NFE. There is no need for a separate unconditional run and subsequent linear combination at inference time. The network directly outputs the guided average velocity using a single forward pass: $x _ { 0 } = \epsilon - u _ { \theta } ^ { \mathrm { c f g } } ( \epsilon , 0 , 1 \mid \mathbf { c } )$

Improved CFG for MeanFlow (Appendix B.1)

The paper introduces an improvement to the CFG formulation by adding a mixing scale $\kappa$ . This allows for a more explicit mix of class-conditional and class-unconditional predictions within the target $\tilde{v}_t$ . The ground-truth instantaneous velocity is redefined as: $v ^ { \mathrm { c f g } } ( z _ { t } , t \mid \mathbf { c } ) = \omega v ( z _ { t } , t \mid \mathbf { c } ) + \kappa u ^ { \mathrm { c f g } } ( z _ { t } , t , t \mid \mathbf { c } ) + ( 1 - \omega - \kappa ) u ^ { \mathrm { c f g } } ( z _ { t } , t , t )$ And the modified instantaneous velocity for the training target becomes: $\tilde { v } _ { t } \triangleq \omega \underbrace { ( \epsilon - x ) } _ { \mathrm { s a m p l e } \ v _ { t } } + \underbrace { \kappa \ u _ { \theta } ^ { \mathrm { c f g } } ( z _ { t } , t , t \mid \mathbf { c } ) } _ { \mathrm { c l s - c o n d o u t p u t } } + \underbrace { ( 1 - \omega - \kappa ) u _ { \theta } ^ { \mathrm { c f g } } ( z _ { t } , t , t ) } _ { \mathrm { c l s - u n c o n d o u t p u t } }$

$\kappa$ : A mixing scale.
$u_\theta^{\mathrm{cfg}}(z_t, t, t | \mathbf{c})$ : The network's prediction of conditional instantaneous velocity.
$u_\theta^{\mathrm{cfg}}(z_t, t, t)$ : The network's prediction of unconditional instantaneous velocity.

This formulation helps to further improve generation quality by explicitly training the network to produce both conditional and unconditional outputs, mirroring the random dropout strategy of standard CFG.

4.3 Design Decisions

Loss Metrics

The paper investigates different forms of loss functions. The base loss is the squared L2 loss, $\mathcal{L} = ||\Delta||_2^2$ , where $\Delta = u_\theta - u_{\mathrm{tgt}}$ . They use adaptive loss weighting [15], which effectively transforms the squared L2 loss into a powered L2 loss $\mathcal{L}_\gamma = ||\Delta||_2^{2\gamma}$ . This is achieved by weighting the squared L2 loss with $w = 1 / (||\Delta||_2^2 + c)^p$ , where $p = 1 - \gamma$ and $c$ is a small constant.

If $p=0$ ( $\gamma=1$ ), it's the standard squared L2 loss.
If $p=0.5$ ( $\gamma=0.5$ ), it's similar to the Pseudo-Huber loss [43], which is less sensitive to large errors.

Sampling Time Steps `(r, t)`

The choice of how $r$ and $t$ are sampled from the time domain (usually [0, 1]) affects training.

Uniform distribution $\mathcal{U}(0, 1)$ : $r$ and $t$ are chosen uniformly.
Logit-Normal (lognorm) distribution [11]: Samples are drawn from a normal distribution $\mathcal{N}(\mu, \sigma)$ and then mapped to $(0, 1)$ using the logistic function. This often concentrates samples in specific regions of the time domain, which can be beneficial. After sampling, $t$ is assigned the larger value and $r$ the smaller, ensuring $r < t$ . A certain portion of samples also have $r=t$ , which essentially makes the target reduce to Flow Matching.

Conditioning on `(r, t)`

The neural network needs to be conditioned on the time variables $r$ and $t$ . Positional embedding [48] is used for this. The paper explores different ways to represent these time variables as input to the network:

Directly embedding (t, r).
Embedding $(t, t-r)$ , where t-r represents the time interval $\Delta t$ .
Embedding $(t, r, -r)$ , which is equivalent to $(t, r, \Delta t)$ if $\Delta t = t-r$ .
Embedding only t-r.

The choice of embedding can influence the network's ability to learn the relationships between $u$ , $r$ , and $t$ .

B.3 On the Sufficiency of the MeanFlow Identity (Appendix B.3)

The paper explicitly shows that the MeanFlow Identity (Eq. 6) is not only a necessary condition derived from the definition of average velocity (Eq. 3), but also a sufficient condition. This means that if a function $u$ satisfies Eq. (6), it also satisfies Eq. (3).

The proof involves defining a displacement field $S(z_t, r, t) = (t-r)u(z_t, r, t)$ . If the MeanFlow Identity holds, then $\frac{d}{dt} S(z_t, r, t) = v(z_t, t)$ . Integrating both sides with respect to $t$ would yield $S(z_t, r, t) + C_1 = \int_r^t v(z_\tau, \tau) d\tau + C_2$ . However, by definition, $S|_{t=r} = 0$ (since $t-r=0$ ), and the integral $\int_r^t v d\tau$ is also 0 when $t=r$ . These boundary conditions force $C_1 = C_2$ , thus proving sufficiency. This is important because it means learning a network that satisfies Eq. (6) implies it correctly models the average velocity as defined by the integral.

B.4 Analysis on Jacobian-Vector Product (JVP) Computation (Appendix B.4)

While JVP computations can sometimes be a bottleneck, the paper clarifies that in MeanFlow, it is very lightweight. The key reason is that the JVP result (the total derivative term) is part of the regression target $u_{\mathrm{tgt}}$ and is subject to the stop-gradient (sg) operation. This means the gradients for optimizing the neural network parameters $\theta$ are not propagated through the JVP computation.

The JVP itself effectively performs a single backward pass (like backpropagation), but only to compute the derivatives with respect to its inputs ( $z_t, r, t$ ), not with respect to the network's parameters $\theta$ . This is less costly than a full backpropagation to update $\theta$ . The empirical benchmark shows an overhead of less than 20% of the total training time, confirming its efficiency.

Experimental Setup

Datasets

ImageNet $256 \times 256$ [7]:
- Description: A large-scale dataset of natural images, widely used for image classification and generation tasks. It contains millions of images across thousands of object categories.
- Characteristics: High resolution ( $256 \times 256$ pixels), diverse content (animals, objects, scenes), and class-conditional labels.
- Usage: Used for major experiments, evaluating class-conditional image generation.
- Data Example: A picture of a golden retriever from the dataset.
- Preprocessing: Images are first processed by a pre-trained VAE tokenizer [37]. This converts the $256 \times 256 \times 3$ (height $\times$ width $\times$ channels) pixel input into a lower-dimensional latent space representation of $32 \times 32 \times 4$ . The MeanFlow model then operates on this latent space. This is a common practice in latent diffusion models to reduce computational load.
CIFAR-10 [25]:
- Description: A smaller dataset consisting of 60,000 $32 \times 32$ color images in 10 classes, with 6,000 images per class.
- Characteristics: Low resolution ( $32 \times 32$ pixels), 10 common object classes (e.g., airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).
- Usage: Used for unconditional image generation experiments, directly on pixel space.
- Data Example: A small $32 \times 32$ pixel image of a frog.

Evaluation Metrics

Fréchet Inception Distance (FID) [17]:

Conceptual Definition: FID is a metric used to assess the quality of images generated by a generative model. It measures the "distance" between the distribution of real images and the distribution of generated images. A lower FID score indicates better quality and higher realism, meaning the generated images are more similar to real images and the diversity of generated images matches that of real images. It is commonly considered a more robust metric than Inception Score because it takes into account the statistical properties of both real and generated images.
Mathematical Formula: $\mathrm{FID} = ||\mu_x - \mu_g||_2^2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2})$
Symbol Explanation:
- $\mu_x$ : The mean feature vector of real images, extracted from an Inception v3 network's activation layer.
- $\mu_g$ : The mean feature vector of generated images, extracted from an Inception v3 network's activation layer.
- $||\cdot||_2^2$ : The squared Euclidean distance (L2 norm) between the mean feature vectors.
- $\Sigma_x$ : The covariance matrix of feature vectors for real images.
- $\Sigma_g$ : The covariance matrix of feature vectors for generated images.
- $\mathrm{Tr}(\cdot)$ : The trace of a matrix (sum of its diagonal elements).
- $(\Sigma_x \Sigma_g)^{1/2}$ : The matrix square root of the product of the covariance matrices.
  
  A batch size of 50,000 generated images (FID-50K) is typically used to compute the FID score to ensure statistical robustness.

Baselines

The paper compares MeanFlow against a range of state-of-the-art (SOTA) generative models, categorized by their underlying approach:

One-step Diffusion/Flow Models (trained from scratch):
- iCT-XL/2 [43]: An Improved Consistency Trajectory model.
- Shortcut-XL/2 [13]: A shortcut model approach for one-step diffusion.
- iMM-XL/2 [52]: Inductive Moment Matching model.
Other Families of Generative Models (for reference):
- GANs (Generative Adversarial Networks):
  - BigGAN [5]
  - GigaGAN [21]
  - StyleGAN-XL [40]
- Autoregressive/Masking Models:
  - AR w/ VQGAN [10]
  - MaskGIT [6]
  - VAR-d30 [47]
  - MAR-H [27]
- Multi-step Diffusion/Flow Models:
  - ADM [8] (Improved Denoising Diffusion Probabilistic Models)
  - LDM-4-G [37] (Latent Diffusion Models)
  - SimDiff [20] (Simple Diffusion)
  - DiT-XL/2 [34] (Diffusion Transformers)
  - SiT-XL/2 [33] (Scalable Interpolant Transformers)
  - SiT-XL/2+REPA [51] (with Representation Alignment)
    
    These baselines represent different paradigms in generative modeling, with DiT and SiT being particularly strong Transformer-based diffusion models that typically perform very well but require many NFEs.

Model Architecture and Training Details

ImageNet $256 \times 256$

Backbone: DiT [34] architecture, based on Vision Transformers (ViT) [9] with adaLN-Zero for conditioning. ViT divides an image into patches, treats them as tokens, and processes them with a Transformer encoder. adaLN-Zero is an adaptive layer normalization technique that effectively integrates conditioning information (like time and class labels).
Conditioning: Positional embedding [48] is applied to each time variable ( $r$ and $t$ ), followed by a 2-layer Multi-Layer Perceptron (MLP), and the outputs are summed to form the conditioning input for the Transformer. The primary conditioning scheme is $(t, t-r)$ .
Model Sizes: Ablation studies use ViT-B/4 (Base size, patch size 4x4). For main results, various sizes are used: $B/2$ , $M/2$ , $L/2$ , $XL/2$ (Base, Medium, Large, Extra Large, all with patch size 2x2).
Training Schedule: Models are trained from scratch for 80 to 1000 epochs.
Optimizer: Adam [24] with specific hyperparameters:
- Learning Rate: 0.0001 (constant schedule)
- $\beta_1 = 0.9$ , $\beta_2 = 0.95$
- Weight Decay: 0.0
Batch Size: Not specified in the main text for ImageNet, but typically large for DiT architectures.
Dropout: 0.0
EMA Decay: 0.9999 (Exponential Moving Average, used for model weights to stabilise training and improve performance).
Time Step Sampling: Logit-normal distribution lognorm(0.4, 1.0) is typically used for sampling $r$ and $t$ .
Ratio of $r \neq t$ : $25\%$ in ablation, implying that $75\%$ of the time $r=t$ (which reduces to Flow Matching). For main results, the table implies $25\%$ of $r=t$ .
Adaptive Weighting: Power $p=1.0$ is used for the adaptive loss weighting.
CFG (Classifier-Free Guidance):
- Effective guidance scale $\omega'$ typically 2.0 or 2.5.
- Class-conditional dropout [18] rate of 0.1 (i.e., 10% of samples are trained unconditionally).
- CFG is triggered for specific ranges of $t$ values depending on the model size and configuration.
- Mixing scale $\kappa$ (from Improved CFG) is also used for $XL/2+$ configuration.

CIFAR-10

Backbone: U-net [38] architecture (approximately 55M parameters), a common choice for image generation, particularly in diffusion models [44].
Input: Direct pixel space $32 \times 32 \times 3$ .
Conditioning: Positional embedding on $(t, t-r)$ , concatenated for the network.
Preconditioner: No EDM preconditioner [22] is used, unlike many other SOTA CIFAR-10 models.
Optimizer: Adam
- Learning Rate: 0.0006
- Batch Size: 1024
- $\beta_1 = 0.9$ , $\beta_2 = 0.999$
- Dropout: 0.2
- Weight Decay: 0
- EMA Decay: 0.99995
Training Schedule: 800K iterations (including 10K warm-up).
Time Step Sampling: Logit-normal distribution lognorm(2.0, 2.0).
Ratio of $r \neq t$ : $75\%$ .
Adaptive Weighting: Power $p=0.75$ .
Data Augmentation: Follows [22], but with vertical flipping and rotation disabled.

Results & Analysis

5.1 Ablation Study

The ablation study, conducted on ImageNet 256x256 with a ViT-B/4 backbone trained for 80 epochs, investigates the impact of various MeanFlow design choices on $1-NFE FID$ performance. Default settings are grayed out in the table below.

Table 1: Ablation study on 1-NFE ImageNet $256 \times 256$ generation. FID-50K is evaluated.

Manual Transcription of Table 1:

(a) Ratio of sampling $r \neq t$ . The $0\%$ entry reduces to the standard Flow Matching baseline.

% of r≠t	FID, 1-NFE
0% (= FM)	328.91
25%	61.06
50%	63.14
100%	67.32

Analysis (Table 1a): This ablation highlights the necessity of the average velocity concept. When the ratio of sampling $r \neq t$ is $0\%$ , the model effectively reduces to standard Flow Matching (as $r=t$ implies the second term in the MeanFlow Identity vanishes). This results in a very poor FID of 328.91, indicating that standard Flow Matching is not suitable for 1-NFE generation. Introducing a non-zero ratio of $r \neq t$ significantly improves performance, with $25\%$ achieving the best FID of 61.06. This confirms that explicitly training with intervals (where $r \neq t$ ) is crucial for MeanFlow's effectiveness in one-step generation. The model needs to learn how to bridge larger time gaps, not just instantaneous velocities.

(b) JVP computation. The correct jvp tangent is $(v, 0, 1)$ for Jacobian $(\partial_z u, \partial_r u, \partial_t u)$ .

jvp tangent	FID, 1-NFE
(v, 0, 1)	61.06
(v, 0, 0)	268.06
(v, 1, 0)	329.22
(v, 1, 1)	137.96

Analysis (Table 1b): This destructive ablation underscores the importance of the correct mathematical formulation of the MeanFlow Identity, particularly the total derivative term. The correct jvp tangent $(v, 0, 1)$ (representing $\frac{dz_t}{dt}, \frac{dr}{dt}, \frac{dt}{dt}$ respectively) yields the best performance (61.06 FID). Incorrect tangent specifications (e.g., $(v, 0, 0)$ which assumes $t$ is constant, or $(v, 1, 0)$ which incorrectly implies $r$ changes with $t$ ) lead to significantly worse FID scores (268.06, 329.22, 137.96). This strongly validates the theoretical derivation of the MeanFlow Identity and its total derivative breakdown.

pos. embed	FID, 1-NFE
(t, r)	61.75
(t, t−r)	61.06
(t, r, −r)	63.98
t−r only	63.13

Analysis (Table 1c): This explores different ways to encode the time information (r, t) for the neural network. All tested variants produce meaningful 1-NFE results, demonstrating the robustness of MeanFlow as a framework. The best performance is achieved by conditioning on $(t, t-r)$ (61.06 FID), suggesting that providing both the current time $t$ and the interval duration t-r is most informative. However, (t, r) also performs almost as well (61.75 FID). Even using only the interval t-r yields a reasonable result (63.13 FID), highlighting the importance of the interval itself.

(d) Time samplers. $t$ and $r$ are sampled from the specific sampler.

t, r sampler	FID, 1-NFE
uniform(0, 1)	65.90
lognorm(0.2, 1.0)	63.83
lognorm(0.2, 1.2)	64.72
lognorm(0.4, 1.0)	61.06
lognorm(0.4, 1.2)	61.79

Analysis (Table 1d): The distribution from which $r$ and $t$ are sampled significantly impacts performance. Logit-normal (lognorm) samplers consistently outperform the uniform distribution (65.90 FID). Among logit-normal options, lognorm(0.4, 1.0) achieves the best FID (61.06). This aligns with observations in Flow Matching research, where non-uniform sampling strategies for time variables often lead to better results by focusing training on more critical regions of the time domain.

(e) Loss metrics. $p = 0$ is squared L2 loss. $p = 0.5$ is Pseudo-Huber loss.

p	FID, 1-NFE
0.0	79.75
0.5	63.98
1.0	61.06
1.5	66.57
2.0	69.19

Analysis (Table 1e): This evaluates the adaptive loss weighting parameter $p$ . A standard squared L2 loss ( $p=0$ ) performs the worst (79.75 FID), though it still yields meaningful results. Increasing $p$ (which effectively down-weights larger errors) improves performance, with $p=1.0$ achieving the best FID (61.06). This indicates that moderately down-weighting very large errors during training is beneficial, consistent with findings in other few-step generative models (e.g., Pseudo-Huber loss at $p=0.5$ ).

(f) CFG scale:. Our method supports 1-NFE CFG sampling.

$\omega'$	FID, 1-NFE
1.0 (w/o cfg)	61.06
1.5	33.33
2.0	20.15
3.0	15.53
5.0	20.75

Analysis (Table 1f): This shows the significant impact of Classifier-Free Guidance (CFG). Without CFG ( $\omega'=1.0$ ), the FID is 61.06. Introducing CFG dramatically improves the FID, reaching a minimum of 15.53 at an effective guidance scale of $\omega'=3.0$ . This demonstrates that MeanFlow successfully integrates CFG in a 1-NFE manner, yielding substantial quality gains, consistent with observations in multi-step diffusion models. Too high a guidance scale (e.g., $\omega'=5.0$ ) can sometimes degrade quality (20.75 FID), a common phenomenon in CFG.

Scalability

$Figure 4: Scalability of MeanFlow models on ImageNet ${ \\pmb 2 5 6 \\times 2 5 6 }$ 1-NFE generation FID is reported. All models are trained from scratch. CFG is applied while maintaining the 1-NFE sa…$ 该图像是图表，展示了MeanFlow模型在ImageNet 256×256上使用1-NFE生成的FID随训练轮数变化的曲线，比较了不同模型大小（B/2, M/2, L/2, XL/2）的表现。结果表明模型规模越大，FID越低，性能更优。 Analysis (Figure 4): This figure illustrates the scalability of MeanFlow models on ImageNet 256x256 1-NFE generation. As the model size increases from $Base/2$ to Extra-Large/2 ( $B/2$ , $M/2$ , $L/2$ , $XL/2$ ), the FID score consistently decreases, indicating improved generation quality. This trend is consistent with other Transformer-based generative models like DiT and SiT, where larger models generally perform better given sufficient training. The plot also shows that CFG (implicitly applied here to achieve these FIDs) enables strong performance across scales while maintaining 1-NFE. This suggests that MeanFlow can benefit from increased computational resources and model capacity.

5.2 Comparisons with Prior Work

Table 2: Class-conditional generation on ImageNet $256 \times 256$ . All entries are reported with CFG, when applicable.

Manual Transcription of Table 2 (Left): Left: 1-NFE and 2-NFE diffusion/flow models trained from scratch.

method	params (M)	NFE	FID
1-NFE diffusion/flow from scratch
iCT-XL/2 [43]†	675	1	34.24
Shortcut-XL/2 [13]	675	1	10.60
MeanFlow-B/2	131	1	6.17
MeanFlow-M/2	308	1	5.01
MeanFlow-L/2	459	1	3.84
MeanFlow-XL/2	676	1	3.43
2-NFE diffusion/flow from scratch
iCT-XL/2 [43]†	675	2	20.30
iMM-XL/2 [52]	675	1×2	7.77
MeanFlow-XL/2	676	2	2.93
MeanFlow-XL/2+	676	2	2.20

Manual Transcription of Table 2 (Right): Right: Other families of generative models as a reference. In both tables, 1x2 indicates that CFG incurs an NFE of 2 per sampling step. Our MeanFlow models are all trained for 240 epochs, except that "MeanFlow-XL/2+" is trained for more epochs and with configurations selected for longer training specified in appendix. †: iCT [43] results are reported by [52].

method	params (M)	NFE	FID
GANs
BigGAN [5]	112	1	6.95
GigaGAN [21]	569	1	3.45
StyleGAN-XL [40]	166	1	2.30
autoregressive/masking
AR w/ VQGAN [10]	227	1024	26.52
MaskGIT [6]	227	8	6.18
VAR-d30 [47]	2000	10×2	1.92
MAR-H [27]	943	256×2	1.55
diffusion/flow
ADM [8]	554	250×2	10.94
LDM-4-G [37]	400	250×2	3.60
SimDiff [20]	2000	512×2	2.77
DiT-XL/2 [34]	675	250×2	2.27
SiT-XL/2 [33]	675	250×2	2.06
SiT-XL/2+REPA [51]	675	250×2	1.42

Analysis (ImageNet $256 \times 256$ Comparisons):

1-NFE Performance: MeanFlow-XL/2 achieves an FID of 3.43 with 1-NFE. This is a significant breakthrough compared to previous SOTA one-step models:
- Shortcut-XL/2 (previous SOTA 1-NFE) had an FID of 10.60. MeanFlow improves upon this by a relative margin of nearly $70\%$ .
- iCT-XL/2 (another 1-NFE model) had an FID of 34.24.
- iMM-XL/2 reported 1-step results, but these involve 2-NFE due to CFG (1x2 in table), achieving 7.77 FID. MeanFlow still significantly outperforms this.
Closing the Gap with Multi-step Models: The 3.43 FID of MeanFlow with 1-NFE is remarkably close to, and even surpasses, some multi-step diffusion models that require hundreds of NFEs, such as ADM (10.94 FID with $250 \times 2$ NFE) and LDM (3.60 FID with $250 \times 2$ NFE).
2-NFE Performance: The MeanFlow-XL/2+ model achieves an FID of 2.20 with 2-NFE. This is highly competitive with leading multi-step Transformer-based models like DiT-XL/2 (2.27 FID with $250 \times 2$ NFE) and SiT-XL/2 (2.06 FID with $250 \times 2$ NFE). This result strongly supports the claim that few-step diffusion/flow models can now rival their many-step predecessors.
Self-contained and From Scratch: The fact that MeanFlow achieves these results from scratch, without pre-training, distillation, or curriculum learning, is a significant advantage in terms of simplicity and robustness.
Comparison to GANs: MeanFlow's 1-NFE performance (3.43 FID) is competitive with SOTA GANs like GigaGAN (3.45 FID), though StyleGAN-XL (2.30 FID) remains slightly better at 1-NFE. However, GANs have their own training stability challenges.

Table 3: Unconditional CIFAR-10.

Manual Transcription of Table 3:

method	precond	NFE	FID
iCT [43]	EDM	1	2.83
ECT [15]	EDM	1	3.60
sCT [31]	EDM	1	2.97
IMM [52]	EDM	1	3.20
MeanFlow	none	1	2.92

Analysis (CIFAR-10 Comparisons): On the CIFAR-10 dataset for unconditional generation, MeanFlow achieves an FID of 2.92 with 1-NFE. This is competitive with other 1-NFE Consistency Model variants like iCT (2.83 FID) and sCT (2.97 FID), and outperforms ECT (3.60 FID) and IMM (3.20 FID). A notable point is that MeanFlow achieves this performance without using an EDM preconditioner, which is typically employed by other SOTA models for CIFAR-10 to improve stability and performance. This suggests that MeanFlow's core mechanism is robust even without additional preconditioning techniques.

B.1 Improved CFG for MeanFlow (Appendix B.1)

Table 5: Improved CFG for MeanFlow.

Manual Transcription of Table 5:

$\kappa$	FID, 1-NFE
0.0	20.15
0.5	19.15
0.8	19.10
0.9	18.63
0.95	19.17

Analysis (Table 5): This ablation explores the effect of the mixing scale $\kappa$ in the Improved CFG formulation (Eq. 20 and 21). With the effective guidance scale $\omega'$ fixed at 2.0, varying $\kappa$ shows further improvements in FID. When $\kappa=0.0$ , the FID is 20.15. Introducing $\kappa$ and mixing class-conditional and class-unconditional $u_\theta^{\mathrm{cfg}}$ terms in the target reduces the FID to a minimum of 18.63 at $\kappa=0.9$ . This demonstrates that explicitly blending both conditional and unconditional aspects within the CFG target, similar to the dropout mechanism in standard CFG, further enhances MeanFlow's generation quality.

Qualitative Results

$Figure 5: 1-NFE Generation Results. We show curated examples of class-conditional generation on ImageNet $2 5 6 \\times 2 5 6$ using our 1-NFE model (MeanFlow-XL/2, 3.43 FID).$ 该图像是一组图像拼贴，展示了使用论文中1-NFE模型（MeanFlow-XL/2，3.43 FID）在ImageNet 256×256分辨率下的类别条件生成样例，涵盖多种自然和动物场景。 Analysis (Figure 5): The figure showcases curated examples of class-conditional images generated by the MeanFlow-XL/2 model on ImageNet 256x256 with 1-NFE. The images exhibit high fidelity, sharpness, and diversity, closely resembling real photographs across various categories. This visual evidence supports the quantitative FID results, confirming the model's ability to produce high-quality samples in a single step.

Conclusion & Personal Thoughts

Conclusion Summary

This paper introduces MeanFlow, a novel and principled framework for one-step generative modeling. Its core innovation is the concept of average velocity to characterise flow fields, contrasting with the instantaneous velocity typically modeled by Flow Matching methods. By deriving and leveraging the MeanFlow Identity, which fundamentally relates average and instantaneous velocities, the method provides a robust and self-contained training objective. MeanFlow requires no pre-training, distillation, or curriculum learning, simplifying its application. It achieves state-of-the-art performance for 1-NFE generation on ImageNet 256x256 (FID 3.43), significantly outperforming previous one-step models and substantially narrowing the gap with multi-step diffusion/flow models. Furthermore, MeanFlow naturally integrates Classifier-Free Guidance (CFG) without increasing the Number of Function Evaluations (NFE) during sampling.

Limitations & Future Work (Authors' Perspective)

The authors identify that MeanFlow's strong performance largely closes the gap between one-step and many-step diffusion/flow models, suggesting that future research could revisit the foundations of these models. They also explicitly state that orthogonal improvements, such as REPA [51] (Representation Alignment), are applicable and left for future work. The paper concludes by suggesting that its formulation, by describing quantities at coarsened levels of granularity, could bridge research in generative modeling, simulation, and dynamical systems in related fields (e.g., multi-scale simulation problems in physics).

Personal Insights & Critique

The MeanFlow paper presents a highly impactful contribution to the field of generative AI. The shift from instantaneous velocity to average velocity is conceptually elegant and mathematically sound. The rigorous derivation of the MeanFlow Identity is a significant strength, providing a principled foundation that many other few-step methods (which often rely on heuristics or empirically tuned consistency constraints) lack.

Strengths:

Principled Approach: The derivation of the MeanFlow Identity from first principles is commendable. It provides a strong theoretical underpinning for the model's effectiveness.
Self-Contained and From Scratch: The ability to achieve SOTA 1-NFE performance without distillation or curriculum learning is a major practical advantage. It makes MeanFlow easier to train, more robust, and less dependent on complex multi-stage pipelines.
Efficient CFG Integration: The native inclusion of CFG for 1-NFE sampling is a clever design choice that overcomes a common limitation of few-step models.
Exceptional Performance: The FID scores achieved are truly impressive, especially for 1-NFE. Closing the gap with multi-step models is a significant milestone for real-time generative applications.
Scalability: The demonstrated scalability is crucial for leveraging larger Transformer architectures and further improving quality.

Possible Improvements / Open Questions:

Computational Cost of JVP: While the authors state the JVP is lightweight (16% overhead), this still adds to training time. Exploring ways to approximate or further optimize this derivative calculation could be beneficial, especially for extremely large models or resource-constrained environments.
Generalization of Average Velocity: The concept of average velocity is inherently tied to a specific time interval [r, t]. Could this idea be extended to other forms of "averaging" or "coarsening" in different contexts (e.g., spatial averaging, multi-resolution flows)?
Robustness to Diverse Data: While ImageNet is a challenging dataset, further validation on more diverse or specialized datasets (e.g., medical images, complex scientific data) could assess the model's broader applicability.
Theoretical Guarantees: While the derivation is principled, further theoretical analysis regarding convergence properties, stability, and potential for mode collapse in MeanFlow compared to standard Flow Matching could provide deeper insights.
Hyperparameter Sensitivity: The ablation studies show sensitivity to $p$ in the loss function, time samplers, and CFG scale. Further work on adaptive or robust strategies for these hyperparameters could make the model even more user-friendly.

Broader Impact: MeanFlow has the potential to significantly accelerate the adoption of generative models in real-world applications where low latency is critical, such as interactive image editing, real-time content generation in games or virtual reality, and faster prototyping in design. The principled nature of its average velocity concept could also inspire new theoretical developments in the understanding and construction of continuous-time generative models, potentially leading to more stable and efficient algorithms across various domains. The connection to multi-scale simulation problems hinted at in the conclusion is particularly intriguing, suggesting that MeanFlow might offer a fresh perspective on complex scientific and engineering challenges.

Mean Flows for One-step Generative Modeling

TL;DR Summary