Paper status: completed

Mean Flows for One-step Generative Modeling

Published:05/20/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
15 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MeanFlow introduces average velocity and an identity guiding training, enabling one-step generative modeling without pre-training. It achieves a 3.43 FID on ImageNet 256×256 with a single function evaluation, outperforming prior one-step models and narrowing the gap to multi-step

Abstract

We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.

Mind Map

In-depth Reading

English Analysis

Bibliographic Information

  • Title: Mean Flows for One-step Generative Modeling
  • Authors: Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, Kaiming He
  • Affiliations: Carnegie Mellon University (CMU), Massachusetts Institute of Technology (MIT)
  • Journal/Conference: Not explicitly stated as published in a journal or conference, but submitted to arXiv. Given the authors and the quality of work, it is likely intended for a top-tier machine learning conference (e.g., ICLR, NeurIPS, ICML) or journal. arXiv is a preprint server widely used in academia for rapid dissemination of research.
  • Publication Year: 2025
  • Abstract: This paper introduces a novel framework, MeanFlow, for one-step generative modeling. Unlike traditional Flow Matching methods that model instantaneous velocity, MeanFlow proposes to characterize flow fields using average velocity. The core innovation is a derived identity between average and instantaneous velocities, which directly guides the training of a neural network. The MeanFlow model is self-contained, meaning it does not require pre-training, distillation, or curriculum learning. Empirically, MeanFlow achieves an impressive Fréchet Inception Distance (FID) of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256, significantly surpassing previous state-of-the-art (SOTA) one-step diffusion/flow models. The authors conclude that this work substantially reduces the performance gap between one-step and multi-step generative models, encouraging further foundational research in this area.
  • Original Source Link: https://arxiv.org/abs/2505.13447
  • PDF Link: https://arxiv.org/pdf/2505.13447v1.pdf

Executive Summary

Background & Motivation (Why)

The fundamental goal of generative modeling is to learn how to transform a simple, known probability distribution (like a Gaussian noise distribution) into a complex, target data distribution (like images of cats). Diffusion models and Flow Matching have emerged as highly successful frameworks for this task. However, these models traditionally require many sampling steps (or function evaluations, NFE) during the inference (generation) phase to produce high-quality samples, making them computationally expensive and slow.

The core problem this paper addresses is the challenge of one-step generative modeling – generating high-quality samples in just a single function evaluation (1-NFE). Achieving one-step generation is crucial for practical applications where speed is paramount, such as real-time content creation or interactive systems. Prior one-step or few-step methods often rely on complex techniques like distillation from multi-step models, pre-training, or carefully designed curriculum learning strategies to stabilise training and improve performance. These approaches can be intricate to implement and may introduce dependencies on other models or training procedures. The specific gap identified is the lack of a principled, self-contained framework for one-step generation that can achieve SOTA quality without these additional complexities.

The paper's novel approach, MeanFlow, introduces the concept of average velocity to characterise flow fields. Instead of modeling the instantaneous velocity at each moment in time (as Flow Matching typically does), MeanFlow directly learns the average velocity over a time interval. This conceptual shift allows for direct one-step generation by effectively "skipping" the intermediate steps.

Main Contributions / Findings (What)

  1. Introduction of Average Velocity and MeanFlow Identity: The paper formally defines average velocity and, through a rigorous derivation, establishes a MeanFlow Identity. This identity describes a fundamental relationship between the average velocity and instantaneous velocity, providing a principled, self-contained mathematical basis for training a neural network for one-step generation.
  2. Principled One-Step Generative Framework: MeanFlow offers a self-contained framework that trains from scratch without needing pre-training, distillation, or curriculum learning. This simplifies the training process and makes the model more robust.
  3. Integration of Classifier-Free Guidance (CFG) for 1-NFE: The method naturally incorporates Classifier-Free Guidance (CFG) directly into its underlying ground-truth fields, enabling the benefits of CFG (improved sample quality) without incurring additional NFE during one-step sampling.
  4. State-of-the-Art Empirical Performance: MeanFlow achieves an FID of 3.43 with 1-NFE on ImageNet 256x256, significantly outperforming previous SOTA one-step diffusion/flow models by a relative margin of 50-70%. This performance substantially narrows the gap between one-step and multi-step generative models. It also achieves an FID of 2.20 with 2-NFE, matching or exceeding the performance of leading multi-step models (e.g., DiT, SiT) that use hundreds of NFEs.
  5. Scalability: The model demonstrates promising scalability with increasing model size, consistent with Transformer-based diffusion/flow models.

Prerequisite Knowledge & Related Work

To understand MeanFlow, it's essential to first grasp the foundational concepts of generative modeling, diffusion models, and Flow Matching.

Foundational Concepts

Generative Modeling

Generative modeling is a branch of machine learning focused on learning the underlying probability distribution of a dataset. Once learned, a generative model can produce new samples that are similar to the training data. For example, given a dataset of images of faces, a generative model could generate new, realistic faces that were not in the original dataset. The goal is to transform a simple, known prior distribution (e.g., standard Gaussian noise) into the complex data distribution.

Diffusion Models

Diffusion models (e.g., DDPM [19], Score-based models [44]) are a class of generative models that learn to reverse a gradual diffusion process. This process involves progressively adding noise to data until it becomes pure noise. The model then learns to denoise the data, effectively reversing the diffusion process to transform noise back into coherent data. This reversal is typically achieved by predicting the noise added at each step or the score function (gradient of the log-probability density). During sampling (generation), these models iteratively apply a denoising function over many time steps, starting from random noise.

Flow Matching

Flow Matching [28, 2, 30] is a framework closely related to diffusion models that focuses on learning velocity fields. Instead of directly predicting noise or score functions, Flow Matching learns a velocity field that defines a continuous flow path transforming a simple prior distribution (like Gaussian noise) into the target data distribution.

  • Flow Path: A flow path is a continuous trajectory ztz_t that connects a sample xx from the data distribution to a sample ϵ\epsilon from the prior distribution. It's often defined as a linear interpolation: zt=atx+btϵz_t = a_t x + b_t \epsilon where tt is time (typically from 0 to 1), and ata_t, btb_t are predefined schedules (e.g., at=1ta_t = 1-t, bt=tb_t = t).
  • Instantaneous Velocity (vtv_t): The velocity at any point ztz_t along this path is the time derivative of ztz_t. If zt=(1t)x+tϵz_t = (1-t)x + t\epsilon, then the instantaneous velocity is: vt=dztdt=x+ϵ v_t = \frac{d z_t}{d t} = -x + \epsilon This is often called the conditional velocity because it depends on the specific pair (x,ϵ)(x, \epsilon) that generated ztz_t.
  • Conditional vs. Marginal Velocity:
    • Conditional Velocity: Given a specific data point xx and a noise vector ϵ\epsilon, the conditional velocity vt(ztx,ϵ)v_t(z_t|x, \epsilon) describes the direction and speed of the flow path for that specific pair. As shown in Fig. 2 (left), a given point ztz_t in the latent space can arise from different (x,ϵ)(x, \epsilon) pairs, leading to different conditional velocities.
    • Marginal Velocity: Because the neural network needs to learn a consistent velocity field for any ztz_t it encounters, Flow Matching aims to model the marginal velocity v(zt,t)v(z_t, t). This is the expected velocity at ztz_t and time tt, averaged over all possible (x,ϵ)(x, \epsilon) pairs that could lead to ztz_t (Fig. 2 right). v(zt,t)Ept(vtzt)[vt] v(z_t, t) \triangleq \mathbb{E}_{p_t(v_t|z_t)}[v_t] Training directly with this expectation is often intractable. However, it has been shown that minimizing a conditional Flow Matching (CFM) loss, which matches the predicted velocity to the conditional velocity vt(ztx,ϵ)v_t(z_t|x, \epsilon), is equivalent to minimizing the marginal Flow Matching loss.
  • Sampling: Once the neural network vθv_\theta is trained to predict the velocity field, sampling new data points involves solving an Ordinary Differential Equation (ODE): ddtzt=v(zt,t)\frac{d}{dt} z_t = v(z_t, t) This ODE is typically solved numerically using methods like Euler's method over many discrete time steps, starting from noise (z1=ϵz_1 = \epsilon) and integrating backwards to z0z_0. Each step in this numerical integration constitutes a function evaluation (NFE).

Number of Function Evaluations (NFE)

NFE refers to how many times the neural network (which predicts velocity or noise) must be called during the sampling process to generate a single output sample.

  • Multi-step Generation: Models requiring many NFE (e.g., 250-1000) are multi-step models. They are generally slower but often produce higher quality samples.
  • Few-step Generation: Models aiming to reduce NFE to a small number (e.g., 2-50).
  • One-step Generation (1-NFE): The ultimate goal for fast inference, where a single forward pass of the neural network generates a complete sample. This is the focus of MeanFlow.

Latent Space Models

Many generative models operate not directly on high-resolution pixels but on a compressed latent space representation. A Variational Autoencoder (VAE) [37] is commonly used as a tokenizer to convert images into a lower-dimensional latent representation and vice versa. The generative model then learns to transform noise in this latent space, and a decoder converts the generated latent representation back to pixel space. This reduces computational cost and allows Transformer-based architectures to operate more efficiently.

Positional Embedding

In Transformer architectures, positional embeddings [48] are used to inject information about the relative or absolute position of elements in a sequence (or, in this case, time variables) into the model. Since Transformers are inherently permutation-invariant, positional embeddings are crucial for the model to understand the temporal context.

Previous Works & Technological Evolution

The field has evolved towards faster generation, primarily through distillation or consistency-based methods.

  • Distillation-based Methods: These approaches first train a multi-step diffusion model and then distill its knowledge into a smaller, faster model that requires fewer sampling steps (e.g., [39, 14, 41, 32]). This often means the few-step model is dependent on the performance and training of the multi-step teacher model.
  • Consistency Models (CMs) [46, 43, 15, 31, 49]: These are standalone generative models designed for one-step generation without distillation. They introduce a consistency constraint during training: the network should produce consistent outputs (i.e., map to the same original data point) when evaluated at different time steps along the same flow path. CMs often use a discretization curriculum [46] where the time domain is progressively constrained, making training sensitive and requiring careful scheduling. In the notation of MeanFlow, CMs focus on paths anchored at the data side (r0r \equiv 0).
    • Example: iCT [43] and ECT [15] are variants of Consistency Models.
  • Shortcut Models [13]: These models introduce a self-consistency loss function in addition to Flow Matching. They focus on relationships between flows at different discrete time intervals, effectively trying to learn a shortcut from noise to data. They are conditioned on two time variables, similar to MeanFlow's use of rr and tt.
  • Inductive Moment Matching (IMM) [52]: This method models the self-consistency of stochastic interpolants (similar to flow paths) at different time steps. It also uses two time variables.

Differentiation

MeanFlow differentiates itself from prior work in several key ways:

  • Principled Foundation: Unlike Consistency Models, which impose consistency constraints as properties of the neural network's behavior, MeanFlow derives its core MeanFlow Identity (Eq. 6) directly from the definition of average velocity (Eq. 3). This identity exists independently of any neural network and provides a ground-truth target field for learning, leading to more stable and robust training.
  • Self-Contained and From Scratch: MeanFlow does not rely on pre-training, distillation from multi-step models, or complex curriculum learning strategies often seen in Consistency Models. It is trained entirely from scratch, making it a simpler and more fundamental approach.
  • Average vs. Instantaneous Velocity: The central distinction is the target: MeanFlow models average velocity u(zt,r,t)u(z_t, r, t) across an interval [r, t], whereas traditional Flow Matching and its variants model instantaneous velocity v(zt,t)v(z_t, t) at a single point in time. This conceptual shift directly facilitates one-step generation.
  • Native CFG Integration: MeanFlow integrates Classifier-Free Guidance (CFG) by defining guided ground-truth velocity fields (vcfgv^{\mathrm{cfg}} and ucfgu^{\mathrm{cfg}}), allowing for 1-NFE sampling with guidance, whereas other methods might require extra NFE for CFG or struggle to integrate it seamlessly into a one-step framework.

Methodology (Core Technology & Implementation Details)

The MeanFlow model introduces the concept of average velocity to achieve efficient one-step generative modeling. This section details the mathematical derivation and practical implementation.

4.1 Mean Flows

Core Idea: Average Velocity vs. Instantaneous Velocity

Traditional Flow Matching models learn the instantaneous velocity v(zt,t)v(z_t, t), which describes the direction and magnitude of movement at a specific point ztz_t and time tt. To generate a sample, this instantaneous velocity must be integrated over time, typically through numerical ODE solvers that require multiple function evaluations (NFE).

MeanFlow proposes to model average velocity u(zt,r,t)u(z_t, r, t). This average velocity is defined over a time interval [r, t] and represents the average rate of change of ztz_t over that interval. The intuition is that if we can directly learn this average velocity, we can effectively "jump" from ztz_t to zrz_r in one step, without needing to compute all intermediate instantaneous velocities.

Definition of Average Velocity

The average velocity u(zt,r,t)u(z_t, r, t) is defined as the displacement between two time steps tt and rr divided by the time interval (t-r). The displacement itself is the integral of the instantaneous velocity v(zτ,τ)v(z_\tau, \tau) over that interval.

u(zt,r,t)1trrtv(zτ,τ)dτ u ( z _ { t } , r , t ) \triangleq \frac { 1 } { t - r } \int _ { r } ^ { t } v ( z _ { \tau } , \tau ) d \tau

  • u(zt,r,t)u(z_t, r, t): The average velocity vector. It is a function of the current state ztz_t, the start time rr of the interval, and the end time tt of the interval.
  • v(zτ,τ)v(z_\tau, \tau): The instantaneous velocity vector at an intermediate time τ\tau.
  • rtv(zτ,τ)dτ\int_r^t v(z_\tau, \tau) d\tau: This integral represents the total displacement of the trajectory zτz_\tau from time rr to time tt.
  • (t-r): The duration of the time interval.

Key characteristics of average velocity uu:

  • It is a field that depends jointly on (r, t).
  • As rtr \to t, the average velocity approaches the instantaneous velocity: limrtu=v\lim_{r \to t} u = v.
  • It satisfies a natural consistency property: taking one large step from rr to tt is consistent with taking two smaller consecutive steps from rr to ss and then from ss to tt. This is due to the additivity of integrals: (tr)u(zt,r,t)=rtvdτ=rsvdτ+stvdτ=(sr)u(zs,r,s)+(ts)u(zt,s,t)(t-r)u(z_t, r, t) = \int_r^t v d\tau = \int_r^s v d\tau + \int_s^t v d\tau = (s-r)u(z_s, r, s) + (t-s)u(z_t, s, t) This means a neural network learning uu inherently learns these consistency relations without explicit constraints.

Figure 3 visually contrasts the average velocity with the instantaneous velocity. The instantaneous velocity (tangent to the path) and average velocity (aligned with the displacement vector) are generally not aligned for curved paths.

The MeanFlow Identity

Directly using Eq. (3) for training is problematic because it requires evaluating an integral, which is computationally expensive during training. The core insight of MeanFlow is to derive a relationship between uu and vv that is amenable to training.

  1. Rewrite Eq. (3) as a displacement: (tr)u(zt,r,t)=rtv(zτ,τ)dτ (t - r) u ( z _ { t } , r , t ) = \int _ { r } ^ { t } v ( z _ { \tau } , \tau ) d \tau
  2. Differentiate both sides with respect to tt: Treating rr as independent of tt.
    • Left Hand Side (LHS) - Product Rule: ddt[(tr)u(zt,r,t)]=ddt(tr)u(zt,r,t)+(tr)ddtu(zt,r,t) \frac{d}{dt} [(t-r) u(z_t, r, t)] = \frac{d}{dt}(t-r) \cdot u(z_t, r, t) + (t-r) \cdot \frac{d}{dt} u(z_t, r, t) Since ddt(tr)=1\frac{d}{dt}(t-r) = 1, the LHS becomes: u(zt,r,t)+(tr)ddtu(zt,r,t) u(z_t, r, t) + (t-r) \frac{d}{dt} u(z_t, r, t)
    • Right Hand Side (RHS) - Fundamental Theorem of Calculus: ddtrtv(zτ,τ)dτ=v(zt,t) \frac{d}{dt} \int_r^t v(z_\tau, \tau) d\tau = v(z_t, t) (assuming rr is constant with respect to tt).
  3. Equate LHS and RHS and rearrange: u(zt,r,t)+(tr)ddtu(zt,r,t)=v(zt,t) u ( z _ { t } , r , t ) + ( t - r ) \frac { d } { d t } u ( z _ { t } , r , t ) = v ( z _ { t } , t ) Rearranging to isolate u(zt,r,t)u(z_t, r, t): u(zt,r,t)=v(zt,t)(tr)ddtu(zt,r,t) \boxed { u ( z _ { t } , r , t ) = v ( z _ { t } , t ) - ( t - r ) \frac { d } { d t } u ( z _ { t } , r , t ) } This is the MeanFlow Identity. It states that the average velocity (uu) can be expressed in terms of the instantaneous velocity (vv) and the total derivative of the average velocity with respect to tt. This identity is crucial because it provides a "target" for learning uu that only involves vv and derivatives of uu, rather than an integral of vv.

Decomposing the Total Derivative

The term ddtu(zt,r,t)\frac{d}{dt} u(z_t, r, t) is a total derivative. Since uu is a function of ztz_t, rr, and tt, and ztz_t itself is a function of tt (i.e., ztz_t depends on tt through the flow path), we must use the chain rule for total derivatives:

ddtu(zt,r,t)=uzdztdt+urdrdt+utdtdt \frac { d } { d t } u ( z _ { t } , r , t ) = \frac { \partial u } { \partial z } \frac { d z _ { t } } { d t } + \frac { \partial u } { \partial r } \frac { d r } { d t } + \frac { \partial u } { \partial t } \frac { d t } { d t }

  • dztdt=v(zt,t)\frac{dz_t}{dt} = v(z_t, t): This is the definition of instantaneous velocity from Eq. (2) in the paper's background.

  • drdt=0\frac{dr}{dt} = 0: We treat rr as an independent time parameter from the interval [r, t], meaning its value doesn't change as we differentiate with respect to tt.

  • dtdt=1\frac{dt}{dt} = 1: The derivative of tt with respect to tt is 1.

    Substituting these into the total derivative equation:

ddtu(zt,r,t)=v(zt,t)zu+tu \boxed { \frac { d } { d t } u ( z _ { t } , r , t ) = v ( z _ { t } , t ) \partial _ { z } u + \partial _ { t } u }

  • zu\partial_z u: This represents the Jacobian (matrix of partial derivatives) of uu with respect to ztz_t.
  • v(zt,t)zuv(z_t, t) \partial_z u: This is a Jacobian-vector product (JVP). It represents how changes in ztz_t (due to its motion with velocity vv) affect uu.
  • tu\partial_t u: This is the partial derivative of uu with respect to tt, holding ztz_t and rr constant.

Training with Average Velocity

The goal is to train a neural network, uθu_\theta, parameterized by θ\theta, to approximate the average velocity uu. The MeanFlow Identity (Eq. 6), combined with the total derivative decomposition (Eq. 8), provides the target for this training.

The loss function minimises the squared difference between the network's prediction uθ(zt,r,t)u_\theta(z_t, r, t) and an effective regression target utgtu_{\mathrm{tgt}}.

L(θ)=Euθ(zt,r,t)sg(utgt)22 \mathcal { L } ( \boldsymbol { \theta } ) = \mathbb { E } \big \| u _ { \boldsymbol { \theta } } ( z _ { t } , r , t ) - \mathbf { s g } ( u _ { \mathrm { t g t } } ) \big \| _ { 2 } ^ { 2 } where the target utgtu_{\mathrm{tgt}} is derived from the MeanFlow Identity: utgt=v(zt,t)(tr)(v(zt,t)zuθ+tuθ) u _ { \mathrm { t g t } } = v ( z _ { t } , t ) - ( t - r ) ( v ( z _ { t } , t ) \partial _ { z } u _ { \boldsymbol { \theta } } + \partial _ { t } u _ { \boldsymbol { \theta } } )

  • uθ(zt,r,t)u_\theta(z_t, r, t): The output of the neural network, predicting the average velocity.

  • sg()\mathbf{sg}(\cdot): Stop-gradient operation. This means that during backpropagation, the gradients are not passed through utgtu_{\mathrm{tgt}}. The target is treated as a fixed value, preventing double backpropagation through the derivatives of uθu_\theta within utgtu_{\mathrm{tgt}}, which would lead to higher-order derivatives and make training much more complex and computationally expensive.

  • v(zt,t)v(z_t, t): The instantaneous velocity. Following Flow Matching practice, this is replaced by the conditional velocity for training. For rectified flows (a common choice for at,bta_t, b_t), this is vt=ϵxv_t = \epsilon - x.

  • zuθ\partial_z u_\theta and tuθ\partial_t u_\theta: These are the Jacobian and partial derivative of the neural network's output uθu_\theta with respect to ztz_t and tt, respectively. They are computed using automatic differentiation, specifically Jacobian-vector products (JVP).

    The MeanFlow Identity is a necessary and sufficient condition for Eq. (3) if the boundary condition St=r=0S|_{t=r} = 0 is met (where S=(tr)uS=(t-r)u). The MeanFlow formulation inherently satisfies this.

Algorithm 1: Training MeanFlow

The pseudocode illustrates the training process:

Algorithm 1 MeanFlow: Training.
Input: A batch of data x, prior distribution samples e, time t, r, model u_theta
Output: Loss value

1.  Sample t from U(0,1)
2.  Sample r from U(0,t)  // Or some other distribution, with r < t
3.  Generate z_t = (1-t) * x + t * e  // The noisy sample at time t
4.  Compute conditional instantaneous velocity v_t = e - x
5.  Define a function f(z, r_val, t_val) that computes u_theta(z, r_val, t_val)
6.  Use jvp to compute the total derivative:
    # d/dt u(z_t, r, t) = (v_t @ partial_z u_theta) + partial_t u_theta
    # This combines the partial derivatives w.r.t. z and t, where z itself depends on t.
    (u_val, grad_u_val) = jax.jvp(f, (z_t, r, t), (v_t, 0, 1))  
    # (u_val is u_theta(z_t, r, t), grad_u_val is the total derivative d/dt u_theta)
    # The tangent vector for jvp is (v_t, 0, 1) because:
    # d z_t / d t = v_t
    # d r / d t = 0 (r is treated as independent of t for this derivative)
    # d t / d t = 1
7.  Compute the regression target:
    u_tgt = v_t - (t - r) * grad_u_val
    // Apply stop_gradient (sg) to u_tgt
8.  Compute the loss:
    loss = mean(square(u_val - sg(u_tgt)))
9.  Backpropagate and update theta
  • jvp (Jacobian-Vector Product): This operation efficiently computes the total derivative. It takes the function ff (our uθu_\theta), the arguments (zt,r,t)(z_t, r, t), and a tangent vector (vt,0,1)(v_t, 0, 1). The tangent vector specifies how each argument changes with respect to the differentiation variable (here, tt). The jvp returns both the function's output u_val (uθ(zt,r,t)u_\theta(z_t, r, t)) and the total derivative grad_u_val (ddtuθ(zt,r,t)\frac{d}{dt} u_\theta(z_t, r, t)). The jvp computation is efficient, often requiring only a single backward pass, similar to standard backpropagation.

Sampling

The main advantage of MeanFlow is its one-step sampling capability. Once uθu_\theta is trained, to generate a sample z0z_0 from noise z1=ϵz_1=\epsilon:

z _ { r } = z _ { t } - ( t - r ) u ( z _ { t } , r , t )

For one-step sampling from prior noise to data, we set t=1t=1 (noise state) and r=0r=0 (data state): z0=z1(10)u(z1,0,1)=ϵu(ϵ,0,1) z _ { 0 } = z _ { 1 } - ( 1 - 0 ) u ( z _ { 1 } , 0 , 1 ) = \epsilon - u ( \epsilon , 0 , 1 )

Algorithm 2: 1-step Sampling

Algorithm 2 MeanFlow: 1-step Sampling
Input: Model u_theta, prior distribution sample e
Output: Generated data x_0

1.  Sample epsilon ~ p_prior
2.  Compute x_0 = epsilon - u_theta(epsilon, r=0, t=1)  // Single function evaluation

This single evaluation of uθu_\theta generates the final sample, making it highly efficient. Few-step sampling is also straightforward by applying Eq. (12) iteratively.

4.2 Mean Flows with Guidance

Classifier-Free Guidance (CFG) [18] is a technique commonly used in diffusion models to improve sample quality by amplifying the effect of a given class condition (e.g., "generate a cat"). It typically involves running the model twice: once with the desired condition and once unconditionally, then linearly combining their outputs. This often doubles the NFE.

MeanFlow integrates CFG by constructing a guided ground-truth velocity field that the network learns directly, thus preserving 1-NFE sampling.

Ground-truth Fields with CFG

The paper defines a new ground-truth instantaneous velocity field vcfg(zt,tc)v^{\mathrm{cfg}}(z_t, t | \mathbf{c}) that incorporates guidance: vcfg(zt,tc)ωv(zt,tc)+(1ω)v(zt,t) v ^ { \mathrm { c f g } } ( z _ { t } , t \mid \mathbf { c } ) \triangleq \omega v ( z _ { t } , t \mid \mathbf { c } ) + ( 1 - \omega ) v ( z _ { t } , t )

  • v(zt,tc)v(z_t, t | \mathbf{c}): The instantaneous velocity conditioned on a class condition c\mathbf{c}.

  • v(zt,t)v(z_t, t): The class-unconditional instantaneous velocity. This is obtained by averaging over all possible conditions: v(zt,t)Ec[v(zt,tc)]v(z_t, t) \triangleq \mathbb{E}_{\mathbf{c}}[v(z_t, t | \mathbf{c})].

  • ω\omega: The guidance scale, a hyperparameter controlling the strength of the guidance. Higher ω\omega means stronger guidance.

    Following the MeanFlow Identity, the corresponding average velocity field ucfgu^{\mathrm{cfg}} must satisfy: ucfg(zt,r,tc)=vcfg(zt,tc)(tr)ddtucfg(zt,r,tc) u ^ { \mathrm { c f g } } ( z _ { t } , r , t \mid \mathbf { c } ) = v ^ { \mathrm { c f g } } ( z _ { t } , t \mid \mathbf { c } ) - ( t - r ) { \frac { d } { d t } } u ^ { \mathrm { c f g } } ( z _ { t } , r , t \mid \mathbf { c } )

Training with Guidance

The neural network uθcfgu_\theta^{\mathrm{cfg}} is parameterized to directly model ucfgu^{\mathrm{cfg}}. The loss function is similar to the unguided case: L(θ)=Euθcfg(zt,r,tc)sg(utgt)22 \mathcal { L } ( \theta ) = \mathbb { E } \big \| u _ { \theta } ^ { \mathrm { c f g } } ( z _ { t } , r , t \mid \mathbf { c } ) - \mathbf { s g } ( u _ { \mathrm { t g t } } ) \big \| _ { 2 } ^ { 2 } where utgtu_{\mathrm{tgt}} is now defined using a modified instantaneous velocity v~t\tilde{v}_t: utgt=v~t(tr)(v~tzuθcfg+tuθcfg) u _ { \mathrm { t g t } } = \tilde { v } _ { t } - ( t - r ) \big ( \tilde { v } _ { t } \partial _ { z } u _ { \theta } ^ { \mathrm { c f g } } + \partial _ { t } u _ { \theta } ^ { \mathrm { c f g } } \big ) The modified instantaneous velocity v~t\tilde{v}_t is the crucial part for CFG integration: v~tωvt+(1ω)uθcfg(zt,t,t) \tilde { v } _ { t } \triangleq \omega v _ { t } + ( 1 - \omega ) u _ { \theta } ^ { \mathrm { c f g } } ( z _ { t } , t , t )

  • vtv_t: The sample-conditional instantaneous velocity (e.g., ϵx\epsilon - x).

  • uθcfg(zt,t,t)u_\theta^{\mathrm{cfg}}(z_t, t, t): This term represents the network's prediction of average velocity when the interval is infinitesimally small (i.e., r=tr=t). In this limit, average velocity equals instantaneous velocity, so uθcfg(zt,t,t)u_\theta^{\mathrm{cfg}}(z_t, t, t) serves as the network's prediction for the unconditional instantaneous velocity v(zt,t)v(z_t, t). This is where the unconditional part of CFG comes from.

    To expose the network to class-unconditional information, a dropout mechanism is applied to the class condition c\mathbf{c} during training (e.g., setting c\mathbf{c} to a null token for a certain percentage of samples) as in standard CFG [18]. When c\mathbf{c} is dropped, the model learns to predict the unconditional velocity.

Single-NFE Sampling with CFG

Because uθcfgu_\theta^{\mathrm{cfg}} is trained to directly model the guided average velocity ucfgu^{\mathrm{cfg}}, sampling is still 1-NFE. There is no need for a separate unconditional run and subsequent linear combination at inference time. The network directly outputs the guided average velocity using a single forward pass: x0=ϵuθcfg(ϵ,0,1c) x _ { 0 } = \epsilon - u _ { \theta } ^ { \mathrm { c f g } } ( \epsilon , 0 , 1 \mid \mathbf { c } )

Improved CFG for MeanFlow (Appendix B.1)

The paper introduces an improvement to the CFG formulation by adding a mixing scale κ\kappa. This allows for a more explicit mix of class-conditional and class-unconditional predictions within the target v~t\tilde{v}_t. The ground-truth instantaneous velocity is redefined as: vcfg(zt,tc)=ωv(zt,tc)+κucfg(zt,t,tc)+(1ωκ)ucfg(zt,t,t) v ^ { \mathrm { c f g } } ( z _ { t } , t \mid \mathbf { c } ) = \omega v ( z _ { t } , t \mid \mathbf { c } ) + \kappa u ^ { \mathrm { c f g } } ( z _ { t } , t , t \mid \mathbf { c } ) + ( 1 - \omega - \kappa ) u ^ { \mathrm { c f g } } ( z _ { t } , t , t ) And the modified instantaneous velocity for the training target becomes: v~tω(ϵx)sample vt+κ uθcfg(zt,t,tc)clscondoutput+(1ωκ)uθcfg(zt,t,t)clsuncondoutput \tilde { v } _ { t } \triangleq \omega \underbrace { ( \epsilon - x ) } _ { \mathrm { s a m p l e } \ v _ { t } } + \underbrace { \kappa \ u _ { \theta } ^ { \mathrm { c f g } } ( z _ { t } , t , t \mid \mathbf { c } ) } _ { \mathrm { c l s - c o n d o u t p u t } } + \underbrace { ( 1 - \omega - \kappa ) u _ { \theta } ^ { \mathrm { c f g } } ( z _ { t } , t , t ) } _ { \mathrm { c l s - u n c o n d o u t p u t } }

  • κ\kappa: A mixing scale.

  • uθcfg(zt,t,tc)u_\theta^{\mathrm{cfg}}(z_t, t, t | \mathbf{c}): The network's prediction of conditional instantaneous velocity.

  • uθcfg(zt,t,t)u_\theta^{\mathrm{cfg}}(z_t, t, t): The network's prediction of unconditional instantaneous velocity.

    This formulation helps to further improve generation quality by explicitly training the network to produce both conditional and unconditional outputs, mirroring the random dropout strategy of standard CFG.

4.3 Design Decisions

Loss Metrics

The paper investigates different forms of loss functions. The base loss is the squared L2 loss, L=Δ22\mathcal{L} = ||\Delta||_2^2, where Δ=uθutgt\Delta = u_\theta - u_{\mathrm{tgt}}. They use adaptive loss weighting [15], which effectively transforms the squared L2 loss into a powered L2 loss Lγ=Δ22γ\mathcal{L}_\gamma = ||\Delta||_2^{2\gamma}. This is achieved by weighting the squared L2 loss with w=1/(Δ22+c)pw = 1 / (||\Delta||_2^2 + c)^p, where p=1γp = 1 - \gamma and cc is a small constant.

  • If p=0p=0 (γ=1\gamma=1), it's the standard squared L2 loss.
  • If p=0.5p=0.5 (γ=0.5\gamma=0.5), it's similar to the Pseudo-Huber loss [43], which is less sensitive to large errors.

Sampling Time Steps (r, t)

The choice of how rr and tt are sampled from the time domain (usually [0, 1]) affects training.

  • Uniform distribution U(0,1)\mathcal{U}(0, 1): rr and tt are chosen uniformly.
  • Logit-Normal (lognorm) distribution [11]: Samples are drawn from a normal distribution N(μ,σ)\mathcal{N}(\mu, \sigma) and then mapped to (0,1)(0, 1) using the logistic function. This often concentrates samples in specific regions of the time domain, which can be beneficial. After sampling, tt is assigned the larger value and rr the smaller, ensuring r<tr < t. A certain portion of samples also have r=tr=t, which essentially makes the target reduce to Flow Matching.

Conditioning on (r, t)

The neural network needs to be conditioned on the time variables rr and tt. Positional embedding [48] is used for this. The paper explores different ways to represent these time variables as input to the network:

  • Directly embedding (t, r).

  • Embedding (t,tr)(t, t-r), where t-r represents the time interval Δt\Delta t.

  • Embedding (t,r,r)(t, r, -r), which is equivalent to (t,r,Δt)(t, r, \Delta t) if Δt=tr\Delta t = t-r.

  • Embedding only t-r.

    The choice of embedding can influence the network's ability to learn the relationships between uu, rr, and tt.

B.3 On the Sufficiency of the MeanFlow Identity (Appendix B.3)

The paper explicitly shows that the MeanFlow Identity (Eq. 6) is not only a necessary condition derived from the definition of average velocity (Eq. 3), but also a sufficient condition. This means that if a function uu satisfies Eq. (6), it also satisfies Eq. (3).

The proof involves defining a displacement field S(zt,r,t)=(tr)u(zt,r,t)S(z_t, r, t) = (t-r)u(z_t, r, t). If the MeanFlow Identity holds, then ddtS(zt,r,t)=v(zt,t)\frac{d}{dt} S(z_t, r, t) = v(z_t, t). Integrating both sides with respect to tt would yield S(zt,r,t)+C1=rtv(zτ,τ)dτ+C2S(z_t, r, t) + C_1 = \int_r^t v(z_\tau, \tau) d\tau + C_2. However, by definition, St=r=0S|_{t=r} = 0 (since tr=0t-r=0), and the integral rtvdτ\int_r^t v d\tau is also 0 when t=rt=r. These boundary conditions force C1=C2C_1 = C_2, thus proving sufficiency. This is important because it means learning a network that satisfies Eq. (6) implies it correctly models the average velocity as defined by the integral.

B.4 Analysis on Jacobian-Vector Product (JVP) Computation (Appendix B.4)

While JVP computations can sometimes be a bottleneck, the paper clarifies that in MeanFlow, it is very lightweight. The key reason is that the JVP result (the total derivative term) is part of the regression target utgtu_{\mathrm{tgt}} and is subject to the stop-gradient (sg) operation. This means the gradients for optimizing the neural network parameters θ\theta are not propagated through the JVP computation.

The JVP itself effectively performs a single backward pass (like backpropagation), but only to compute the derivatives with respect to its inputs (zt,r,tz_t, r, t), not with respect to the network's parameters θ\theta. This is less costly than a full backpropagation to update θ\theta. The empirical benchmark shows an overhead of less than 20% of the total training time, confirming its efficiency.

Experimental Setup

Datasets

  1. ImageNet 256×256256 \times 256 [7]:

    • Description: A large-scale dataset of natural images, widely used for image classification and generation tasks. It contains millions of images across thousands of object categories.
    • Characteristics: High resolution (256×256256 \times 256 pixels), diverse content (animals, objects, scenes), and class-conditional labels.
    • Usage: Used for major experiments, evaluating class-conditional image generation.
    • Data Example: A picture of a golden retriever from the dataset.
    • Preprocessing: Images are first processed by a pre-trained VAE tokenizer [37]. This converts the 256×256×3256 \times 256 \times 3 (height ×\times width ×\times channels) pixel input into a lower-dimensional latent space representation of 32×32×432 \times 32 \times 4. The MeanFlow model then operates on this latent space. This is a common practice in latent diffusion models to reduce computational load.
  2. CIFAR-10 [25]:

    • Description: A smaller dataset consisting of 60,000 32×3232 \times 32 color images in 10 classes, with 6,000 images per class.
    • Characteristics: Low resolution (32×3232 \times 32 pixels), 10 common object classes (e.g., airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).
    • Usage: Used for unconditional image generation experiments, directly on pixel space.
    • Data Example: A small 32×3232 \times 32 pixel image of a frog.

Evaluation Metrics

Fréchet Inception Distance (FID) [17]:

  • Conceptual Definition: FID is a metric used to assess the quality of images generated by a generative model. It measures the "distance" between the distribution of real images and the distribution of generated images. A lower FID score indicates better quality and higher realism, meaning the generated images are more similar to real images and the diversity of generated images matches that of real images. It is commonly considered a more robust metric than Inception Score because it takes into account the statistical properties of both real and generated images.
  • Mathematical Formula: FID=μxμg22+Tr(Σx+Σg2(ΣxΣg)1/2) \mathrm{FID} = ||\mu_x - \mu_g||_2^2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2})
  • Symbol Explanation:
    • μx\mu_x: The mean feature vector of real images, extracted from an Inception v3 network's activation layer.

    • μg\mu_g: The mean feature vector of generated images, extracted from an Inception v3 network's activation layer.

    • 22||\cdot||_2^2: The squared Euclidean distance (L2 norm) between the mean feature vectors.

    • Σx\Sigma_x: The covariance matrix of feature vectors for real images.

    • Σg\Sigma_g: The covariance matrix of feature vectors for generated images.

    • Tr()\mathrm{Tr}(\cdot): The trace of a matrix (sum of its diagonal elements).

    • (ΣxΣg)1/2(\Sigma_x \Sigma_g)^{1/2}: The matrix square root of the product of the covariance matrices.

      A batch size of 50,000 generated images (FID-50K) is typically used to compute the FID score to ensure statistical robustness.

Baselines

The paper compares MeanFlow against a range of state-of-the-art (SOTA) generative models, categorized by their underlying approach:

  1. One-step Diffusion/Flow Models (trained from scratch):

    • iCT-XL/2 [43]: An Improved Consistency Trajectory model.
    • Shortcut-XL/2 [13]: A shortcut model approach for one-step diffusion.
    • iMM-XL/2 [52]: Inductive Moment Matching model.
  2. Other Families of Generative Models (for reference):

    • GANs (Generative Adversarial Networks):
      • BigGAN [5]
      • GigaGAN [21]
      • StyleGAN-XL [40]
    • Autoregressive/Masking Models:
      • AR w/ VQGAN [10]
      • MaskGIT [6]
      • VAR-d30 [47]
      • MAR-H [27]
    • Multi-step Diffusion/Flow Models:
      • ADM [8] (Improved Denoising Diffusion Probabilistic Models)

      • LDM-4-G [37] (Latent Diffusion Models)

      • SimDiff [20] (Simple Diffusion)

      • DiT-XL/2 [34] (Diffusion Transformers)

      • SiT-XL/2 [33] (Scalable Interpolant Transformers)

      • SiT-XL/2+REPA [51] (with Representation Alignment)

        These baselines represent different paradigms in generative modeling, with DiT and SiT being particularly strong Transformer-based diffusion models that typically perform very well but require many NFEs.

Model Architecture and Training Details

ImageNet 256×256256 \times 256

  • Backbone: DiT [34] architecture, based on Vision Transformers (ViT) [9] with adaLN-Zero for conditioning. ViT divides an image into patches, treats them as tokens, and processes them with a Transformer encoder. adaLN-Zero is an adaptive layer normalization technique that effectively integrates conditioning information (like time and class labels).
  • Conditioning: Positional embedding [48] is applied to each time variable (rr and tt), followed by a 2-layer Multi-Layer Perceptron (MLP), and the outputs are summed to form the conditioning input for the Transformer. The primary conditioning scheme is (t,tr)(t, t-r).
  • Model Sizes: Ablation studies use ViT-B/4 (Base size, patch size 4x4). For main results, various sizes are used: B/2B/2, M/2M/2, L/2L/2, XL/2XL/2 (Base, Medium, Large, Extra Large, all with patch size 2x2).
  • Training Schedule: Models are trained from scratch for 80 to 1000 epochs.
  • Optimizer: Adam [24] with specific hyperparameters:
    • Learning Rate: 0.0001 (constant schedule)
    • β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95
    • Weight Decay: 0.0
  • Batch Size: Not specified in the main text for ImageNet, but typically large for DiT architectures.
  • Dropout: 0.0
  • EMA Decay: 0.9999 (Exponential Moving Average, used for model weights to stabilise training and improve performance).
  • Time Step Sampling: Logit-normal distribution lognorm(0.4, 1.0) is typically used for sampling rr and tt.
  • Ratio of rtr \neq t: 25%25\% in ablation, implying that 75%75\% of the time r=tr=t (which reduces to Flow Matching). For main results, the table implies 25%25\% of r=tr=t.
  • Adaptive Weighting: Power p=1.0p=1.0 is used for the adaptive loss weighting.
  • CFG (Classifier-Free Guidance):
    • Effective guidance scale ω\omega' typically 2.0 or 2.5.
    • Class-conditional dropout [18] rate of 0.1 (i.e., 10% of samples are trained unconditionally).
    • CFG is triggered for specific ranges of tt values depending on the model size and configuration.
    • Mixing scale κ\kappa (from Improved CFG) is also used for XL/2+XL/2+ configuration.

CIFAR-10

  • Backbone: U-net [38] architecture (approximately 55M parameters), a common choice for image generation, particularly in diffusion models [44].
  • Input: Direct pixel space 32×32×332 \times 32 \times 3.
  • Conditioning: Positional embedding on (t,tr)(t, t-r), concatenated for the network.
  • Preconditioner: No EDM preconditioner [22] is used, unlike many other SOTA CIFAR-10 models.
  • Optimizer: Adam
    • Learning Rate: 0.0006
    • Batch Size: 1024
    • β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999
    • Dropout: 0.2
    • Weight Decay: 0
    • EMA Decay: 0.99995
  • Training Schedule: 800K iterations (including 10K warm-up).
  • Time Step Sampling: Logit-normal distribution lognorm(2.0, 2.0).
  • Ratio of rtr \neq t: 75%75\%.
  • Adaptive Weighting: Power p=0.75p=0.75.
  • Data Augmentation: Follows [22], but with vertical flipping and rotation disabled.

Results & Analysis

5.1 Ablation Study

The ablation study, conducted on ImageNet 256x256 with a ViT-B/4 backbone trained for 80 epochs, investigates the impact of various MeanFlow design choices on 1NFEFID1-NFE FID performance. Default settings are grayed out in the table below.

Table 1: Ablation study on 1-NFE ImageNet 256×256256 \times 256 generation. FID-50K is evaluated.

Manual Transcription of Table 1:

(a) Ratio of sampling rtr \neq t. The 0%0\% entry reduces to the standard Flow Matching baseline.

% of r≠t FID, 1-NFE
0% (= FM) 328.91
25% 61.06
50% 63.14
100% 67.32

Analysis (Table 1a): This ablation highlights the necessity of the average velocity concept. When the ratio of sampling rtr \neq t is 0%0\%, the model effectively reduces to standard Flow Matching (as r=tr=t implies the second term in the MeanFlow Identity vanishes). This results in a very poor FID of 328.91, indicating that standard Flow Matching is not suitable for 1-NFE generation. Introducing a non-zero ratio of rtr \neq t significantly improves performance, with 25%25\% achieving the best FID of 61.06. This confirms that explicitly training with intervals (where rtr \neq t) is crucial for MeanFlow's effectiveness in one-step generation. The model needs to learn how to bridge larger time gaps, not just instantaneous velocities.

(b) JVP computation. The correct jvp tangent is (v,0,1)(v, 0, 1) for Jacobian (zu,ru,tu)(\partial_z u, \partial_r u, \partial_t u).

jvp tangent FID, 1-NFE
(v, 0, 1) 61.06
(v, 0, 0) 268.06
(v, 1, 0) 329.22
(v, 1, 1) 137.96

Analysis (Table 1b): This destructive ablation underscores the importance of the correct mathematical formulation of the MeanFlow Identity, particularly the total derivative term. The correct jvp tangent (v,0,1)(v, 0, 1) (representing dztdt,drdt,dtdt\frac{dz_t}{dt}, \frac{dr}{dt}, \frac{dt}{dt} respectively) yields the best performance (61.06 FID). Incorrect tangent specifications (e.g., (v,0,0)(v, 0, 0) which assumes tt is constant, or (v,1,0)(v, 1, 0) which incorrectly implies rr changes with tt) lead to significantly worse FID scores (268.06, 329.22, 137.96). This strongly validates the theoretical derivation of the MeanFlow Identity and its total derivative breakdown.

(c) Positional embedding. The network is conditioned on the embeddings applied to the specified variables.

pos. embed FID, 1-NFE
(t, r) 61.75
(t, t−r) 61.06
(t, r, −r) 63.98
t−r only 63.13

Analysis (Table 1c): This explores different ways to encode the time information (r, t) for the neural network. All tested variants produce meaningful 1-NFE results, demonstrating the robustness of MeanFlow as a framework. The best performance is achieved by conditioning on (t,tr)(t, t-r) (61.06 FID), suggesting that providing both the current time tt and the interval duration t-r is most informative. However, (t, r) also performs almost as well (61.75 FID). Even using only the interval t-r yields a reasonable result (63.13 FID), highlighting the importance of the interval itself.

(d) Time samplers. tt and rr are sampled from the specific sampler.

t, r sampler FID, 1-NFE
uniform(0, 1) 65.90
lognorm(0.2, 1.0) 63.83
lognorm(0.2, 1.2) 64.72
lognorm(0.4, 1.0) 61.06
lognorm(0.4, 1.2) 61.79

Analysis (Table 1d): The distribution from which rr and tt are sampled significantly impacts performance. Logit-normal (lognorm) samplers consistently outperform the uniform distribution (65.90 FID). Among logit-normal options, lognorm(0.4, 1.0) achieves the best FID (61.06). This aligns with observations in Flow Matching research, where non-uniform sampling strategies for time variables often lead to better results by focusing training on more critical regions of the time domain.

(e) Loss metrics. p=0p = 0 is squared L2 loss. p=0.5p = 0.5 is Pseudo-Huber loss.

p FID, 1-NFE
0.0 79.75
0.5 63.98
1.0 61.06
1.5 66.57
2.0 69.19

Analysis (Table 1e): This evaluates the adaptive loss weighting parameter pp. A standard squared L2 loss (p=0p=0) performs the worst (79.75 FID), though it still yields meaningful results. Increasing pp (which effectively down-weights larger errors) improves performance, with p=1.0p=1.0 achieving the best FID (61.06). This indicates that moderately down-weighting very large errors during training is beneficial, consistent with findings in other few-step generative models (e.g., Pseudo-Huber loss at p=0.5p=0.5).

(f) CFG scale:. Our method supports 1-NFE CFG sampling.

ω\omega' FID, 1-NFE
1.0 (w/o cfg) 61.06
1.5 33.33
2.0 20.15
3.0 15.53
5.0 20.75

Analysis (Table 1f): This shows the significant impact of Classifier-Free Guidance (CFG). Without CFG (ω=1.0\omega'=1.0), the FID is 61.06. Introducing CFG dramatically improves the FID, reaching a minimum of 15.53 at an effective guidance scale of ω=3.0\omega'=3.0. This demonstrates that MeanFlow successfully integrates CFG in a 1-NFE manner, yielding substantial quality gains, consistent with observations in multi-step diffusion models. Too high a guidance scale (e.g., ω=5.0\omega'=5.0) can sometimes degrade quality (20.75 FID), a common phenomenon in CFG.

Scalability

Figure 4: Scalability of MeanFlow models on ImageNet \({ \\pmb 2 5 6 \\times 2 5 6 }\) 1-NFE generation FID is reported. All models are trained from scratch. CFG is applied while maintaining the 1-NFE sa… 该图像是图表,展示了MeanFlow模型在ImageNet 256×256上使用1-NFE生成的FID随训练轮数变化的曲线,比较了不同模型大小(B/2, M/2, L/2, XL/2)的表现。结果表明模型规模越大,FID越低,性能更优。 Analysis (Figure 4): This figure illustrates the scalability of MeanFlow models on ImageNet 256x256 1-NFE generation. As the model size increases from Base/2Base/2 to Extra-Large/2 (B/2B/2, M/2M/2, L/2L/2, XL/2XL/2), the FID score consistently decreases, indicating improved generation quality. This trend is consistent with other Transformer-based generative models like DiT and SiT, where larger models generally perform better given sufficient training. The plot also shows that CFG (implicitly applied here to achieve these FIDs) enables strong performance across scales while maintaining 1-NFE. This suggests that MeanFlow can benefit from increased computational resources and model capacity.

5.2 Comparisons with Prior Work

Table 2: Class-conditional generation on ImageNet 256×256256 \times 256. All entries are reported with CFG, when applicable.

Manual Transcription of Table 2 (Left): Left: 1-NFE and 2-NFE diffusion/flow models trained from scratch.

method params (M) NFE FID
1-NFE diffusion/flow from scratch
iCT-XL/2 [43]† 675 1 34.24
Shortcut-XL/2 [13] 675 1 10.60
MeanFlow-B/2 131 1 6.17
MeanFlow-M/2 308 1 5.01
MeanFlow-L/2 459 1 3.84
MeanFlow-XL/2 676 1 3.43
2-NFE diffusion/flow from scratch
iCT-XL/2 [43]† 675 2 20.30
iMM-XL/2 [52] 675 1×2 7.77
MeanFlow-XL/2 676 2 2.93
MeanFlow-XL/2+ 676 2 2.20

Manual Transcription of Table 2 (Right): Right: Other families of generative models as a reference. In both tables, 1x2 indicates that CFG incurs an NFE of 2 per sampling step. Our MeanFlow models are all trained for 240 epochs, except that "MeanFlow-XL/2+" is trained for more epochs and with configurations selected for longer training specified in appendix. †: iCT [43] results are reported by [52].

method params (M) NFE FID
GANs
BigGAN [5] 112 1 6.95
GigaGAN [21] 569 1 3.45
StyleGAN-XL [40] 166 1 2.30
autoregressive/masking
AR w/ VQGAN [10] 227 1024 26.52
MaskGIT [6] 227 8 6.18
VAR-d30 [47] 2000 10×2 1.92
MAR-H [27] 943 256×2 1.55
diffusion/flow
ADM [8] 554 250×2 10.94
LDM-4-G [37] 400 250×2 3.60
SimDiff [20] 2000 512×2 2.77
DiT-XL/2 [34] 675 250×2 2.27
SiT-XL/2 [33] 675 250×2 2.06
SiT-XL/2+REPA [51] 675 250×2 1.42

Analysis (ImageNet 256×256256 \times 256 Comparisons):

  • 1-NFE Performance: MeanFlow-XL/2 achieves an FID of 3.43 with 1-NFE. This is a significant breakthrough compared to previous SOTA one-step models:
    • Shortcut-XL/2 (previous SOTA 1-NFE) had an FID of 10.60. MeanFlow improves upon this by a relative margin of nearly 70%70\%.
    • iCT-XL/2 (another 1-NFE model) had an FID of 34.24.
    • iMM-XL/2 reported 1-step results, but these involve 2-NFE due to CFG (1x2 in table), achieving 7.77 FID. MeanFlow still significantly outperforms this.
  • Closing the Gap with Multi-step Models: The 3.43 FID of MeanFlow with 1-NFE is remarkably close to, and even surpasses, some multi-step diffusion models that require hundreds of NFEs, such as ADM (10.94 FID with 250×2250 \times 2 NFE) and LDM (3.60 FID with 250×2250 \times 2 NFE).
  • 2-NFE Performance: The MeanFlow-XL/2+ model achieves an FID of 2.20 with 2-NFE. This is highly competitive with leading multi-step Transformer-based models like DiT-XL/2 (2.27 FID with 250×2250 \times 2 NFE) and SiT-XL/2 (2.06 FID with 250×2250 \times 2 NFE). This result strongly supports the claim that few-step diffusion/flow models can now rival their many-step predecessors.
  • Self-contained and From Scratch: The fact that MeanFlow achieves these results from scratch, without pre-training, distillation, or curriculum learning, is a significant advantage in terms of simplicity and robustness.
  • Comparison to GANs: MeanFlow's 1-NFE performance (3.43 FID) is competitive with SOTA GANs like GigaGAN (3.45 FID), though StyleGAN-XL (2.30 FID) remains slightly better at 1-NFE. However, GANs have their own training stability challenges.

Table 3: Unconditional CIFAR-10.

Manual Transcription of Table 3:

method precond NFE FID
iCT [43] EDM 1 2.83
ECT [15] EDM 1 3.60
sCT [31] EDM 1 2.97
IMM [52] EDM 1 3.20
MeanFlow none 1 2.92

Analysis (CIFAR-10 Comparisons): On the CIFAR-10 dataset for unconditional generation, MeanFlow achieves an FID of 2.92 with 1-NFE. This is competitive with other 1-NFE Consistency Model variants like iCT (2.83 FID) and sCT (2.97 FID), and outperforms ECT (3.60 FID) and IMM (3.20 FID). A notable point is that MeanFlow achieves this performance without using an EDM preconditioner, which is typically employed by other SOTA models for CIFAR-10 to improve stability and performance. This suggests that MeanFlow's core mechanism is robust even without additional preconditioning techniques.

B.1 Improved CFG for MeanFlow (Appendix B.1)

Table 5: Improved CFG for MeanFlow.

Manual Transcription of Table 5:

κ\kappa FID, 1-NFE
0.0 20.15
0.5 19.15
0.8 19.10
0.9 18.63
0.95 19.17

Analysis (Table 5): This ablation explores the effect of the mixing scale κ\kappa in the Improved CFG formulation (Eq. 20 and 21). With the effective guidance scale ω\omega' fixed at 2.0, varying κ\kappa shows further improvements in FID. When κ=0.0\kappa=0.0, the FID is 20.15. Introducing κ\kappa and mixing class-conditional and class-unconditional uθcfgu_\theta^{\mathrm{cfg}} terms in the target reduces the FID to a minimum of 18.63 at κ=0.9\kappa=0.9. This demonstrates that explicitly blending both conditional and unconditional aspects within the CFG target, similar to the dropout mechanism in standard CFG, further enhances MeanFlow's generation quality.

Qualitative Results

Figure 5: 1-NFE Generation Results. We show curated examples of class-conditional generation on ImageNet \(2 5 6 \\times 2 5 6\) using our 1-NFE model (MeanFlow-XL/2, 3.43 FID). 该图像是一组图像拼贴,展示了使用论文中1-NFE模型(MeanFlow-XL/2,3.43 FID)在ImageNet 256×256分辨率下的类别条件生成样例,涵盖多种自然和动物场景。 Analysis (Figure 5): The figure showcases curated examples of class-conditional images generated by the MeanFlow-XL/2 model on ImageNet 256x256 with 1-NFE. The images exhibit high fidelity, sharpness, and diversity, closely resembling real photographs across various categories. This visual evidence supports the quantitative FID results, confirming the model's ability to produce high-quality samples in a single step.

Conclusion & Personal Thoughts

Conclusion Summary

This paper introduces MeanFlow, a novel and principled framework for one-step generative modeling. Its core innovation is the concept of average velocity to characterise flow fields, contrasting with the instantaneous velocity typically modeled by Flow Matching methods. By deriving and leveraging the MeanFlow Identity, which fundamentally relates average and instantaneous velocities, the method provides a robust and self-contained training objective. MeanFlow requires no pre-training, distillation, or curriculum learning, simplifying its application. It achieves state-of-the-art performance for 1-NFE generation on ImageNet 256x256 (FID 3.43), significantly outperforming previous one-step models and substantially narrowing the gap with multi-step diffusion/flow models. Furthermore, MeanFlow naturally integrates Classifier-Free Guidance (CFG) without increasing the Number of Function Evaluations (NFE) during sampling.

Limitations & Future Work (Authors' Perspective)

The authors identify that MeanFlow's strong performance largely closes the gap between one-step and many-step diffusion/flow models, suggesting that future research could revisit the foundations of these models. They also explicitly state that orthogonal improvements, such as REPA [51] (Representation Alignment), are applicable and left for future work. The paper concludes by suggesting that its formulation, by describing quantities at coarsened levels of granularity, could bridge research in generative modeling, simulation, and dynamical systems in related fields (e.g., multi-scale simulation problems in physics).

Personal Insights & Critique

The MeanFlow paper presents a highly impactful contribution to the field of generative AI. The shift from instantaneous velocity to average velocity is conceptually elegant and mathematically sound. The rigorous derivation of the MeanFlow Identity is a significant strength, providing a principled foundation that many other few-step methods (which often rely on heuristics or empirically tuned consistency constraints) lack.

Strengths:

  • Principled Approach: The derivation of the MeanFlow Identity from first principles is commendable. It provides a strong theoretical underpinning for the model's effectiveness.
  • Self-Contained and From Scratch: The ability to achieve SOTA 1-NFE performance without distillation or curriculum learning is a major practical advantage. It makes MeanFlow easier to train, more robust, and less dependent on complex multi-stage pipelines.
  • Efficient CFG Integration: The native inclusion of CFG for 1-NFE sampling is a clever design choice that overcomes a common limitation of few-step models.
  • Exceptional Performance: The FID scores achieved are truly impressive, especially for 1-NFE. Closing the gap with multi-step models is a significant milestone for real-time generative applications.
  • Scalability: The demonstrated scalability is crucial for leveraging larger Transformer architectures and further improving quality.

Possible Improvements / Open Questions:

  • Computational Cost of JVP: While the authors state the JVP is lightweight (16% overhead), this still adds to training time. Exploring ways to approximate or further optimize this derivative calculation could be beneficial, especially for extremely large models or resource-constrained environments.
  • Generalization of Average Velocity: The concept of average velocity is inherently tied to a specific time interval [r, t]. Could this idea be extended to other forms of "averaging" or "coarsening" in different contexts (e.g., spatial averaging, multi-resolution flows)?
  • Robustness to Diverse Data: While ImageNet is a challenging dataset, further validation on more diverse or specialized datasets (e.g., medical images, complex scientific data) could assess the model's broader applicability.
  • Theoretical Guarantees: While the derivation is principled, further theoretical analysis regarding convergence properties, stability, and potential for mode collapse in MeanFlow compared to standard Flow Matching could provide deeper insights.
  • Hyperparameter Sensitivity: The ablation studies show sensitivity to pp in the loss function, time samplers, and CFG scale. Further work on adaptive or robust strategies for these hyperparameters could make the model even more user-friendly.

Broader Impact: MeanFlow has the potential to significantly accelerate the adoption of generative models in real-world applications where low latency is critical, such as interactive image editing, real-time content generation in games or virtual reality, and faster prototyping in design. The principled nature of its average velocity concept could also inspire new theoretical developments in the understanding and construction of continuous-time generative models, potentially leading to more stable and efficient algorithms across various domains. The connection to multi-scale simulation problems hinted at in the conclusion is particularly intriguing, suggesting that MeanFlow might offer a fresh perspective on complex scientific and engineering challenges.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.