Mean Flows for One-step Generative Modeling
TL;DR Summary
MeanFlow introduces average velocity and an identity guiding training, enabling one-step generative modeling without pre-training. It achieves a 3.43 FID on ImageNet 256×256 with a single function evaluation, outperforming prior one-step models and narrowing the gap to multi-step
Abstract
We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.
Mind Map
In-depth Reading
English Analysis
Bibliographic Information
- Title: Mean Flows for One-step Generative Modeling
- Authors: Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, Kaiming He
- Affiliations: Carnegie Mellon University (CMU), Massachusetts Institute of Technology (MIT)
- Journal/Conference: Not explicitly stated as published in a journal or conference, but submitted to
arXiv. Given the authors and the quality of work, it is likely intended for a top-tier machine learning conference (e.g., ICLR, NeurIPS, ICML) or journal.arXivis a preprint server widely used in academia for rapid dissemination of research. - Publication Year: 2025
- Abstract: This paper introduces a novel framework,
MeanFlow, forone-step generative modeling. Unlike traditionalFlow Matchingmethods that modelinstantaneous velocity,MeanFlowproposes to characterizeflow fieldsusingaverage velocity. The core innovation is a derived identity between average and instantaneous velocities, which directly guides the training of a neural network. TheMeanFlowmodel isself-contained, meaning it does not requirepre-training,distillation, orcurriculum learning. Empirically,MeanFlowachieves an impressiveFréchet Inception Distance (FID)of 3.43 with asingle function evaluation (1-NFE)onImageNet 256x256, significantly surpassing previousstate-of-the-art (SOTA)one-step diffusion/flow models. The authors conclude that this work substantially reduces the performance gap betweenone-stepandmulti-step generative models, encouraging further foundational research in this area. - Original Source Link:
https://arxiv.org/abs/2505.13447 - PDF Link:
https://arxiv.org/pdf/2505.13447v1.pdf
Executive Summary
Background & Motivation (Why)
The fundamental goal of generative modeling is to learn how to transform a simple, known probability distribution (like a Gaussian noise distribution) into a complex, target data distribution (like images of cats). Diffusion models and Flow Matching have emerged as highly successful frameworks for this task. However, these models traditionally require many sampling steps (or function evaluations, NFE) during the inference (generation) phase to produce high-quality samples, making them computationally expensive and slow.
The core problem this paper addresses is the challenge of one-step generative modeling – generating high-quality samples in just a single function evaluation (1-NFE). Achieving one-step generation is crucial for practical applications where speed is paramount, such as real-time content creation or interactive systems. Prior one-step or few-step methods often rely on complex techniques like distillation from multi-step models, pre-training, or carefully designed curriculum learning strategies to stabilise training and improve performance. These approaches can be intricate to implement and may introduce dependencies on other models or training procedures. The specific gap identified is the lack of a principled, self-contained framework for one-step generation that can achieve SOTA quality without these additional complexities.
The paper's novel approach, MeanFlow, introduces the concept of average velocity to characterise flow fields. Instead of modeling the instantaneous velocity at each moment in time (as Flow Matching typically does), MeanFlow directly learns the average velocity over a time interval. This conceptual shift allows for direct one-step generation by effectively "skipping" the intermediate steps.
Main Contributions / Findings (What)
- Introduction of Average Velocity and MeanFlow Identity: The paper formally defines
average velocityand, through a rigorous derivation, establishes aMeanFlow Identity. This identity describes a fundamental relationship between theaverage velocityandinstantaneous velocity, providing a principled, self-contained mathematical basis for training a neural network forone-step generation. - Principled One-Step Generative Framework:
MeanFlowoffers aself-containedframework that trainsfrom scratchwithout needingpre-training,distillation, orcurriculum learning. This simplifies the training process and makes the model more robust. - Integration of Classifier-Free Guidance (CFG) for 1-NFE: The method naturally incorporates
Classifier-Free Guidance (CFG)directly into its underlyingground-truth fields, enabling the benefits ofCFG(improved sample quality) without incurring additionalNFEduringone-step sampling. - State-of-the-Art Empirical Performance:
MeanFlowachieves anFIDof 3.43 with1-NFEonImageNet 256x256, significantly outperforming previousSOTA one-step diffusion/flow modelsby a relative margin of 50-70%. This performance substantially narrows the gap betweenone-stepandmulti-step generative models. It also achieves anFIDof 2.20 with2-NFE, matching or exceeding the performance of leadingmulti-step models(e.g.,DiT,SiT) that use hundreds ofNFEs. - Scalability: The model demonstrates promising
scalabilitywith increasing model size, consistent withTransformer-based diffusion/flow models.
Prerequisite Knowledge & Related Work
To understand MeanFlow, it's essential to first grasp the foundational concepts of generative modeling, diffusion models, and Flow Matching.
Foundational Concepts
Generative Modeling
Generative modeling is a branch of machine learning focused on learning the underlying probability distribution of a dataset. Once learned, a generative model can produce new samples that are similar to the training data. For example, given a dataset of images of faces, a generative model could generate new, realistic faces that were not in the original dataset. The goal is to transform a simple, known prior distribution (e.g., standard Gaussian noise) into the complex data distribution.
Diffusion Models
Diffusion models (e.g., DDPM [19], Score-based models [44]) are a class of generative models that learn to reverse a gradual diffusion process. This process involves progressively adding noise to data until it becomes pure noise. The model then learns to denoise the data, effectively reversing the diffusion process to transform noise back into coherent data. This reversal is typically achieved by predicting the noise added at each step or the score function (gradient of the log-probability density). During sampling (generation), these models iteratively apply a denoising function over many time steps, starting from random noise.
Flow Matching
Flow Matching [28, 2, 30] is a framework closely related to diffusion models that focuses on learning velocity fields. Instead of directly predicting noise or score functions, Flow Matching learns a velocity field that defines a continuous flow path transforming a simple prior distribution (like Gaussian noise) into the target data distribution.
- Flow Path: A
flow pathis a continuous trajectory that connects a sample from thedata distributionto a sample from theprior distribution. It's often defined as a linear interpolation: where is time (typically from 0 to 1), and , are predefinedschedules(e.g., , ). - Instantaneous Velocity (): The
velocityat any point along this path is thetime derivativeof . If , then theinstantaneous velocityis: This is often called theconditional velocitybecause it depends on the specific pair that generated . - Conditional vs. Marginal Velocity:
- Conditional Velocity: Given a specific data point and a noise vector , the
conditional velocitydescribes the direction and speed of theflow pathfor that specific pair. As shown in Fig. 2 (left), a given point in thelatent spacecan arise from different pairs, leading to differentconditional velocities. - Marginal Velocity: Because the neural network needs to learn a consistent
velocity fieldfor any it encounters,Flow Matchingaims to model themarginal velocity. This is the expectedvelocityat and time , averaged over all possible pairs that could lead to (Fig. 2 right). Training directly with this expectation is often intractable. However, it has been shown that minimizing aconditional Flow Matching (CFM)loss, which matches the predictedvelocityto theconditional velocity, is equivalent to minimizing themarginal Flow Matchingloss.
- Conditional Velocity: Given a specific data point and a noise vector , the
- Sampling: Once the neural network is trained to predict the
velocity field,samplingnew data points involves solving anOrdinary Differential Equation (ODE): ThisODEis typically solved numerically using methods likeEuler's methodover many discretetime steps, starting from noise () and integrating backwards to . Each step in this numerical integration constitutes afunction evaluation (NFE).
Number of Function Evaluations (NFE)
NFE refers to how many times the neural network (which predicts velocity or noise) must be called during the sampling process to generate a single output sample.
- Multi-step Generation: Models requiring many
NFE(e.g., 250-1000) aremulti-step models. They are generally slower but often produce higher quality samples. - Few-step Generation: Models aiming to reduce
NFEto a small number (e.g., 2-50). - One-step Generation (1-NFE): The ultimate goal for fast
inference, where a single forward pass of the neural network generates a complete sample. This is the focus ofMeanFlow.
Latent Space Models
Many generative models operate not directly on high-resolution pixels but on a compressed latent space representation. A Variational Autoencoder (VAE) [37] is commonly used as a tokenizer to convert images into a lower-dimensional latent representation and vice versa. The generative model then learns to transform noise in this latent space, and a decoder converts the generated latent representation back to pixel space. This reduces computational cost and allows Transformer-based architectures to operate more efficiently.
Positional Embedding
In Transformer architectures, positional embeddings [48] are used to inject information about the relative or absolute position of elements in a sequence (or, in this case, time variables) into the model. Since Transformers are inherently permutation-invariant, positional embeddings are crucial for the model to understand the temporal context.
Previous Works & Technological Evolution
The field has evolved towards faster generation, primarily through distillation or consistency-based methods.
- Distillation-based Methods: These approaches first train a
multi-step diffusion modeland thendistillits knowledge into a smaller, faster model that requires fewersampling steps(e.g., [39, 14, 41, 32]). This often means thefew-step modelis dependent on the performance and training of themulti-step teacher model. - Consistency Models (CMs) [46, 43, 15, 31, 49]: These are
standalone generative modelsdesigned forone-step generationwithoutdistillation. They introduce aconsistency constraintduring training: the network should produce consistent outputs (i.e., map to the same original data point) when evaluated at differenttime stepsalong the sameflow path. CMs often use adiscretization curriculum[46] where the time domain is progressively constrained, making training sensitive and requiring careful scheduling. In the notation ofMeanFlow, CMs focus on paths anchored at the data side ().- Example:
iCT[43] andECT[15] are variants ofConsistency Models.
- Example:
- Shortcut Models [13]: These models introduce a
self-consistency loss functionin addition toFlow Matching. They focus on relationships betweenflowsat different discretetime intervals, effectively trying to learn a shortcut from noise to data. They are conditioned on two time variables, similar toMeanFlow's use of and . - Inductive Moment Matching (IMM) [52]: This method models the
self-consistencyofstochastic interpolants(similar toflow paths) at differenttime steps. It also uses two time variables.
Differentiation
MeanFlow differentiates itself from prior work in several key ways:
- Principled Foundation: Unlike
Consistency Models, which imposeconsistency constraintsas properties of the neural network's behavior,MeanFlowderives its coreMeanFlow Identity(Eq. 6) directly from the definition ofaverage velocity(Eq. 3). This identity exists independently of any neural network and provides aground-truth target fieldfor learning, leading to more stable and robust training. - Self-Contained and From Scratch:
MeanFlowdoes not rely onpre-training,distillationfrommulti-step models, or complexcurriculum learningstrategies often seen inConsistency Models. It is trained entirelyfrom scratch, making it a simpler and more fundamental approach. - Average vs. Instantaneous Velocity: The central distinction is the target:
MeanFlowmodelsaverage velocityacross an interval[r, t], whereas traditionalFlow Matchingand its variants modelinstantaneous velocityat a single point in time. This conceptual shift directly facilitatesone-step generation. - Native CFG Integration:
MeanFlowintegratesClassifier-Free Guidance (CFG)by defining guidedground-truth velocity fields( and ), allowing for1-NFE samplingwith guidance, whereas other methods might require extraNFEforCFGor struggle to integrate it seamlessly into aone-stepframework.
Methodology (Core Technology & Implementation Details)
The MeanFlow model introduces the concept of average velocity to achieve efficient one-step generative modeling. This section details the mathematical derivation and practical implementation.
4.1 Mean Flows
Core Idea: Average Velocity vs. Instantaneous Velocity
Traditional Flow Matching models learn the instantaneous velocity , which describes the direction and magnitude of movement at a specific point and time . To generate a sample, this instantaneous velocity must be integrated over time, typically through numerical ODE solvers that require multiple function evaluations (NFE).
MeanFlow proposes to model average velocity . This average velocity is defined over a time interval [r, t] and represents the average rate of change of over that interval. The intuition is that if we can directly learn this average velocity, we can effectively "jump" from to in one step, without needing to compute all intermediate instantaneous velocities.
Definition of Average Velocity
The average velocity is defined as the displacement between two time steps and divided by the time interval (t-r). The displacement itself is the integral of the instantaneous velocity over that interval.
- : The
average velocityvector. It is a function of the current state , the start time of the interval, and the end time of the interval. - : The
instantaneous velocityvector at an intermediate time . - : This integral represents the total displacement of the trajectory from time to time .
(t-r): The duration of the time interval.
Key characteristics of average velocity :
- It is a field that depends jointly on
(r, t). - As , the
average velocityapproaches theinstantaneous velocity: . - It satisfies a natural
consistencyproperty: taking one large step from to is consistent with taking two smaller consecutive steps from to and then from to . This is due to theadditivity of integrals: This means a neural network learning inherently learns these consistency relations without explicit constraints.
Figure 3 visually contrasts the average velocity with the instantaneous velocity. The instantaneous velocity (tangent to the path) and average velocity (aligned with the displacement vector) are generally not aligned for curved paths.
The MeanFlow Identity
Directly using Eq. (3) for training is problematic because it requires evaluating an integral, which is computationally expensive during training. The core insight of MeanFlow is to derive a relationship between and that is amenable to training.
- Rewrite Eq. (3) as a displacement:
- Differentiate both sides with respect to : Treating as independent of .
- Left Hand Side (LHS) - Product Rule: Since , the LHS becomes:
- Right Hand Side (RHS) - Fundamental Theorem of Calculus: (assuming is constant with respect to ).
- Equate LHS and RHS and rearrange:
Rearranging to isolate :
This is the MeanFlow Identity. It states that the
average velocity() can be expressed in terms of theinstantaneous velocity() and thetotal derivativeof theaverage velocitywith respect to . This identity is crucial because it provides a "target" for learning that only involves and derivatives of , rather than an integral of .
Decomposing the Total Derivative
The term is a total derivative. Since is a function of , , and , and itself is a function of (i.e., depends on through the flow path), we must use the chain rule for total derivatives:
-
: This is the definition of
instantaneous velocityfrom Eq. (2) in the paper's background. -
: We treat as an independent time parameter from the interval
[r, t], meaning its value doesn't change as we differentiate with respect to . -
: The derivative of with respect to is 1.
Substituting these into the total derivative equation:
- : This represents the
Jacobian(matrix of partial derivatives) of with respect to . - : This is a
Jacobian-vector product (JVP). It represents how changes in (due to its motion with velocity ) affect . - : This is the
partial derivativeof with respect to , holding and constant.
Training with Average Velocity
The goal is to train a neural network, , parameterized by , to approximate the average velocity . The MeanFlow Identity (Eq. 6), combined with the total derivative decomposition (Eq. 8), provides the target for this training.
The loss function minimises the squared difference between the network's prediction and an effective regression target .
where the target is derived from the MeanFlow Identity:
-
: The output of the neural network, predicting the
average velocity. -
:
Stop-gradientoperation. This means that duringbackpropagation, the gradients are not passed through . The target is treated as a fixed value, preventingdouble backpropagationthrough the derivatives of within , which would lead to higher-order derivatives and make training much more complex and computationally expensive. -
: The
instantaneous velocity. FollowingFlow Matchingpractice, this is replaced by theconditional velocityfor training. Forrectified flows(a common choice for ), this is . -
and : These are the
Jacobianandpartial derivativeof the neural network's output with respect to and , respectively. They are computed using automatic differentiation, specificallyJacobian-vector products (JVP).The
MeanFlow Identityis a necessary and sufficient condition for Eq. (3) if the boundary condition is met (where ). TheMeanFlowformulation inherently satisfies this.
Algorithm 1: Training MeanFlow
The pseudocode illustrates the training process:
Algorithm 1 MeanFlow: Training.
Input: A batch of data x, prior distribution samples e, time t, r, model u_theta
Output: Loss value
1. Sample t from U(0,1)
2. Sample r from U(0,t) // Or some other distribution, with r < t
3. Generate z_t = (1-t) * x + t * e // The noisy sample at time t
4. Compute conditional instantaneous velocity v_t = e - x
5. Define a function f(z, r_val, t_val) that computes u_theta(z, r_val, t_val)
6. Use jvp to compute the total derivative:
# d/dt u(z_t, r, t) = (v_t @ partial_z u_theta) + partial_t u_theta
# This combines the partial derivatives w.r.t. z and t, where z itself depends on t.
(u_val, grad_u_val) = jax.jvp(f, (z_t, r, t), (v_t, 0, 1))
# (u_val is u_theta(z_t, r, t), grad_u_val is the total derivative d/dt u_theta)
# The tangent vector for jvp is (v_t, 0, 1) because:
# d z_t / d t = v_t
# d r / d t = 0 (r is treated as independent of t for this derivative)
# d t / d t = 1
7. Compute the regression target:
u_tgt = v_t - (t - r) * grad_u_val
// Apply stop_gradient (sg) to u_tgt
8. Compute the loss:
loss = mean(square(u_val - sg(u_tgt)))
9. Backpropagate and update theta
jvp(Jacobian-Vector Product): This operation efficiently computes thetotal derivative. It takes the function (our ), the arguments , and atangent vector. Thetangent vectorspecifies how each argument changes with respect to the differentiation variable (here, ). Thejvpreturns both the function's outputu_val() and thetotal derivativegrad_u_val(). Thejvpcomputation is efficient, often requiring only a singlebackward pass, similar to standardbackpropagation.
Sampling
The main advantage of MeanFlow is its one-step sampling capability. Once is trained, to generate a sample from noise :
z _ { r } = z _ { t } - ( t - r ) u ( z _ { t } , r , t )
For one-step sampling from prior noise to data, we set (noise state) and (data state):
Algorithm 2: 1-step Sampling
Algorithm 2 MeanFlow: 1-step Sampling
Input: Model u_theta, prior distribution sample e
Output: Generated data x_0
1. Sample epsilon ~ p_prior
2. Compute x_0 = epsilon - u_theta(epsilon, r=0, t=1) // Single function evaluation
This single evaluation of generates the final sample, making it highly efficient. Few-step sampling is also straightforward by applying Eq. (12) iteratively.
4.2 Mean Flows with Guidance
Classifier-Free Guidance (CFG) [18] is a technique commonly used in diffusion models to improve sample quality by amplifying the effect of a given class condition (e.g., "generate a cat"). It typically involves running the model twice: once with the desired condition and once unconditionally, then linearly combining their outputs. This often doubles the NFE.
MeanFlow integrates CFG by constructing a guided ground-truth velocity field that the network learns directly, thus preserving 1-NFE sampling.
Ground-truth Fields with CFG
The paper defines a new ground-truth instantaneous velocity field that incorporates guidance:
-
: The
instantaneous velocityconditioned on aclass condition. -
: The
class-unconditional instantaneous velocity. This is obtained by averaging over all possible conditions: . -
: The
guidance scale, a hyperparameter controlling the strength of theguidance. Higher means stronger guidance.Following the
MeanFlow Identity, the correspondingaverage velocity fieldmust satisfy:
Training with Guidance
The neural network is parameterized to directly model . The loss function is similar to the unguided case:
where is now defined using a modified instantaneous velocity :
The modified instantaneous velocity is the crucial part for CFG integration:
-
: The
sample-conditional instantaneous velocity(e.g., ). -
: This term represents the network's prediction of
average velocitywhen the interval is infinitesimally small (i.e., ). In this limit,average velocityequalsinstantaneous velocity, so serves as the network's prediction for theunconditional instantaneous velocity. This is where theunconditionalpart ofCFGcomes from.To expose the network to
class-unconditionalinformation, adropoutmechanism is applied to theclass conditionduring training (e.g., setting to a null token for a certain percentage of samples) as in standardCFG[18]. When is dropped, the model learns to predict theunconditional velocity.
Single-NFE Sampling with CFG
Because is trained to directly model the guided average velocity , sampling is still 1-NFE. There is no need for a separate unconditional run and subsequent linear combination at inference time. The network directly outputs the guided average velocity using a single forward pass:
Improved CFG for MeanFlow (Appendix B.1)
The paper introduces an improvement to the CFG formulation by adding a mixing scale . This allows for a more explicit mix of class-conditional and class-unconditional predictions within the target . The ground-truth instantaneous velocity is redefined as:
And the modified instantaneous velocity for the training target becomes:
-
: A mixing scale.
-
: The network's prediction of
conditional instantaneous velocity. -
: The network's prediction of
unconditional instantaneous velocity.This formulation helps to further improve generation quality by explicitly training the network to produce both
conditionalandunconditionaloutputs, mirroring the random dropout strategy of standardCFG.
4.3 Design Decisions
Loss Metrics
The paper investigates different forms of loss functions. The base loss is the squared L2 loss, , where .
They use adaptive loss weighting [15], which effectively transforms the squared L2 loss into a powered L2 loss . This is achieved by weighting the squared L2 loss with , where and is a small constant.
- If (), it's the standard squared L2 loss.
- If (), it's similar to the
Pseudo-Huber loss[43], which is less sensitive to large errors.
Sampling Time Steps (r, t)
The choice of how and are sampled from the time domain (usually [0, 1]) affects training.
- Uniform distribution : and are chosen uniformly.
- Logit-Normal (lognorm) distribution [11]: Samples are drawn from a
normal distributionand then mapped to using thelogistic function. This often concentrates samples in specific regions of the time domain, which can be beneficial. After sampling, is assigned the larger value and the smaller, ensuring . A certain portion of samples also have , which essentially makes the target reduce toFlow Matching.
Conditioning on (r, t)
The neural network needs to be conditioned on the time variables and . Positional embedding [48] is used for this. The paper explores different ways to represent these time variables as input to the network:
-
Directly embedding
(t, r). -
Embedding , where
t-rrepresents the time interval . -
Embedding , which is equivalent to if .
-
Embedding only
t-r.The choice of embedding can influence the network's ability to learn the relationships between , , and .
B.3 On the Sufficiency of the MeanFlow Identity (Appendix B.3)
The paper explicitly shows that the MeanFlow Identity (Eq. 6) is not only a necessary condition derived from the definition of average velocity (Eq. 3), but also a sufficient condition. This means that if a function satisfies Eq. (6), it also satisfies Eq. (3).
The proof involves defining a displacement field . If the MeanFlow Identity holds, then . Integrating both sides with respect to would yield . However, by definition, (since ), and the integral is also 0 when . These boundary conditions force , thus proving sufficiency. This is important because it means learning a network that satisfies Eq. (6) implies it correctly models the average velocity as defined by the integral.
B.4 Analysis on Jacobian-Vector Product (JVP) Computation (Appendix B.4)
While JVP computations can sometimes be a bottleneck, the paper clarifies that in MeanFlow, it is very lightweight. The key reason is that the JVP result (the total derivative term) is part of the regression target and is subject to the stop-gradient (sg) operation. This means the gradients for optimizing the neural network parameters are not propagated through the JVP computation.
The JVP itself effectively performs a single backward pass (like backpropagation), but only to compute the derivatives with respect to its inputs (), not with respect to the network's parameters . This is less costly than a full backpropagation to update . The empirical benchmark shows an overhead of less than 20% of the total training time, confirming its efficiency.
Experimental Setup
Datasets
-
ImageNet [7]:
- Description: A large-scale dataset of natural images, widely used for image classification and generation tasks. It contains millions of images across thousands of object categories.
- Characteristics: High resolution ( pixels), diverse content (animals, objects, scenes), and
class-conditionallabels. - Usage: Used for major experiments, evaluating
class-conditional image generation. - Data Example: A picture of a
golden retrieverfrom the dataset. - Preprocessing: Images are first processed by a
pre-trained VAE tokenizer[37]. This converts the (height width channels) pixel input into a lower-dimensionallatent spacerepresentation of . TheMeanFlowmodel then operates on thislatent space. This is a common practice inlatent diffusion modelsto reduce computational load.
-
CIFAR-10 [25]:
- Description: A smaller dataset consisting of 60,000 color images in 10 classes, with 6,000 images per class.
- Characteristics: Low resolution ( pixels), 10 common object classes (e.g., airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).
- Usage: Used for
unconditional image generationexperiments, directly on pixel space. - Data Example: A small pixel image of a
frog.
Evaluation Metrics
Fréchet Inception Distance (FID) [17]:
- Conceptual Definition: FID is a metric used to assess the quality of images generated by a
generative model. It measures the "distance" between the distribution of real images and the distribution of generated images. A lower FID score indicates better quality and higher realism, meaning the generated images are more similar to real images and the diversity of generated images matches that of real images. It is commonly considered a more robust metric thanInception Scorebecause it takes into account the statistical properties of both real and generated images. - Mathematical Formula:
- Symbol Explanation:
-
: The
mean feature vectorof real images, extracted from anInception v3network's activation layer. -
: The
mean feature vectorof generated images, extracted from anInception v3network's activation layer. -
: The
squared Euclidean distance(L2 norm) between the mean feature vectors. -
: The
covariance matrixof feature vectors for real images. -
: The
covariance matrixof feature vectors for generated images. -
: The
traceof a matrix (sum of its diagonal elements). -
: The
matrix square rootof the product of the covariance matrices.A batch size of 50,000 generated images (
FID-50K) is typically used to compute the FID score to ensure statistical robustness.
-
Baselines
The paper compares MeanFlow against a range of state-of-the-art (SOTA) generative models, categorized by their underlying approach:
-
One-step Diffusion/Flow Models (trained from scratch):
iCT-XL/2[43]: AnImproved Consistency Trajectory model.Shortcut-XL/2[13]: Ashortcut modelapproach forone-step diffusion.iMM-XL/2[52]:Inductive Moment Matchingmodel.
-
Other Families of Generative Models (for reference):
- GANs (Generative Adversarial Networks):
BigGAN[5]GigaGAN[21]StyleGAN-XL[40]
- Autoregressive/Masking Models:
AR w/ VQGAN[10]MaskGIT[6]VAR-d30[47]MAR-H[27]
- Multi-step Diffusion/Flow Models:
-
ADM[8] (Improved Denoising Diffusion Probabilistic Models) -
LDM-4-G[37] (Latent Diffusion Models) -
SimDiff[20] (Simple Diffusion) -
DiT-XL/2[34] (Diffusion Transformers) -
SiT-XL/2[33] (Scalable Interpolant Transformers) -
SiT-XL/2+REPA[51] (withRepresentation Alignment)These baselines represent different paradigms in
generative modeling, withDiTandSiTbeing particularly strongTransformer-based diffusion modelsthat typically perform very well but require manyNFEs.
-
- GANs (Generative Adversarial Networks):
Model Architecture and Training Details
ImageNet
- Backbone:
DiT[34] architecture, based onVision Transformers (ViT)[9] withadaLN-Zeroforconditioning.ViTdivides an image into patches, treats them as tokens, and processes them with aTransformer encoder.adaLN-Zerois an adaptive layer normalization technique that effectively integratesconditioninginformation (liketimeandclass labels). - Conditioning:
Positional embedding[48] is applied to each time variable ( and ), followed by a 2-layerMulti-Layer Perceptron (MLP), and the outputs are summed to form the conditioning input for theTransformer. The primary conditioning scheme is . - Model Sizes: Ablation studies use
ViT-B/4(Base size, patch size 4x4). For main results, various sizes are used: , , , (Base, Medium, Large, Extra Large, all with patch size 2x2). - Training Schedule: Models are trained
from scratchfor 80 to 1000epochs. - Optimizer:
Adam[24] with specific hyperparameters:- Learning Rate: 0.0001 (constant schedule)
- ,
- Weight Decay: 0.0
- Batch Size: Not specified in the main text for ImageNet, but typically large for
DiTarchitectures. - Dropout: 0.0
- EMA Decay: 0.9999 (Exponential Moving Average, used for model weights to stabilise training and improve performance).
- Time Step Sampling:
Logit-normaldistributionlognorm(0.4, 1.0)is typically used for sampling and . - Ratio of : in ablation, implying that of the time (which reduces to
Flow Matching). For main results, the table implies of . - Adaptive Weighting: Power is used for the
adaptive loss weighting. - CFG (Classifier-Free Guidance):
- Effective guidance scale typically 2.0 or 2.5.
Class-conditional dropout[18] rate of 0.1 (i.e., 10% of samples are trained unconditionally).- CFG is triggered for specific ranges of values depending on the model size and configuration.
Mixing scale(from Improved CFG) is also used for configuration.
CIFAR-10
- Backbone:
U-net[38] architecture (approximately 55M parameters), a common choice for image generation, particularly indiffusion models[44]. - Input: Direct pixel space .
- Conditioning:
Positional embeddingon , concatenated for the network. - Preconditioner: No
EDM preconditioner[22] is used, unlike many otherSOTACIFAR-10models. - Optimizer:
Adam- Learning Rate: 0.0006
- Batch Size: 1024
- ,
- Dropout: 0.2
- Weight Decay: 0
- EMA Decay: 0.99995
- Training Schedule: 800K iterations (including 10K warm-up).
- Time Step Sampling:
Logit-normaldistributionlognorm(2.0, 2.0). - Ratio of : .
- Adaptive Weighting: Power .
- Data Augmentation: Follows [22], but with vertical flipping and rotation disabled.
Results & Analysis
5.1 Ablation Study
The ablation study, conducted on ImageNet 256x256 with a ViT-B/4 backbone trained for 80 epochs, investigates the impact of various MeanFlow design choices on performance. Default settings are grayed out in the table below.
Table 1: Ablation study on 1-NFE ImageNet generation. FID-50K is evaluated.
Manual Transcription of Table 1:
(a) Ratio of sampling . The entry reduces to the standard Flow Matching baseline.
| % of r≠t | FID, 1-NFE |
|---|---|
| 0% (= FM) | 328.91 |
| 25% | 61.06 |
| 50% | 63.14 |
| 100% | 67.32 |
Analysis (Table 1a): This ablation highlights the necessity of the average velocity concept. When the ratio of sampling is , the model effectively reduces to standard Flow Matching (as implies the second term in the MeanFlow Identity vanishes). This results in a very poor FID of 328.91, indicating that standard Flow Matching is not suitable for 1-NFE generation. Introducing a non-zero ratio of significantly improves performance, with achieving the best FID of 61.06. This confirms that explicitly training with intervals (where ) is crucial for MeanFlow's effectiveness in one-step generation. The model needs to learn how to bridge larger time gaps, not just instantaneous velocities.
(b) JVP computation. The correct jvp tangent is for Jacobian .
| jvp tangent | FID, 1-NFE |
|---|---|
| (v, 0, 1) | 61.06 |
| (v, 0, 0) | 268.06 |
| (v, 1, 0) | 329.22 |
| (v, 1, 1) | 137.96 |
Analysis (Table 1b): This destructive ablation underscores the importance of the correct mathematical formulation of the MeanFlow Identity, particularly the total derivative term. The correct jvp tangent (representing respectively) yields the best performance (61.06 FID). Incorrect tangent specifications (e.g., which assumes is constant, or which incorrectly implies changes with ) lead to significantly worse FID scores (268.06, 329.22, 137.96). This strongly validates the theoretical derivation of the MeanFlow Identity and its total derivative breakdown.
(c) Positional embedding. The network is conditioned on the embeddings applied to the specified variables.
| pos. embed | FID, 1-NFE |
|---|---|
| (t, r) | 61.75 |
| (t, t−r) | 61.06 |
| (t, r, −r) | 63.98 |
| t−r only | 63.13 |
Analysis (Table 1c): This explores different ways to encode the time information (r, t) for the neural network. All tested variants produce meaningful 1-NFE results, demonstrating the robustness of MeanFlow as a framework. The best performance is achieved by conditioning on (61.06 FID), suggesting that providing both the current time and the interval duration t-r is most informative. However, (t, r) also performs almost as well (61.75 FID). Even using only the interval t-r yields a reasonable result (63.13 FID), highlighting the importance of the interval itself.
(d) Time samplers. and are sampled from the specific sampler.
| t, r sampler | FID, 1-NFE |
|---|---|
| uniform(0, 1) | 65.90 |
| lognorm(0.2, 1.0) | 63.83 |
| lognorm(0.2, 1.2) | 64.72 |
| lognorm(0.4, 1.0) | 61.06 |
| lognorm(0.4, 1.2) | 61.79 |
Analysis (Table 1d): The distribution from which and are sampled significantly impacts performance. Logit-normal (lognorm) samplers consistently outperform the uniform distribution (65.90 FID). Among logit-normal options, lognorm(0.4, 1.0) achieves the best FID (61.06). This aligns with observations in Flow Matching research, where non-uniform sampling strategies for time variables often lead to better results by focusing training on more critical regions of the time domain.
(e) Loss metrics. is squared L2 loss. is Pseudo-Huber loss.
| p | FID, 1-NFE |
|---|---|
| 0.0 | 79.75 |
| 0.5 | 63.98 |
| 1.0 | 61.06 |
| 1.5 | 66.57 |
| 2.0 | 69.19 |
Analysis (Table 1e): This evaluates the adaptive loss weighting parameter . A standard squared L2 loss () performs the worst (79.75 FID), though it still yields meaningful results. Increasing (which effectively down-weights larger errors) improves performance, with achieving the best FID (61.06). This indicates that moderately down-weighting very large errors during training is beneficial, consistent with findings in other few-step generative models (e.g., Pseudo-Huber loss at ).
(f) CFG scale:. Our method supports 1-NFE CFG sampling.
| FID, 1-NFE | |
|---|---|
| 1.0 (w/o cfg) | 61.06 |
| 1.5 | 33.33 |
| 2.0 | 20.15 |
| 3.0 | 15.53 |
| 5.0 | 20.75 |
Analysis (Table 1f): This shows the significant impact of Classifier-Free Guidance (CFG). Without CFG (), the FID is 61.06. Introducing CFG dramatically improves the FID, reaching a minimum of 15.53 at an effective guidance scale of . This demonstrates that MeanFlow successfully integrates CFG in a 1-NFE manner, yielding substantial quality gains, consistent with observations in multi-step diffusion models. Too high a guidance scale (e.g., ) can sometimes degrade quality (20.75 FID), a common phenomenon in CFG.
Scalability
该图像是图表,展示了MeanFlow模型在ImageNet 256×256上使用1-NFE生成的FID随训练轮数变化的曲线,比较了不同模型大小(B/2, M/2, L/2, XL/2)的表现。结果表明模型规模越大,FID越低,性能更优。
Analysis (Figure 4): This figure illustrates the scalability of MeanFlow models on ImageNet 256x256 1-NFE generation. As the model size increases from to Extra-Large/2 (, , , ), the FID score consistently decreases, indicating improved generation quality. This trend is consistent with other Transformer-based generative models like DiT and SiT, where larger models generally perform better given sufficient training. The plot also shows that CFG (implicitly applied here to achieve these FIDs) enables strong performance across scales while maintaining 1-NFE. This suggests that MeanFlow can benefit from increased computational resources and model capacity.
5.2 Comparisons with Prior Work
Table 2: Class-conditional generation on ImageNet . All entries are reported with CFG, when applicable.
Manual Transcription of Table 2 (Left): Left: 1-NFE and 2-NFE diffusion/flow models trained from scratch.
| method | params (M) | NFE | FID |
|---|---|---|---|
| 1-NFE diffusion/flow from scratch | |||
| iCT-XL/2 [43]† | 675 | 1 | 34.24 |
| Shortcut-XL/2 [13] | 675 | 1 | 10.60 |
| MeanFlow-B/2 | 131 | 1 | 6.17 |
| MeanFlow-M/2 | 308 | 1 | 5.01 |
| MeanFlow-L/2 | 459 | 1 | 3.84 |
| MeanFlow-XL/2 | 676 | 1 | 3.43 |
| 2-NFE diffusion/flow from scratch | |||
| iCT-XL/2 [43]† | 675 | 2 | 20.30 |
| iMM-XL/2 [52] | 675 | 1×2 | 7.77 |
| MeanFlow-XL/2 | 676 | 2 | 2.93 |
| MeanFlow-XL/2+ | 676 | 2 | 2.20 |
Manual Transcription of Table 2 (Right):
Right: Other families of generative models as a reference. In both tables, 1x2 indicates that CFG incurs an NFE of 2 per sampling step. Our MeanFlow models are all trained for 240 epochs, except that "MeanFlow-XL/2+" is trained for more epochs and with configurations selected for longer training specified in appendix. †: iCT [43] results are reported by [52].
| method | params (M) | NFE | FID |
|---|---|---|---|
| GANs | |||
| BigGAN [5] | 112 | 1 | 6.95 |
| GigaGAN [21] | 569 | 1 | 3.45 |
| StyleGAN-XL [40] | 166 | 1 | 2.30 |
| autoregressive/masking | |||
| AR w/ VQGAN [10] | 227 | 1024 | 26.52 |
| MaskGIT [6] | 227 | 8 | 6.18 |
| VAR-d30 [47] | 2000 | 10×2 | 1.92 |
| MAR-H [27] | 943 | 256×2 | 1.55 |
| diffusion/flow | |||
| ADM [8] | 554 | 250×2 | 10.94 |
| LDM-4-G [37] | 400 | 250×2 | 3.60 |
| SimDiff [20] | 2000 | 512×2 | 2.77 |
| DiT-XL/2 [34] | 675 | 250×2 | 2.27 |
| SiT-XL/2 [33] | 675 | 250×2 | 2.06 |
| SiT-XL/2+REPA [51] | 675 | 250×2 | 1.42 |
Analysis (ImageNet Comparisons):
- 1-NFE Performance:
MeanFlow-XL/2achieves anFIDof 3.43 with1-NFE. This is a significant breakthrough compared to previousSOTA one-step models:Shortcut-XL/2(previousSOTA 1-NFE) had anFIDof 10.60.MeanFlowimproves upon this by a relative margin of nearly .iCT-XL/2(another1-NFEmodel) had anFIDof 34.24.iMM-XL/2reported1-stepresults, but these involve2-NFEdue toCFG(1x2in table), achieving 7.77FID.MeanFlowstill significantly outperforms this.
- Closing the Gap with Multi-step Models: The
3.43 FIDofMeanFlowwith1-NFEis remarkably close to, and even surpasses, somemulti-step diffusion modelsthat require hundreds ofNFEs, such asADM(10.94FIDwithNFE) andLDM(3.60FIDwithNFE). - 2-NFE Performance: The
MeanFlow-XL/2+model achieves anFIDof 2.20 with2-NFE. This is highly competitive with leadingmulti-step Transformer-based modelslikeDiT-XL/2(2.27FIDwithNFE) andSiT-XL/2(2.06FIDwithNFE). This result strongly supports the claim thatfew-step diffusion/flow modelscan now rival theirmany-step predecessors. - Self-contained and From Scratch: The fact that
MeanFlowachieves these resultsfrom scratch, withoutpre-training,distillation, orcurriculum learning, is a significant advantage in terms of simplicity and robustness. - Comparison to GANs:
MeanFlow's1-NFEperformance (3.43FID) is competitive withSOTA GANslikeGigaGAN(3.45FID), thoughStyleGAN-XL(2.30FID) remains slightly better at1-NFE. However,GANshave their own training stability challenges.
Table 3: Unconditional CIFAR-10.
Manual Transcription of Table 3:
| method | precond | NFE | FID |
|---|---|---|---|
| iCT [43] | EDM | 1 | 2.83 |
| ECT [15] | EDM | 1 | 3.60 |
| sCT [31] | EDM | 1 | 2.97 |
| IMM [52] | EDM | 1 | 3.20 |
| MeanFlow | none | 1 | 2.92 |
Analysis (CIFAR-10 Comparisons):
On the CIFAR-10 dataset for unconditional generation, MeanFlow achieves an FID of 2.92 with 1-NFE. This is competitive with other 1-NFE Consistency Model variants like iCT (2.83 FID) and sCT (2.97 FID), and outperforms ECT (3.60 FID) and IMM (3.20 FID). A notable point is that MeanFlow achieves this performance without using an EDM preconditioner, which is typically employed by other SOTA models for CIFAR-10 to improve stability and performance. This suggests that MeanFlow's core mechanism is robust even without additional preconditioning techniques.
B.1 Improved CFG for MeanFlow (Appendix B.1)
Table 5: Improved CFG for MeanFlow.
Manual Transcription of Table 5:
| FID, 1-NFE | |
|---|---|
| 0.0 | 20.15 |
| 0.5 | 19.15 |
| 0.8 | 19.10 |
| 0.9 | 18.63 |
| 0.95 | 19.17 |
Analysis (Table 5): This ablation explores the effect of the mixing scale in the Improved CFG formulation (Eq. 20 and 21). With the effective guidance scale fixed at 2.0, varying shows further improvements in FID. When , the FID is 20.15. Introducing and mixing class-conditional and class-unconditional terms in the target reduces the FID to a minimum of 18.63 at . This demonstrates that explicitly blending both conditional and unconditional aspects within the CFG target, similar to the dropout mechanism in standard CFG, further enhances MeanFlow's generation quality.
Qualitative Results
该图像是一组图像拼贴,展示了使用论文中1-NFE模型(MeanFlow-XL/2,3.43 FID)在ImageNet 256×256分辨率下的类别条件生成样例,涵盖多种自然和动物场景。
Analysis (Figure 5): The figure showcases curated examples of class-conditional images generated by the MeanFlow-XL/2 model on ImageNet 256x256 with 1-NFE. The images exhibit high fidelity, sharpness, and diversity, closely resembling real photographs across various categories. This visual evidence supports the quantitative FID results, confirming the model's ability to produce high-quality samples in a single step.
Conclusion & Personal Thoughts
Conclusion Summary
This paper introduces MeanFlow, a novel and principled framework for one-step generative modeling. Its core innovation is the concept of average velocity to characterise flow fields, contrasting with the instantaneous velocity typically modeled by Flow Matching methods. By deriving and leveraging the MeanFlow Identity, which fundamentally relates average and instantaneous velocities, the method provides a robust and self-contained training objective. MeanFlow requires no pre-training, distillation, or curriculum learning, simplifying its application. It achieves state-of-the-art performance for 1-NFE generation on ImageNet 256x256 (FID 3.43), significantly outperforming previous one-step models and substantially narrowing the gap with multi-step diffusion/flow models. Furthermore, MeanFlow naturally integrates Classifier-Free Guidance (CFG) without increasing the Number of Function Evaluations (NFE) during sampling.
Limitations & Future Work (Authors' Perspective)
The authors identify that MeanFlow's strong performance largely closes the gap between one-step and many-step diffusion/flow models, suggesting that future research could revisit the foundations of these models. They also explicitly state that orthogonal improvements, such as REPA [51] (Representation Alignment), are applicable and left for future work. The paper concludes by suggesting that its formulation, by describing quantities at coarsened levels of granularity, could bridge research in generative modeling, simulation, and dynamical systems in related fields (e.g., multi-scale simulation problems in physics).
Personal Insights & Critique
The MeanFlow paper presents a highly impactful contribution to the field of generative AI. The shift from instantaneous velocity to average velocity is conceptually elegant and mathematically sound. The rigorous derivation of the MeanFlow Identity is a significant strength, providing a principled foundation that many other few-step methods (which often rely on heuristics or empirically tuned consistency constraints) lack.
Strengths:
- Principled Approach: The derivation of the
MeanFlow Identityfrom first principles is commendable. It provides a strong theoretical underpinning for the model's effectiveness. - Self-Contained and From Scratch: The ability to achieve
SOTA1-NFEperformance withoutdistillationorcurriculum learningis a major practical advantage. It makesMeanFloweasier to train, more robust, and less dependent on complex multi-stage pipelines. - Efficient CFG Integration: The native inclusion of
CFGfor1-NFE samplingis a clever design choice that overcomes a common limitation offew-step models. - Exceptional Performance: The
FIDscores achieved are truly impressive, especially for1-NFE. Closing the gap withmulti-step modelsis a significant milestone for real-time generative applications. - Scalability: The demonstrated scalability is crucial for leveraging larger
Transformer architecturesand further improving quality.
Possible Improvements / Open Questions:
- Computational Cost of JVP: While the authors state the
JVPis lightweight (16% overhead), this still adds to training time. Exploring ways to approximate or further optimize this derivative calculation could be beneficial, especially for extremely large models or resource-constrained environments. - Generalization of Average Velocity: The concept of
average velocityis inherently tied to a specific time interval[r, t]. Could this idea be extended to other forms of "averaging" or "coarsening" in different contexts (e.g., spatial averaging, multi-resolution flows)? - Robustness to Diverse Data: While
ImageNetis a challenging dataset, further validation on more diverse or specialized datasets (e.g., medical images, complex scientific data) could assess the model's broader applicability. - Theoretical Guarantees: While the derivation is principled, further theoretical analysis regarding convergence properties, stability, and potential for mode collapse in
MeanFlowcompared to standardFlow Matchingcould provide deeper insights. - Hyperparameter Sensitivity: The ablation studies show sensitivity to in the loss function,
time samplers, andCFG scale. Further work on adaptive or robust strategies for these hyperparameters could make the model even more user-friendly.
Broader Impact:
MeanFlow has the potential to significantly accelerate the adoption of generative models in real-world applications where low latency is critical, such as interactive image editing, real-time content generation in games or virtual reality, and faster prototyping in design. The principled nature of its average velocity concept could also inspire new theoretical developments in the understanding and construction of continuous-time generative models, potentially leading to more stable and efficient algorithms across various domains. The connection to multi-scale simulation problems hinted at in the conclusion is particularly intriguing, suggesting that MeanFlow might offer a fresh perspective on complex scientific and engineering challenges.
Similar papers
Recommended via semantic vector search.