Flow-GRPO: Training Flow Matching Models via Online RL
TL;DR Summary
Flow-GRPO integrates online policy gradient RL into flow matching models, enhancing sampling efficiency and output quality in text-to-image generation by converting ODEs to SDEs and employing a denoising reduction strategy, minimizing reward hacking.
Abstract
We propose Flow-GRPO, the first method to integrate online policy gradient reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original number of inference steps, significantly improving sampling efficiency without sacrificing performance. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For compositional generation, RL-tuned SD3.5-M generates nearly perfect object counts, spatial relations, and fine-grained attributes, increasing GenEval accuracy from to . In visual text rendering, accuracy improves from to , greatly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Flow-GRPO: Training Flow Matching Models via Online RL
1.2. Authors
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Lil, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang.
Affiliations: The authors are affiliated with prominent academic institutions and technology companies:
-
MMLab, CUHK (The Chinese University of Hong Kong)
-
Tsinghua University
-
Kling Team, Kuaishou Technology
-
Nanjing University
-
Shanghai AI Laboratory
Their backgrounds collectively suggest expertise in artificial intelligence, machine learning, computer vision, and deep learning, particularly in generative models and reinforcement learning.
1.3. Journal/Conference
The paper is published as a preprint on arXiv, specifically arXiv:2505.05470. While not yet peer-reviewed in a formal journal or conference proceedings, arXiv is a highly influential platform for rapid dissemination of research in AI and related fields, allowing researchers to share their work before or concurrently with formal publication processes.
1.4. Publication Year
2025
1.5. Abstract
This paper introduces Flow-GRPO, a novel method that integrates online policy gradient reinforcement learning (RL) into flow matching models for the first time. The approach relies on two key strategies: (1) an ODE-to-SDE conversion, which transforms the deterministic Ordinary Differential Equation (ODE) underlying flow matching into an equivalent Stochastic Differential Equation (SDE). This conversion preserves the model's marginal distribution across all timesteps and introduces the necessary stochasticity for RL exploration. (2) A Denoising Reduction strategy, which significantly reduces the number of denoising steps required during training while maintaining the original inference steps, thereby boosting sampling efficiency without compromising performance. Empirically, Flow-GRPO demonstrates strong effectiveness across various text-to-image tasks. For compositional generation, an RL-tuned SD3.5-M (Stable Diffusion 3.5 Medium) model achieves a near-perfect increase in GenEval accuracy from to for tasks involving object counts, spatial relations, and fine-grained attributes. In visual text rendering, accuracy improves from to . The method also shows substantial gains in aligning with human preferences. Notably, the improvements are achieved with minimal reward hacking, meaning that increases in reward did not lead to appreciable degradation in image quality or diversity.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2505.05470 - PDF Link:
https://arxiv.org/pdf/2505.05470v5.pdfThe paper is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem that this paper aims to solve revolves around the limitations of current flow matching models in generating complex and precise images, particularly for tasks requiring compositional accuracy and text rendering. While flow matching models, like those used in advanced image generation (e.g., SD3.5-M), have strong theoretical foundations and produce high-quality images, they often struggle with:
-
Composing complex scenes: This includes accurately rendering multiple objects, their attributes, and spatial relationships (e.g., "a red ball on a blue box").
-
Visual text rendering: Generating accurate and coherent text within images.
This problem is important because text-to-image (T2I) generation models are increasingly expected to handle sophisticated prompts that demand fine-grained control and reasoning, moving beyond merely producing aesthetically pleasing but semantically inconsistent images. The gap in prior research is that while
online reinforcement learning (RL)has proven highly effective in enhancing the reasoning capabilities ofLarge Language Models (LLMs), its potential for advancing flow matching generative models remains largely unexplored. Previous applications of RL to generative models have mainly focused onearly diffusion-based modelsoroffline RL techniques(likeDirect Preference Optimization (DPO)) for flow-based models.
The paper's innovative idea is to leverage online RL, specifically the Gradient Policy Optimization (GRPO) algorithm, to fine-tune flow matching models. This introduces two critical challenges:
- Deterministic Nature of Flow Models: Flow models rely on deterministic
Ordinary Differential Equations (ODEs)for generation, which conflicts withRL'sneed for stochastic sampling to explore the environment. - Sampling Efficiency:
Online RLrequires efficient data collection, but flow models typically need many iterative steps to generate each sample, makingRLtraining costly and slow, especially for large models.
2.2. Main Contributions / Findings
The paper's primary contributions are:
- First Online RL for Flow Matching: Proposing
Flow-GRPO, the first method to successfully integrateonline policy gradient RL(specificallyGRPO) into flow matching models, demonstrating its effectiveness forT2Itasks. This addresses the challenge of extendingRL'sbenefits fromLLMstoT2Igeneration with flow models. - ODE-to-SDE Conversion: Developing a novel
ODE-to-SDEstrategy that transforms the deterministicODE-basedflow into an equivalentStochastic Differential Equation (SDE)framework. This crucial step introduces the necessary randomness forRL explorationwhile preserving the original model's marginal distributions, overcoming the fundamental conflict between deterministic generative processes andRL'sstochastic requirements. - Denoising Reduction Strategy: Introducing a practical
Denoising Reductionstrategy that significantly reduces the number of denoising steps duringRLtraining (e.g., from 40 to 10 steps) while maintaining the original number of inference steps during testing. This dramatically improves sampling efficiency and accelerates the training process without sacrificing the quality of the final generated images. - Effective
KLConstraint for Reward Hacking Prevention: Demonstrating that theKullback-Leibler (KL)constraint effectively preventsreward hacking, where models optimize for the reward metric at the expense of overall image quality or diversity. Properly tunedKLregularization allows matching high rewards while preserving image quality, albeit with longer training. - Empirical Validation and Significant Performance Gains:
-
Compositional Generation:
Flow-GRPOimprovesSD3.5-Maccuracy on the GenEval benchmark from to , even surpassingGPT-4o. -
Visual Text Rendering: Accuracy increases from to .
-
Human Preference Alignment: Achieves substantial gains in aligning with human preferences (e.g., Pickscore).
-
Minimal Reward Hacking: All improvements are achieved with very little degradation in image quality or diversity, as evidenced by stable
DrawBenchmetrics.These findings collectively address the core problem by significantly enhancing the reasoning and control capabilities of flow matching models, making them more robust and aligned with complex user intentions, without compromising the high-fidelity image generation they are known for.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand Flow-GRPO, a reader should be familiar with several core concepts in machine learning and generative models:
- Generative Models: These are models that can learn the distribution of input data and then generate new samples that resemble the training data. Examples include
Generative Adversarial Networks (GANs),Variational Autoencoders (VAEs),Diffusion Models, andFlow Matching Models. - Flow Matching Models:
- Continuous-Time Normalizing Flows: These models transform a simple probability distribution (e.g., Gaussian noise) into a complex data distribution (e.g., images) through a continuous, invertible transformation. This transformation is defined by an
Ordinary Differential Equation (ODE). - Velocity Field: The
ODEdescribes how a data point evolves over time (from noise to data ). Thevelocity fielddictates the direction and speed of this transformation at any given point and time . Flow matching models are trained to directly regress thisvelocity field. - Deterministic Sampling: In standard flow matching, once the
velocity fieldis learned, generating a sample involves numerically solving theODEfrom a noise sample (e.g., standard Gaussian) to a data sample . This process is deterministic, meaning the same initial noise will always produce the same output image.
- Continuous-Time Normalizing Flows: These models transform a simple probability distribution (e.g., Gaussian noise) into a complex data distribution (e.g., images) through a continuous, invertible transformation. This transformation is defined by an
- Reinforcement Learning (RL):
- Agent, Environment, State, Action, Reward:
RLinvolves anagentinteracting with anenvironment. Theenvironmentis characterized bystates. At eachstate, theagentchooses anaction. Thisactionleads to a newstateand theagentreceives areward. The goal ofRLis for theagentto learn apolicythat maximizes the cumulativerewardover time. - Policy: A
policyis a function that maps states to a probability distribution over actions, indicating which action to take in a given state. - Online RL vs. Offline RL:
- Online RL: The
agentlearns by directly interacting with theenvironmentand collecting new data (trajectories) on the fly, updating itspolicyiteratively. This allows for exploration and adaptation. - Offline RL: The
agentlearns from a fixed dataset of previously collected interactions, without further interaction with theenvironment. This can be more sample-efficient but limits exploration.
- Online RL: The
- Policy Gradient Methods: A class of
RLalgorithms that directly optimize thepolicyfunction (e.g., a neural network) by taking gradients of an objective function that represents the expectedreward. - Exploration vs. Exploitation: A fundamental dilemma in
RL.Explorationinvolves trying new actions to discover better outcomes, whileexploitationinvolves choosing actions known to yield highrewards.Online RLinherently relies on exploration.
- Agent, Environment, State, Action, Reward:
- Markov Decision Process (MDP): A mathematical framework for modeling sequential decision-making. An
MDPis defined by a tuple :- : A set of possible
states. - : A set of possible
actions. - : The initial
statedistribution. - : The
transition probabilityfunction, defining the probability of reaching state from state by taking action . R(s,a): Therewardfunction, specifying the immediate reward received after taking action in state .
- : A set of possible
- Ordinary Differential Equations (ODEs) and Stochastic Differential Equations (SDEs):
- ODE: An equation involving an unknown function of one independent variable and its derivatives.
ODEsdescribe deterministic continuous-time processes. In generative models, they describe the path from noise to data. For example, . - SDE: An
ODEextended with astochastic(random) term, typically involving a Wiener process (or Brownian motion).SDEsdescribe continuous-time processes that are subject to random fluctuations. For example, , where represents increments of a Wiener process and is a diffusion coefficient controlling the noise level. The key difference is the introduction ofstochasticity.
- ODE: An equation involving an unknown function of one independent variable and its derivatives.
- Kullback-Leibler (KL) Divergence: A measure of how one probability distribution diverges from a second, expected probability distribution . A low
KL divergencemeans the two distributions are very similar. InRL, it's often used as a regularization term to keep the learnedpolicyfrom deviating too much from a referencepolicy, preventing aggressive policy updates that could lead to instability orreward hacking.- Formula for two Gaussian distributions and : $ D_{\mathrm{KL}}(\mathcal{N}(\mu_1, \Sigma_1) || \mathcal{N}(\mu_2, \Sigma_2)) = \frac{1}{2} \left( \mathrm{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2-\mu_1)^T \Sigma_2^{-1}(\mu_2-\mu_1) - k + \ln\left(\frac{\det(\Sigma_2)}{\det(\Sigma_1)}\right) \right) $ where is the dimensionality of the distributions. For isotropic Gaussians with and : $ D_{\mathrm{KL}}(\mathcal{N}(\mu_1, \sigma_1^2 I) || \mathcal{N}(\mu_2, \sigma_2^2 I)) = \frac{1}{2} \left( \frac{\sigma_1^2}{\sigma_2^2} + \frac{||\mu_2-\mu_1||^2}{\sigma_2^2} - k + k \ln\left(\frac{\sigma_2^2}{\sigma_1^2}\right) \right) $
- GRPO (Gradient Policy Optimization): A
policy gradientmethod mentioned as a lightweight alternative toPPO[20]. It's more memory-efficient by not requiring avalue networkand uses a group-relative advantage formulation.
3.2. Previous Works
The paper builds upon and differentiates itself from several lines of prior research:
- Flow Matching (FM) Models:
- Rectified Flow [3]: A key foundational work that defines a straight path between data and noise, simplifying the
ODEand enabling efficient deterministic sampling. This is the framework adopted by recent advanced models likeSD3.5-M[4] andFLUX.1 Dev[5]. - Flow Matching for Generative Modeling [2]: Introduced the concept of learning
ODEsby directly matching thevelocity field, providing a solid theoretical basis.
- Rectified Flow [3]: A key foundational work that defines a straight path between data and noise, simplifying the
- Diffusion Models:
- Denoising Diffusion Probabilistic Models (DDPM) [21]: A seminal work introducing the concept of adding Gaussian noise iteratively and learning to reverse the process.
- Denoising Diffusion Implicit Models (DDIM) [22]: Improved sampling speed and determinism for
diffusion models. - Score-based Generative Modeling through Stochastic Differential Equations (SDEs) [23]: Unified
diffusion modelsunder anSDE/ODEframework, providing a way to introduce stochasticity or determinism during sampling. This work is directly relevant toFlow-GRPO'sODE-to-SDEconversion. - Unified Diffusion and Flow Models [28, 29]: Recent theoretical work that unifies
diffusionandflow modelsunder a commonSDE/ODEframework, supportingFlow-GRPO's theoretical foundations.
- RL for Generative Models:
- Training Diffusion Models with Reinforcement Learning [12] (
DDPO): AppliedRLtodiffusion models.Flow-GRPOextends this idea to the more efficientflow matching modelsand faces the challenge of their deterministic nature. - RL for LLMs [10, 11]: Demonstrated the power of
online RL(PPO,GRPO) in enhancingLLMreasoning.Flow-GRPOseeks to transfer this success toT2Imodels. - Direct Preference Optimization (DPO) and variants [13, 14, 15, 38, 39]:
Offline RLtechniques that align models with human preferences.Flow-GRPOfocuses ononline RL, which allows for continuous interaction and exploration, and shows it outperformsonline DPOin some settings.
- Training Diffusion Models with Reinforcement Learning [12] (
- Alignment for
T2IModels: A broad category of methods aimed at improving consistency with human preferences or specific criteria:- Differentiable Rewards [30, 31, 32, 33]: Fine-tuning with rewards where gradients can be backpropagated directly.
Flow-GRPOdoesn't require differentiable rewards, allowing for broader applicability. - Reward Weighted Regression (RWR) [34, 35, 36, 37]: Techniques that weigh samples by their rewards during fine-tuning.
- PPO-style Policy Gradients [47, 48, 49, 50, 51, 52]: Other applications of
policy gradient RLtoT2Iordiffusion models. - Training-free Alignment Methods [53, 54, 55]: Methods that adjust generation without explicit training.
- Differentiable Rewards [30, 31, 32, 33]: Fine-tuning with rewards where gradients can be backpropagated directly.
3.3. Technological Evolution
The field of generative imaging has rapidly evolved:
-
Early Generative Models (GANs, VAEs): Capable of generating diverse images but often struggled with fidelity or mode collapse.
-
Diffusion Models (
DDPM,DDIM): Introduced a new paradigm of iterative denoising from noise to data, achieving unprecedented image quality and diversity. Their foundation inSDEsprovided flexibility in sampling. -
Flow Matching Models (
Rectified Flow,Flow Matching): Emerged as a more efficient alternative todiffusion models, directly learningvelocity fieldsand enabling faster, deterministicODE-basedsampling while maintaining competitive quality. These models became the backbone of state-of-the-artT2Isystems likeSD3.5-MandFLUX. -
Alignment with Human Preferences and Instructions: As generative models improved, the focus shifted to aligning their outputs more precisely with user intentions,
human preferences, and complex instructions. This led to the adoption ofRLtechniques, initially forLLMsand then increasingly forT2Imodels.Flow-GRPOfits into this timeline by pushing the boundaries of alignment for the most advanced image generative models (flow matching models) by integrating the powerfulonline RLparadigm, which was previously challenging due to the deterministic nature of these models.
3.4. Differentiation Analysis
Compared to the main methods in related work, Flow-GRPO introduces several core innovations:
-
Online RL for Flow Matching (First of its Kind): Previous works applied
RLprimarily todiffusion models(e.g.,DDPO[12]) or usedoffline RL(e.g.,DPO[14, 39]) forflow-based models.Flow-GRPOis the first to successfully integrate online policy gradient RL into the inherently deterministicflow matchingframework. This is a significant distinction, asonline RLoffers continuous exploration and adaptation thatoffline RLlacks. -
ODE-to-SDE Conversion as Key for Stochasticity: Unlike prior work that might reformulate
velocity predictionto estimate Gaussian distributions (e.g., [56] for text-to-speechflow models, requiring retraining the pre-trained model) or focus onSDE-based stochasticityonly at inference time [57],Flow-GRPOproposes a directODE-to-SDEconversion that preserves marginal distributions. This allows injecting stochasticity forRL explorationinto a pre-trained deterministic flow model without retraining its core components, making it a plug-and-play solution. -
Denoising Reduction for Training Efficiency: The
Denoising Reductionstrategy is novel in this context. While efficient sampling is generally a goal, this specific technique of using fewer steps for training data collection but full steps for inference is crucial for makingonline RLpractical for computationally intensive generative models. This allowsFlow-GRPOto gather low-quality but informative trajectories efficiently, a key enabling factor foronline RL. -
Robust Reward Hacking Prevention via
KLRegularization: The paper rigorously demonstrates the effectiveness ofKLregularization in preventingreward hacking(quality degradation, diversity collapse), which is a common challenge inRLapplications. This is explicitly shown to be superior to simply early stopping and is a critical component for stable, high-qualityRLfine-tuning. -
Generalizability Across Reward Types:
Flow-GRPOis shown to be effective across various reward types:verifiable rule-based rewards(GenEval, Visual Text Rendering) andmodel-based human preference rewards(PickScore). This suggests a broad applicability of the framework.In essence,
Flow-GRPOinnovatively bridges the gap between the efficiency and quality offlow matching modelsand the reasoning/alignment power ofonline RL, overcoming the inherent incompatibilities through clever technical strategies.
4. Methodology
4.1. Principles
The core idea of Flow-GRPO is to enhance flow matching models for text-to-image (T2I) generation by leveraging the power of online reinforcement learning (RL). This integration is driven by the principle that RL can optimize models for complex, human-defined objectives (like compositional accuracy or human preferences) that are difficult to capture with traditional supervised learning loss functions.
The theoretical basis and intuition behind Flow-GRPO can be broken down into two main principles, addressing the key challenges of applying online RL to flow models:
-
Introducing Stochasticity for RL Exploration:
Online RLfundamentally relies onstochastic samplingto explore theenvironmentand learn optimalpolicies. However, standardflow matching modelsare inherently deterministic, generating images by solvingOrdinary Differential Equations (ODEs). The principle here is to convert this deterministicODE-basedgenerative process into an equivalentStochastic Differential Equation (SDE)process that preserves the original model's marginal probability distribution at all timesteps. ThisODE-to-SDEconversion injects the necessary randomness (exploration noise) into the generation process, allowing theRL agent(the flow model) to try different actions (denoising steps leading to different images) and learn from theirrewards. The underlying intuition is that while the path from noise to data becomes stochastic, the overall distribution of generated images remains consistent with the pre-trained flow model, ensuring quality while enabling exploration. -
Improving Sampling Efficiency for Online RL Training:
Online RLrequires collecting manytrajectories(sequences of states, actions, and rewards) to update thepolicy.Flow modelstypically require numerous iterativedenoising stepsto generate a single high-quality image, making data collection prohibitively slow and expensive foronline RL. The principle ofDenoising Reductionis that for the purpose of collecting training data forRL, high-fidelity images are not strictly necessary. Instead, "low-quality but still informative trajectories" generated with significantly fewerdenoising stepscan be sufficient to provide a usefulreward signal. The intuition is thatRLoptimizes based on relative preferences (which sample is better than another), and this relative signal can still be extracted even from less refined samples. By drastically cutting the number of steps during training, the wall-clock time for data collection is reduced, makingonline RLpractical. The full, high-step schedule is then reserved for inference to ensure top-quality final outputs.By adhering to these principles,
Flow-GRPOaims to bridge the gap between efficient, high-quality image generation and the powerful optimization capabilities ofonline RL.
4.2. Core Methodology In-depth (Layer by Layer)
Flow-GRPO adapts the GRPO algorithm for flow matching models by introducing two key strategies: ODE-to-SDE conversion for stochasticity and Denoising Reduction for efficiency.
4.2.1. GRPO on Flow Matching
The overall goal of RL is to learn a policy (parameterized by , which represents the parameters of the flow model's velocity field predictor) that maximizes the expected cumulative reward. The paper formulates this with a regularized objective:
$
\operatorname*{max}{\theta} \mathbb{E}{(s_0, a_0, \ldots, s_T, a_T) \sim \pi_{\theta}} \left[ \sum_{t=0}^{T} \left( R(s_t, a_t) - \beta D_{\mathrm{KL}}(\pi_{\theta}(\cdot \mid s_t) || \pi_{\mathrm{ref}}(\cdot \mid s_t) ) \right) \right]
$
Here:
- : Parameters of the
policy(the flow model). - : A
trajectoryofstatesandactionssampled according to thepolicy. - : The
rewardreceived attimestep. In this MDP, rewards are typically given only at the final step (when the image is generated). - : A hyperparameter controlling the strength of the
KL divergenceregularization. - :
KL divergencebetween the currentpolicyand a referencepolicy(typically theold policyor the initial pre-trained model) atstate. This regularization term prevents thepolicyfrom drifting too far from the reference, mitigatingreward hackingand maintaining stability.
Denoising as an MDP:
As described in Section 3 of the paper, the iterative denoising process in flow matching models is framed as an MDP .
- State : At
timestep, thestateis defined as , where is the textcondition(prompt), is the currenttimestep, and is the current noisy image representation. - Action : The
actionis thedenoised samplepredicted by the model, representing the image at the previoustimestep(closer to the clean image). - Policy : The
policyis , which describes the probability distribution over possible next image states given the current noisy image and the text condition . - Transition : This is deterministic, meaning applying action to state always leads to a specific next state . The prompt remains constant, the
timestepdecreases by 1, and the image becomes . - Initial State Distribution : This is , meaning the process starts with a randomly sampled prompt , at the maximum
timestep, and with an initial noisy image sampled from a standard Gaussian distribution . - Reward : The
rewardis sparse, given only at the final step when , i.e., if , and0otherwise. This is the task-specific reward (e.g., GenEval score, OCR accuracy, PickScore).
GRPO Advantage Estimation:
GRPO [16] uses a group relative formulation for estimating the advantage. Given a prompt , the flow model samples a group of individual images and their corresponding trajectories . The advantage for the -th image at timestep is calculated by normalizing the group-level rewards:
$
\hat{A}_t^i = \frac{R(\pmb{x}_0^i, \pmb{c}) - \mathrm{mean}({R(\pmb{x}0^i, \pmb{c})}{i=1}^G)}{\mathrm{std}({R(\pmb{x}0^i, \pmb{c})}{i=1}^G)}
$
Here:
- : The final
rewardfor the -th generated image given prompt . - and : The mean and standard deviation of the
rewardsacross all images in the group for the same prompt. This normalization makes the advantage estimate robust to the absolute scale of rewards and focuses on relative performance within a group.
GRPO Objective:
GRPO optimizes the policy model by maximizing the following objective:
$
\mathcal{L}{\mathrm{Flow-GRPO}}(\theta) = \mathbb{E}{c \sim \mathcal{C}, { \boldsymbol{x}^i }{i=1}^G \sim \pi{\theta_{\mathrm{old}}}(\cdot \vert c)} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{T} \sum_{t=0}^{T-1} \left( \operatorname{min}\left( r_t^i(\theta) \hat{A}t^i, \mathrm{clip}\Big( r_t^i(\theta), 1-\varepsilon, 1+\varepsilon \Big) \hat{A}t^i \right) - \beta D{\mathrm{KL}}(\pi{\theta} || \pi_{\mathrm{ref}}) \right) \right]
$
where the probability ratio is:
$
r_t^i(\theta) = \frac{p_{\theta}(x_{t-1}^i \mid x_t^i, c)}{p_{\theta_{\mathrm{old}}}(x_{t-1}^i \mid x_t^i, c)}
$
And:
- : Distribution of prompts.
- : Parameters of the
policyused to collect the current batch of samples (theold policy), which is periodically updated to . - : A small clipping parameter (similar to
PPO[20]) that limits the magnitude ofpolicyupdates, ensuring stability. - : The
KL regularizationcoefficient, as explained earlier. This objective aims to increase the probability of actions that lead to higher-than-average rewards (positive advantage) and decrease the probability of actions leading to lower-than-average rewards (negative advantage), while keeping thepolicyclose to theold policyand preventing excessive divergence.
4.2.2. From ODE to SDE
The deterministic nature of flow matching models (based on ODEs) presents two problems for GRPO:
-
Computing the
probability ratiois computationally expensive under deterministic dynamics due to divergence estimation. -
More critically,
RLrelies onexplorationthroughstochastic sampling. Deterministic sampling lacks the randomness needed forRLto explore different outcomes and learn.To address this, the paper converts the deterministic Flow-ODE into an equivalent
SDEthat matches the original model's marginal probability density function at all timesteps.
Original ODE:
The standard flow matching ODE is given by:
$
\mathrm{d}\pmb{x}_t = \pmb{v}_t \mathrm{d}t
$
where is the velocity field learned via the flow matching objective. This ODE implies a one-to-one mapping between successive timesteps.
Generic SDE and Fokker-Planck Equation:
A generic SDE has the form:
$
\mathrm{d}\pmb{x}t = f{\mathrm{SDE}}(\pmb{x}_t, t)\mathrm{d}t + \sigma_t \mathrm{d}\pmb{w}
$
where:
-
: The
drift coefficient. -
: The
diffusion coefficientcontrolling the level of stochasticity. -
: Increments of a
Wiener process(standard Brownian motion).The
marginal probability densityof anSDEevolves according to theFokker-Planck equation[74]: $ \partial_t p_t(x) = - \nabla \cdot [ f_{\mathrm{SDE}}(\pmb{x}_t, t) p_t(\pmb{x}) ] + \frac{1}{2} \nabla^2 [ \sigma_t^2 p_t(\pmb{x}) ] $ For the deterministicODE(Eq. 10), itsmarginal probability densityevolves as: $ \partial_t p_t(\pmb{x}) = - \nabla \cdot [ \pmb{v}_t(\pmb{x}_t, t) p_t(\pmb{x}) ] $
Equating Marginal Distributions:
To ensure the SDE shares the same marginal distribution as the ODE, their Fokker-Planck equations must be equal:
$
- \nabla \cdot [ f_{\mathrm{SDE}} p_t(\pmb{x}) ] + \frac{1}{2} \nabla^2 [ \sigma_t^2 p_t(\pmb{x}) ] = - \nabla \cdot [ \pmb{v}_t(\pmb{x}t, t) p_t(\pmb{x}) ]
$
Using the identity , and after substituting and simplifying (detailed in Appendix A), the
drift coefficientis derived as: $ f{\mathrm{SDE}} = \boldsymbol{v}_t(\boldsymbol{x}_t, t) + \frac{\sigma_t^2}{2} \nabla \log p_t(\boldsymbol{x}) $ This leads to theforward SDEwith the desiredmarginal distribution: $ \mathrm{d}\pmb{x}_t = \bigg( \pmb{v}_t(\pmb{x}_t) + \frac{\sigma_t^2}{2} \nabla \log p_t(\pmb{x}_t) \bigg)\mathrm{d}t + \sigma_t \mathrm{d}\pmb{w} $ Here, is thescore function.
Reverse-Time SDE for Sampling:
For practical sampling, a reverse-time SDE is needed, which runs from the final state back to the initial state. The relationship between forward and reverse-time SDEs is established by [75, 23]. If a forward SDE is , its reverse-time SDE is:
$
\mathrm{d}\pmb{x}_t = \left[ f(\pmb{x}_t, t) - g^2(t) \nabla \log p_t(\pmb{x}_t) \right]\mathrm{d}t + g(t)\mathrm{d}\overline{\pmb{w}}
$
Setting and substituting from Eq. 17, we get the reverse-time SDE:
$
\mathrm{d}\pmb{x}_t = \bigg[ \pmb{v}_t(\pmb{x}_t) + \frac{\sigma_t^2}{2} \nabla \log p_t(\pmb{x}_t) - \sigma_t^2 \nabla \log p_t(\pmb{x}_t) \bigg]\mathrm{d}t + \sigma_t \mathrm{d}\overline{\pmb{w}}
$
This simplifies to:
$
\mathrm{d}\pmb{x}_t = \left( \pmb{v}_t(\pmb{x}_t) - \frac{\sigma_t^2}{2} \nabla \log p_t(\pmb{x}_t) \right) \mathbf{d}t + \sigma_t \mathbf{d}\pmb{w}
$
The term is implicitly linked to the velocity field . For the Rectified Flow framework used in the paper, the authors use the linear interpolation , where and .
From this, the conditional score is .
The marginal score becomes .
After a series of derivations (Equations 22-26 in Appendix A), the score function is expressed in terms of and :
$
\nabla \log p_t(\pmb{x}) = - \frac{\pmb{x}}{t} - \frac{1-t}{t} \pmb{v}_t(\pmb{x})
$
Substituting this score function back into the reverse-time SDE (Eq. 21) yields the final SDE for Rectified Flow:
$
\mathrm{d}\pmb{x}_t = \left[ \pmb{v}_t(\pmb{x}_t) + \frac{\sigma_t^2}{2t} \left( \pmb{x}t + (1-t) \pmb{v}t(\pmb{x}t) \right) \right] \mathrm{d}t + \sigma_t \mathrm{d}\pmb{w}
$
This is the SDE that the Flow-GRPO model will sample from. To numerically solve this SDE, Euler-Maruyama discretization is applied, resulting in the following update rule:
$
\boxed{x{t+\Delta t} = x_t + \left[ v{\theta}(x_t, t) + \frac{\sigma_t^2}{2t} \big( x_t + (1-t) v{\theta}(x_t, t) \big) \right] \Delta t + \sigma_t \sqrt{\Delta t} \epsilon}
$
Here:
-
: The image representation at
timestep. -
: The
velocity fieldpredicted by the model (parameterized by ) atstateandtimestep. -
: The
diffusion coefficient, which controls the level of stochasticity. The paper uses , where is a scalar hyper-parameter. -
: The
timestepsize for discretization. -
: A sample from a standard Gaussian distribution, explicitly injecting
stochasticityinto the sampling process.This
SDEupdate rule defines thepolicy, which is anisotropic Gaussian distribution. This allows for a closed-form computation of theKL divergencebetween and thereference policy(which would be based on ): $ D_{\mathrm{KL}}(\pi_{\theta} || \pi_{\mathrm{ref}}) = \frac{||\overline{x}{t+\Delta t, \theta} - \overline{x}{t+\Delta t, \mathrm{ref}}||^2}{2\sigma_t^2 \Delta t} = \frac{\Delta t}{2} \left( \frac{\sigma_t (1-t)}{2t} + \frac{1}{\sigma_t} \right)^2 ||v_{\theta}(x_t, t) - v_{\mathrm{ref}}(x_t, t)||^2 $ Here: -
: The mean of the distribution for under .
-
: The mean of the distribution for under .
-
This formula highlights that the
KL divergenceis proportional to the squared difference between thevelocity fieldsof the current andreference policies, scaled by terms related to and . This makes theKLregularization directly influence the similarity of the learnedvelocity fieldto the reference.
4.2.3. Denoising Reduction
To address the high computational cost of data collection for online RL, the Denoising Reduction strategy is employed:
-
Training Phase: During
online RLtraining, the model uses significantly fewerdenoising steps(e.g., ) to generate samples. These samples, while of lower visual quality, are sufficient to provide a usefulreward signalforGRPO's relative advantage estimation. This drastically reduces the time and resources needed for data collection. -
Inference Phase: For generating final, high-quality images during evaluation or deployment, the model reverts to its original, full
denoising steps(e.g., forSD3.5-M).This strategy allows for faster
RLtraining without compromising the quality of the final outputs, as the underlying flow model is still capable of high-fidelity generation when given enough steps.
5. Experimental Setup
5.1. Datasets
The experiments evaluate Flow-GRPO across three main tasks, each with specific prompt generation and reward definitions:
5.1.1. Compositional Image Generation
- Dataset Source: The GenEval [17] benchmark.
- Characteristics: This benchmark assesses
T2Imodels on complex compositional prompts that require precise understanding and generation of:- Object Counting: e.g., "three red apples."
- Spatial Relations: e.g., "a cat on the roof of a house."
- Attribute Binding: e.g., "a blue car and a red car."
- Prompt Generation: Training prompts are generated using official GenEval scripts, which employ templates and random combinations to create a diverse prompt dataset. The test set is strictly deduplicated to avoid overlap with training data, treating prompts differing only in object order as identical.
- Prompt Ratio: Based on the base model's initial accuracy, the ratio of prompt types used for training is: Position : Counting : Attribute Binding : Colors : Two Objects : Single Object =
7 : 5 : 3 : 1 : 1 : 0. This prioritizes more challenging compositional aspects. - Example Data Sample (GenEval-style prompt): "a photo of a blue pizza and a yellow baseball glove." (As seen in Figure 24 from the appendix).
5.1.2. Visual Text Rendering
- Dataset Source: Prompts generated by
GPT-4o. - Characteristics: This task evaluates the model's ability to accurately render specified text within an image.
- Prompt Generation: Each prompt follows the template
A sign that says "text". The placeholder"text"is the exact string the model should render. 20K training prompts and 1K test prompts were generated byGPT-4o. - Example Data Sample (Visual Text Rendering prompt):
A sign that says "caution: telepathic subjects"(As seen in Figure 25 from the appendix).
5.1.3. Human Preference Alignment
- Reward Model Source: PickScore [19].
- Characteristics: This task aims to align
T2Imodels with general human aesthetic and semantic preferences. - Prompt Generation: The paper uses prompts from various sources to train for human preference alignment.
- Example Data Sample (PickScore-style prompt): "a woman on top of a horse" (As seen in Figure 28 from the appendix).
5.2. Evaluation Metrics
For every evaluation metric, the following structure is provided:
5.2.1. Task-Specific Metrics
- GenEval Accuracy (Compositional Image Generation):
- Conceptual Definition: Measures how accurately the generated image reflects complex compositional elements specified in the text prompt, such as correct object counts, colors, and spatial relationships. It's often assessed by detecting objects and analyzing their attributes and arrangements.
- Mathematical Formula: The reward function directly serves as the accuracy metric for GenEval tasks.
- Counting: $ r = 1 - \frac{|N_{\mathrm{gen}} - N_{\mathrm{ref}}|}{\bar{N}_{\mathrm{ref}}} $
- Position / Color: If object count is correct, a partial reward is given. The remaining reward is granted if the predicted position or color is also correct.
- Symbol Explanation:
- : Number of objects generated by the model.
- : Number of objects referenced in the prompt.
- : (Implied from the context, typically) The reference count or an average/expected reference count for normalization.
- OCR Accuracy (Visual Text Rendering):
- Conceptual Definition: Quantifies the accuracy of text rendered within the generated image compared to the target text specified in the prompt. It's based on the minimum changes needed to transform the rendered text into the target text.
- Mathematical Formula: $ r = \mathrm{max}\left(1 - \frac{N_{\mathrm{e}}}{N_{\mathrm{ref}}}, 0\right) $
- Symbol Explanation:
- : The minimum
edit distance(e.g., Levenshtein distance) between the text rendered in the image and the target text from the prompt. - : The number of characters in the target text (the string within quotation marks in the prompt).
- : The minimum
- PickScore (Human Preference Alignment):
- Conceptual Definition: A
model-based rewardthat predicts human preferences forT2Igenerated images. It's trained on a large dataset ofhuman-annotated pairwise comparisonsof images from the same prompt and provides an overall score reflecting prompt alignment and visual quality. - Mathematical Formula: PickScore is typically a neural network model, so there isn't a simple single formula. It outputs a scalar score, .
- Symbol Explanation:
- : The generated image.
- : The text prompt used for generation.
- : A scalar score indicating the model's predicted human preference for the
image-promptpair.
- Conceptual Definition: A
5.2.2. Image Quality & Preference Metrics (for Reward Hacking Detection)
To detect reward hacking (where task-specific reward increases but general image quality or diversity declines), the paper uses several automatic image quality metrics, all computed on DrawBench [1], a comprehensive benchmark with diverse prompts.
- Aesthetic Score [59]:
- Conceptual Definition: A metric that predicts the perceived aesthetic quality of an image, typically trained on human aesthetic ratings. It aims to capture subjective beauty.
- Mathematical Formula: It is a
CLIP-based linear regressor. The formula is typically not published as a simple equation but represents the output of a trained model: $ S_{\mathrm{Aesthetic}} = \mathrm{Regressor}(\mathrm{CLIP_Features}(\mathrm{image})) $ - Symbol Explanation:
- : The input image.
- : Feature embeddings extracted from the image using a pre-trained
CLIPmodel. - : A linear regression model trained to map
CLIP featuresto aesthetic scores. - : The predicted aesthetic score.
- DeQA score [60]:
- Conceptual Definition: A
multimodal large language model (MLLM)-based image quality assessment (IQA) model. It quantifies how distortions, texture damage, and other low-level artifacts affect perceived quality, providing a more objective measure of image fidelity. - Mathematical Formula: Similar to
PickScore,DeQAis a complex neural network. Its output is a scalar score, . - Symbol Explanation:
- : The input image.
- : A scalar score representing the image's quality in terms of distortions and artifacts.
- Conceptual Definition: A
- ImageReward [32]:
- Conceptual Definition: A general-purpose
T2I human preference reward modelthat evaluates multiple criteria, includingtext-image alignment,visual fidelity, andharmlessness. - Mathematical Formula:
ImageRewardis a deep neural network that outputs a scalar score, . - Symbol Explanation:
- : The generated image.
- : The text prompt.
- : A scalar score reflecting human preference based on alignment, fidelity, and harmlessness.
- Conceptual Definition: A general-purpose
- UnifiedReward [61]:
- Conceptual Definition: A recently proposed unified reward model designed for
multimodal understanding and generation, aiming to achieve state-of-the-art performance inhuman preference assessment. It is intended to be a comprehensive measure of overall quality and alignment. - Mathematical Formula:
UnifiedRewardis also a complex neural network, producing a scalar score, . - Symbol Explanation:
- : The generated image.
- : The text prompt.
- : A scalar score representing a unified measure of multimodal understanding and generation quality.
- Conceptual Definition: A recently proposed unified reward model designed for
- Diversity Score: (Implicitly measured through qualitative assessment and sometimes quantitative metrics like
FIDorCLIP Scoredistribution, though not a standalone formula given here.)- Conceptual Definition: Measures the variety and range of outputs generated by a model for a given prompt or set of prompts. A high diversity score indicates the model can produce distinct and varied images, while low diversity might suggest mode collapse.
- Mathematical Formula: Not explicitly provided with a standalone formula in the paper for diversity, but typically assessed via metrics like
FID(Fréchet Inception Distance),CLIP Scoredistribution width, or qualitative observation of generated samples. In Table 6,CLIP Scoreis used, where a higher score implies bettertext-image alignment, andDiversity Scoreis explicitly reported, likely derived from the spread of embeddings. - Symbol Explanation: Not applicable for a generic formula, but in the context of Table 6,
CLIP Score ↑indicates that a higher score is better for text-image alignment, andDiversity Score ↑indicates higher scores are better for diversity.
5.3. Baselines
Flow-GRPO was compared against several representative alignment methods, categorized by their approach:
- Supervised Fine-Tuning (SFT):
- Description: This baseline selects the highest-reward image within each group of generated images and fine-tunes the model on it using standard supervised learning objectives.
- Representativeness: Represents a straightforward, direct optimization approach based on explicit high-quality samples.
- Flow-DPO [14, 39] (Direct Preference Optimization):
- Description: An
offline RLtechnique that uses pairwise preferences. For each group of generated images, the highest-reward image is designated as the "chosen" sample, and the lowest-reward image as the "rejected" sample. TheDPO lossis then applied to these pairs. - Representativeness: A prominent
offline RLmethod widely used for alignment tasks, particularly inLLMsand increasingly ingenerative models.
- Description: An
- Flow-RWR [14, 76] (Reward Weighted Regression):
- Description: An
online reward-weighted regressionmethod that applies asoftmaxover rewards within each group and performsreward-weighted likelihood maximization. It guides the model to prioritize high-reward regions. - Representativeness: A class of
RLmethods that use rewards to weight training samples, common for fine-tuning.
- Description: An
- Online Variants (of SFT, Flow-DPO, Flow-RWR):
- Description: The "online" versions of the above methods update their data collection models (the policies generating samples for training) every 40 steps, reflecting an adaptive learning process, similar to
Flow-GRPO. - Representativeness: Crucial for a fair comparison against
Flow-GRPO, which is anonline RLmethod itself.
- Description: The "online" versions of the above methods update their data collection models (the policies generating samples for training) every 40 steps, reflecting an adaptive learning process, similar to
- DDPO [12] (Training Diffusion Models with Reinforcement Learning):
- Description: An
online RLmethod originally developed fordiffusion-based backbones. The paper adapted it toflow-matching modelsusing theODE-to-SDEconversion for comparison. - Representativeness: A direct
RLcompetitor for generative models, specifically diffusion, and thus relevant for showingFlow-GRPO's advantages onflow models.
- Description: An
- ReFL [32] (Reward-guided Fine-tuning of Latent Diffusion):
- Description: Directly fine-tunes
diffusion modelsby viewingreward model scoresashuman preference lossesandback-propagating gradientsto a randomly picked latetimestep. - Representativeness: Another
RL-likealignment method that uses differentiable rewards.
- Description: Directly fine-tunes
- ORW [35] (Online Reward-Weighted Regression):
-
Description: An
online reward-weighted regressionmethod that usesWasserstein-2 regularizationto preventpolicy collapseand maintain diversity, differing fromKL regularization. -
Representativeness: A distinct
online RLapproach that addressespolicy collapseusing a different regularization technique thanFlow-GRPO.These baselines collectively cover various strategies for aligning
T2Imodels, including supervised approaches,offline RL, and otheronline RLvariants, allowingFlow-GRPOto be evaluated comprehensively.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results strongly validate Flow-GRPO's effectiveness across multiple text-to-image tasks, demonstrating significant improvements in compositional generation, text rendering, and human preference alignment, all while maintaining image quality and diversity.
Overall Performance and Reward Hacking Mitigation:
Figure 1 (from the original paper) provides a high-level overview:
-
(a) GenEval performance rises steadily throughout
Flow-GRPO's training and outperformsGPT-4o.: This highlights the primary success in compositional tasks. -
(b) Image quality metrics on
DrawBench[1] remain essentially unchanged. This is crucial, indicating thatFlow-GRPOachieves its task-specific gains without sacrificing general image quality, effectively mitigatingreward hacking. -
(c) Human Preference Scores on
DrawBenchimproves after training. This shows the method can also align with broader aesthetic and preference objectives.The following figure (Figure 1 from the original paper) summarizes
Flow-GRPO's overall performance:![Figure 1: (a) GenEval performance rises steadily throughout Flow-GRPO's training and outperforms GPT-4o. (b) Image quality metrics on DrawBench \[1\] remain essentially unchanged. (c) Human Preference Scores on DrawBench improves after training. Results show that Flow-GRPO enhances the desired capability while preserving image quality and exhibiting minimal reward-hacking.](/files/papers/6921af93d8097f0bc1d013a8/images/1.jpg)
Compositional Image Generation (GenEval):
Flow-GRPOsignificantly boostsSD3.5-M's ability to handle complex compositional prompts. The following are the results from Table 1 of the original paper:
| Model | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding |
|---|---|---|---|---|---|---|---|
| Diffusion Models | |||||||
| LDM [62] | 0.37 | 0.92 | 0.29 | 0.23 | 0.70 | 0.02 | 0.05 |
| SD1.5 [62] | 0.43 | 0.97 | 0.38 | 0.35 | 0.76 | 0.04 | 0.06 |
| SD2. [62] | 0.50 | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 |
| SD-XL [63] | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 |
| DALLE-2 [64] | 0.52 | 0.94 | 0.66 | 0.49 | 0.77 | 0.10 | 0.19 |
| DALLE-3 [65] | 0.67 | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 |
| Autoregressive Models | |||||||
| Show-o [66] | 0.53 | 0.95 | 0.52 | 0.49 | 0.82 | 0.11 | 0.28 |
| Emu3-Gen [67] | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 |
| JanusFlow [68] | 0.63 | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 |
| Janus-Pro-7B [69] | 0.80 | 0.99 | 0.89 | 0.59 | 0.90 | 0.79 | 0.66 |
| GPT-4o [18] | 0.84 | 0.99 | 0.92 | 0.85 | 0.92 | 0.75 | 0.61 |
| Flow Matching Models | |||||||
| FLUX.1 Dev [5] | 0.66 | 0.98 | 0.81 | 0.74 | 0.79 | 0.22 | 0.45 |
| SD3.5-L [4] | 0.71 | 0.98 | 0.89 | 0.73 | 0.83 | 0.34 | 0.47 |
| SANA-1.5 4.8B [70] | 0.81 | 0.99 | 0.93 | 0.86 | 0.84 | 0.59 | 0.65 |
| SD3.5-M [4] | 0.63 | 0.98 | 0.78 | 0.50 | 0.81 | 0.24 | 0.52 |
| SD3.5-M+Flow-GRPO | 0.95 | 1.00 | 0.99 | 0.95 | 0.92 | 0.99 | 0.86 |
As shown in Table 1, SD3.5-M with Flow-GRPO achieved an outstanding Overall GenEval score of 0.95, a substantial increase from the base SD3.5-M's 0.63. This score is not only the best among all models listed (including Diffusion Models, Autoregressive Models, and other Flow Matching Models), but it also significantly outperforms GPT-4o (0.84), which was previously a strong performer.
The improvements are consistent across all sub-tasks, particularly in Counting (), Position (), and Attribute Binding (), which are known challenges for T2I models. This indicates Flow-GRPO's ability to learn fine-grained control and reasoning.
Figure 3 from the original paper provides qualitative comparisons on the GenEval benchmark, further illustrating Flow-GRPO's superior performance in Counting, Colors, Attribute Binding, and Position. For example, Flow-GRPO correctly generates the specified number of objects and their attributes, where the base SD3.5-M often fails.
The following figure (Figure 3 from the original paper) visually compares Flow-GRPO's qualitative performance on the GenEval benchmark:

Visual Text Rendering and Human Preference Alignment: The following are the results from Table 2 of the original paper:
| Model | Task Metric | Image Quality | Preference Score | |||||
|---|---|---|---|---|---|---|---|---|
| GenEval | OCR Acc. | PickScore | Aesthetic | DeQA | ImgRwd | PickScore | UniRwd | |
| SD3.5-M | 0.63 | 0.59 | 21.72 | 5.39 | 4.07 | 0.87 | 22.34 | 3.33 |
| Compositional Image Generation | ||||||||
| Flow-GRPO (w/o KL) | 0.95 | 4.93 | 2.77 | 0.44 | 21.16 | 2.94 | ||
| Flow-GRPO (w/KL) | 0.95 | 5.25 | 4.01 | 1.03 | 22.37 | 3.51 | ||
| Visual Text Rendering | ||||||||
| Flow-GRPO (w/o KL) | 0.93 | 5.13 | 3.66 | 0.58 | 21.79 | 3.15 | ||
| Flow-GRPO (w/KL) | 0.92 | 5.32 | 4.06 | 0.95 | 22.44 | 3.42 | ||
| Human Preference Alignment | ||||||||
| Flow-GRPO (w/o KL) | 23.41 | 6.15 | 4.16 | 1.24 | 23.56 | 3.57 | ||
| Flow-GRPO (w/ KL) | 23.31 | 5.92 | 4.22 | 1.28 | 23.53 | 3.66 | ||
Table 2 confirms these gains and further highlights the role of KL regularization:
- Visual Text Rendering:
Flow-GRPO (w/KL)increasesOCR Acc.from0.59to0.92. Crucially,Aesthetic,DeQA,ImageReward,PickScore, andUnifiedRewardmetrics remain stable or slightly improve, demonstrating thatFlow-GRPOenhances text rendering without compromising general image quality. - Human Preference Alignment:
Flow-GRPO (w/KL)improvesPickScore(task metric) from21.72to23.31. Again, general quality metrics are preserved. - Impact of
KL Regularization: ComparingFlow-GRPO (w/o KL)withFlow-GRPO (w/KL)clearly shows the importance ofKL. WithoutKL,Image Quality(e.g.,DeQAdrops from4.07to2.77for compositional generation) andPreference Scores(e.g.,ImageRewarddrops from0.87to0.44) significantly degrade, even if task metrics are high. This is a clear indication ofreward hacking. TheKLconstraint effectively mitigates this.
Comparison with Other Alignment Methods:
Figure 4 from the original paper compares Flow-GRPO with various online and offline alignment methods on the Compositional Generation Task.
The following figure (Figure 4 from the original paper) shows the comparison with other alignment methods:
该图像是图表,展示了不同对齐方法在组合生成任务中的 GenEval 评分对比。随着训练提示数量的增加,Flow-GRPO 方法的 GenEval 评分显著提高,最高达到 0.9 以上,而其他方法的表现有所不同。
Flow-GRPO consistently outperforms all baselines (SFT, Flow-DPO, Flow-RWR, and their online variants) by a significant margin in terms of GenEval score. For instance, Flow-GRPO reaches over 0.9, while the next best Online DPO struggles to pass 0.8. This indicates the superior effectiveness of online policy gradient with GRPO for flow matching models.
6.2. Ablation Studies / Parameter Analysis
The paper conducts several ablation studies to understand the behavior and robustness of Flow-GRPO's key components.
6.2.1. Reward Hacking and KL Regularization
The impact of KL regularization is a critical finding:
-
Observation: Without the
KLconstraint (Flow-GRPO (w/o KL)), models achieve high task-specific rewards but suffer fromquality degradation(for GenEval and OCR) anddiversity decline(for PickScore). For example, in Table 2,DeQAscores drop significantly whenKLis removed. In theHuman Preference Alignmenttask,KLprevents a collapse in visual diversity, where outputs converge to a single style. -
Conclusion:
KL regularizationis not merely anearly stoppingmechanism. A properly tunedKLterm (e.g., for GenEval/Text Rendering, for Pickscore) allowsFlow-GRPOto match the high task rewards of theKL-freeversion while preserving image quality and diversity, though it might require longer training. The following figure (Figure 6 from the original paper) visually demonstrates the effect ofKL Regularization:
该图像是一个示意图,展示了KL正则化的效果。左侧的‘Quality Degradation’部分对比了不同模型生成的苹果图像质量,右侧的‘Diversity Decline’部分则展示了不同模型生成的林肯演讲图像多样性。采用KL正则化的图像在质量与多样性上均表现优异。
The following figure (Figure 12 from the original paper) shows learning curves with and without KL for all three tasks:
该图像是图表,展示了在训练步骤中,使用和不使用 KL 的情况下在多个任务中的评估结果。左侧(a)为图像生成的 GenEval 分数,中间(b)为视觉文本渲染的 OCR 准确率,右侧(c)为人类偏好对齐的 PickScore。通过 KL 惩罚能有效抑制奖励黑客行为。
This further emphasizes that KL penalty slows early training but effectively suppresses reward hacking, leading to more robust models.
6.2.2. Effect of Denoising Reduction
The Denoising Reduction strategy is crucial for training efficiency.
-
Observation: Figure 7(a) shows that reducing
denoising stepsduring training from40to10achieves over a speedup (convergence in terms of GPU time) without impacting the final reward on the GenEval task. Further reduction to5steps does not consistently improve speed and can sometimes slow training or make it unstable. -
Conclusion: Using a moderate number of
denoising steps(e.g.,10) during training is an effective trade-off, enabling faster convergence without sacrificing final performance at inference (where40steps are used). This confirms thatlow-quality but informative trajectoriesare sufficient forRLlearning. The following figure (Figure 7 from the original paper) illustrates the effect ofDenoising Reductionon GenEval:
该图像是图表,展示了去噪减少对GenEval得分和噪声水平消融对OCR准确度的影响。图(a)显示不同去噪步骤在GPU训练时间中的表现,图(b)显示不同噪声水平对OCR准确度的影响,最佳噪声水平为。
The following figure (Figure 9 from the original paper) provides extended Denoising Reduction ablations for Visual Text Rendering and Human Preference Alignment:
该图像是图表,展示了 Flow-GRPO 在视觉文本渲染和人类偏好对齐方面的训练效果。左侧图表显示 OCR 评估准确率随着训练时间的变化,右侧图表呈现 PickScore 的变化趋势。不同步骤数的效果被标记,显示了训练效率的提升。
These graphs confirm similar trends across tasks: fewer steps () significantly accelerate training while achieving comparable final performance.
6.2.3. Effect of Noise Level ()
The parameter in controls the level of stochasticity injected into the SDE.
-
Observation: Figure 7(b) shows that a small (e.g.,
0.1) limits exploration and slowsreward improvement. Increasing up to0.7boosts exploration and speeds upreward gains(maximizingOCR accuracy). Beyond0.7(e.g.,1.0), further increases provide no additional benefit, as exploration is already sufficient. -
Conclusion: A moderate
noise levelis optimal. Too much noise can degrade image quality, leading to zero reward andfailed training, indicating a balance between exploration and maintaining image coherence is necessary. The following figure (Figure 7 from the original paper) illustrates the effect ofNoise Level:
该图像是图表,展示了去噪减少对GenEval得分和噪声水平消融对OCR准确度的影响。图(a)显示不同去噪步骤在GPU训练时间中的表现,图(b)显示不同噪声水平对OCR准确度的影响,最佳噪声水平为。
6.2.4. Effect of Group Size ()
The group size is crucial for GRPO's advantage estimation.
-
Observation: Figure 5 shows that reducing
group sizeto and led to unstable training and eventual collapse when usingPickScoreas the reward function. remained stable. -
Conclusion: Smaller
group sizesproduce inaccurateadvantage estimates, increasing variance and leading totraining collapse. A sufficiently largegroup size(e.g., ) is necessary for stable and effectiveGRPOtraining, consistent with findings in otherRLliterature [71, 72]. The following figure (Figure 5 from the original paper) shows ablation studies on differentGroup Size G:
该图像是图表,展示了不同组大小 对 Flow-GRPO 训练步骤的影响。可以看到,组大小为 24 时的评估分数最高,而组大小为 6 时的评估效果明显下降,表明更高的组大小带来了更好的性能。
6.2.5. Generalization Analysis
Flow-GRPO demonstrates strong generalization capabilities.
-
Unseen GenEval Scenarios: Table 4 shows
Flow-GRPOgeneralizes well tounseen objects(trained on 60, evaluated on 20 unseen) andunseen counting(trained on 2-4 objects, evaluated on 5-6 or 12 objects). For instance, it increasesOverallaccuracy onunseen objectsfrom0.64to0.90andCountingaccuracy for 5-6 objects from0.13to0.48. -
T2I-CompBench++ [6, 73]: Table 3 indicates significant gains on
T2I-CompBench++, a benchmark for open-world compositionalT2Igeneration with object classes and relationships substantially different from the GenEval-style training data. For example,SD3.5-M+Flow-GRPOimproves2D-Spatialfrom0.2850to0.5447. -
Conclusion: The learned capabilities are not just memorized but generalize to novel compositional challenges, showcasing the model's enhanced reasoning.
The following are the results from Table 3 of the original paper:
Model Color Shape Texture 2D-Spatial 3D-Spatial Numeracy Non-Spatial Janus-Pro-7B [69] 0.5145 0.3323 0.4069 0.1566 0.2753 0.4406 0.3137 EMU3 [67] 0.7913 0.5846 0.7422 — — FLUX.1 Dev [5] 0.7407 0.5718 0.6922 0.2863 0.3866 0.6185 0.3127 SD3.5-M [4] 0.7994 0.5669 0.7338 0.2850 0.3739 0.5927 0.3146 SD3.5-M+Flow-GRPO 0.8379 0.6130 0.7236 0.5447 0.4471 0.6752 0.3195
The following are the results from Table 4 of the original paper:
| Method | Unseen Objects | Unseen Counting | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding | 5-6 Objects | 12 Objects | |
| SD3.5-M | 0.64 | 0.96 | 0.73 | 0.53 | 0.87 | 0.26 | 0.47 | 0.13 | 0.02 |
| SD3.5-M+Flow-GRPO | 0.90 | 1.00 | 0.94 | 0.86 | 0.97 | 0.84 | 0.77 | 0.48 | 0.12 |
6.2.6. Comparison with Other Alignment Methods (Extended)
-
Online vs. Offline: Figure 8 illustrates
Flow-GRPO's superior performance over SFT, Flow-RWR, Flow-DPO, and their online variants on theHuman Preference Alignmenttask. The online variants (e.g., Online DPO) generally outperform their offline counterparts, confirming the benefits ofonlineinteraction. -
DDPO Comparison:
DDPO, when adapted toflow matching models, showed slowerreward increasesand eventuallycollapsedin later stages, whereasFlow-GRPOtrained stably and improved consistently. -
ReFL Comparison:
Flow-GRPOalso surpassedReFL(which requiresdifferentiable rewards), highlighting its robustness and generalizability as it does not impose this constraint. -
ORW Comparison: Table 5 and Table 6 compare
Flow-GRPOwithORW.Flow-GRPOconsistently achieves higherPickScoreover training steps (Table 5) and outperformsORWin bothCLIP Score(proxy fortext-image alignment) andDiversity Score(Table 6). This further solidifiesFlow-GRPO's advantage in maintaining diversity while aligning with preferences.The following are the results from Table 5 of the original paper:
Method Step 0 Step 240 Step 480 Step 720 Step 960 SD3.5-M + ORW 28.79 29.05 29.15 27.58 23.05 SD3.5-M + Flow-GRPO 28.79 29.10 29.17 29.51 29.89
The following are the results from Table 6 of the original paper:
| Method | CLIP Score ↑ | Diversity Score ↑ |
|---|---|---|
| SD3.5-M | 27.99 | 0.96 |
| SD3.5-M + ORW | 28.40 | 0.97 |
| SD3.5-M + Flow-GRPO | 30.18 | 1.02 |
6.2.7. Effect of Initial Noise
-
Observation: Figure 10 shows that initializing each rollout with different random noise (to increase
exploratory diversity) consistently achieved higher rewards during training compared to using the same initial noise for all rollouts. -
Conclusion: This supports the importance of diverse exploration during
RLtraining for stable and effective learning. The following figure (Figure 10 from the original paper) shows the effect ofInitial Noise:
该图像是一个图表,展示了在训练步骤与 PickScore 评估之间的关系,比较了使用不同初始噪声和相同初始噪声的 Flow GRPO 方法的效果。随着训练步骤的增加,两条曲线显示出明显的上升趋势。
6.2.8. Additional Results on FLUX.1-Dev
-
Observation:
Flow-GRPOapplied toFLUX.1-Dev(anotherflow matching model) usingPickScoreas reward also showed a steady increase in reward throughout training without noticeablereward hacking(Figure 11). Table 7 confirms improvements inAesthetic,ImageReward,PickScore, andUnifiedRewardforFLUX.1-Dev + Flow-GRPOcompared to the baseFLUX.1-Dev. -
Conclusion: This demonstrates
Flow-GRPO's generalizability beyondSD3.5-Mto otherflow matching modelarchitectures. The following figure (Figure 11 from the original paper) shows additional results onFLUX.1-Dev:
该图像是图表,展示了在 FLUX.1 Dev 数据集上使用 Flow-GRPO 方法的训练步骤与 PickScore 评估的关系。随着训练步骤的增加,PickScore 评估值逐渐上升,最终达到 23.43,明显高于未使用 Flow-GRPO 方法时的 21.94。
The following are the results from Table 7 of the original paper:
| Model | Aesthetic | DeQA | ImageReward | PickScore | UnifiedReward |
|---|---|---|---|---|---|
| FLUX.1-Dev | 5.71 | 4.31 | 0.85 | 22.62 | 3.65 |
| FLUX.1-Dev + Flow-GRPO | 6.02 | 4.24 | 1.32 | 23.97 | 3.81 |
6.2.9. Training Sample Visualization with Denoising Reduction
-
Observation: Figure 19 visualizes samples under different inference settings:
ODE(40 steps),SDE(40 steps),SDE(10 steps), andSDE(5 steps).ODE(40) andSDE(40) yield visually indistinguishable high-quality images, confirming theODE-to-SDEconversion preserves quality. However,SDE(10) andSDE(5) steps introduce artifacts like color drift and blur, resulting in lower-quality images. -
Conclusion: Despite the lower quality of samples generated with fewer steps, this
Denoising Reductionstrategy accelerates optimization becauseFlow-GRPOrelies on relative preferences. The model still extracts a usefulreward signal, while significantly cutting wall-clock time, leading to faster convergence without sacrificing final performance. The following figure (Figure 19 from the original paper) visualizestraining samplesunder differentinference settings:
该图像是多张不同风格的欢迎拉斯维加斯的路标插图,展示了各式各样的灯光效果和设计。在夜晚的背景中,每个标志独具特色,体现了拉斯维加斯的独特魅力。
6.3. Qualitative Results
Figures 13, 14, 15, 16, 17, and 18 from the appendix provide extensive qualitative comparisons and insights into the model's behavior:
- GenEval, OCR, PickScore Rewards: These figures show that
Flow-GRPOwithKL regularizationdramatically improves the target capability (e.g., correct object counts, legible text, preferred styles) while maintaining overall image quality. In contrast, removingKLoften leads to visual degradation or loss of diversity. - Evolution of Evaluation Images: Figures 16, 17, and 18 illustrate how the generated images for fixed prompts progressively improve and align with task objectives over successive training iterations, showcasing the
online RLlearning process.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Flow-GRPO, a pioneering method that successfully integrates online policy gradient reinforcement learning (RL) into flow matching models for text-to-image (T2I) generation. The core innovation lies in addressing the fundamental challenges of applying RL to these models: their deterministic nature and high sampling cost. Flow-GRPO achieves this through two key strategies:
-
ODE-to-SDE Conversion: Transforms the deterministic
Ordinary Differential Equation (ODE)sampling offlow matching modelsinto an equivalentStochastic Differential Equation (SDE)framework. This crucial step introduces the necessary stochasticity forRL explorationwhile rigorously preserving the original model's marginal distributions. -
Denoising Reduction Strategy: Significantly reduces the number of
denoising stepsduringRLtraining (for efficient data collection) while retaining the full number of steps for inference (to ensure high-quality outputs). This strategy drastically improves sampling efficiency and training speed.Empirically,
Flow-GRPOdemonstrates state-of-the-art performance across diverseT2Itasks. It boostsSD3.5-M's accuracy on the challenging GenEval compositional generation benchmark from to an impressive , outperforming evenGPT-4o. Similarly, visual text rendering accuracy improves from to , and substantial gains are achieved in human preference alignment. A critical finding is the effectiveness ofKL regularizationin preventingreward hacking, ensuring that performance gains do not come at the expense of overall image quality or diversity.Flow-GRPOoffers a simple, general, and robust framework for applyingonline RLtoflow-based generative models, opening new avenues for controllable and aligned image synthesis.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose directions for future research:
- Reward Design: While
Flow-GRPOshows promise for video generation, currentreward models(e.g., object detectors, trackers) are often simple heuristics. More advancedreward modelsare needed to capture complex attributes like physical realism and temporal consistency in videos. - Balancing Multiple Rewards: Video generation typically involves optimizing multiple, sometimes conflicting, objectives (e.g., realism, smoothness, coherence). Balancing these competing goals remains a challenge requiring careful tuning.
- Scalability: Video generation is significantly more resource-intensive than
T2I. ApplyingFlow-GRPOat scale for video tasks will require more efficient data collection and training pipelines. - Reward Hacking Prevention: Although
KL regularizationhelps, it can lead to longer training times, and occasionalreward hackingmay still occur for specific prompts. Exploring better, more robust methods for preventingreward hackingis an ongoing area of research.
7.3. Personal Insights & Critique
This paper presents a highly impactful contribution by successfully integrating online RL into flow matching models, which represents a significant step towards more controllable and alignable T2I generation.
Innovations and Strengths:
- Elegant Solution to a Core Problem: The
ODE-to-SDEconversion is a technically elegant solution to the fundamental incompatibility between deterministicflow modelsand stochasticRL exploration. It allows pre-trained, high-qualityflow modelsto be fine-tuned withRLwithout extensive architectural changes or full retraining, which is highly practical. - Practical Efficiency: The
Denoising Reductionstrategy is a brilliant practical innovation. Recognizing thatRLdoesn't always need pristine samples for learning relative preferences dramatically cuts down training costs, makingonline RLfeasible for large generative models. This highlights a pragmatic approach toRLdata efficiency. - Comprehensive Validation: The extensive experiments across
compositional generation,text rendering, andhuman preference alignmentwith various baselines and ablation studies (especially onKL regularization,noise level,group size) thoroughly demonstrate the method's effectiveness and robustness. The clear evidence againstreward hacking(withKL) is particularly reassuring. - Generalizability: The results on
FLUX.1-DevandT2I-CompBench++showcase the method's potential applicability across differentflow-based architecturesand broader, more complex compositional settings.
Potential Issues & Areas for Improvement/Further Research:
-
Hyperparameter Sensitivity: As noted by the authors, the
KL regularizationcoefficient andnoise levelare crucial hyperparameters. Finding optimal values can be challenging and task-dependent. While the paper provides guidance, developing adaptive or less sensitiveRLvariants could further improve usability. -
Complexity of Reward Models: While
Flow-GRPOcan utilize non-differentiablereward models(a strength), the quality ofRLfine-tuning is inherently tied to the quality of thereward signal. Currentreward models(even advancedVLMs) still have limitations and might not fully capture nuanced human preferences or complex task requirements. Future work might need to focus on jointly improvingreward modelsandRLalgorithms. -
Interpretability of
SDEConversion: While mathematically sound, theSDEconversion introduces ascore functionterm that modifies thevelocity field. A deeper understanding or visualization of how this modifiedvelocity fieldbehaves, especially with different schedules, could offer more insights into theRL's exploration mechanism. -
Scaling to Higher Resolutions and Video: The authors correctly identify scalability to video as a limitation.
Denoising Reductionhelps, butonline RLon very high-resolution images or videos still faces immense computational hurdles related to memory and processing power. Exploring more sophisticatedexperience replayoroff-policy RLtechniques adapted for generative models might further improve data efficiency. -
Interaction with Pre-trained Weights: The
KL regularizationhelps keep the model close to its pre-trained weights. While beneficial for quality preservation, there might be scenarios where more aggressive deviation from the pre-trainedpolicyis desired for novel capabilities. Investigating dynamicKLweighting or alternative regularization schemes could be interesting.Inspiration from this paper includes the realization that
RL's power for reasoning and alignment can indeed be unlocked for efficientODE-based generative modelswith clever theoretical and practical adjustments. TheODE-to-SDEconversion paradigm could be a powerful tool for injecting stochasticity into other deterministic processes forRLor other applications. The emphasis on carefully managingreward hackingthroughKL regularizationis a valuable lesson for allRLapplications in complex domains.
Similar papers
Recommended via semantic vector search.