AiPaper
Paper status: completed

FlowRL: Matching Reward Distributions for LLM Reasoning

Published:09/19/2025
Original LinkPDF
Price: 0.10
Price: 0.10
0 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

FlowRL introduces a novel method that matches full reward distributions via flow balancing, enhancing diversity in reasoning. Experiments show FlowRL improves performance by 10% over GRPO and 5.1% over PPO in math tasks, demonstrating the significance of reward distribution match

Abstract

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0%10.0\% over GRPO and 5.1%5.1\% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

FlowRL: Matching Reward Distributions for LLM Reasoning

1.2. Authors

Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xiingtai Lv, C Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin.

The authors are affiliated with several institutions, including Renmin University of China, Stanford University, Toyota Technological Institute at Chicago, and others (the OCR quality on affiliations is low, but key authors are from major academic and research institutions). The wide range of affiliations suggests a collaborative effort between academia and industry, bringing together expertise in reinforcement learning, large language models, and generative modeling.

1.3. Journal/Conference

The paper is available on arXiv, which is a preprint server for academic papers. This means it has not yet undergone formal peer review for publication in a journal or conference. However, arXiv is a standard platform for disseminating cutting-edge research in fields like machine learning, and many influential papers appear there first. The version analyzed is v3v3, indicating it has been revised since its initial posting.

1.4. Publication Year

The paper specifies a future publication date of 2025-09-18T17:56:36.000Z. This is likely a placeholder, as the paper was submitted to arXiv and is being considered for future publication venues. The research reflects the state of the art as of late 2024/early 2025.

1.5. Abstract

The abstract introduces FlowRL, a novel reinforcement learning (RL) method for large language models (LLMs) focused on reasoning tasks. Instead of traditional reward maximization (used by methods like PPO and GRPO), FlowRL aims to match the entire reward distribution. The authors argue that reward-maximizing approaches tend to over-optimize for dominant, high-reward reasoning paths, which reduces the diversity of solutions and harms generalization. FlowRL addresses this by transforming scalar rewards into a target probability distribution using a learnable partition function. It then trains the LLM's policy to match this target distribution by minimizing the reverse KL divergence. This is implemented as a flow-balanced optimization objective inspired by GFlowNets. The method is shown to significantly improve performance on math and code reasoning benchmarks, outperforming GRPO by an average of 10.0% and PPO by 5.1% on math tasks. The authors conclude that matching reward distributions is a promising direction for improving exploration and diversity in LLM reasoning.

2. Executive Summary

2.1. Background & Motivation

  • Core Problem: The primary problem addressed is the limitation of existing reinforcement learning (RL) methods used for fine-tuning Large Language Models (LLMs) on complex reasoning tasks (like math and coding). State-of-the-art methods such as Proximal Policy Optimization (PPO) and Group Reward Policy Optimization (GRPO) are designed to maximize a scalar reward signal.
  • Challenges in Prior Research: This reward-maximization objective has a critical flaw: it often leads to mode collapse. The model learns to favor and repeatedly generate the most common or obvious high-reward reasoning paths, effectively "overfitting" to the dominant modes of the reward distribution. This behavior suppresses exploration of less frequent but equally valid or even more creative reasoning strategies. Consequently, the model's diversity decreases, and its ability to generalize to new, unseen problems is hampered.
  • Paper's Entry Point: The paper's innovative idea is to fundamentally change the training objective. Instead of asking the model "what is the single best action to maximize my reward?", it asks "what is the full distribution of good solutions?". The proposal is to shift from reward maximization to reward distribution matching. By doing so, the model is encouraged to learn a policy that can generate a wide variety of high-quality solutions, reflecting the entire landscape of possible correct reasoning paths.

2.2. Main Contributions / Findings

  • A New RL Algorithm (FlowRL): The main contribution is FlowRL, a policy optimization algorithm that implements the concept of reward distribution matching. Its core mechanism involves:
    1. Transforming scalar rewards into a target probability distribution using a learnable partition function, inspired by energy-based models.
    2. Training the policy to match this target distribution by minimizing the reverse Kullback-Leibler (KL) divergence, which is practically achieved through a trajectory balance loss derived from Generative Flow Networks (GFlowNets).
  • Practical Adaptations for LLM Training: The paper introduces two crucial technical innovations to make this theoretical framework work for long-form Chain-of-Thought (CoT) reasoning:
    1. Length Normalization: To prevent exploding gradients caused by summing log-probabilities over very long token sequences (up to 8K tokens).
    2. Importance Sampling: To enable efficient off-policy training by reusing data generated from an older version of the policy, which is standard practice in LLM RL but not native to the GFlowNet objective.
  • State-of-the-Art Performance: The key finding is that FlowRL significantly outperforms established RL baselines on challenging reasoning tasks.
    • On math benchmarks, FlowRL achieves an average improvement of 10.0% over GRPO and 5.1% over PPO with a 32B parameter model.
    • On code generation tasks, it also demonstrates consistently superior performance.
    • Analysis confirms that FlowRL generates substantially more diverse reasoning paths, validating the core hypothesis that distribution matching enhances exploration and generalization.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Reinforcement Learning (RL) for LLMs

Reinforcement Learning is a machine learning paradigm where an agent learns to make decisions by interacting with an environment. In the context of LLMs, the setup is as follows:

  • Agent: The LLM itself, which acts as the policy πθ\pi_{\theta}.
  • State: The current context, which includes the input prompt (e.g., a math problem) and the sequence of tokens generated so far.
  • Action: Generating the next token in the sequence.
  • Policy πθ(yx)\pi_{\theta}(\mathbf{y}|\mathbf{x}): A probability distribution over possible output sequences y\mathbf{y} (the answer) given an input prompt x\mathbf{x}. The policy is parameterized by the LLM's weights θ\theta.
  • Reward r(x,y)r(\mathbf{x}, \mathbf{y}): A scalar score assigned to a complete generated sequence y\mathbf{y}. For reasoning tasks, this could be a binary reward (1 for a correct answer, 0 for incorrect) or a more graded score. The goal of traditional RL is to adjust the policy's parameters θ\theta to maximize the expected reward: Eyπθ(x)[r(x,y)]\mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})}[r(\mathbf{x}, \mathbf{y})].

3.1.2. Kullback-Leibler (KL) Divergence

KL Divergence is a measure of how one probability distribution differs from a second, reference probability distribution. For two distributions PP and QQ, it is defined as: $ D_{KL}(P || Q) = \sum_{x} P(x) \log\left(\frac{P(x)}{Q(x)}\right) $

  • Forward vs. Reverse KL: The order matters.
    • Forward KL (DKL(PQ)D_{KL}(P||Q)): Used when we want to approximate a target distribution QQ with a simpler distribution PP. Minimizing this forces P(x) to be low wherever Q(x) is low. This is known as mode-seeking, as PP will tend to focus on a single mode (peak) of QQ.
    • Reverse KL (DKL(QP)D_{KL}(Q||P)): Minimizing this forces P(x) to be high wherever Q(x) is high. This is known as mode-covering, as PP must spread its probability mass to cover all the modes of QQ.
  • Note on Paper's Terminology: The paper states it minimizes "reverse KL divergence" but provides the formula DKL(πθπ~)D_{KL}(\pi_{\theta} || \tilde{\pi}), which is technically the forward KL from the policy πθ\pi_{\theta} to the target π~\tilde{\pi}. While this terminology is slightly confusing, the paper's goal and the effect of its GFlowNet-based objective are mode-covering. The objective forces the policy to be proportional to the reward distribution, which naturally covers all high-reward modes.

3.1.3. Generative Flow Networks (GFlowNets)

GFlowNets are a class of generative models designed to sample complex, discrete objects (like sequences or graphs) from a distribution proportional to a given reward function, P(τ)R(τ)P(\tau) \propto R(\tau). They achieve this without needing to compute an intractable normalization constant (partition function).

  • Core Idea: GFlowNets frame the generation process as a flow moving through a state graph. The total flow entering any state must equal the total flow leaving it.

  • Flow Balancing: As illustrated in Figure 2, an initial flow ZZ is injected at the starting state s0s_0. The policy πθ\pi_{\theta} directs this flow through intermediate states. At terminal states (completed sequences), the flow exiting the network is equal to the reward for that sequence. By enforcing a flow balance constraint at every state, the network learns a policy that samples terminal states with a probability proportional to their reward. This inherently promotes diversity and covers all high-reward modes.

    The following figure (Figure 2 from the original paper) illustrates the GFlowNet concept.

    Figure 2 | GFlowNets \[Bengio et al., 2023a\], a flow-balance perspective on reinforcement learning. The initial flow \(Z _ { \\phi } ( s _ { 0 } )\) injects probability mass into the environment, which is transported through intermediate states by the policy \(\\pi _ { \\theta }\) and accumulated at terminal states in proportion to the scalar rewards. 该图像是示意图,展示了GFlowNets在强化学习中的流平衡视角。图中展示了初始流Zϕ(s0)Z_{\phi}(s_0)注入概率质量到环境,并通过策略πθ\pi_{\theta}在中间状态间传输,最终以与标量奖励成比例的方式在终态累积。状态{s0,s1,s2,,s10}\{s_0, s_1, s_2, \ldots, s_{10}\}及其对应的流向显示了在强化学习中的流动路径和奖励积累过程。

3.2. Previous Works

3.2.1. PPO (Proximal Policy Optimization)

PPO is a highly popular RL algorithm that improves on simpler policy gradient methods like REINFORCE. It provides more stable and reliable training by constraining the policy updates to not deviate too far from the previous policy. Its objective function is: $ \mathcal{L}^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t) \right] $

  • Symbols Explained:
    • rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} is the probability ratio between the current and old policies.
    • A^t\hat{A}_t is the estimated advantage function, which measures how much better an action is compared to the average action at a given state. It's often calculated using a critic (value function) model.
    • ϵ\epsilon is a hyperparameter that defines the clipping range (e.g., 0.2). The clip function prevents the ratio rt(θ)r_t(\theta) from becoming too large or small, which keeps the policy updates in a trusted region.

3.2.2. GRPO (Group Reward Policy Optimization)

GRPO is a simplification of PPO designed for LLM fine-tuning. It eliminates the need for a separate critic model to estimate the advantage. Instead, it generates a group of GG responses for a single prompt, calculates their rewards, and computes an "advantage" for each response by normalizing its reward relative to the group's mean and standard deviation. The paper provides a complex formula for GRPO, which is a variant of the PPO-clip objective. The key component is its advantage calculation: $ A_i = \frac{r_i - \text{mean}({r_1, r_2, \dots, r_G})}{\text{std}({r_1, r_2, \dots, r_G})} $ GRPO uses this group-normalized reward as a surrogate for the advantage, simplifying the training process but often requiring more rollouts (a larger group size GG) to get a stable estimate.

3.3. Technological Evolution

The field of RL for LLM reasoning has evolved from simple to more complex and stable algorithms:

  1. REINFORCE: The foundational policy gradient algorithm. Simple but suffers from high variance and instability.
  2. PPO: Introduced a clipping mechanism and value function to stabilize training, becoming a standard for many RL applications, including LLM fine-tuning (e.g., InstructGPT).
  3. GRPO: Simplified PPO by removing the value function and using group-based reward normalization. This made it more tailored to LLM generation tasks where a group of responses can be easily generated.
  4. FlowRL (This Paper): Represents a paradigm shift. Instead of focusing on stabilizing reward maximization, it changes the objective entirely to distribution matching. This directly addresses the underlying problem of mode collapse that all previous reward-maximizing methods are susceptible to.

3.4. Differentiation Analysis

The core innovation of FlowRL compared to prior work is its objective function:

  • PPO/GRPO: Aim to maximize the expected (advantage-weighted) reward. Their goal is to find and exploit high-reward regions of the solution space. This is fundamentally an optimization problem.

  • FlowRL: Aims to match a target reward distribution. Its goal is to learn a policy that reflects the entire landscape of good solutions, in proportion to their quality. This is fundamentally a generative modeling or density estimation problem.

    This conceptual shift from optimization to distribution matching is what allows FlowRL to promote diversity and avoid the mode collapse pitfalls of its predecessors.

4. Methodology

4.1. Principles

The central principle of FlowRL is to reframe reinforcement learning from a reward-maximization problem into a distribution-matching problem. While methods like PPO and GRPO push the policy towards the single highest peak in the reward landscape, FlowRL encourages the policy to cover all peaks, with the probability of sampling a solution being proportional to its reward. This is achieved by defining a target distribution based on the rewards and training the policy to match it using an objective derived from GFlowNets.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology of FlowRL is developed in several steps, starting from a theoretical objective and progressively adding practical components to make it work for long-form text generation.

4.2.1. Step 1: From Reward Maximization to a KL Divergence Objective

The starting point is the recognition that reward-maximizing RL leads to mode collapse (as depicted in Figure 1). To counter this, the paper proposes to align the policy's output distribution, πθ(yx)\pi_{\theta}(\mathbf{y}|\mathbf{x}), with a target distribution that is defined by the reward function.

However, a scalar reward r(x,y)r(\mathbf{x}, \mathbf{y}) is not a probability distribution. Inspired by energy-based models, the paper introduces a learnable partition function Zϕ(x)Z_{\phi}(\mathbf{x}) to normalize the rewards into a valid target distribution. This allows the formulation of a distribution matching objective using KL divergence. The objective is to minimize the KL divergence between the policy and this reward-induced target distribution: minθDKL(πθ(yx)exp(βr(x,y))Zϕ(x))πθ(yx)exp(βr(x,y)) \min_{\theta} \mathcal{D}_{\mathrm{KL}}\left(\pi_{\theta}(\mathbf{y} | \mathbf{x}) \left\| \frac{\exp(\beta r(\mathbf{x}, \mathbf{y}))}{Z_{\phi}(\mathbf{x})}\right)\right. \quad \Rightarrow \quad \pi_{\theta}(\mathbf{y} | \mathbf{x}) \propto \exp(\beta r(\mathbf{x}, \mathbf{y}))

  • Symbols Explained:
    • πθ(yx)\pi_{\theta}(\mathbf{y}|\mathbf{x}): The policy model (the LLM) that generates response y\mathbf{y} given prompt x\mathbf{x}.
    • r(x,y)r(\mathbf{x}, \mathbf{y}): The scalar reward for the generated response.
    • β\beta: A hyperparameter that controls the "sharpness" of the target distribution (how strongly rewards influence probabilities).
    • Zϕ(x)Z_{\phi}(\mathbf{x}): The learnable partition function, parameterized by ϕ\phi. It normalizes the exponentiated rewards over all possible responses y\mathbf{y} for a given prompt x\mathbf{x}. Its role is to make the expression a valid probability distribution.

4.2.2. Step 2: Equivalence to GFlowNet's Trajectory Balance Loss

Directly minimizing the KL objective is difficult because computing the partition function Zϕ(x)Z_{\phi}(\mathbf{x}) is intractable (it requires summing over all possible trajectories). The paper leverages a key result from GFlowNet theory (Proposition 1) which states that, in terms of expected gradients, minimizing the KL objective is equivalent to minimizing a squared-error loss known as the trajectory balance loss: minθDKL(πθ(yx)exp(βr(x,y))Zϕ(x))minθ(logZϕ(x)+logπθ(yx)βr(x,y))2 \min_{\theta} \mathcal{D}_{\mathrm{KL}}\left(\pi_{\theta}(\mathbf{y} | \mathbf{x}) \left\| \frac{\exp(\beta r(\mathbf{x}, \mathbf{y}))}{Z_{\phi}(\mathbf{x})}\right)\right. \quad \Longleftrightarrow \quad \min_{\theta} \left(\log Z_{\phi}(\mathbf{x}) + \log \pi_{\theta}(\mathbf{y} | \mathbf{x}) - \beta r(\mathbf{x}, \mathbf{y})\right)^2

  • Significance: This reformulation is crucial. It turns an intractable KL minimization problem into a tractable regression-like problem. The objective simply tries to make logZϕ(x)+logπθ(yx)\log Z_{\phi}(\mathbf{x}) + \log \pi_{\theta}(\mathbf{y}|\mathbf{x}) equal to βr(x,y)\beta r(\mathbf{x}, \mathbf{y}) for sampled trajectories. Both πθ\pi_{\theta} and ZϕZ_{\phi} are trained jointly to satisfy this condition. This squared loss serves as a practical surrogate for the KL objective.

4.2.3. Step 3: Adapting the Objective for Long CoT Reasoning

Applying the trajectory balance loss directly to long Chain-of-Thought (CoT) generation with LLMs presents two major challenges:

  1. Exploding Gradients: The term logπθ(yx)\log \pi_{\theta}(\mathbf{y}|\mathbf{x}) is a sum of log-probabilities for each token in the sequence: tlogπθ(yty<t,x)\sum_t \log \pi_{\theta}(\mathbf{y}_t | \mathbf{y}_{<t}, \mathbf{x}). For long sequences (e.g., 8,000 tokens), this sum can become a very large negative number, leading to huge gradients and unstable training.

  2. Sampling Mismatch: The trajectory balance loss assumes on-policy sampling (trajectories are sampled from the current policy πθ\pi_{\theta}). However, for efficiency, LLM RL pipelines like PPO and GRPO use off-policy data, sampling trajectories from a fixed older policy πθold\pi_{\theta_{old}} and performing multiple updates on this data.

    To solve these issues, the paper introduces several modifications. First, it redefines the target distribution to include a reference model πref\pi_{\mathrm{ref}} (the initial pre-trained model) as a regularizer. This prevents the policy from drifting too far from its original capabilities. The reward term becomes: exp(βr(x,y))πref(yx) \exp(\beta r(\mathbf{x}, \mathbf{y})) \cdot \pi_{\mathrm{ref}}(\mathbf{y} | \mathbf{x}) Substituting this into the trajectory balance loss (Equation 3) yields a new objective: minθ(logZϕ(x)+logπθ(yx)βr^i(x,y)logπref(yx))2 \min_{\theta} \left( \log Z_{\phi}(\mathbf{x}) + \log \pi_{\theta}(\mathbf{y} | \mathbf{x}) - \beta \hat{r}_{i}(\mathbf{x}, \mathbf{y}) - \log \pi_{\mathrm{ref}}(\mathbf{y} | \mathbf{x}) \right)^2

  • Here, r^i\hat{r}_i is also introduced, which is the group-normalized reward, similar to GRPO: r^i=(rimean(r))/std(r)\hat{r}_{i} = (r_{i} - \mathrm{mean}(\mathbf{r})) / \mathrm{std}(\mathbf{r}).

4.2.4. Step 4: The Final FlowRL Objective

Finally, the paper incorporates the solutions for exploding gradients and sampling mismatch to arrive at the complete FlowRL objective.

  • Length Normalization (Remark 3): To address exploding gradients, the sequence-level log-probability terms are normalized by the sequence length y|\mathbf{y}|.

  • Importance Sampling (Remark 4): To handle the sampling mismatch, an importance weight ww is introduced, similar to PPO. This weight corrects for the discrepancy between the sampling policy (πold\pi_{\mathrm{old}}) and the current training policy (πθ\pi_{\theta}). The weight is clipped and its gradient is detached to ensure stability.

    This leads to the final FlowRL objective function: LFlowRL=w(logZϕ(x)+1ylogπθ(yx)βr^(x,y)1ylogπref(yx))2 \mathcal{L}_{\mathrm{FlowRL}} = w \cdot \Bigg( \log Z_{\phi}(\mathbf{x}) + \frac{1}{|\mathbf{y}|} \log \pi_{\theta}(\mathbf{y} | \mathbf{x}) - \beta \hat{r}(\mathbf{x}, \mathbf{y}) - \frac{1}{|\mathbf{y}|} \log \pi_{\mathrm{ref}}(\mathbf{y} | \mathbf{x}) \Bigg)^2 where the weight ww and normalized reward r^\hat{r} are defined as: w=clip(πθ(yx)πold(yx),1ϵ,1+ϵ)detach,r^i=rimean(r)std(r) w = \mathrm{clip}\left(\frac{\pi_{\theta}(\mathbf{y} | \mathbf{x})}{\pi_{\mathrm{old}}(\mathbf{y} | \mathbf{x})}, 1 - \epsilon, 1 + \epsilon\right)^{\mathrm{detach}}, \quad \hat{r}_{i} = \frac{r_{i} - \mathrm{mean}(\mathbf{r})}{\mathrm{std}(\mathbf{r})}

  • Symbols Explained:

    • ww: The clipped and detached importance sampling weight. It re-weights the loss for each sample based on how likely it is under the current policy versus the policy that generated it.

    • y|\mathbf{y}|: The length of the generated token sequence. Dividing by this term performs the length normalization.

    • πref\pi_{\mathrm{ref}}: The fixed, pre-trained base model, acting as a regularizer.

    • πold\pi_{\mathrm{old}}: The older version of the policy from which the training data (rollouts) was sampled.

    • ϵ\epsilon: The clipping hyperparameter for the importance weight, same as in PPO.

      This final objective function is used to update both the policy parameters θ\theta and the partition function parameters ϕ\phi during training.

5. Experimental Setup

5.1. Datasets

The experiments were conducted in two primary domains: math reasoning and code generation.

  • Math Reasoning:
    • Training Data: The training set was collected from DAPO [Yu et al., 2025b].
    • Evaluation Benchmarks: A set of six challenging benchmarks were used:
      • AIME 2024/2025: American Invitational Mathematics Examination problems.
      • AMC 2023: American Mathematics Competitions problems.
      • MATH-500: A subset of the MATH dataset.
      • Minerva: A dataset of quantitative reasoning problems.
      • Olympiad: Problems from math olympiads, representing a very high level of difficulty.
  • Code Reasoning:
    • Training Data: The training set from DeepCoder [Luo et al., 2025] was used.
    • Evaluation Benchmarks: Three benchmarks were used to assess coding ability:
      • LiveCodeBench: A benchmark with live, unseen programming problems.

      • CodeForces: A popular competitive programming platform; performance is measured by rating and percentile.

      • HumanEval+HumanEval+: An enhanced version of the popular HumanEval benchmark for code generation.

        These datasets were chosen because they represent complex reasoning tasks where multiple valid solution paths exist, making them ideal for testing the diversity and generalization benefits of FlowRL.

5.2. Evaluation Metrics

5.2.1. Avg@16 / Pass@16

  • Conceptual Definition: Avg@16 (also known as Pass@16) is a metric used to evaluate the performance of generative models on tasks with a verifiable correct answer (like math or coding). It measures the probability that at least one correct solution is found within k=16k=16 generated attempts (rollouts) for a given problem. It rewards models that can produce a correct answer even if it's not their single most likely output, thus capturing the model's ability to explore the solution space.
  • Mathematical Formula: The unbiased estimator for Pass@k is given by: $ \text{Pass}@k = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right] $
  • Symbol Explanation:
    • nn: The total number of samples generated for a single problem.
    • cc: The number of correct samples among the nn generated samples.
    • kk: The number of samples allowed for checking (in this case, 16).
    • (nk)\binom{n}{k}: The binomial coefficient, representing the number of ways to choose kk items from a set of nn. The formula calculates the probability of picking at least one correct sample if kk samples are drawn without replacement from the nn generated samples. In practice, if n=kn=k, this simplifies to checking if any of the kk generated samples are correct. The paper generates 16 rollouts, so n=k=16n=k=16.

5.2.2. Codeforces Rating and Percentile

  • Conceptual Definition: In competitive programming platforms like Codeforces, a player's skill is measured by a rating system similar to Elo in chess. A higher rating indicates better performance. The percentile indicates the percentage of participants a player's rating is higher than. These metrics are used to evaluate how a model's generated code performs in a competitive setting against other programs and human-written solutions. They measure not just correctness but also efficiency and problem-solving ability in a ranked context.
  • Mathematical Formula: There is no single formula; it is based on the specific Elo-like rating system implemented by Codeforces, which updates ratings based on the outcome of contests and the ratings of opponents.

5.3. Baselines

The paper compares FlowRL against three representative reward-maximization RL algorithms:

  • REINFORCE++ (R++): An improved version of the basic REINFORCE algorithm. It represents a simple but fundamental policy gradient approach.

  • PPO (Proximal Policy Optimization): The de-facto standard for many RL applications, known for its stability and performance. It uses a critic model and clipped objective.

  • GRPO (Group Reward Policy Optimization): A recent simplification of PPO for LLMs that removes the critic and uses group-based reward normalization. It is a strong and highly relevant baseline.

    These baselines were chosen to represent the evolution of reward-maximizing RL algorithms and provide a comprehensive comparison against the state-of-the-art in LLM fine-tuning.

6. Results & Analysis

6.1. Core Results Analysis

FlowRL consistently outperforms all baseline methods across both math and code reasoning tasks, demonstrating the effectiveness of its distribution-matching approach.

6.1.1. Math Reasoning Results

The following are the results from Table 1 of the original paper:

Models AIME24 AIME25 AMC23 MATH500 Minerva Olympiad Avg
Qwen2.5-32B-Base, Max Response Len = 8K tokens
Backbone 4.58 2.08 28.59 52.48 26.99 21.37 22.68
R++ 14.79+10.21 9.17+7.08 52.65+24.06 44.35-8.13 17.37-9.62 24.52+3.15 27.14
PPO 26.87+22.29 20.41+18.33 76.40+47.81 69.17+16.69 28.79+1.80 37.90+16.53 43.25
GRPO 23.12+18.54 14.58+12.50 76.87+48.28 61.60+9.12 18.95-8.04 34.94+13.57 38.34
FlowRL 23.95+19.37 21.87+19.79 73.75+45.16 80.75+28.27 38.21+11.22 51.83+30.46 48.39
Qwen2.5-7B-Base, Max Response Len = 8K tokens
Backbone 4.38 2.08 30.78 54.47 22.38 24.03 23.02
R++ 11.04+6.66 5.41+3.33 66.71+35.93 54.25-0.22 24.37+1.99 27.33+3.30 31.52
PPO 9.38+5.00 7.29+5.21 63.43+32.65 57.98+3.51 26.53+4.15 27.25+3.22 31.98
GRPO 13.54+9.16 9.79+7.71 64.53+33.75 57.05+2.58 23.06+0.68 26.88+2.85 32.48
FlowRL 15.41+11.03 10.83+8.75 54.53+23.75 66.96+12.49 31.41+9.03 34.61+10.58 35.63
  • Analysis: At both the 7B and 32B model scales, FlowRL achieves the highest average accuracy. For the 32B model, FlowRL reaches an average of 48.39%, which is 5.14 points higher than PPO (43.25%) and 10.05 points higher than GRPO (38.34%). The improvements are particularly strong on the most difficult benchmarks like MATH-500 and Olympiad, suggesting that FlowRL's diverse exploration is especially beneficial for problems requiring complex and non-obvious reasoning paths.

6.1.2. Code Reasoning Results

The following are the results from Table 2 of the original paper:

Models LiveCodeBench CodeForces HumanEval+
Avg@16 Pass@16 Rating Percentile Avg@16
DeepSeek-R1-Distill-Qwen-7B, Max Response Len = 8K tokens
Backbone 30.68 49.46 886.68 19.4% 80.90
R++ 30.46-0.22 52.68+3.22 1208.03+321.35 56.8%+37.4% 76.61-4.29
PPO 35.10+4.42 54.48+5.02 1403.07+516.39 73.7%+54.3% 82.32+1.42
GRPO 32.75+2.07 52.32+2.86 1313.82+427.14 67.1%+47.7% 80.13-0.77
FlowRL 37.43+6.75 56.27+6.81 1549.47+662.79 83.3%+63.9% 83.28+2.38
  • Analysis: FlowRL demonstrates the strongest performance across all three coding benchmarks. It achieves the highest Avg@16 on LiveCodeBench and HumanEval+HumanEval+, and a significantly higher Rating (1549.47) and Percentile (83.3%) on CodeForces. This indicates its strong generalization capabilities extend from mathematical to algorithmic reasoning.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Importance of Technical Components

The paper conducts ablation studies to validate the key components of FlowRL. The following are the results from Table 3 of the original paper:

Method AIME 2024 AIME 2025 AMC 2023 MATH-500 Minerva Olympiad Avg
FlowRL 15.41 10.83 54.53 66.96 31.41 34.61 35.63
w/o IS 6.25 7.91 41.40 56.97 22.19 25.52 26.71
Zhang et al. [2025a] 10.41 6.66 53.75 66.50 30.97 33.72 33.67
  • Analysis of Importance Sampling (IS): Removing importance sampling (w/o IS) causes a massive performance drop from 35.63% to 26.71% on average. This confirms that correcting for the distribution mismatch between the old policy used for generation and the current policy being trained is critical for stable and effective learning.
  • Comparison to Combined Loss: The FlowRL approach of using a trajectory-level importance ratio also outperforms the method from Zhang et al. [2025a], which uses a combined GFlowNet and PPO loss. This suggests that FlowRL's principled integration of off-policy correction into the trajectory balance objective is more effective for CoT reasoning tasks.

6.2.2. Parameter Analysis on β\beta

The hyperparameter β\beta controls how sharply the reward distribution is peaked. The following figure (Figure 3 from the original paper) shows the results of an ablation study on β\beta.

Figure 3 | Ablation study on the \(\\beta\) in FlowRL. \(\\beta = 1 5\) (highlighted in blue) achieves the best performance. 该图像是一个柱状图,展示了不同 eta 值(5, 10, 15, 30)对应的平均得分(%)。其中,eta = 15 的得分最高,达到 35.63,明显优于其他 eta 值。

  • Analysis: The results show that performance is sensitive to the value of β\beta. A value of β=15\beta = 15 achieves the optimal performance. If β\beta is too low, the reward signal is too weak to guide the policy effectively. If it is too high, the target distribution may become too peaked, approaching the mode-collapse behavior of reward maximization.

6.3. Diversity Analysis

To empirically verify that FlowRL produces more diverse solutions, the authors had GPT-4o-mini evaluate the diversity of generated responses for AIME problems.

The following figure (Figure 4 from the original paper) shows the judged diversity scores.

Figure 4 | GPT-judged diversity scores on rollouts of AIME 24/25 problems. FlowRL generates more diverse solutions than \(\\mathrm { R } + +\) , GRPO, and PPO. 该图像是一个柱状图,展示了不同方法生成的多样性得分。FlowRL的多样性得分为2.28,显著超过了R++(1.11)、GRPO(1.23)和PPO(1.31),表明FlowRL在生成多样化解决方案方面表现更佳。

  • Analysis: FlowRL achieves a diversity score of 2.28, which is substantially higher than PPO (1.31), GRPO (1.23), and R++ (1.11). This provides strong evidence for the central hypothesis of the paper: the flow-balanced, distribution-matching objective successfully encourages the model to explore and generate a wider variety of reasoning paths, whereas reward-maximizing baselines tend to produce more repetitive solutions.

6.4. Case Study

The following table, transcribed from Table 4 of the paper, provides a qualitative comparison between GRPO and FlowRL on a specific AIME problem.

Content (boxed = actions; "×k" = repeated; "." = omitted)
Question
Let B be the set of rectangular boxes with surface area 54 and volume 23. Let r be the radius of the smallest sphere that can contain each box in B. If r² = p/q with gcd(p, q) = 1, find p + q.
GRPO
denote a, b,c..2(ab+bc+ca)= 54,abc =23.d= a²+b²+c² 2r=sqrt(d)/2
.. (a+b+c)² = a²+b²+c² + 2(ab+bc+ca)... AM-GM×3: AM-GM (1) ..AM-GM (2) loop (2) ...
AM-GM (3) .. . .. (a+b+c)³ identity loop ×2: loop(1) ….
a = b = c (contradiction) ...back to (a+b+c)² ... no factorization .."
FlowRL
let a, b,c with 2(ab+bc+ca)= 54,abc =23 .. d = a²+b²+c², r = sqrt(d)/2
(a+b+c)² ⇒ a²+b²+c²= (a+b+c)²-54… a=b → a³ - 27a + 46 = 0
rational root a = 2factor (a − 2)(a² + 2a − 23)branch a = -1 + 2√6
back-sub c = 23/a².. a²+b²+c²= 657/4 → Answer 721
  • Analysis: This case study vividly illustrates the behavioral difference.
    • GRPO: Gets stuck in a loop, repeatedly applying a common strategy (AM-GM inequality). This leads to a contradiction and failure to solve the problem. This is a classic example of exploiting a familiar but incorrect path, indicative of mode collapse.
    • FlowRL: Explores a more diverse set of algebraic manipulations. It makes a strategic simplification (a=ba=b), which transforms the problem into a solvable cubic equation. It then proceeds systematically through rational root testing and factorization to arrive at the correct answer. This demonstrates a more flexible and robust reasoning process, enabled by exploring less obvious but ultimately more fruitful solution paths.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces and validates FlowRL, a novel reinforcement learning algorithm for LLM reasoning. By shifting the training paradigm from reward maximization to reward distribution matching, FlowRL directly tackles the problem of mode collapse inherent in methods like PPO and GRPO. The core contributions are:

  1. A principled objective based on matching a reward-induced target distribution via a GFlowNet-inspired trajectory balance loss.
  2. Practical adaptations, namely length normalization and importance sampling, that make this framework effective for long-form Chain-of-Thought reasoning.
  3. Strong empirical results showing that FlowRL significantly outperforms state-of-the-art baselines on diverse and challenging math and code benchmarks. The work provides compelling evidence that promoting solution diversity through distribution matching is a key factor in enhancing the generalization and reasoning capabilities of LLMs.

7.2. Limitations & Future Work

The paper itself does not explicitly list limitations. However, based on the methodology, some potential areas for future exploration can be identified:

  • Complexity of the Partition Function: FlowRL introduces a separate learnable module, the partition function ZϕZ_{\phi}. The paper uses a simple 3-layer MLP, but the optimal architecture and its impact on performance are not deeply explored. The stability and efficiency of training this additional component could be a concern.
  • Hyperparameter Sensitivity: The method relies on key hyperparameters, particularly β\beta (reward temperature) and ϵ\epsilon (clipping). The ablation study shows that performance is quite sensitive to β\beta. A more robust method or an adaptive way to set these parameters would be a valuable future contribution.
  • Reward Function Quality: Like all RL methods, FlowRL is dependent on the quality of the reward signal. While it makes better use of the signal by considering its distribution, its effectiveness is still bounded by how well the reward function captures true solution quality. Exploring its performance with more nuanced or sparse rewards would be interesting.
  • Theoretical Guarantees: The equivalence between the KL objective and the trajectory balance loss holds in terms of expected gradients. The practical implications of using stochastic, off-policy updates with finite data could be further analyzed theoretically.

7.3. Personal Insights & Critique

  • Conceptual Shift is Powerful: The most impressive aspect of this paper is its elegant conceptual shift. Moving from "find the best path" to "learn the landscape of all good paths" is a powerful idea that resonates with how humans solve complex problems—by understanding multiple strategies, not just memorizing one. This connection to generative modeling and energy-based models provides a rich theoretical foundation for future RL research in LLMs.
  • Excellent Problem-Driven Engineering: The authors not only proposed a strong theoretical idea but also did the necessary engineering to make it work in a challenging real-world setting. The introduction of length normalization and importance sampling demonstrates a deep understanding of the practical pitfalls of training LLMs on long sequences. These contributions are just as important as the core algorithm itself.
  • Minor Critique on Terminology: The use of the term "reverse KL divergence" for the formula DKL(πθπ~)D_{KL}(\pi_{\theta} || \tilde{\pi}) is confusing, as this is standardly known as forward KL. While this does not invalidate the method—the resulting trajectory balance objective achieves the desired mode-covering behavior of GFlowNets—clearer terminology would strengthen the paper's theoretical exposition. The practical outcome aligns with the spirit of reverse KL (mode coverage), but not the mathematical form.
  • Broader Implications: This work opens the door to applying other principles from probabilistic generative modeling to RL for LLMs. If policies can be trained to match reward distributions, perhaps they can also be trained using objectives from diffusion models or other advanced frameworks, leading to even more robust and creative reasoning agents. FlowRL feels less like a final solution and more like a significant step in a promising new direction.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.