Paper status: completed

Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models

Published:09/29/2025

Math Reasoning Benchmarks (6)Sequence Policy Optimization (38)RL Training for Large Language Models (63)Risk-Sensitive Reinforcement Learning (1)

Original Link PDF

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Risk-Sensitive RL (RS-GRPO) to alleviate LLMs' exploration dilemma in reasoning tasks. By using a risk-seeking objective that amplifies learning from difficult problems, RS-GRPO fosters deeper exploration. Experiments show it consistently improves `pass@k` w

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. However, existing methods suffer from an exploration dilemma: the sharply peaked initial policies of pre-trained LLMs confine standard RL algorithms to a narrow set of solutions, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. To overcome this, we introduce a Risk-Sensitive Reinforcement Learning framework. Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm, Risk-Sensitive GRPO (RS-GRPO), which drives deeper exploration by amplifying learning from challenging prompts. Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications. On six mathematical reasoning benchmarks and with five different LLMs, RS-GRPO consistently improves pass@k performance while maintaining or enhancing pass@1 accuracy.

Mind Map

In-depth Reading

English Analysis~15 min read · 16,434 chars

1. Bibliographic Information

Title: Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models
Authors: Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, Lin Yan
Affiliations: Tsinghua University, ETH Zurich, ByteDance Seed
Journal/Conference: This paper is a preprint on arXiv. The date listed in the paper, September 30, 2025, is a placeholder and not the actual publication date.
Abstract: The paper addresses a key issue in using Reinforcement Learning with Verifiable Rewards (RLVR) to improve Large Language Models (LLMs) on reasoning tasks. Standard RL methods tend to sharpen the LLM's initial, narrow policy, improving single-solution accuracy (pass@1) but hurting solution diversity and multi-solution accuracy (pass@k). This "exploration dilemma" means RL often just refines existing knowledge rather than discovering new strategies. To solve this, the authors propose a Risk-Sensitive Reinforcement Learning framework. Their algorithm, Risk-Sensitive GRPO (RS-GRPO), uses a risk-seeking objective that prioritizes learning from difficult problems where the model performs poorly. This encourages deeper exploration. The method is simple to implement, requiring only small code changes. Experiments on six math benchmarks with five different LLMs show that RS-GRPO consistently improves pass@k while maintaining or improving pass@1.
Original Source Link:
- arXiv Link: https://arxiv.org/abs/2509.24261
- PDF Link: http://arxiv.org/pdf/2509.24261v1
- Publication Status: Preprint on arXiv.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: When fine-tuning Large Language Models (LLMs) with Reinforcement Learning (RL) for complex reasoning tasks, a significant problem arises, termed the "exploration dilemma." Pre-trained LLMs start with a very confident but narrow policy (a "sharply peaked" distribution over possible solutions). Standard RL algorithms, designed to maximize average rewards, tend to get trapped by this initial bias. They reinforce the most probable existing solutions, which boosts accuracy for a single attempt (pass@1), but at the cost of reducing the variety of solutions the model can generate.
- Importance: This dilemma is a major bottleneck. It causes the model's ability to find any correct solution among many attempts (pass@k) to stagnate or even worsen. Instead of discovering novel reasoning pathways, the LLM simply becomes more rigid in its existing, often suboptimal, strategies. The goal of using RL to truly expand a model's capabilities is therefore not fully realized.
- Innovation: The paper introduces a novel perspective from Risk-Sensitive RL. Instead of optimizing for the average reward (a risk-neutral goal), the authors propose a risk-seeking objective. This objective mathematically shifts the focus from "what is the best average performance?" to "how can we learn from the best possible outcomes, even if they are rare?" This encourages the model to explore less likely but potentially more rewarding solution paths.
Main Contributions / Findings (What):
1. A New Framework: The paper introduces a risk-sensitive RL framework to address the exploration dilemma in LLM fine-tuning and proposes a simple yet powerful new algorithm called Risk-Sensitive GRPO (RS-GRPO).
2. Theoretical and Empirical Justification: It provides evidence showing that standard RL can get stuck in local optima due to the LLM's initial sharp policy, while the proposed risk-sensitive formulation can successfully escape these traps and find better solutions.
3. Superior Performance: Through extensive experiments on six mathematical reasoning benchmarks with five different LLMs, RS-GRPO is shown to significantly improve pass@k performance (solution diversity and multi-attempt success) while also maintaining or even slightly improving pass@1 accuracy (single-attempt success). This demonstrates a better trade-off between exploration and exploitation compared to existing methods.

Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data, capable of generating human-like text, answering questions, and performing reasoning tasks. Examples include models from the GPT, Llama, and Qwen series.
- Reinforcement Learning (RL): A machine learning paradigm where an "agent" learns to make decisions by performing actions in an "environment" to maximize a cumulative "reward." In the context of LLMs, the LLM is the agent, a user prompt is the environment, the generated text is the action, and a reward function scores the quality of the text.
- Reinforcement Learning with Verifiable Rewards (RLVR): A specific application of RL to LLMs where the reward is objective and verifiable. For example, in a math problem, the reward is 1 if the final answer is correct and 0 if it's wrong. This avoids the need for a subjective human preference model.
- Policy: In RL, the policy is the agent's strategy. For an LLM, the policy $π_θ$ is the model itself, which defines the probability of generating a certain sequence of text (a solution $y$ ) given a prompt $x$ .
- pass@1 and pass@k: These are evaluation metrics. pass@1 measures the probability that a single solution generated by the model is correct. pass@k measures the probability that at least one out of k generated solutions is correct. A high pass@k indicates the model is capable of generating diverse and correct solutions.
- Exploration vs. Exploitation: A fundamental dilemma in RL. Exploitation means using the currently known best strategy to get high rewards. Exploration means trying new, uncertain strategies with the hope of discovering even better ones. The "exploration dilemma" in this paper refers to standard RL over-exploiting the LLM's initial biases.
- Policy Gradient Methods: A class of RL algorithms that directly optimize the policy's parameters ( $θ$ ) by calculating the gradient (direction of steepest ascent) of the expected reward. GRPO is one such algorithm used for LLMs.
- Advantage Function ( $A$ ): A core component in policy gradient methods. It measures how much better a particular action (generating solution $y$ ) is compared to the average action for a given prompt. Positive advantage encourages the policy to take that action more often, while negative advantage discourages it.
- Risk-Sensitive RL: A branch of RL that goes beyond maximizing the simple average (expected) reward. It allows the agent to be risk-averse (preferring stable, predictable rewards and avoiding low-reward outcomes) or risk-seeking (preferring to take chances to get very high rewards, even if it means risking low ones). This paper focuses on risk-seeking behavior to promote exploration.
Previous Works & Differentiation:
- Prior RLVR methods have successfully improved pass@1 but often at the expense of pass@k. This paper identifies this as a failure to escape the initial policy's local optima.
- Other methods have tried to directly optimize for pass@k. The paper compares itself to several of these (e.g., Tang et al. [56], Mahdavi et al. [39], Chen et al. [9]).
- The proposed risk-sensitive framework is differentiated in two key ways:
  1. Generality: It naturally handles both binary rewards (correct/incorrect) and continuous rewards (e.g., partial scores), whereas many pass@k optimization methods are restricted to binary signals.
  2. Denser Signal: Other methods often stop providing a learning signal (the advantage becomes zero) for prompts where the model is already performing well. The risk-sensitive approach provides a non-zero gradient even for high-accuracy prompts, allowing it to continue refining its policy and achieving a better balance between pass@1 and pass@k.

4. Methodology (Core Technology & Implementation)

The core of the paper's contribution is a new objective function for RL that encourages exploration by being "risk-seeking."

Principles: The standard RL objective is to maximize the mean (or expected) reward. This is risk-neutral. The authors argue that to escape the narrow initial policy of an LLM, the objective should instead be risk-seeking, meaning it should be more sensitive to high-reward outcomes, effectively interpolating between optimizing for the mean reward and the maximum reward.
The Risk-Sensitive Objective: The authors employ an objective based on the exponential utility function, which is standard in risk-sensitive control theory. For a given prompt $x$ , the risk-sensitive objective is defined as:

$\mathcal { I } _ { \mathrm { RS } } ( \pi _ { \theta } ) = \mathbb { E } _ { x \sim \mathcal { D } } \left[ \frac { 1 } { \beta } \log \mathbb { E } _ { y \sim \pi _ { \theta } ( \cdot | x ) } \Big [ e ^ { \beta r ( y ) } \Big ] \right]$
- $π_θ$ is the LLM's policy.
- $r(y)$ is the reward for a generated solution $y$ .
- $β$ $β$ is the risk-sensitivity parameter, which controls the behavior:
  - If $β \to 0$ , the objective becomes the standard expected reward $\mathbb{E}[r(y)]$ (risk-neutral).
  - If $β > 0$ and increases, the objective becomes more sensitive to high-reward outcomes, approaching the maximum possible reward $\max_y r(y)$ (risk-seeking). This is the setting used in the paper to drive exploration.
  - If $β < 0$ , the objective becomes sensitive to low-reward outcomes, approaching the minimum reward $\min_y r(y)$ (risk-averse).
The Risk-Sensitive Policy Gradient and Advantage Function: The authors derive the policy gradient for this new objective (Theorem 1), which has a similar structure to the standard policy gradient but with a new risk-sensitive advantage function, $A_β^{π_θ}$ .

$\nabla _ { \theta } \mathcal { I } _ { R S } ( \pi _ { \theta } ) = \mathbb { E } _ { x \sim \mathcal { D } , y \sim \pi _ { \theta } ( \cdot | x ) } \Big [ A _ { \beta } ^ { \pi _ { \theta } } ( y ) \nabla _ { \theta } \log \pi _ { \theta } ( y \mid x ) \Big ]$

The risk-sensitive advantage function is given by:

$A _ { \beta } ^ { \pi _ { \theta } } ( y ) = \frac { 1 } { \beta } \left( \frac { e ^ { \beta r ( y ) } } { \mathbb { E } _ { y ^ { \prime } \sim \pi _ { \theta } ( \cdot | x ) } [ e ^ { \beta r ( y ^ { \prime } ) } ] } - 1 \right)$

In practice, the expectation in the denominator is estimated using $N$ samples from the policy for a given prompt:

$\hat { A } _ { \beta } ^ { \pi _ { \theta } } ( y _ { i } ) = \frac { 1 } { \beta } \left( \frac { e ^ { \beta r ( y _ { i } ) } } { \frac { 1 } { N } \sum _ { j = 1 } ^ { N } e ^ { \beta r ( y _ { j } ) } } - 1 \right)$

This practical form, RS-GRPO, is a simple "drop-in" replacement for the standard advantage calculation in algorithms like GRPO.
Analysis of the Advantage Function:

该图像为图表，展示了不同风险参数β在连续奖励设置(a)和二元奖励设置(b)下优势（Advantage）随奖励或提示准确率变化的趋势。图中曲线颜色区分四种β值（0、2、4、8），显示较大β值会提升对难题的学习权重，从而影响探索策略。在连续奖励图(a)中，优势随奖励线性上升；在二元奖励图(b)中，正负优势和累积优势随提示准确率变化呈不同曲线形态，体现了风险敏感强化学习对探索行为的不同调节效果。

Image 1 (Figure 2) visualizes how this new advantage function behaves.
- (a) Continuous Reward Setting: In the standard case ( $β \to 0$ ), the advantage is linear with the reward. As $β$ increases, the function becomes much steeper for high rewards and flatter for low rewards. This means high-reward solutions get a much stronger positive update signal, while low-reward solutions are less penalized.
- (b) Binary Reward Setting: This plot shows the advantage as a function of "prompt accuracy" (the fraction of correct solutions for a given prompt).
  - Positive/Negative: For hard prompts (low accuracy), a correct solution gets a very large positive advantage (strong encouragement), and an incorrect solution gets a small penalty. For easy prompts (high accuracy), the penalty for an incorrect solution is large, but the reward for a correct one is small.
  - Cumulative: The total magnitude of the learning signal (cumulative absolute advantage) is highest for hard prompts when $β$ is large. In contrast, for standard RL ( $β=0$ ), the learning signal peaks for prompts with 50% accuracy.
- Takeaway: Increasing $β$ systematically shifts the learning focus towards harder prompts, where the model is struggling, thereby forcing it to explore new strategies to solve them.

5. Why Risk-Sensitive RL is Better

The paper provides both empirical and theoretical arguments for why the risk-sensitive approach is superior for LLM fine-tuning.

Empirical Perspective: The Bandit Experiment

这是两部分组成的图表。左图展示动作索引对应的奖励分布（灰色柱状图）和初始策略分布（蓝色曲线），显示奖励分布多峰且初始策略集中在较低奖励区域。右图为不同β值下奖励随训练步数变化曲线，β越大，策略收敛更快且最终奖励更高，体现了风险敏感RL在提升探索和奖励优化中的效果。

Image 2 (Figure 3) illustrates the core problem with a simple multi-armed bandit experiment.
- Left Plot: The reward landscape has a global optimum (reward=1.0) and a significant local optimum (reward=0.6). The initial policy (blue curve) is sharply peaked around the local optimum, mimicking a pre-trained LLM with strong biases.
- Right Plot: The learning curves show that the standard risk-neutral policy ( $β=0$ ) gets permanently stuck at the local optimum (reward 0.6). In contrast, risk-seeking policies ( $β \geq 4$ ) successfully explore the landscape, escape the local trap, and converge to the global optimum (reward 1.0).
Theoretical Perspective: The paper presents three lemmas in a simplified bandit setting that provide theoretical intuition.
1. Lemma 2: The standard policy gradient update can actually decrease the probability of selecting the optimal action. This can happen if a suboptimal action is still better than the average, which misguides the update away from the true best option.
2. Lemma 3: The risk-sensitive policy gradient, for a sufficiently large $β$ , is guaranteed to increase the probability of selecting the optimal action. This explains its ability to escape local optima.
3. Lemma 4: Using an infinitely large $β$ is not ideal. Beyond a certain threshold, increasing $β$ further will slow down the convergence speed, even though the update direction remains correct. This implies there is a "sweet spot" for $β$ that balances aggressive exploration with efficient learning.

6. Experimental Setup

Datasets:
- Training: Three mathematical reasoning datasets of varying sizes were used: math12k, dapo17k, and deepmath103k.
- Evaluation: Six challenging mathematical reasoning benchmarks were used: MATH500, AIME24, AIME25, HMMT-Feb24, HMMT-Feb25, and CMIMC25.
Models: The experiments were conducted on a diverse set of five base LLMs: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-7B, Qwen3-4B-Base, and Llama3.1-8B-Instruct.
Evaluation Metrics:
- pass@1: Measures single-solution accuracy.
- pass@k: Measures multi-solution accuracy, with k up to 1024. pass@32 is reported in detail in tables.
Baselines:
- Base Model: The original pre-trained LLM without any RL fine-tuning.
- GRPO: The standard risk-neutral RL algorithm.
- Other pass@k Optimization Methods: Works by Chen et al. [9], Mahdavi et al. [39], and Walder and Karkhanis [57].

7. Results & Analysis

Core Results: pass@k Performance

该图像为多个折线图组成的图表，展示了Base、GRPO和RS-GRPO三种算法在六个数学推理基准（AIME24、AIME25、HMMT_Feb24、HMMT_Feb25、CMIMC25、MATH500）上的Pass@k性能对比。横轴为k值，纵轴为Pass@k的准确率百分比。图中红色线（RS-GRPO）多数情况下优于其他两种，显示该方法在多解性能提升上具有显著优势。

Image 3 (Figure 4) shows the pass@k curves for the base models, GRPO, and RS-GRPO.
- RS-GRPO (red line) consistently and significantly outperforms both the base model (orange line) and GRPO (green line) across all models and benchmarks.
- Crucially, for several models, GRPO actually performs worse than the base model for high values of k. This is direct evidence of the exploration dilemma: GRPO sharpens the policy so much that it loses the diversity present in the original model.
- RS-GRPO, in contrast, surpasses the base model even at high k, demonstrating that it successfully expands the model's exploratory capabilities and discovers new, correct solutions.
Ablation Study: Impact of Hyperparameter $β$

该图像为图表，展示了不同参数β（0、2、4、8）下算法在训练步骤中的表现变化。图表分上下两行，每行四个子图，分别为累计解决率（Cumulative Solve Rate）、训练奖励（Training Rewards）、测试通过率@1（Testing Pass@1 %）和测试通过率@32（Testing Pass@32 %）。结果显示随着β增加，累计解决率和测试通过率整体提升，尤其是在较大训练步数时，β=8表现最优，体现了风险敏感强化学习方法对提升模型多样性和性能的积极作用。

Image 4 (Figure 5) analyzes the effect of different $β$ values during training.
- Exploration vs. Reward: As $β$ increases, the Cumulative Solve Rate on the training data improves (more problems are solved at least once), but the average Training Reward grows more slowly. This aligns with the theory that risk-seeking prioritizes solving hard problems over maximizing average score.
- Test Performance: Higher $β$ values lead to substantial gains in Testing Pass@32. While pass@1 performance is generally maintained, a moderate value like $β=2$ can even lead to a slight improvement in pass@1.
- Conclusion: The authors conclude that $β=2$ offers the best trade-off, achieving strong pass@k gains while also enhancing pass@1.
Comparison to Other pass@k Baselines

该图像为图表，展示了不同算法在多个步骤（Steps）下的测试通过率（Pass@1和Pass@32, %）。四条曲线分别代表Chen等人、Mahdavi等人、Walder & Karkhanis以及本文提出的风险敏感（Risk-Sensitive）方法。从曲线趋势看，风险敏感算法在测试通过率上整体优于或接近其他方法，尤其在Pass@32指标上表现更为突出，说明其在探索多解策略上效果更佳。

Image 5 (Figure 6) and Table 2 in the paper compare RS-GRPO with other methods designed to optimize pass@k.
- RS-GRPO generally achieves comparable pass@32 performance to the best baselines but consistently outperforms them on pass@1. This supports the claim that its "denser advantage signal" allows for more balanced optimization.
- The method by Walder and Karkhanis [57] performs poorly, which the authors attribute to its advantage estimates always being positive, leading to a quick collapse in policy diversity (entropy).
Analysis of Pass@k Improvement

该图像由两个部分组成，左侧为柱状图，比较了GRPO和RS-GRPO在独特答案比例上的表现，RS-GRPO明显更高。右侧为热力图，展示了两者在不同问题准确率区间的联合分布，颜色深浅表示比例大小，数值显示具体比例，表明RS-GRPO在较高准确率区间的比例优势。整体体现RS-GRPO在提升多样性和准确率上的改进效果。

Image 6 (Figure 7) provides insight into how RS-GRPO works.
- Left (Bar Chart): RS-GRPO generates a significantly higher ratio of unique correct answers compared to GRPO. This directly confirms that it increases solution diversity.
- Right (Heatmap): This transition map shows how prompt accuracies change from GRPO to RS-GRPO. The most significant change is in the bottom-left corner: 8% of problems that GRPO could not solve at all (accuracy 0) are now solved with some non-zero accuracy by RS-GRPO. This is the primary driver of the pass@k improvement. The model learns to solve new, hard problems.

8. Conclusion & Reflections

Conclusion Summary: The paper successfully identifies and addresses the "exploration dilemma" in RL fine-tuning of LLMs. Standard RL methods often over-exploit an LLM's initial biases, improving pass@1 but harming the more general pass@k metric. The proposed Risk-Sensitive GRPO (RS-GRPO) algorithm, based on a risk-seeking objective, effectively encourages exploration of the solution space. This leads to the discovery of novel reasoning paths, significantly improving pass@k performance while maintaining or enhancing pass@1, thus achieving a superior trade-off.
Limitations & Future Work: The authors note a key limitation: their experiments used a fixed risk-sensitivity parameter $β$ throughout training. They experimented with dynamically adjusting $β$ (e.g., decaying it over time) but found that none of these heuristics outperformed a single, well-chosen fixed value. Devising an optimal schedule for balancing exploration and exploitation by dynamically tuning $β$ remains a challenging open problem.
Personal Insights & Critique:
- The paper's strength lies in its clear diagnosis of a critical problem and the elegance of its solution. The RS-GRPO algorithm requires only a minor modification to existing RL pipelines but yields substantial and consistent improvements.
- The connection drawn to risk-sensitive control theory provides a strong theoretical foundation for the approach. The simple bandit experiment is a very effective illustration of the core intuition.
- The results strongly suggest that the choice of the RL objective function is as important as the algorithm itself, especially when fine-tuning powerful but biased pre-trained models.
- The framework is general and could likely be applied to other generative domains beyond mathematical reasoning, such as code generation, scientific discovery, or creative writing, where discovering diverse, high-quality solutions is valuable.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.