QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
TL;DR Summary
QuestA enhances LLM reasoning by augmenting RL training with partial solutions, effectively reducing problem difficulty and improving learning signals where standard RL fails. This method achieved new SOTA results on math reasoning benchmarks (e.g., AIME, HMMT) with 1.5B models,
Abstract
Reinforcement learning (RL) has emerged as a central paradigm for training large language models (LLMs) in reasoning tasks. Yet recent studies question RL's ability to incentivize reasoning capacity beyond the base model. This raises a key challenge: how can RL be adapted to solve harder reasoning problems more effectively? To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 72.50% (+10.73%) on AIME24, 62.29% (+12.79%) on AIME25, and 41.67% (+10.11%) on HMMT25. Code, data and model are available at this https URL.
English Analysis
1. Bibliographic Information
- Title: QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
- Authors: Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, Jingzhao Zhang
- Affiliations: Tsinghua University, Shanghai Qi Zhi Institute, Amazon, Stanford University
- Journal/Conference: This paper is a preprint available on arXiv. As of its submission date, it has not yet been published in a peer-reviewed conference or journal. arXiv is a common platform for researchers to share their work early.
- Publication Year: 2025 (as cited in the paper, likely referring to the expected publication year or a typo for 2024, given the submission date).
- Abstract: The paper addresses a key challenge in using Reinforcement Learning (RL) to train Large Language Models (LLMs) for reasoning tasks: RL often fails to improve a model's ability to solve problems that are initially beyond its capacity. The authors propose QuestA, a simple data augmentation strategy that introduces partial solutions (hints) into difficult problems during RL training. This technique reduces problem difficulty and provides more effective learning signals. When applied to math reasoning, QuestA significantly improves performance on both
pass@1
(first-try success) andpass@k
(success in k attempts), especially on problems where standard RL struggles. The method achieves new state-of-the-art results for 1.5B-parameter models on several challenging math benchmarks, including AIME and HMMT. - Original Source Link: https://www.arxiv.org/abs/2507.13266
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: While Reinforcement Learning (RL) is a popular technique for improving LLM reasoning, recent studies show it often struggles to expand a model's fundamental reasoning capacity. Instead, it tends to exploit what the model already knows, leading to overfitting on easier problems and a decrease in performance on harder ones (a phenomenon known as "entropy collapse"). Training directly on hard problems is inefficient because the model rarely produces a correct answer, resulting in sparse rewards and slow learning.
- Importance: As LLMs are increasingly tasked with complex reasoning (e.g., mathematics, coding), finding methods to genuinely enhance their problem-solving abilities, rather than just refining existing knowledge, is critical for progress.
- Innovation: The paper introduces a data-centric solution instead of a complex algorithmic one. The core idea, QuestA (Question Augmentation), is to make hard problems tractable by embedding partial solutions as "hints" directly into the training prompts. This acts as a form of curriculum learning, scaffolding the model's learning process.
-
Main Contributions / Findings (What):
- Identifies a Key Trade-off: The paper empirically demonstrates that training on easy problems hurts reasoning diversity (
pass@k
), while training on hard problems is too slow to be practical. This highlights a fundamental tension between learning efficiency and expanding reasoning capacity. - Proposes QuestA: A simple and effective data augmentation method that injects partial solutions into hard problems during RL training. This method is modular and can be integrated with existing RL pipelines without changing the underlying algorithm.
- Achieves State-of-the-Art Results: QuestA significantly boosts the performance of 1.5B-parameter models on difficult math benchmarks, outperforming other models of similar size and even competing with models over 20 times larger. For example,
QUESTA-Nemotron-1.5B
achieves 72.50% on AIME24 and 62.29% on AIME25. - Provides Theoretical Justification: The authors provide a theoretical framework explaining why QuestA works. By breaking down a problem, the hints increase the probability of sampling a correct solution, which provides the dense reward signals needed for the RL algorithm to make progress and improve sample efficiency.
- Identifies a Key Trade-off: The paper empirically demonstrates that training on easy problems hurts reasoning diversity (
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are deep learning models with billions of parameters, trained on vast amounts of text data. They can generate human-like text, answer questions, and perform complex reasoning tasks.
- Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. In the context of LLMs, the "agent" is the model, the "action" is generating text (a solution), and the "reward" is a signal indicating whether the solution is correct.
- Reinforcement Learning with Verifiable Rewards (RLVR): A specialized form of RL for tasks where correctness can be automatically verified, such as math problems (checking the final answer) or code generation (running unit tests). This provides a clear, objective reward signal.
pass@k
: An evaluation metric used to measure the problem-solving ability of a model. It calculates the probability that at least one correct solution is generated out of k independent attempts. A higherpass@k
indicates better performance and, for larger k, greater diversity in correct solutions.- Entropy Collapse: A problem in RL where the model becomes overconfident and repeatedly generates the same or very similar outputs. This reduces its ability to explore different solution paths, which can hurt its performance on problems that require novel reasoning.
-
Previous Works & Differentiation:
- Prior work has focused on improving RL for reasoning by modifying the RL algorithm itself (e.g.,
GRPO
,DAPO
), adjusting the reward function (e.g., adding process-based rewards), or managing training dynamics to prevent entropy collapse. QuestA
differentiates itself by being an orthogonal, data-centric approach. It does not change the RL algorithm or the reward model. Instead, it modifies the input data to make learning more efficient. This makes it a "plug-and-play" solution that can be combined with other advanced RL techniques. The idea is similar to curriculum learning, where a model is first trained on easier tasks before moving to harder ones, butQuestA
implements this by dynamically adjusting the difficulty of individual problems.
- Prior work has focused on improving RL for reasoning by modifying the RL algorithm itself (e.g.,
4. Methodology (Core Technology & Implementation)
The core of the paper is the QuestA
framework, which scaffolds RL training by augmenting difficult questions with partial solutions.
-
Principles: The central intuition is that if a problem is too hard for a model to solve from scratch, the RL algorithm will rarely receive a positive reward, and thus, learning will stall. By providing a "hint" (a part of the correct solution), the problem is simplified, increasing the chance of generating a correct completion. This provides the necessary reward signal for the model to learn the remaining reasoning steps. Over time, the model's ability to solve the original, un-hinted problem improves.
-
Steps & Procedures:
-
Question Augmentation Mechanism:
-
For a given problem x with a known solution y,
QuestA
creates an augmented prompt ~x(p). -
This is done by taking the first
p%
of the solution y (based on token count) and appending it to the original question x as a hint. -
The paper provides an example in Figure 4, where a complex math problem is augmented with a hint explaining a key insight: the function f is an involution. This hint guides the model toward the correct reasoning path.
该图像为柱状图,展示了不同模型在五个数学推理基准测试(AIME24、AIME25、HMMT FEB 25、Olympiad Bench、BRUMO25)上的准确率(Accuracy Avg@32,百分比)。图中对比了Qwen3-1.7B、Nemotron-1.5B、QuestA-Nemotron-1.5B以及DeepSeek-R1-Distill-32B四个模型的表现,结果显示采用QuestA方法的Nemotron-1.5B在各测试集准确率均显著高于其他模型,提升幅度明显。
-
-
Targeting High-Difficulty Problems:
QuestA
is not applied to all problems, only the hardest ones where the base model consistently fails.- A two-stage filtering process is used:
- Stage 1: A weak model is used to filter a large dataset (220K problems) down to a smaller set of hard candidates (26K).
- Stage 2: The current model being trained is used to sample solutions for these augmented prompts. Only prompts where the model still has a low success rate (i.e., high variance in correctness) are kept for training. This focuses the training effort on problems that are in the "zone of proximal development"—challenging but not impossible with a hint.
-
Integrating with RL Pipelines (Iterative Curriculum):
QuestA
is designed to be easily integrated into existing RL pipelines likeGRPO
. The original dataset is simply replaced with theQuestA
-augmented dataset.- The authors use an iterative curriculum to gradually reduce the model's reliance on hints:
- Phase 1: Train the model on problems augmented with a large hint (e.g., p = 50%).
- Phase 2: Once performance saturates, reduce the hint size (e.g., to p = 25%) and continue training.
- This process encourages the model to transition from scaffolded reasoning to autonomous problem-solving.
-
-
Mathematical Formulas & Key Details (Theoretical Justification):
The paper formalizes why hints improve RL efficiency.
- Solution Set : The set of all correct solution trajectories for a question .
- Model Capacity Set : The smallest set of trajectories that the model is most likely to generate, covering at least of the total probability mass.
- The Bottleneck (Theorem 4.4): If the model's capacity set does not overlap with the solution set (), the model is highly unlikely to sample a correct solution. Under the assumption that RL algorithms don't update weights without a positive reward, the training process will stall.
- The Solution (Theorem 4.6):
QuestA
introduces a hint that decomposes the problem. The model only needs to be able to generate the hint and then the remaining solution separately, each with a reasonable probability (). The probability of generating the full solution from scratch might be very low (e.g., ), but with the hint provided, the model only needs to solve the second part. This dramatically increases the chance of receiving a positive reward, requiring a much smaller sampling budget ( instead of ) to start learning effectively.
5. Experimental Setup
-
Datasets:
- Training: The primary dataset is
OpenR1-Math-220K
. This was filtered down to 26K hard problems. For controlled experiments, this set was further divided intoEasy Data
(model gets 7-8/8 correct) andHard Data
(model gets 0-1/8 correct). - Evaluation: A suite of challenging math competition benchmarks:
AIME24
,AIME25
,HMMT FEB 25
,Olympiad Bench
, andBRUMO25
.
- Training: The primary dataset is
-
Evaluation Metrics:
- The primary metric is
pass@1
, which is the probability of getting the correct answer on the first try. Results are averaged over 32 samples per problem. pass@k
is also used extensively to analyze the diversity and robustness of the model's reasoning. The paper uses an unbiased estimator forpass@k
to ensure accuracy.- Crucially, no hints are provided during evaluation. This tests whether the model has truly learned to solve the problems independently.
- The primary metric is
-
Baselines:
- The primary base models are
Nemotron-1.5B
andDeepScaleR-1.5B
. - Comparisons are made against other strong open-source models of various sizes, including
Qwen3-1.7B
,Qwen3-8B
, andDeepSeek-R1-Distill-32B
.
- The primary base models are
6. Results & Analysis
-
Core Results:
-
As shown in Table 1 and the bar chart below (from Figure 1 in the paper),
QUESTA-Nemotron-1.5B
achieves new state-of-the-art results for 1.5B models across all math benchmarks. -
The average performance gain over the base
Nemotron-1.5B
is over 10%. -
Notably, the 1.5B
QuestA
model matches or exceeds the performance of the much largerDeepSeek-R1-Distill-32B
model, demonstrating the remarkable efficiency of the training method.该图像为图表,展示了不同模型在AIME25和HMMT 2025两项测试中,基于样本数量(对数刻度)下的Pass@k准确率表现。图中包括Nemotron-1.5B、QuestA-Nemotron-1.5B、Easy-Nemotron-1.5B和Hard-Nemotron-1.5B四条曲线。结果显示,QuestA-Nemotron-1.5B在两个测试集上均优于其他版本,表明QuestA方法提升了大规模模型的推理准确率。
-
-
pass@k
Analysis:- Figure 2 in the paper shows a critical finding. Standard RL on easy prompts (red line) degrades
pass@k
compared to the base model (blue line), confirming the issue of overfitting. RL on hard prompts (green line) is better but learns slowly. QuestA
(orange line) provides the best outcome: it consistently improves over standard RL across all values of k and avoids the performance drop seen with easy-prompt training. This indicates thatQuestA
enhances both the quality and diversity of solutions.
- Figure 2 in the paper shows a critical finding. Standard RL on easy prompts (red line) degrades
-
Generalization at Test Time (Without Hints):
- Figure 6 in the paper shows the distribution of pass rates on the training set before and after
QuestA
training (evaluated without hints). There is a clear shift from low-success bins (0/8, 1/8 correct) to high-success bins, proving that the model learns to solve the original hard problems autonomously. - Table 2 further supports this by showing that
QuestA
reduces the number of completely unsolved problems in the AIME benchmarks atPass@32
. For example, on AIME24, the number of unsolved problems drops from 5 to 2.
- Figure 6 in the paper shows the distribution of pass rates on the training set before and after
-
Ablations / Parameter Sensitivity:
-
Curriculum Importance (Table 3): A two-stage curriculum (training first with 50% hints, then 25%) significantly outperforms training with only 50% hints for the same number of steps. This confirms the benefit of gradually reducing scaffolding.
-
Effect of Hints (Table 5): An ablation study compares training on the hard data with and without the hint text. While training on hard problems alone provides a boost, adding the hint text gives a further significant improvement and achieves the same performance in nearly half the training steps.
-
Model Generalization (Table 6): Applying
QuestA
to a different model,DeepScaleR-1.5B
, also yields consistent improvements across all math benchmarks, showing the method is not specific to one model architecture. Thepass@k
curve forDeepScaleR
(Figure 8) also shows a clear lift.该图像为对比折线图,展示了DeepScaleR-1.5B与QuestA-DeepScaleR-1.5B在AIME24和AIME25数据集上的Pass@k准确率随采样规模(对数刻度)变化趋势。结果显示,QuestA方法在不同采样规模下均显著优于原始模型,尤其在样本较少时提升更明显,表明QuestA增强了模型的推理能力。
-
-
Training Dynamics:
- Figure 5 in the paper shows that during
QuestA
training, the average reward and response length steadily increase. - Importantly, the average entropy remains stable and does not collapse. This is a key advantage, suggesting that
QuestA
encourages exploration and diverse reasoning paths, unlike standard RL which can lead to overconfidence.
- Figure 5 in the paper shows that during
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces
QuestA
, a simple, lightweight, and highly effective data augmentation framework for improving LLM reasoning with RL. By scaffolding hard problems with partial-solution hints,QuestA
provides denser reward signals, accelerates learning, and avoids the common pitfalls of RL like entropy collapse. It sets a new state-of-the-art for 1.5B models in mathematical reasoning and demonstrates that a data-centric approach can be a powerful alternative or complement to algorithmic innovations. -
Limitations & Future Work:
- The authors themselves suggest that the primary future direction is to generalize
QuestA
to other complex reasoning domains beyond mathematics, such as competitive programming, software engineering, and other agentic tasks. - Designing the optimal "hint" structure and curriculum for these new domains will be an important area of research. For instance, in coding, a hint might be a function signature, a high-level algorithmic sketch, or a key data structure.
- The authors themselves suggest that the primary future direction is to generalize
-
Personal Insights & Critique:
- The elegance of
QuestA
lies in its simplicity. It addresses a fundamental bottleneck in RL (sparse rewards) with a straightforward data preprocessing step, making it highly practical and easy to adopt. - The paper's strength is its rigorous empirical validation, including controlled experiments on data difficulty, extensive ablations, and analysis across different models and datasets.
- The theoretical justification, while based on a simplified tabular RL model, provides a clear and compelling intuition for why the method is so effective. It connects the empirical success to the core principles of sample efficiency in RL.
- This work underscores a growing trend in AI research: sometimes the most significant gains come not from more complex models or algorithms, but from being smarter about the data used for training.
QuestA
is a prime example of effective "data-centric AI."
- The elegance of
Similar papers
Recommended via semantic vector search.
ExGRPO: Learning to Reason from Experience
ExGRPO identifies key experience metrics to prioritize valuable reasoning data, improving reinforcement learning efficiency and reasoning performance in large language models, with stable training across diverse model scales.
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
This study uses reinforcement learning with GRPO to fine-tune a 1.5B-parameter LLM, enhancing math reasoning under limited resources, achieving notable accuracy gains with low cost, while identifying stability and length constraints challenges.
REINFORCEMENTLEARNING IS ALL YOU NEED
This paper demonstrates that pure reinforcement learning, applied to a 3B language model on the Countdown Game, effectively enhances reasoning, outperforming baselines and generalizing well without human feedback. Findings show emergent insights don't always guarantee correctness
Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models
This paper introduces Risk-Sensitive RL (RS-GRPO) to alleviate LLMs' exploration dilemma in reasoning tasks. By using a risk-seeking objective that amplifies learning from difficult problems, RS-GRPO fosters deeper exploration. Experiments show it consistently improves `pass@k` w
Discussion
Leave a comment
No comments yet. Start the discussion!