- Title: QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
- Authors: Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, Jingzhao Zhang
- Affiliations: Tsinghua University, Shanghai Qi Zhi Institute, Amazon, Stanford University
- Journal/Conference: This paper is a preprint available on arXiv. As of its submission date, it has not yet been published in a peer-reviewed conference or journal. arXiv is a common platform for researchers to share their work early.
- Publication Year: 2025 (as cited in the paper, likely referring to the expected publication year or a typo for 2024, given the submission date).
- Abstract: The paper addresses a key challenge in using Reinforcement Learning (RL) to train Large Language Models (LLMs) for reasoning tasks: RL often fails to improve a model's ability to solve problems that are initially beyond its capacity. The authors propose QuestA, a simple data augmentation strategy that introduces partial solutions (hints) into difficult problems during RL training. This technique reduces problem difficulty and provides more effective learning signals. When applied to math reasoning, QuestA significantly improves performance on both
pass@1
(first-try success) and pass@k
(success in k attempts), especially on problems where standard RL struggles. The method achieves new state-of-the-art results for 1.5B-parameter models on several challenging math benchmarks, including AIME and HMMT.
- Original Source Link: https://www.arxiv.org/abs/2507.13266
2. Executive Summary
4. Methodology (Core Technology & Implementation)
The core of the paper is the QuestA
framework, which scaffolds RL training by augmenting difficult questions with partial solutions.
-
Principles: The central intuition is that if a problem is too hard for a model to solve from scratch, the RL algorithm will rarely receive a positive reward, and thus, learning will stall. By providing a "hint" (a part of the correct solution), the problem is simplified, increasing the chance of generating a correct completion. This provides the necessary reward signal for the model to learn the remaining reasoning steps. Over time, the model's ability to solve the original, un-hinted problem improves.
-
Steps & Procedures:
-
Question Augmentation Mechanism:
-
For a given problem x with a known solution y, QuestA
creates an augmented prompt ~x(p).
-
This is done by taking the first p%
of the solution y (based on token count) and appending it to the original question x as a hint.
-
The paper provides an example in Figure 4, where a complex math problem is augmented with a hint explaining a key insight: the function f is an involution. This hint guides the model toward the correct reasoning path.
该图像为柱状图,展示了不同模型在五个数学推理基准测试(AIME24、AIME25、HMMT FEB 25、Olympiad Bench、BRUMO25)上的准确率(Accuracy Avg@32,百分比)。图中对比了Qwen3-1.7B、Nemotron-1.5B、QuestA-Nemotron-1.5B以及DeepSeek-R1-Distill-32B四个模型的表现,结果显示采用QuestA方法的Nemotron-1.5B在各测试集准确率均显著高于其他模型,提升幅度明显。
-
Targeting High-Difficulty Problems:
QuestA
is not applied to all problems, only the hardest ones where the base model consistently fails.
- A two-stage filtering process is used:
- Stage 1: A weak model is used to filter a large dataset (220K problems) down to a smaller set of hard candidates (26K).
- Stage 2: The current model being trained is used to sample solutions for these augmented prompts. Only prompts where the model still has a low success rate (i.e., high variance in correctness) are kept for training. This focuses the training effort on problems that are in the "zone of proximal development"—challenging but not impossible with a hint.
-
Integrating with RL Pipelines (Iterative Curriculum):
QuestA
is designed to be easily integrated into existing RL pipelines like GRPO
. The original dataset is simply replaced with the QuestA
-augmented dataset.
- The authors use an iterative curriculum to gradually reduce the model's reliance on hints:
- Phase 1: Train the model on problems augmented with a large hint (e.g., p = 50%).
- Phase 2: Once performance saturates, reduce the hint size (e.g., to p = 25%) and continue training.
- This process encourages the model to transition from scaffolded reasoning to autonomous problem-solving.
-
Mathematical Formulas & Key Details (Theoretical Justification):
The paper formalizes why hints improve RL efficiency.
- Solution Set S(q): The set of all correct solution trajectories τ for a question q.
S(q)={τ∈V∗∣R(q,τ)=1}
- Model Capacity Set C(q,δp): The smallest set of trajectories that the model μ is most likely to generate, covering at least 1−δp of the total probability mass.
C(q,δp)=argS⊆V∗min{∣S∣ τ∈S∑Pμ(q,τ)≥1−δp}
- The Bottleneck (Theorem 4.4): If the model's capacity set does not overlap with the solution set (C(q,δp)∩S(q)=∅), the model is highly unlikely to sample a correct solution. Under the assumption that RL algorithms don't update weights without a positive reward, the training process will stall.
- The Solution (Theorem 4.6):
QuestA
introduces a hint hq that decomposes the problem. The model only needs to be able to generate the hint and then the remaining solution separately, each with a reasonable probability (δp′). The probability of generating the full solution from scratch might be very low (e.g., (δp′)2), but with the hint provided, the model only needs to solve the second part. This dramatically increases the chance of receiving a positive reward, requiring a much smaller sampling budget (O(1/δp′) instead of O(1/δp)) to start learning effectively.
5. Experimental Setup
-
Datasets:
- Training: The primary dataset is
OpenR1-Math-220K
. This was filtered down to 26K hard problems. For controlled experiments, this set was further divided into Easy Data
(model gets 7-8/8 correct) and Hard Data
(model gets 0-1/8 correct).
- Evaluation: A suite of challenging math competition benchmarks:
AIME24
, AIME25
, HMMT FEB 25
, Olympiad Bench
, and BRUMO25
.
-
Evaluation Metrics:
- The primary metric is
pass@1
, which is the probability of getting the correct answer on the first try. Results are averaged over 32 samples per problem.
pass@k
is also used extensively to analyze the diversity and robustness of the model's reasoning. The paper uses an unbiased estimator for pass@k
to ensure accuracy.
- Crucially, no hints are provided during evaluation. This tests whether the model has truly learned to solve the problems independently.
-
Baselines:
- The primary base models are
Nemotron-1.5B
and DeepScaleR-1.5B
.
- Comparisons are made against other strong open-source models of various sizes, including
Qwen3-1.7B
, Qwen3-8B
, and DeepSeek-R1-Distill-32B
.
6. Results & Analysis
7. Conclusion & Reflections
-
Conclusion Summary:
The paper successfully introduces QuestA
, a simple, lightweight, and highly effective data augmentation framework for improving LLM reasoning with RL. By scaffolding hard problems with partial-solution hints, QuestA
provides denser reward signals, accelerates learning, and avoids the common pitfalls of RL like entropy collapse. It sets a new state-of-the-art for 1.5B models in mathematical reasoning and demonstrates that a data-centric approach can be a powerful alternative or complement to algorithmic innovations.
-
Limitations & Future Work:
- The authors themselves suggest that the primary future direction is to generalize
QuestA
to other complex reasoning domains beyond mathematics, such as competitive programming, software engineering, and other agentic tasks.
- Designing the optimal "hint" structure and curriculum for these new domains will be an important area of research. For instance, in coding, a hint might be a function signature, a high-level algorithmic sketch, or a key data structure.
-
Personal Insights & Critique:
- The elegance of
QuestA
lies in its simplicity. It addresses a fundamental bottleneck in RL (sparse rewards) with a straightforward data preprocessing step, making it highly practical and easy to adopt.
- The paper's strength is its rigorous empirical validation, including controlled experiments on data difficulty, extensive ablations, and analysis across different models and datasets.
- The theoretical justification, while based on a simplified tabular RL model, provides a clear and compelling intuition for why the method is so effective. It connects the empirical success to the core principles of sample efficiency in RL.
- This work underscores a growing trend in AI research: sometimes the most significant gains come not from more complex models or algorithms, but from being smarter about the data used for training.
QuestA
is a prime example of effective "data-centric AI."