Initializing viewer…

English Analysis

Details

1. Bibliographic Information

Title: QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
Authors: Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, Jingzhao Zhang
Affiliations: Tsinghua University, Shanghai Qi Zhi Institute, Amazon, Stanford University
Journal/Conference: This paper is a preprint available on arXiv. As of its submission date, it has not yet been published in a peer-reviewed conference or journal. arXiv is a common platform for researchers to share their work early.
Publication Year: 2025 (as cited in the paper, likely referring to the expected publication year or a typo for 2024, given the submission date).
Abstract: The paper addresses a key challenge in using Reinforcement Learning (RL) to train Large Language Models (LLMs) for reasoning tasks: RL often fails to improve a model's ability to solve problems that are initially beyond its capacity. The authors propose QuestA, a simple data augmentation strategy that introduces partial solutions (hints) into difficult problems during RL training. This technique reduces problem difficulty and provides more effective learning signals. When applied to math reasoning, QuestA significantly improves performance on both pass@1 (first-try success) and pass@k (success in k attempts), especially on problems where standard RL struggles. The method achieves new state-of-the-art results for 1.5B-parameter models on several challenging math benchmarks, including AIME and HMMT.
Original Source Link: https://www.arxiv.org/abs/2507.13266

2. Executive Summary

Background & Motivation (Why):
- Core Problem: While Reinforcement Learning (RL) is a popular technique for improving LLM reasoning, recent studies show it often struggles to expand a model's fundamental reasoning capacity. Instead, it tends to exploit what the model already knows, leading to overfitting on easier problems and a decrease in performance on harder ones (a phenomenon known as "entropy collapse"). Training directly on hard problems is inefficient because the model rarely produces a correct answer, resulting in sparse rewards and slow learning.
- Importance: As LLMs are increasingly tasked with complex reasoning (e.g., mathematics, coding), finding methods to genuinely enhance their problem-solving abilities, rather than just refining existing knowledge, is critical for progress.
- Innovation: The paper introduces a data-centric solution instead of a complex algorithmic one. The core idea, QuestA (Question Augmentation), is to make hard problems tractable by embedding partial solutions as "hints" directly into the training prompts. This acts as a form of curriculum learning, scaffolding the model's learning process.
Main Contributions / Findings (What):
1. Identifies a Key Trade-off: The paper empirically demonstrates that training on easy problems hurts reasoning diversity (pass@k), while training on hard problems is too slow to be practical. This highlights a fundamental tension between learning efficiency and expanding reasoning capacity.
2. Proposes QuestA: A simple and effective data augmentation method that injects partial solutions into hard problems during RL training. This method is modular and can be integrated with existing RL pipelines without changing the underlying algorithm.
3. Achieves State-of-the-Art Results: QuestA significantly boosts the performance of 1.5B-parameter models on difficult math benchmarks, outperforming other models of similar size and even competing with models over 20 times larger. For example, QUESTA-Nemotron-1.5B achieves 72.50% on AIME24 and 62.29% on AIME25.
4. Provides Theoretical Justification: The authors provide a theoretical framework explaining why QuestA works. By breaking down a problem, the hints increase the probability of sampling a correct solution, which provides the dense reward signals needed for the RL algorithm to make progress and improve sample efficiency.

Foundational Concepts:
- Large Language Models (LLMs): These are deep learning models with billions of parameters, trained on vast amounts of text data. They can generate human-like text, answer questions, and perform complex reasoning tasks.
- Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. In the context of LLMs, the "agent" is the model, the "action" is generating text (a solution), and the "reward" is a signal indicating whether the solution is correct.
- Reinforcement Learning with Verifiable Rewards (RLVR): A specialized form of RL for tasks where correctness can be automatically verified, such as math problems (checking the final answer) or code generation (running unit tests). This provides a clear, objective reward signal.
- pass@k: An evaluation metric used to measure the problem-solving ability of a model. It calculates the probability that at least one correct solution is generated out of k independent attempts. A higher pass@k indicates better performance and, for larger k, greater diversity in correct solutions.
- Entropy Collapse: A problem in RL where the model becomes overconfident and repeatedly generates the same or very similar outputs. This reduces its ability to explore different solution paths, which can hurt its performance on problems that require novel reasoning.
Previous Works & Differentiation:
- Prior work has focused on improving RL for reasoning by modifying the RL algorithm itself (e.g., GRPO, DAPO), adjusting the reward function (e.g., adding process-based rewards), or managing training dynamics to prevent entropy collapse.
- QuestA differentiates itself by being an orthogonal, data-centric approach. It does not change the RL algorithm or the reward model. Instead, it modifies the input data to make learning more efficient. This makes it a "plug-and-play" solution that can be combined with other advanced RL techniques. The idea is similar to curriculum learning, where a model is first trained on easier tasks before moving to harder ones, but QuestA implements this by dynamically adjusting the difficulty of individual problems.

4. Methodology (Core Technology & Implementation)

The core of the paper is the QuestA framework, which scaffolds RL training by augmenting difficult questions with partial solutions.

Principles: The central intuition is that if a problem is too hard for a model to solve from scratch, the RL algorithm will rarely receive a positive reward, and thus, learning will stall. By providing a "hint" (a part of the correct solution), the problem is simplified, increasing the chance of generating a correct completion. This provides the necessary reward signal for the model to learn the remaining reasoning steps. Over time, the model's ability to solve the original, un-hinted problem improves.
Steps & Procedures:
1. Question Augmentation Mechanism:
  - For a given problem x with a known solution y, QuestA creates an augmented prompt ~x(p).
  - This is done by taking the first p% of the solution y (based on token count) and appending it to the original question x as a hint.
  - The paper provides an example in Figure 4, where a complex math problem is augmented with a hint explaining a key insight: the function f is an involution. This hint guides the model toward the correct reasoning path.
    
    该图像为柱状图，展示了不同模型在五个数学推理基准测试（AIME24、AIME25、HMMT FEB 25、Olympiad Bench、BRUMO25）上的准确率（Accuracy Avg@32，百分比）。图中对比了Qwen3-1.7B、Nemotron-1.5B、QuestA-Nemotron-1.5B以及DeepSeek-R1-Distill-32B四个模型的表现，结果显示采用QuestA方法的Nemotron-1.5B在各测试集准确率均显著高于其他模型，提升幅度明显。
2. Targeting High-Difficulty Problems:
  - QuestA is not applied to all problems, only the hardest ones where the base model consistently fails.
  - A two-stage filtering process is used:
    - Stage 1: A weak model is used to filter a large dataset (220K problems) down to a smaller set of hard candidates (26K).
    - Stage 2: The current model being trained is used to sample solutions for these augmented prompts. Only prompts where the model still has a low success rate (i.e., high variance in correctness) are kept for training. This focuses the training effort on problems that are in the "zone of proximal development"—challenging but not impossible with a hint.
3. Integrating with RL Pipelines (Iterative Curriculum):
  - QuestA is designed to be easily integrated into existing RL pipelines like GRPO. The original dataset is simply replaced with the QuestA-augmented dataset.
  - The authors use an iterative curriculum to gradually reduce the model's reliance on hints:
    - Phase 1: Train the model on problems augmented with a large hint (e.g., p = 50%).
    - Phase 2: Once performance saturates, reduce the hint size (e.g., to p = 25%) and continue training.
  - This process encourages the model to transition from scaffolded reasoning to autonomous problem-solving.
Mathematical Formulas & Key Details (Theoretical Justification):

The paper formalizes why hints improve RL efficiency.
- Solution Set $S(q)$ : The set of all correct solution trajectories $\tau$ for a question $q$ . $S ( q ) = \{ \tau \in \mathcal { V } ^ { * } \mid \mathrm { R } ( q , \tau ) = 1 \}$
- Model Capacity Set $C(q, \delta_p)$ : The smallest set of trajectories that the model $\mu$ is most likely to generate, covering at least $1 - \delta_p$ of the total probability mass. $C ( q , \delta _ { p } ) = \arg \min _ { S \subseteq \mathcal { V } ^ { * } } \left\{ | S | \ \bigg | \sum _ { \tau \in S } P _ { \mu } ( q , \tau ) \geq 1 - \delta _ { p } \right\}$
- The Bottleneck (Theorem 4.4): If the model's capacity set does not overlap with the solution set ( $C(q, \delta_p) \cap S(q) = \emptyset$ ), the model is highly unlikely to sample a correct solution. Under the assumption that RL algorithms don't update weights without a positive reward, the training process will stall.
- The Solution (Theorem 4.6): QuestA introduces a hint $h_q$ that decomposes the problem. The model only needs to be able to generate the hint and then the remaining solution separately, each with a reasonable probability ( $\delta_p'$ ). The probability of generating the full solution from scratch might be very low (e.g., $(\delta_p')^2$ ), but with the hint provided, the model only needs to solve the second part. This dramatically increases the chance of receiving a positive reward, requiring a much smaller sampling budget ( $O(1/\delta_p')$ instead of $O(1/\delta_p)$ ) to start learning effectively.

5. Experimental Setup

Datasets:
- Training: The primary dataset is OpenR1-Math-220K. This was filtered down to 26K hard problems. For controlled experiments, this set was further divided into Easy Data (model gets 7-8/8 correct) and Hard Data (model gets 0-1/8 correct).
- Evaluation: A suite of challenging math competition benchmarks: AIME24, AIME25, HMMT FEB 25, Olympiad Bench, and BRUMO25.
Evaluation Metrics:
- The primary metric is pass@1, which is the probability of getting the correct answer on the first try. Results are averaged over 32 samples per problem.
- pass@k is also used extensively to analyze the diversity and robustness of the model's reasoning. The paper uses an unbiased estimator for pass@k to ensure accuracy.
- Crucially, no hints are provided during evaluation. This tests whether the model has truly learned to solve the problems independently.
Baselines:
- The primary base models are Nemotron-1.5B and DeepScaleR-1.5B.
- Comparisons are made against other strong open-source models of various sizes, including Qwen3-1.7B, Qwen3-8B, and DeepSeek-R1-Distill-32B.

6. Results & Analysis

Core Results:
- As shown in Table 1 and the bar chart below (from Figure 1 in the paper), QUESTA-Nemotron-1.5B achieves new state-of-the-art results for 1.5B models across all math benchmarks.
- The average performance gain over the base Nemotron-1.5B is over 10%.
- Notably, the 1.5B QuestA model matches or exceeds the performance of the much larger DeepSeek-R1-Distill-32B model, demonstrating the remarkable efficiency of the training method.
  
  该图像为图表，展示了不同模型在AIME25和HMMT 2025两项测试中，基于样本数量（对数刻度）下的Pass@k准确率表现。图中包括Nemotron-1.5B、QuestA-Nemotron-1.5B、Easy-Nemotron-1.5B和Hard-Nemotron-1.5B四条曲线。结果显示，QuestA-Nemotron-1.5B在两个测试集上均优于其他版本，表明QuestA方法提升了大规模模型的推理准确率。
pass@k Analysis:
- Figure 2 in the paper shows a critical finding. Standard RL on easy prompts (red line) degrades pass@k compared to the base model (blue line), confirming the issue of overfitting. RL on hard prompts (green line) is better but learns slowly.
- QuestA (orange line) provides the best outcome: it consistently improves over standard RL across all values of k and avoids the performance drop seen with easy-prompt training. This indicates that QuestA enhances both the quality and diversity of solutions.
Generalization at Test Time (Without Hints):
- Figure 6 in the paper shows the distribution of pass rates on the training set before and after QuestA training (evaluated without hints). There is a clear shift from low-success bins (0/8, 1/8 correct) to high-success bins, proving that the model learns to solve the original hard problems autonomously.
- Table 2 further supports this by showing that QuestA reduces the number of completely unsolved problems in the AIME benchmarks at Pass@32. For example, on AIME24, the number of unsolved problems drops from 5 to 2.
Ablations / Parameter Sensitivity:
- Curriculum Importance (Table 3): A two-stage curriculum (training first with 50% hints, then 25%) significantly outperforms training with only 50% hints for the same number of steps. This confirms the benefit of gradually reducing scaffolding.
- Effect of Hints (Table 5): An ablation study compares training on the hard data with and without the hint text. While training on hard problems alone provides a boost, adding the hint text gives a further significant improvement and achieves the same performance in nearly half the training steps.
- Model Generalization (Table 6): Applying QuestA to a different model, DeepScaleR-1.5B, also yields consistent improvements across all math benchmarks, showing the method is not specific to one model architecture. The pass@k curve for DeepScaleR (Figure 8) also shows a clear lift.
  
  该图像为对比折线图，展示了DeepScaleR-1.5B与QuestA-DeepScaleR-1.5B在AIME24和AIME25数据集上的Pass@k准确率随采样规模（对数刻度）变化趋势。结果显示，QuestA方法在不同采样规模下均显著优于原始模型，尤其在样本较少时提升更明显，表明QuestA增强了模型的推理能力。
Training Dynamics:
- Figure 5 in the paper shows that during QuestA training, the average reward and response length steadily increase.
- Importantly, the average entropy remains stable and does not collapse. This is a key advantage, suggesting that QuestA encourages exploration and diverse reasoning paths, unlike standard RL which can lead to overconfidence.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces QuestA, a simple, lightweight, and highly effective data augmentation framework for improving LLM reasoning with RL. By scaffolding hard problems with partial-solution hints, QuestA provides denser reward signals, accelerates learning, and avoids the common pitfalls of RL like entropy collapse. It sets a new state-of-the-art for 1.5B models in mathematical reasoning and demonstrates that a data-centric approach can be a powerful alternative or complement to algorithmic innovations.
Limitations & Future Work:
- The authors themselves suggest that the primary future direction is to generalize QuestA to other complex reasoning domains beyond mathematics, such as competitive programming, software engineering, and other agentic tasks.
- Designing the optimal "hint" structure and curriculum for these new domains will be an important area of research. For instance, in coding, a hint might be a function signature, a high-level algorithmic sketch, or a key data structure.
Personal Insights & Critique:
- The elegance of QuestA lies in its simplicity. It addresses a fundamental bottleneck in RL (sparse rewards) with a straightforward data preprocessing step, making it highly practical and easy to adopt.
- The paper's strength is its rigorous empirical validation, including controlled experiments on data difficulty, extensive ablations, and analysis across different models and datasets.
- The theoretical justification, while based on a simplified tabular RL model, provides a clear and compelling intuition for why the method is so effective. It connects the empirical success to the core principles of sample efficiency in RL.
- This work underscores a growing trend in AI research: sometimes the most significant gains come not from more complex models or algorithms, but from being smarter about the data used for training. QuestA is a prime example of effective "data-centric AI."