- Title: Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
- Authors:
- Quy-Anh Dang (VNU University of Science, Vietnam & Knovel Engineering Lab, Singapore)
- Chris Ngo (Knovel Engineering Lab, Singapore)
- Journal/Conference: This paper is a preprint available on arXiv. Preprints are research articles shared before or during the formal peer-review process.
- Publication Year: 2025 (as indicated in the citations, although the arXiv submission is from 2024/2025). The version provided is dated March 2025.
- Abstract: The paper explores using reinforcement learning (RL) to boost the reasoning skills of small large language models (LLMs) in resource-limited settings. The authors fine-tune a 1.5-billion-parameter model (
DeepSeek-R1-Distill-Qwen-1.5B) on just 4 NVIDIA A40 GPUs within 24 hours. They adapt the Group Relative Policy Optimization (GRPO) algorithm and use a small, high-quality math dataset. The results show significant and rapid performance improvements (e.g., AMC23 accuracy from 63% to 80%) with a minimal training cost of only $42. However, they also identify challenges like training instability and output length issues over prolonged training. The authors release their code and data to encourage further research into cost-effective, reasoning-capable small LLMs.
- Original Source Link:
2. Executive Summary
This section explains the foundational concepts needed to understand the paper, assuming the reader is a beginner.
4. Methodology (Core Technology & Implementation)
The authors' approach is a two-part strategy: first, create a small but potent dataset, and second, apply an efficient RL algorithm.
-
High-Quality Dataset Curation: The goal was to create a compact dataset for mathematical reasoning to minimize training time. They combined and filtered two existing datasets:
- s1 Dataset: A broad reasoning dataset. The authors filtered it to keep only math problems by:
- Keeping only questions whose solutions contained the LaTeX command , which typically encloses a final mathematical answer.
- Using a small model (
DeepSeek-R1-Distill-Qwen-1.5B) to filter out problems that were too easy.
- Using a slightly larger model (
Qwen2.5-7B-Instruct) to remove noisy or poorly formatted questions.
- This resulted in the
open-s1 dataset with 18,615 high-quality math problems.
DeepScaleR Dataset: A pre-existing math-focused dataset. They refined it by:
- Using a math-specialized model (
Qwen2.5-Math-7B-Instruct) to remove easy questions.
- This resulted in the
open-deepscaler dataset with 21,044 problems.
- Final Dataset: The two curated datasets were combined to create a final training pool of 39,659 questions.
-
Reinforcement Learning Algorithm (GRPO):
- Principle: The paper uses Group Relative Policy Optimization (GRPO). The core idea is to train the model by comparing a group of its own answers. For a given question, the model generates multiple solutions. The solutions are graded, and the model is trained to increase the probability of generating the high-scoring solutions and decrease the probability of the low-scoring ones. This avoids needing a separate, costly "critic" model.
- Steps & Procedures:
- Sample: For a question q from the dataset, generate a group of G different outputs {o1,o2,...,oG} using the current model policy πθold.
- Reward: Calculate a reward ri for each output oi using a rule-based system (explained below).
- Advantage Calculation: For each output, calculate its "advantage" Ai. This is a normalized score that indicates how much better or worse that output was compared to the average of the group.
- Optimization: Update the model's parameters θ to maximize the GRPO objective function, which pushes the model to favor outputs with a high advantage.
- Mathematical Formulas & Key Details:
The policy πθ is optimized by maximizing the objective function TGRPO(θ):
TGRPO(θ)=E[q∼P(Q),{ai}i=1G∼πθodd(O∣q)]G1i=1∑G(min(πθold(oi∣q)πθ(oi∣q)Ai,clip(πθodd(oi∣q)πθ(oi∣q),1−ϵ,1+ϵ)Ai)−βDKL(πθ∣∣πref))
-
πθ(oi∣q): The probability of the new policy generating output oi for question q.
-
πθold(oi∣q): The probability of the old (pre-update) policy generating the same output. The ratio πθoldπθ measures how much the policy update favors this output.
-
Ai: The advantage of output oi, which is a normalized reward.
-
ϵ: A small hyperparameter for clipping. The clip function prevents the policy from changing too drastically in a single step, which aids stability.
-
β: A hyperparameter that controls the strength of the KL divergence penalty.
-
DKL(πθ∣∣πref): The Kullback-Leibler (KL) divergence. This term acts as a regularizer, penalizing the new policy πθ if it strays too far from a reference policy πref (usually the original, pre-trained model). This prevents the model from "forgetting" its general language abilities.
The advantage Ai is calculated by normalizing the rewards within the group:
Ai= std({r1,r2,…,rG})ri−mean({r1,r2,…,rG}).
-
ri: The raw reward score for output oi.
-
mean({...}): The average reward of all G outputs in the group.
-
std({...}): The standard deviation of the rewards in the group.
This formula means an output gets a positive advantage if its reward is above average and a negative one if it's below average.
-
Reward Models: Instead of a complex, neural network-based reward model, the authors use a simple and efficient rule-based system with three components:
- Accuracy Reward: A binary reward. It is 1 if the final answer in the is correct, and 0 otherwise.
- Cosine Reward: An enhancement to the accuracy reward. It scales the reward based on the length of the solution using a cosine schedule. This incentivizes the model to find shorter correct solutions and penalizes long incorrect solutions less severely.
- Format Reward: A small positive reward given if the model correctly structures its reasoning within <think> and </think> tags, promoting organized outputs.
5. Experimental Setup
- Base Model:
DeepSeek-R1-Distill-Qwen-1.5B, a 1.5-billion-parameter model. The authors notably skipped SFT, starting RL directly.
- Hardware & Constraints: Training was done on 4 NVIDIA A40 GPUs (48GB VRAM each) and was limited to a 24-hour window. The model could generate 6 outputs per prompt (G=6) with a maximum length of 4096 or 3584 tokens.
- Benchmark Datasets:
AIME24: 30 challenging problems from the 2024 American Invitational Mathematics Examination.
AMC23: 40 problems from the 2023 American Mathematics Competition.
MATH-500: A 500-problem subset of the comprehensive MATH benchmark.
Minerva: A benchmark testing quantitative reasoning across multiple scientific fields.
OlympiadBench: A benchmark with extremely difficult, Olympiad-level problems.
- Evaluation Metric:
- Conceptual Definition: The primary metric is zero-shot pass@1. This measures the model's ability to solve a new problem correctly on its first attempt without seeing any examples. It's a strong test of a model's intrinsic reasoning power.
- Mathematical Formula:
pass@1=Total number of problems in the test setNumber of problems solved correctly on the first attempt
- Symbol Explanation:
Number of problems solved correctly on the first attempt: A count of how many test questions the model answered correctly.
Total number of problems in the test set: The total size of the benchmark dataset.
- Baselines: The authors compared their
Open-RS models against a diverse set of baselines, including:
- General Models:
Llama-3.1-70B-Instruct, o1-preview.
- 7B Models:
Qwen-2.5-Math-7B-Instruct, rStar-Math-7B, Eurus-2-7B-PRIME, Qwen2.5-7B-SimpleRL.
- 1.5B Models: The base model
DeepSeek-R1-Distill-Qwen-1.5B, Still-3-1.5B-Preview, DeepScaleR-1.5B-Preview.
6. Results & Analysis
The paper presents three experiments to understand how the small model behaves under different RL training conditions.
该图像是图表,展示了模型在AMC-2023和MATH-500数据集上的性能随训练步数变化情况。左图显示AMC-2023准确率波动较大,红色虚线为训练初始基线;右图展示MATH-500准确率普遍高于基线但在后期有所下降。
3.5.1 Experiment 1: Impact of High-Quality Data
-
Setup: Trained on the open-s1 dataset (18,615 samples) with accuracy and format rewards. Max length was 4096 tokens.
-
Results: As seen in Figure 2 (blue line), the model showed rapid initial improvement. On AMC23, accuracy jumped from 63% to 70% within 100 steps. However, after 200 steps, performance crashed dramatically.
-
Analysis & Insight 1: Small LLMs can achieve rapid reasoning improvements with limited high-quality data, but performance degrades with prolonged training under strict length constraints.
Figure 3 below reveals why. The model's completion length (right chart) initially fluctuates near the 4000-token limit. The authors suggest the model often failed to finish its reasoning before hitting the limit. After 200 steps, the model started generating unreadable, non-English outputs, indicating reward misalignment or optimization instability.
该图像是图表,展示了实验1中模型输出的准确率奖励(左)和完成长度(右)随本地训练步数的变化趋势。图中显示准确率奖励起伏较大,而完成长度在中期显著下降后回升。全球步数分布于4个GPU,100个全局步约等于3000个本地步。
3.5.2 Experiment 2: Balancing Easy and Hard Problems
-
Setup: To combat the length issue, this experiment used a smaller, mixed-difficulty dataset (7000 samples) and reduced the max length to 3584 tokens.
-
Results: This strategy worked much better initially. Figure 2 (orange line) shows a significant performance boost: AMC23 accuracy soared from 63% to 80% within 50-100 steps. However, instability returned after 150-200 steps.
-
Analysis & Insight 2: Incorporating a mix of easy and hard problems under reduced length constraints enhances early performance and stabilizes reasoning behavior, though long-term stability remains elusive.
As shown in Figure 4, the KL divergence (left chart) became highly unstable after ~4000 local steps, coinciding with the performance drop and the re-emergence of mixed-language outputs. The easier problems likely taught the model to be more concise, but the underlying instability persisted.
该图像是图表,显示了实验2中输出的KL散度和完成长度随本地训练步骤的变化趋势。左侧图展示KL散度在约4000步后迅速上升并波动,右侧图显示完成长度在训练过程中波动较大但整体维持在一定范围内。
3.5.3 Experiment 3: Controlling Length with Cosine Reward
-
Setup: Used the same 7000-sample dataset as Experiment 2 but replaced the simple accuracy reward with the cosine reward to explicitly incentivize shorter solutions. A prompt instruction "Reply in English only" was added.
-
Results: Performance gains were more modest than in Experiment 2 (e.g., AMC23 rose to 72.5%, see green line in Figure 2). However, training was more stable.
-
Analysis & Insight 3: Cosine rewards stabilize completion lengths, improving training consistency, but extending length limits is necessary for extremely hard tasks, particularly with multilingual base models.
Figure 5 shows that the completion length (right chart) became more stable and varied, staying well below the 3584-token limit. This confirms the cosine reward's effectiveness. However, the mixed-language issue persisted, suggesting a simple prompt instruction is not enough to constrain a multilingual base model.
该图像是图表,展示了实验3中不同本地训练步骤下输出的KL散度(左)和生成文本长度(右)的变化趋势。KL散度在约4000步后迅速上升,而生成长度则呈现波动且总体略有下降。
3.5.4 Overall Comparison
The best checkpoints from each experiment (Open-RS1, Open-RS2, Open-RS3) were evaluated against the baselines.
The following is a transcription of Table 1 from the paper.
| Model |
AIME24 |
MATH-500 |
AMC23 |
Minerva |
OlympiadBench |
Avg. |
| General Models |
|
|
|
|
|
|
| Llama-3.1-70B-Instruct |
16.7 |
64.6 |
30.1 |
35.3 |
31.9 |
35.7 |
| o1-preview |
44.6 |
85.5 |
- |
- |
- |
- |
| 7B Models |
|
|
|
|
|
|
| Qwen-2.5-Math-7B-Instruct |
13.3 |
79.8 |
50.6 |
34.6 |
40.7 |
43.8 |
| rStar-Math-7B |
26.7 |
78.4 |
47.5 |
- |
47.1 |
- |
| Eurus-2-7B-PRIME |
26.7 |
79.2 |
57.8 |
38.6 |
42.1 |
48.9 |
| Qwen2.5-7B-SimpleRL |
26.7 |
82.4 |
62.5 |
39.7 |
43.3 |
50.9 |
| 1.5B Models |
|
|
|
|
|
|
| DeepSeek-R1-Distill-Qwen-1.5B |
28.8 |
82.8 |
62.9 |
26.5 |
43.3 |
48.9 |
| Still-3-1.5B-Preview |
32.5 |
84.4 |
66.7 |
29.0 |
45.4 |
51.6 |
| DeepScaleR-1.5B-Preview |
43.1 |
87.8 |
73.6 |
30.2 |
50.0 |
57.0 |
| Our Models |
|
|
|
|
|
|
| Open-RS1 (100 steps) |
30.0 |
83.8 |
70.0 |
29.0 |
52.4 |
53.0 |
| Open-RS2 (50 steps) |
30.0 |
85.4 |
80.0 |
30.5 |
52.4 |
55.7 |
| Open-RS3 (50 steps) |
46.7 |
84.4 |
72.5 |
26.8 |
51.3 |
56.3 |
-
Key Result: The Open-RS models are highly competitive. Open-RS3 achieves the highest score on AIME24 (46.7%), outperforming not only all other 1.5B models but also the much-hyped o1-preview (44.6%). Open-RS2 dominates the AMC23 benchmark with an 80.0% score.
-
Cost and Data Efficiency: This is the most stunning part of the paper.
该图像是图表,展示了零样本通过率(Pass@1)与模型规模及训练成本的对比。左图显示Open-RS模型在AIME24数据集上以46.7%的准确率超过其他模型;右图则展示其训练成本仅约42美元,显著低于其他模型。
Figure 1 and the tables below (transcribed from Tables 2 and 3) highlight the massive efficiency gains.
The following is a transcription of Table 2.
| rStar-Math-7B | Eurus-2-7B- PRIME | Qwen2.5-7B- SimpleRL | Open-RS |
| SFT Data | Base Model| Qwen2.5-Math-7B |7.3M | Qwen2.5-Math-7B 230k | Qwen2.5-Math-7B 0 | DeepSeek-R1- Distill-Qwen-1.5B 0 |
| RM Data |RM RL Data Hardware Time | 7k None 3.647M × 16 10x 8 H100 80GB, 15x 4 A100 40GB | 0 Eurus-2-7B-SFT 150k × 4 1x 8 A100 80GB 72h | 0 None 8k × 8 4x 6 A100 80GB 36h | 0 None 7k × 6 1x 4 A40 48GB 24h |
| Cost Est. | F | |1088</td><td>∣1633 | | \$42 |
The following is a transcription of Table 3.
| DeepScaleR-1.5B-Preview | Still-3-1.5B-Preview | Open-RS |
| Base ModelSFT DataRM Data | DeepSeek-R1-Distill-Qwen-1.5B00 | DeepSeek-R1-Distill-Qwen-1.5B00None30k × 81x 8 A100 80GB|150h | DeepSeek-R1-Distill-Qwen-1.5B00None7k × 61x 4 A40 48GB24h |
| None40k × 168x A100 80GB240h |
| 3 |
| Cost Est. | |3629</td><tdrowspan=1colspan=1>∣2268 | | \$42 |
The Open-RS models were trained for just 42**, whereas comparable 1.5B models like DeepScaleR-1.5B-Preview cost **3629 and 7B models cost over $1000. This demonstrates that small LLMs can achieve elite reasoning performance with minimal data and cost.
7. Conclusion & Reflections
-
Conclusion Summary: The study successfully demonstrates that it is possible to significantly enhance the reasoning capabilities of a small (1.5B) LLM using reinforcement learning under severe resource constraints. By carefully curating a compact dataset and adapting the GRPO algorithm, the authors achieved results that are competitive with, and in some cases superior to, much larger and more expensive models. The work provides a practical framework for developing lightweight, efficient, and powerful reasoning models, despite identifying challenges like optimization instability that need further research.
-
Limitations & Future Work:
- Limitations:
- Training Time: The 24-hour training limit prevented full exploration of the model's long-term behavior and a complete run through the dataset.
- Completion Length: The maximum token length was insufficient for some very complex problems, forcing the model to truncate its thoughts.
- Multilingual Drift: The base model was multilingual, and it started generating non-English text after extended training, a problem not fully solved by simple prompting.
- Domain Specificity: The evaluation was focused exclusively on mathematical reasoning.
- Future Work:
- Explore longer training times and dynamic length schedules.
- Incorporate a lightweight language-identification reward or use a monolingual base model to prevent language drift.
- Evaluate the approach on other reasoning domains like science and coding.
- Combine GRPO with search algorithms (like Monte Carlo Tree Search) to potentially improve reasoning depth without a huge resource cost.
-
Personal Insights & Critique:
- This paper is a prime example of "doing more with less" and is incredibly valuable for the AI community. The $42 price tag for achieving SOTA-level reasoning is the headline result and a powerful statement against the "bigger is always better" narrative.
- The "fast gains, then collapse" training dynamic is a fascinating and common pattern in RL. It suggests the model is learning a "brittle" policy or a shallow heuristic that is effective early on but doesn't generalize, leading to over-optimization and collapse. The paper's insights on data mixing and reward shaping are practical first steps to combat this.
- The choice of a rule-based reward system is clever and pragmatic. While less flexible than a neural reward model, it completely removes a major computational bottleneck, which was central to the paper's goal.
- The multilingual drift issue is an important practical finding. It shows that the properties of the base model can have unexpected and hard-to-control side effects during fine-tuning, serving as a cautionary tale for practitioners.
- Overall, this is a high-impact paper that not only delivers impressive results but also provides a clear, reproducible, and open-source guide for others to build upon. It genuinely helps democratize access to advanced AI capabilities.