Initializing viewer…

English Analysis

Details

1. Bibliographic Information

Title: Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
Authors:
- Quy-Anh Dang (VNU University of Science, Vietnam & Knovel Engineering Lab, Singapore)
- Chris Ngo (Knovel Engineering Lab, Singapore)
Journal/Conference: This paper is a preprint available on arXiv. Preprints are research articles shared before or during the formal peer-review process.
Publication Year: 2025 (as indicated in the citations, although the arXiv submission is from 2024/2025). The version provided is dated March 2025.
Abstract: The paper explores using reinforcement learning (RL) to boost the reasoning skills of small large language models (LLMs) in resource-limited settings. The authors fine-tune a 1.5-billion-parameter model (DeepSeek-R1-Distill-Qwen-1.5B) on just 4 NVIDIA A40 GPUs within 24 hours. They adapt the Group Relative Policy Optimization (GRPO) algorithm and use a small, high-quality math dataset. The results show significant and rapid performance improvements (e.g., AMC23 accuracy from 63% to 80%) with a minimal training cost of only $42. However, they also identify challenges like training instability and output length issues over prolonged training. The authors release their code and data to encourage further research into cost-effective, reasoning-capable small LLMs.
Original Source Link:
- arXiv Link: https://arxiv.org/abs/2503.16219
- PDF Link: https://arxiv.org/pdf/2503.16219v1.pdf
- Publication Status: Preprint on arXiv.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: State-of-the-art Large Language Models (LLMs) like GPT-4o and o1-preview achieve impressive reasoning capabilities, but this comes at a tremendous cost. Their training and fine-tuning require massive computational resources (e.g., thousands of high-end GPUs) and vast datasets, making them inaccessible for most researchers, smaller organizations, and academic labs.
- Importance & Gaps: While techniques exist to improve reasoning, they are often applied to huge models (70B+ parameters). Small LLMs (1-10B parameters) are far more practical for self-hosting and deployment, but their potential for high-level reasoning is less explored, especially under strict budget and time constraints. This paper directly addresses the gap: Can we make a small LLM reason like a large one without breaking the bank?
- Innovation: The paper's novelty lies in its extreme focus on resource efficiency. It investigates whether a small, 1.5B parameter model can achieve competitive mathematical reasoning performance using Reinforcement Learning (RL) with a budget of under $50 and a 24-hour time limit. This provides a practical roadmap for "democratizing" advanced AI reasoning.
Main Contributions / Findings (What):
1. Demonstrated Efficacy of RL for Small LLMs: The study proves that RL-based fine-tuning can rapidly and significantly improve the reasoning ability of a small LLM, even with a tiny dataset and minimal compute.
2. State-of-the-Art Performance at Low Cost: Their best model, Open-RS3, achieves a 46.7% score on the difficult AIME24 benchmark, surpassing the much larger and more expensive o1-preview (44.6%). This remarkable result was achieved with only 7,000 training samples and an estimated cost of $42.
3. Identified Key Challenges and Insights: The paper provides a practical analysis of what works and what doesn't. They found that small LLMs are sensitive to data quality and training duration. While initial gains are fast, prolonged training can lead to instability, output degradation, and language drift (producing non-English text). They offer specific insights on how to mitigate this using data composition and reward design.
4. Open-Sourced Resources: The authors release their curated datasets, source code, and trained models, providing a valuable blueprint for the research community to build upon.

This section explains the foundational concepts needed to understand the paper, assuming the reader is a beginner.

Foundational Concepts:
- Large Language Models (LLMs): These are AI models trained on vast amounts of text data to understand and generate human-like language. Their "size" is measured by the number of parameters; "small" LLMs in this paper are around 1.5 billion parameters, while "large" ones can have 70 billion or more.
- Reasoning in LLMs: This refers to the model's ability to solve complex problems that require logical, step-by-step thinking, not just information retrieval. Mathematical word problems are a classic test for reasoning.
- Post-training: After an LLM is initially "pre-trained" on general internet data, it undergoes post-training to refine its capabilities for specific tasks. Key methods include:
  - Supervised Fine-Tuning (SFT): Training the model on a dataset of high-quality "prompt-and-answer" pairs to teach it a desired behavior or format.
  - Reinforcement Learning (RL): A training paradigm where a model (the "agent") learns by trial and error. It generates an output, receives a "reward" or "penalty" based on how good the output is, and adjusts its internal strategy (its "policy") to maximize future rewards.
- Chain-of-Thought (CoT) Prompting: A technique where the model is prompted to "think step by step" and write down its reasoning process before giving a final answer. This often leads to more accurate results on complex problems.
- Group Relative Policy Optimization (GRPO): The specific RL algorithm used in this paper. Its key advantage is efficiency. Traditional RL methods (like PPO) often require a separate "critic" model to estimate the value of an action, which doubles the memory footprint. GRPO cleverly avoids this by generating a group of outputs for a single prompt, comparing their rewards relative to each other, and using that comparison to update the policy. This makes it ideal for low-resource settings.
- pass@1 Metric: A straightforward evaluation metric. For a set of problems, pass@1 is the percentage of problems the model solves correctly on its very first try (@1), without any hints or examples (zero-shot).
Previous Works & Technological Evolution:
- Early work on LLM reasoning focused on prompting techniques like Chain-of-Thought (CoT) to guide the model's thinking process at inference time.
- This was followed by Supervised Fine-Tuning (SFT), where models were trained on datasets containing step-by-step reasoning examples, effectively baking the CoT ability into the model itself.
- More recently, Reinforcement Learning (RL) has emerged as a superior method. While SFT teaches a model to imitate a correct answer, RL teaches it to discover a correct answer by rewarding successful outcomes. This helps the model generalize better. Landmark models like OpenAI's o1-preview and DeepSeek-R1 heavily rely on RL to achieve top-tier reasoning performance.
Differentiation:
- Against Large Models (o1-preview, DeepSeek-R1): While this paper is inspired by the RL methodology of DeepSeek-R1 (using GRPO), it applies it to a small 1.5B model instead of a massive 671B one. It's a study in miniaturization and efficiency.
- Against Other Small Models (DeepScaleR, Stil1-3): Other research teams have also tried to improve small models. However, as the paper shows, those efforts often used much larger datasets (e.g., 40k prompts with 16 outputs each) and incurred significantly higher computational costs (thousands of dollars). This paper's key differentiator is its extreme cost-effectiveness and data efficiency.

4. Methodology (Core Technology & Implementation)

The authors' approach is a two-part strategy: first, create a small but potent dataset, and second, apply an efficient RL algorithm.

High-Quality Dataset Curation: The goal was to create a compact dataset for mathematical reasoning to minimize training time. They combined and filtered two existing datasets:
1. $s1$ Dataset: A broad reasoning dataset. The authors filtered it to keep only math problems by:
  - Keeping only questions whose solutions contained the LaTeX command $\boxed{}$ , which typically encloses a final mathematical answer.
  - Using a small model (DeepSeek-R1-Distill-Qwen-1.5B) to filter out problems that were too easy.
  - Using a slightly larger model (Qwen2.5-7B-Instruct) to remove noisy or poorly formatted questions.
  - This resulted in the open-s1 dataset with 18,615 high-quality math problems.
2. DeepScaleR Dataset: A pre-existing math-focused dataset. They refined it by:
  - Using a math-specialized model (Qwen2.5-Math-7B-Instruct) to remove easy questions.
  - This resulted in the open-deepscaler dataset with 21,044 problems.
3. Final Dataset: The two curated datasets were combined to create a final training pool of 39,659 questions.
Reinforcement Learning Algorithm (GRPO):
- Principle: The paper uses Group Relative Policy Optimization (GRPO). The core idea is to train the model by comparing a group of its own answers. For a given question, the model generates multiple solutions. The solutions are graded, and the model is trained to increase the probability of generating the high-scoring solutions and decrease the probability of the low-scoring ones. This avoids needing a separate, costly "critic" model.
- Steps & Procedures:
  1. Sample: For a question $q$ from the dataset, generate a group of $G$ different outputs $\{o_1, o_2, ..., o_G\}$ using the current model policy $\pi_{\theta_{\mathrm{old}}}$ .
  2. Reward: Calculate a reward $r_i$ for each output $o_i$ using a rule-based system (explained below).
  3. Advantage Calculation: For each output, calculate its "advantage" $A_i$ . This is a normalized score that indicates how much better or worse that output was compared to the average of the group.
  4. Optimization: Update the model's parameters $\theta$ to maximize the GRPO objective function, which pushes the model to favor outputs with a high advantage.
- Mathematical Formulas & Key Details: The policy $\pi_\theta$ $π_{θ}$ is optimized by maximizing the objective function $\mathcal{T}_{\mathrm{GRPO}}(\theta)$ $T_{GRPO} (θ)$ : $\begin{array} { r l } & { \mathcal { T } _ { \mathrm { GRPO } } ( \theta ) = \mathbb { E } _ { [ q \sim P ( Q ) , \{ \boldsymbol { \mathfrak { a } } _ { i } \} _ { i = 1 } ^ { G } \sim \pi _ { \theta _ { \mathrm { o d d } } } ( O | q ) ] } } \\ & { \qquad \frac { 1 } { G } \displaystyle \sum _ { i = 1 } ^ { G } \left( \operatorname* { m i n } \left( \frac { \pi _ { \theta } ( o _ { i } | q ) } { \pi _ { \theta _ { \mathrm { o l d } } } ( o _ { i } | q ) } A _ { i } , \mathrm { c l i p } \left( \frac { \pi _ { \theta } ( o _ { i } | q ) } { \pi _ { \theta _ { \mathrm { o d d } } } ( o _ { i } | q ) } , 1 - \epsilon , 1 + \epsilon \right) A _ { i } \right) - \beta \mathbb { D } _ { \mathrm { K L } } ( \pi _ { \theta } | | \pi _ { \mathrm { r e f } } ) \right) } \end{array}$
  - $\pi_{\theta}(o_i|q)$ : The probability of the new policy generating output $o_i$ for question $q$ .
  - $\pi_{\theta_{\mathrm{old}}}(o_i|q)$ : The probability of the old (pre-update) policy generating the same output. The ratio $\frac{\pi_{\theta}}{\pi_{\theta_{\mathrm{old}}}}$ measures how much the policy update favors this output.
  - $A_i$ : The advantage of output $o_i$ , which is a normalized reward.
  - $\epsilon$ : A small hyperparameter for clipping. The clip function prevents the policy from changing too drastically in a single step, which aids stability.
  - $\beta$ : A hyperparameter that controls the strength of the KL divergence penalty.
  - $\mathbb{D}_{\mathrm{KL}}(\pi_{\theta} || \pi_{\mathrm{ref}})$ : The Kullback-Leibler (KL) divergence. This term acts as a regularizer, penalizing the new policy $\pi_{\theta}$ if it strays too far from a reference policy $\pi_{\mathrm{ref}}$ (usually the original, pre-trained model). This prevents the model from "forgetting" its general language abilities.
    
    The advantage $A_i$ is calculated by normalizing the rewards within the group: $A _ { i } = { \frac { r _ { i } - \mathrm { m e a n } ( \{ r _ { 1 } , r _ { 2 } , \ldots , r _ { G } \} ) } { \ s t d ( \{ r _ { 1 } , r _ { 2 } , \ldots , r _ { G } \} ) } } .$
  - $r_i$ : The raw reward score for output $o_i$ .
  - $\mathrm{mean}(\{...\})$ : The average reward of all $G$ outputs in the group.
  - $\mathrm{std}(\{...\})$ : The standard deviation of the rewards in the group. This formula means an output gets a positive advantage if its reward is above average and a negative one if it's below average.
Reward Models: Instead of a complex, neural network-based reward model, the authors use a simple and efficient rule-based system with three components:
1. Accuracy Reward: A binary reward. It is 1 if the final answer in the $\boxed{}$ is correct, and 0 otherwise.
2. Cosine Reward: An enhancement to the accuracy reward. It scales the reward based on the length of the solution using a cosine schedule. This incentivizes the model to find shorter correct solutions and penalizes long incorrect solutions less severely.
3. Format Reward: A small positive reward given if the model correctly structures its reasoning within $<think>$ and $</think>$ tags, promoting organized outputs.

5. Experimental Setup

Base Model: DeepSeek-R1-Distill-Qwen-1.5B, a 1.5-billion-parameter model. The authors notably skipped SFT, starting RL directly.
Hardware & Constraints: Training was done on 4 NVIDIA A40 GPUs (48GB VRAM each) and was limited to a 24-hour window. The model could generate 6 outputs per prompt ( $G=6$ ) with a maximum length of 4096 or 3584 tokens.
Benchmark Datasets:
- AIME24: 30 challenging problems from the 2024 American Invitational Mathematics Examination.
- AMC23: 40 problems from the 2023 American Mathematics Competition.
- MATH-500: A 500-problem subset of the comprehensive MATH benchmark.
- Minerva: A benchmark testing quantitative reasoning across multiple scientific fields.
- OlympiadBench: A benchmark with extremely difficult, Olympiad-level problems.
Evaluation Metric:
- Conceptual Definition: The primary metric is zero-shot pass@1. This measures the model's ability to solve a new problem correctly on its first attempt without seeing any examples. It's a strong test of a model's intrinsic reasoning power.
- Mathematical Formula: $\text{pass@1} = \frac{\text{Number of problems solved correctly on the first attempt}}{\text{Total number of problems in the test set}}$
- Symbol Explanation:
  - Number of problems solved correctly on the first attempt: A count of how many test questions the model answered correctly.
  - Total number of problems in the test set: The total size of the benchmark dataset.
Baselines: The authors compared their Open-RS models against a diverse set of baselines, including:
- General Models: Llama-3.1-70B-Instruct, o1-preview.
- 7B Models: Qwen-2.5-Math-7B-Instruct, rStar-Math-7B, Eurus-2-7B-PRIME, Qwen2.5-7B-SimpleRL.
- 1.5B Models: The base model DeepSeek-R1-Distill-Qwen-1.5B, Still-3-1.5B-Preview, DeepScaleR-1.5B-Preview.

6. Results & Analysis

The paper presents three experiments to understand how the small model behaves under different RL training conditions.

Figure 2: Performance of the model on AMC23 (left) and MATH-500 (right) across global training steps. The red dashed line indicates the baseline score at the start of training. 该图像是图表，展示了模型在AMC-2023和MATH-500数据集上的性能随训练步数变化情况。左图显示AMC-2023准确率波动较大，红色虚线为训练初始基线；右图展示MATH-500准确率普遍高于基线但在后期有所下降。

3.5.1 Experiment 1: Impact of High-Quality Data

Setup: Trained on the open-s1 dataset (18,615 samples) with accuracy and format rewards. Max length was 4096 tokens.
Results: As seen in Figure 2 (blue line), the model showed rapid initial improvement. On AMC23, accuracy jumped from 63% to 70% within 100 steps. However, after 200 steps, performance crashed dramatically.
Analysis & Insight 1: Small LLMs can achieve rapid reasoning improvements with limited high-quality data, but performance degrades with prolonged training under strict length constraints. Figure 3 below reveals why. The model's completion length (right chart) initially fluctuates near the 4000-token limit. The authors suggest the model often failed to finish its reasoning before hitting the limit. After 200 steps, the model started generating unreadable, non-English outputs, indicating reward misalignment or optimization instability.

该图像是图表，展示了实验1中模型输出的准确率奖励（左）和完成长度（右）随本地训练步数的变化趋势。图中显示准确率奖励起伏较大，而完成长度在中期显著下降后回升。全球步数分布于4个GPU，100个全局步约等于3000个本地步。

3.5.2 Experiment 2: Balancing Easy and Hard Problems

Setup: To combat the length issue, this experiment used a smaller, mixed-difficulty dataset (7000 samples) and reduced the max length to 3584 tokens.
Results: This strategy worked much better initially. Figure 2 (orange line) shows a significant performance boost: AMC23 accuracy soared from 63% to 80% within 50-100 steps. However, instability returned after 150-200 steps.
Analysis & Insight 2: Incorporating a mix of easy and hard problems under reduced length constraints enhances early performance and stabilizes reasoning behavior, though long-term stability remains elusive. As shown in Figure 4, the KL divergence (left chart) became highly unstable after ~4000 local steps, coinciding with the performance drop and the re-emergence of mixed-language outputs. The easier problems likely taught the model to be more concise, but the underlying instability persisted.

该图像是图表，显示了实验2中输出的KL散度和完成长度随本地训练步骤的变化趋势。左侧图展示KL散度在约4000步后迅速上升并波动，右侧图显示完成长度在训练过程中波动较大但整体维持在一定范围内。

3.5.3 Experiment 3: Controlling Length with Cosine Reward

Setup: Used the same 7000-sample dataset as Experiment 2 but replaced the simple accuracy reward with the cosine reward to explicitly incentivize shorter solutions. A prompt instruction "Reply in English only" was added.
Results: Performance gains were more modest than in Experiment 2 (e.g., AMC23 rose to 72.5%, see green line in Figure 2). However, training was more stable.
Analysis & Insight 3: Cosine rewards stabilize completion lengths, improving training consistency, but extending length limits is necessary for extremely hard tasks, particularly with multilingual base models. Figure 5 shows that the completion length (right chart) became more stable and varied, staying well below the 3584-token limit. This confirms the cosine reward's effectiveness. However, the mixed-language issue persisted, suggesting a simple prompt instruction is not enough to constrain a multilingual base model.

该图像是图表，展示了实验3中不同本地训练步骤下输出的KL散度（左）和生成文本长度（右）的变化趋势。KL散度在约4000步后迅速上升，而生成长度则呈现波动且总体略有下降。

3.5.4 Overall Comparison

The best checkpoints from each experiment (Open-RS1, Open-RS2, Open-RS3) were evaluated against the baselines.

The following is a transcription of Table 1 from the paper.

Model	AIME24	MATH-500	AMC23	Minerva	OlympiadBench	Avg.
General Models
Llama-3.1-70B-Instruct	16.7	64.6	30.1	35.3	31.9	35.7
o1-preview	44.6	85.5	-	-	-	-
7B Models
Qwen-2.5-Math-7B-Instruct	13.3	79.8	50.6	34.6	40.7	43.8
rStar-Math-7B	26.7	78.4	47.5	-	47.1	-
Eurus-2-7B-PRIME	26.7	79.2	57.8	38.6	42.1	48.9
Qwen2.5-7B-SimpleRL	26.7	82.4	62.5	39.7	43.3	50.9
1.5B Models
DeepSeek-R1-Distill-Qwen-1.5B	28.8	82.8	62.9	26.5	43.3	48.9
Still-3-1.5B-Preview	32.5	84.4	66.7	29.0	45.4	51.6
DeepScaleR-1.5B-Preview	43.1	87.8	73.6	30.2	50.0	57.0
Our Models
Open-RS1 (100 steps)	30.0	83.8	70.0	29.0	52.4	53.0
Open-RS2 (50 steps)	30.0	85.4	80.0	30.5	52.4	55.7
Open-RS3 (50 steps)	46.7	84.4	72.5	26.8	51.3	56.3

Key Result: The Open-RS models are highly competitive. Open-RS3 achieves the highest score on AIME24 (46.7%), outperforming not only all other 1.5B models but also the much-hyped o1-preview (44.6%). Open-RS2 dominates the AMC23 benchmark with an 80.0% score.
Cost and Data Efficiency: This is the most stunning part of the paper.

$Figure 1: Comparison of zero-shot pass $@ 1$ performance versus model size (left) and computational cost (right). Our Open-RS (red point) achieves the highest AIME24 score $( 4 6 . 7 \\% )$ . outperfo…$ 该图像是图表，展示了零样本通过率（Pass@1）与模型规模及训练成本的对比。左图显示Open-RS模型在AIME24数据集上以46.7%的准确率超过其他模型；右图则展示其训练成本仅约42美元，显著低于其他模型。

Figure 1 and the tables below (transcribed from Tables 2 and 3) highlight the massive efficiency gains.

The following is a transcription of Table 2.

	rStar-Math-7B	Eurus-2-7B- PRIME	Qwen2.5-7B- SimpleRL	Open-RS
SFT Data	Base Model\| Qwen2.5-Math-7B \|7.3M	Qwen2.5-Math-7B 230k	Qwen2.5-Math-7B 0	DeepSeek-R1- Distill-Qwen-1.5B 0
RM Data \|RM RL Data Hardware Time	7k None 3.647M × 16 10x 8 H100 80GB, 15x 4 A100 40GB	0 Eurus-2-7B-SFT 150k × 4 1x 8 A100 80GB 72h	0 None 8k × 8 4x 6 A100 80GB 36h	0 None 7k × 6 1x 4 A40 48GB 24h
Cost Est.	F	\| $1088</td><td>\|$ 1633	\| \$42

The following is a transcription of Table 3.

	DeepScaleR-1.5B-Preview	Still-3-1.5B-Preview	Open-RS
Base ModelSFT DataRM Data	DeepSeek-R1-Distill-Qwen-1.5B00	DeepSeek-R1-Distill-Qwen-1.5B00None30k × 81x 8 A100 80GB\|150h	DeepSeek-R1-Distill-Qwen-1.5B00None7k × 61x 4 A40 48GB24h
None40k × 168x A100 80GB240h
3
Cost Est.	\| $3629</td><td rowspan=1 colspan=1>\|$ 2268	\| \$42

The Open-RS models were trained for just 42**, whereas comparable 1.5B models like DeepScaleR-1.5B-Preview cost **3629 and 7B models cost over $1000. This demonstrates that small LLMs can achieve elite reasoning performance with minimal data and cost.

7. Conclusion & Reflections

Conclusion Summary: The study successfully demonstrates that it is possible to significantly enhance the reasoning capabilities of a small (1.5B) LLM using reinforcement learning under severe resource constraints. By carefully curating a compact dataset and adapting the GRPO algorithm, the authors achieved results that are competitive with, and in some cases superior to, much larger and more expensive models. The work provides a practical framework for developing lightweight, efficient, and powerful reasoning models, despite identifying challenges like optimization instability that need further research.
Limitations & Future Work:
- Limitations:
  - Training Time: The 24-hour training limit prevented full exploration of the model's long-term behavior and a complete run through the dataset.
  - Completion Length: The maximum token length was insufficient for some very complex problems, forcing the model to truncate its thoughts.
  - Multilingual Drift: The base model was multilingual, and it started generating non-English text after extended training, a problem not fully solved by simple prompting.
  - Domain Specificity: The evaluation was focused exclusively on mathematical reasoning.
- Future Work:
  - Explore longer training times and dynamic length schedules.
  - Incorporate a lightweight language-identification reward or use a monolingual base model to prevent language drift.
  - Evaluate the approach on other reasoning domains like science and coding.
  - Combine GRPO with search algorithms (like Monte Carlo Tree Search) to potentially improve reasoning depth without a huge resource cost.
Personal Insights & Critique:
- This paper is a prime example of "doing more with less" and is incredibly valuable for the AI community. The $42 price tag for achieving SOTA-level reasoning is the headline result and a powerful statement against the "bigger is always better" narrative.
- The "fast gains, then collapse" training dynamic is a fascinating and common pattern in RL. It suggests the model is learning a "brittle" policy or a shallow heuristic that is effective early on but doesn't generalize, leading to over-optimization and collapse. The paper's insights on data mixing and reward shaping are practical first steps to combat this.
- The choice of a rule-based reward system is clever and pragmatic. While less flexible than a neural reward model, it completely removes a major computational bottleneck, which was central to the paper's goal.
- The multilingual drift issue is an important practical finding. It shows that the properties of the base model can have unexpected and hard-to-control side effects during fine-tuning, serving as a cautionary tale for practitioners.
- Overall, this is a high-impact paper that not only delivers impressive results but also provides a clear, reproducible, and open-source guide for others to build upon. It genuinely helps democratize access to advanced AI capabilities.