- Title: Training-Free Group Relative Policy Optimization
- Authors: Youtu-Agent Team (Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun)
- Affiliations: Tencent Youtu Lab, Fudan University, Xiamen University
- Journal/Conference: This paper is a preprint submitted to arXiv. It has not yet undergone formal peer review at a major conference or journal.
- Publication Year: The paper is dated October 9, 2025 (likely a placeholder date, common in preprints).
- Abstract: The paper addresses the performance degradation of Large Language Model (LLM) agents in specialized domains that require external tools and specific prompting. While existing methods like agentic Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO) can adapt agents, they require costly parameter updates (fine-tuning). The authors argue that a similar effect can be achieved without training by learning "experiential knowledge" as a token prior. They propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a method that avoids all parameter updates. Instead of using numerical advantages to update model weights, it distills a "semantic advantage" from groups of agent outputs (rollouts) to iteratively build a high-quality library of experiential knowledge. This knowledge is then provided in the context of future LLM API calls to guide its behavior. Experiments on mathematical reasoning and web searching tasks show that their method, applied to a powerful frozen model (
DeepSeek-V3.1-Terminus
), significantly improves out-of-domain performance using only a few dozen training samples and at a fraction of the cost of traditional fine-tuning.
- Original Source Link:
2. Executive Summary
-
Foundational Concepts:
- Large Language Model (LLM) Agent: An LLM that can do more than just generate text. It can interact with an environment by using external tools (e.g., running code, searching the web, using APIs). A common framework for this is
ReAct
(Reasoning and Acting), where the LLM alternates between thinking about a problem and taking an action.
- Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. It learns from trial and error.
- Policy Optimization: In RL, the "policy" is the agent's strategy for choosing actions. Policy optimization algorithms aim to improve this policy to achieve higher rewards. A well-known algorithm is Proximal Policy Optimization (PPO).
- Group Relative Policy Optimization (GRPO): An advanced policy optimization method. Instead of using a separate "critic" model to estimate the value of an action, GRPO generates a group of several possible outputs for a given prompt. It then compares the rewards of these outputs to each other to determine which ones are better or worse than the group average. This "relative advantage" is then used to update the model's parameters to favor better outputs in the future.
- In-Context Learning (ICL): The remarkable ability of modern LLMs to learn a new task or adapt its behavior based solely on examples and instructions provided in its input prompt (the "context"), without any changes to its underlying weights. Training-Free GRPO heavily relies on this capability.
-
Previous Works:
- Agentic RL (Parameter Tuning): Methods like
ReTool
, AFM
, ZeroTIR
, and SimpleTIR
use RL algorithms (like PPO or GRPO) to fine-tune an LLM's parameters. They train the model on thousands of examples of successful task completion, teaching it to use tools more effectively. Their main drawback is the high cost and data requirements.
- Training-Free Methods (Inference-Time Optimization): Methods like
Self-Refine
and Reflexion
improve an LLM's output for a single problem instance. They typically involve a multi-step process: generate an answer, have the LLM critique its own answer, and then use that feedback to generate a revised answer. They focus on improving a single output iteratively, not on building a general knowledge base from a dataset.
- Knowledge Base Methods (
Agent KB
): This work is similar in that it builds a knowledge base from experience. However, Agent KB
uses a more complex reason-retrieve-refine process, relies on hand-crafted examples, and performs learning in an off-policy manner (learning from pre-collected data rather than interactively).
-
Differentiation:
- vs. Agentic RL: Training-Free GRPO is fundamentally different because it does not update any model parameters. It is orders of magnitude cheaper and requires far less data.
- vs. Inference-Time Refinement: Unlike methods that refine a single output, Training-Free GRPO learns from a dataset across multiple epochs to build a shared, reusable experience library. This library then benefits performance on new, unseen problems.
- vs. Vanilla GRPO: It replaces the numerical advantage and gradient updates of vanilla GRPO with semantic advantage (natural language feedback) and context updates (modifying the experience library).
4. Methodology (Core Technology & Implementation)
The core idea of Training-Free GRPO is to replicate the policy alignment effects of traditional GRPO but entirely at inference time by manipulating the LLM's context.
The process is illustrated in Figure 2, which contrasts it with vanilla GRPO.
该图像是图2,对比了香草GRPO(Vanilla GRPO)和免训练GRPO(Training-Free GRPO)的系统架构。图(a)展示了Vanilla GRPO通过策略模型、参考模型和奖励模型进行组计算并更新策略模型的过程。图(b)描绘了Training-Free GRPO,其策略模型结合了经验知识,并通过一个控制器和大型语言模型(LLMs)在组计算中实现总结和经验提取,最终更新经验库。这种方法避免了模型参数更新,而侧重于经验知识的迭代学习。
-
(a) Vanilla GRPO: A trainable policy model
generates multiple outputs (O1,...,OG). A reward model
scores each output to get rewards (r1,...,rG). A Group Computation
step calculates a numerical advantage (Ai) for each output. This advantage is used to perform a gradient Update
on the policy model's parameters.
-
(b) Training-Free GRPO: An experiences
library is fed into a frozen policy model
. The model generates outputs, which are scored by a reward model
. The key difference is the Group Computation
block. Here, an LLM is used to summarize
each output and then extract experience
by comparing them. This produces a textual, semantic advantage (A_text
). A controller
model then uses this feedback to Update
the experiences
library with operations like ADD
, DELETE
, KEEP
, or MODIFY
. The policy model itself is never changed.
The step-by-step procedure is as follows:
-
Initialization: The method starts with a powerful, frozen LLM policy (πθ) and an empty experience library, E=∅.
-
Rollout and Reward:
- For a given query q from the training set, the model generates a group of G different outputs: {o1,o2,...,oG}. The generation is conditioned on the current experience library: πθ(oi∣q,E).
- Each output oi is evaluated by a reward model R to get a scalar reward ri=R(q,oi). In the experiments, this is a simple binary reward (correct/incorrect) based on the ground-truth answer.
-
Group Advantage Computation (Semantic Advantage): This is the most innovative step. Instead of calculating a numerical score representing how much better an output is than the group average, the method generates a natural language explanation.
-
This step is only performed for groups that have both successful and unsuccessful rollouts, ensuring a clear learning signal.
-
First, the LLM is prompted to create a summary si for each output oi.
-
Next, the LLM is prompted again, this time with all the summaries {s1,...,sG} for the group. It is asked to introspect and articulate a generalizable piece of advice or strategy, Atext, that explains what led to success or failure. This Atext is the semantic advantage.
Figure 3 provides a concrete example of this process for a geometry problem.
该图像是图3,展示了Training-Free GRPO的一个学习步骤示例。它以一个几何问题开始,接着展示了G个轨迹(Rollout x G),其中轨迹1失败而轨迹2成功地找到了线段 y=2x−4。每个轨迹随后被详细总结,并进行优势计算。通过比较不同尝试,该过程学习到经验知识,例如在几何问题中数学解需满足几何约束,从而指导LLM行为,实现无参数更新的学习。
In this example, different solution attempts (rollouts) are generated. trajectory 1
fails by finding an intersection point outside the required line segment. trajectory G
succeeds. After summarizing each trajectory, the advantage computation module compares them and distills a key insight: "When solving geometry problems involving intersections with bounded regions, always validate that mathematical solutions satisfy geometric constraints." This becomes the learned experience.
-
Optimization (Experience Library Update):
- The semantic advantages (Atext) from all groups in a training batch are collected.
- An LLM is prompted to act as a "controller." It reviews the existing experience library E and the newly generated semantic advantages, and decides how to update the library. The possible operations are:
- Add: Add a new piece of advice to E.
- Delete: Remove a low-quality or redundant piece of advice from E.
- Revise: Modify an existing piece of advice based on new insights.
- Keep: Make no changes.
-
Multi-Epoch Learning: This entire process is repeated for multiple epochs over the small training dataset. With each epoch, the experience library E becomes more refined and effective. During final evaluation on the test set, this curated library E is included in the prompt to guide the frozen LLM.
5. Experimental Setup
-
Datasets:
- Mathematical Reasoning:
- Training: A small set of 100 problems (
DAPO-100
) randomly sampled from the DAPO-Math-17K
dataset.
- Evaluation: The challenging, out-of-domain
AIME 2024
and AIME 2025
competition benchmarks.
- Web Searching:
- Training: A set of 100 queries (
AFM-100
) randomly sampled from the AFM
web interaction RL dataset.
- Evaluation: The
WebWalkerQA
benchmark, which requires agents to navigate web pages to find answers.
-
Evaluation Metrics:
Mean@k
: Used for the AIME math benchmarks.
- Conceptual Definition: This metric is equivalent to
Pass@k
. It measures the success rate on a set of problems, where a problem is considered "solved" if at least one correct answer is found within k independent attempts (generations). It evaluates the model's ability to produce a correct solution within a limited budget of tries. The paper uses k=32.
- Mathematical Formula:
Pass@k=Eproblems[1−i=1∏k(1−ci)]
- Symbol Explanation:
- ci is a binary variable that is 1 if the i-th generated solution is correct, and 0 otherwise.
- The formula calculates the probability that not all k attempts are incorrect.
Pass@k
: Used for the WebWalkerQA benchmark. The definition is the same as above. The paper reports pass@1
(the accuracy of the very first attempt) and pass@3
.
-
Baselines:
ReAct
: A standard agentic prompting framework without any learned experiences. This serves as the zero-shot baseline.
- Fine-tuned Models: A suite of models trained with agentic RL methods on thousands of data points, including
ReTool
, AFM
, ZeroTIR
, and SimpleTIR
, all based on a 32B parameter Qwen
model.
MiroThinker
: Another open-source agentic model fine-tuned for research and problem-solving.
- Base Models:
DeepSeek-V3.1-Terminus
(a very powerful proprietary model), Qwen3-32B
, and Qwen2.5-72B-Instruct
.
6. Results & Analysis
The experiments convincingly demonstrate the effectiveness and efficiency of Training-Free GRPO.
-
Core Results:
Image 1 visually summarizes the main findings on the AIME benchmarks.
该图像是三个柱状图的组合,展示了Training-Free GRPO(Ours)在性能、训练成本和训练数据方面的优势。在AIME基准测试中,我们的方法显著提升了无工具和有工具条件下的性能。同时,与传统的RL训练(32B)相比,我们的方法(671B)将训练成本从约 10000
降低到约 18
,并将训练数据从约 17000
个样本减少到 100
个,突显了其高效率和轻量级特性。
The charts show that the proposed method (+Ours) significantly boosts performance on AIME24 and AIME25, both with and without tools. Crucially, it achieves this with a training cost of only ~18 and 100 data samples, compared to traditional RL training which costs ~
10,000 and requires ~17,000 samples.
-
Mathematical Reasoning:
The following is a transcription of Table 1 from the paper.
Method |
Learning Cost |
Training Set |
Model |
Tool |
AIME24 |
AIME25 |
Direct |
- |
- |
DeepSeek-V3.1-Terminus |
- |
68.6 |
52.9 |
+ Training-Free GRPO |
≈ 8</td><tdcolspan="2">DAPO−100</td><tdcolspan="2">DeepSeek−V3.1−Terminus</td><td>−</td><td>72.6(↑4.0)</td><td>54.0(↑1.1)</td></tr><tr><tdcolspan="2">ReAct[1]</td><tdcolspan="2">−</td><tdcolspan="2">−</td><tdcolspan="2">DeepSeek−V3.1−Terminus</td><td>CI</td><td>80.0</td><td>67.9</td></tr><tr><tdcolspan="2">+Training−FreeGRPO</td><tdcolspan="2">≈18 |
DAPO-100 |
DeepSeek-V3.1-Terminus |
CI |
82.7 (↑2.7) |
73.3 (↑5.4) |
ReAct [1] |
- |
- |
DeepSeek-V3.2-Exp |
CI |
71.0 |
61.8 |
+ Training-Free GRPO |
≈ $8 |
DAPO-100 |
DeepSeek-V3.2-Exp |
CI |
73.1 (↑2.1) |
63.2 (↑1.4) |
When applied to the powerful DeepSeek-V3.1-Terminus
model with a code interpreter (CI
) tool, Training-Free GRPO improves the Mean@32
score from 80.0% to 82.7% on AIME24 and from 67.9% to 73.3% on AIME25. This result, achieved with only 100 training samples, surpasses all the heavily fine-tuned 32B models shown in Table 3. This supports the paper's central argument: it's more effective to guide a powerful frozen model with context than to exhaustively train a weaker one.
-
Web Searching:
The following is a transcription of Table 4.
Method |
Training Set |
Model |
pass@1 |
ReAct [1] |
- |
DeepSeek-V3.1-Terminus |
63.2 |
+ Training-Free GRPO |
AFM-100 |
DeepSeek-V3.1-Terminus |
67.8 (↑4.6) |
On the WebWalkerQA
benchmark, the method improves the pass@1
score by a significant 4.6 absolute points, demonstrating its applicability beyond mathematical reasoning.
-
Ablations / Parameter Sensitivity:
The following is a transcription of the ablation study in Table 2.
Method |
Training Set |
AIME24 |
AIME25 |
ReAct [1] |
|
80.0 |
67.9 |
ReAct [1] + Directly Generated Experiences |
|
79.8 |
67.3 |
ReAct [1] + Training-Free GRPO (w/o ground truths) |
DAPO-100 |
80.7 |
68.9 |
ReAct [1] + Training-Free GRPO (w/o group computation) |
DAPO-100 |
80.4 |
69.3 |
ReAct [1] + Training-Free GRPO |
DAPO-100 |
82.7 |
73.3 |
- Importance of the Optimization Process: Simply adding "directly generated experiences" without the GRPO-style learning process slightly harms performance. This proves that the iterative refinement and distillation of high-quality experiences are crucial.
- Robustness without Ground Truth: When ground truths are removed (w/o ground truths), the method still provides a small performance boost. In this setting, the LLM relies on self-consistency and majority voting within the group to identify better trajectories, showing the method's robustness in scenarios where rewards are unavailable.
- Necessity of Group Computation: Removing the group comparison (setting group size to 1) severely diminishes the performance gains. This confirms that comparing multiple diverse trajectories is key to distilling effective semantic advantages.
-
Cross-domain Transferability:
The following is a transcription of Table 6.
Method |
Learned Domain |
Math Reasoning |
Web Searching |
AIME24 |
AIME25 |
WebWalker |
ReAct [1] (Qwen2.5-32B-Instruct) |
- |
29.6 |
23.1 |
31.9 |
ReTool [4] (Qwen2.5-32B-Instruct) |
Math |
67.0 |
49.3 |
18.3 |
MiroThinker [10] (Qwen3-32B) |
Web |
43.5 |
36.8 |
53.6 |
Training-Free GRPO (DeepSeek-V3.1-Terminus) |
Math / Web |
82.7 |
73.3 |
67.8 |
This table highlights a major advantage. The fine-tuned models (ReTool
, MiroThinker
) are domain specialists. ReTool
, trained on math, fails miserably on web searching (18.3%). MiroThinker
, trained on web tasks, is weak at math. In contrast, Training-Free GRPO achieves state-of-the-art performance in both domains simply by loading the appropriate experience library for the task at hand, demonstrating excellent modularity and generalization.
7. Conclusion & Reflections
-
Conclusion Summary:
The paper introduces Training-Free GRPO, a novel and highly practical paradigm for enhancing LLM agent performance. By shifting RL-style policy optimization from the expensive parameter space to the lightweight context space, it effectively addresses the key limitations of traditional agent fine-tuning. The method leverages an LLM's own reasoning capabilities to distill semantic advantages from group rollouts, iteratively building an experiential knowledge base that guides a frozen model's behavior. The results show that this approach is not only orders of magnitude cheaper and more data-efficient but also leads to superior performance and better generalization compared to fine-tuning smaller models.
-
Limitations & Future Work:
The paper does not explicitly list its limitations, but some can be inferred:
- Dependence on a Strong Base Model: The experiments show that the method's effectiveness is highly dependent on the capability of the underlying frozen LLM. As seen with
QwQ-32B
on web tasks, if the base model lacks fundamental reasoning or tool-use skills, providing it with experiences may not be sufficient to elicit good performance.
- Inference Overhead: Including the experiential knowledge library in every prompt increases the context length. This can lead to higher per-query API costs and increased latency, which might be a concern for high-throughput applications. However, the paper argues this is offset by the elimination of fixed GPU serving costs.
- Scalability of the Experience Library: The paper does not explore how the system behaves as the experience library grows very large. An overly large set of experiences could potentially become noisy or computationally burdensome for the LLM to process effectively in-context.
- Complexity of the Controller: The process of updating the experience library relies on another LLM call. The quality and consistency of this "controller" LLM are critical to the learning process and could be a potential point of failure.
-
Personal Insights & Critique:
This paper presents a compelling and pragmatic solution to a significant real-world problem. The core idea of "optimizing the context, not the weights" is elegant and leverages the unique strengths of modern LLMs.
- Novelty: The concept of "semantic advantage" is a clever translation of a core RL principle into the natural language domain. It essentially turns the optimization process into a meta-reasoning task that LLMs are well-suited for.
- Practical Impact: This approach could democratize the development of specialized LLM agents. Instead of requiring massive GPU clusters for fine-tuning, organizations could adapt powerful API-based models for their specific needs using small, curated datasets and this lightweight learning process.
- Future Directions: This paradigm opens up many exciting avenues. The experience library could be structured hierarchically, or a retrieval mechanism could be used to select only the most relevant experiences for a given query, mitigating the context length issue. Furthermore, the process of distilling semantic advantage could itself be refined to produce even more potent and generalizable insights. Overall, Training-Free GRPO is a significant step towards making advanced agentic AI more accessible, efficient, and adaptable.