AiPaper
Status: completed

Training-Free Group Relative Policy Optimization

LLM Reasoning Capacity EnhancementSequence Policy OptimizationRL Training for Large Language ModelsTraining-Free Acceleration Methods
Original LinkPDFEdit PDF notes
Price: 0.10
Price: 0.10
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Existing LLM agent fine-tuning is costly and parameter-heavy for specialized tasks. This paper proposes Training-Free GRPO, which distills "relative semantic advantage" from few rollouts into a "token prior" to guide LLMs. This method avoids parameter updates, addresses data scar

Abstract

Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.

English Analysis

1. Bibliographic Information

  • Title: Training-Free Group Relative Policy Optimization
  • Authors: Youtu-Agent Team (Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun)
  • Affiliations: Tencent Youtu Lab, Fudan University, Xiamen University
  • Journal/Conference: This paper is a preprint submitted to arXiv. It has not yet undergone formal peer review at a major conference or journal.
  • Publication Year: The paper is dated October 9, 2025 (likely a placeholder date, common in preprints).
  • Abstract: The paper addresses the performance degradation of Large Language Model (LLM) agents in specialized domains that require external tools and specific prompting. While existing methods like agentic Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO) can adapt agents, they require costly parameter updates (fine-tuning). The authors argue that a similar effect can be achieved without training by learning "experiential knowledge" as a token prior. They propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a method that avoids all parameter updates. Instead of using numerical advantages to update model weights, it distills a "semantic advantage" from groups of agent outputs (rollouts) to iteratively build a high-quality library of experiential knowledge. This knowledge is then provided in the context of future LLM API calls to guide its behavior. Experiments on mathematical reasoning and web searching tasks show that their method, applied to a powerful frozen model (DeepSeek-V3.1-Terminus), significantly improves out-of-domain performance using only a few dozen training samples and at a fraction of the cost of traditional fine-tuning.
  • Original Source Link:

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: General-purpose LLM agents, while powerful, often fail in specialized, real-world domains that require the use of specific external tools (like calculators or web browsers) and domain-specific strategies.
    • Gaps in Prior Work: Existing solutions, primarily under the umbrella of "agentic reinforcement learning," adapt LLMs by fine-tuning their parameters. This approach, while effective, is plagued by several major challenges:
      1. Prohibitive Cost: Fine-tuning even moderately sized LLMs requires immense computational resources, making it expensive and environmentally unsustainable.
      2. Data Scarcity: High-quality, annotated training data for specialized domains is often rare and costly to create.
      3. Poor Generalization: Models fine-tuned for one specific task often perform poorly on others (a phenomenon known as catastrophic forgetting or domain overfitting), requiring separate models for each task.
      4. Diminishing Returns: Due to high costs, researchers often fine-tune smaller, less capable models (e.g., 32B parameters), which may not be as effective as simply using a larger, more powerful, but un-tuned (frozen) model via an API.
    • Fresh Angle: The paper hypothesizes that the core capabilities needed for these tasks are already present in powerful LLMs. Instead of changing the model's parameters, one can guide its behavior by providing it with learned "experiential knowledge" directly in its context (prompt). This shifts the optimization problem from the parameter space to the context space, which is far more efficient.
  • Main Contributions / Findings (What):

    1. A New Training-Free RL Paradigm: The paper introduces Training-Free GRPO, a novel method that mimics the policy optimization of traditional RL without any gradient updates. It optimizes agent behavior by iteratively refining an external library of "experiential knowledge" that is used as a token prior during inference.
    2. Semantic Group Advantage: It replaces the numerical advantage signal used in traditional GRPO with a semantic group advantage. This involves using an LLM to analyze a group of different outputs (rollouts) for the same task, understand why some succeeded and others failed, and distill this understanding into natural language advice (the semantic advantage).
    3. Demonstrated Efficiency and Effectiveness: Experiments show that with only 100 training samples and minimal cost (around 18), Training-Free GRPO significantly boosts the performance of a top-tier model (DeepSeek-V3.1-Terminus) on challenging mathematical reasoning and web search tasks, outperforming smaller models that were fine-tuned with thousands of samples and costs exceeding10,000.
    4. Superior Generalization: Since the base model's parameters are never changed, the system can adapt to different domains simply by "plugging in" the relevant, domain-specific experiential knowledge library, avoiding the poor cross-domain transferability of fine-tuned models.

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Large Language Model (LLM) Agent: An LLM that can do more than just generate text. It can interact with an environment by using external tools (e.g., running code, searching the web, using APIs). A common framework for this is ReAct (Reasoning and Acting), where the LLM alternates between thinking about a problem and taking an action.
    • Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. It learns from trial and error.
    • Policy Optimization: In RL, the "policy" is the agent's strategy for choosing actions. Policy optimization algorithms aim to improve this policy to achieve higher rewards. A well-known algorithm is Proximal Policy Optimization (PPO).
    • Group Relative Policy Optimization (GRPO): An advanced policy optimization method. Instead of using a separate "critic" model to estimate the value of an action, GRPO generates a group of several possible outputs for a given prompt. It then compares the rewards of these outputs to each other to determine which ones are better or worse than the group average. This "relative advantage" is then used to update the model's parameters to favor better outputs in the future.
    • In-Context Learning (ICL): The remarkable ability of modern LLMs to learn a new task or adapt its behavior based solely on examples and instructions provided in its input prompt (the "context"), without any changes to its underlying weights. Training-Free GRPO heavily relies on this capability.
  • Previous Works:

    • Agentic RL (Parameter Tuning): Methods like ReTool, AFM, ZeroTIR, and SimpleTIR use RL algorithms (like PPO or GRPO) to fine-tune an LLM's parameters. They train the model on thousands of examples of successful task completion, teaching it to use tools more effectively. Their main drawback is the high cost and data requirements.
    • Training-Free Methods (Inference-Time Optimization): Methods like Self-Refine and Reflexion improve an LLM's output for a single problem instance. They typically involve a multi-step process: generate an answer, have the LLM critique its own answer, and then use that feedback to generate a revised answer. They focus on improving a single output iteratively, not on building a general knowledge base from a dataset.
    • Knowledge Base Methods (Agent KB): This work is similar in that it builds a knowledge base from experience. However, Agent KB uses a more complex reason-retrieve-refine process, relies on hand-crafted examples, and performs learning in an off-policy manner (learning from pre-collected data rather than interactively).
  • Differentiation:

    • vs. Agentic RL: Training-Free GRPO is fundamentally different because it does not update any model parameters. It is orders of magnitude cheaper and requires far less data.
    • vs. Inference-Time Refinement: Unlike methods that refine a single output, Training-Free GRPO learns from a dataset across multiple epochs to build a shared, reusable experience library. This library then benefits performance on new, unseen problems.
    • vs. Vanilla GRPO: It replaces the numerical advantage and gradient updates of vanilla GRPO with semantic advantage (natural language feedback) and context updates (modifying the experience library).

4. Methodology (Core Technology & Implementation)

The core idea of Training-Free GRPO is to replicate the policy alignment effects of traditional GRPO but entirely at inference time by manipulating the LLM's context.

The process is illustrated in Figure 2, which contrasts it with vanilla GRPO.

Figure 2. Comparison of vanilla GRPO and Training-Free GRPO. 该图像是图2,对比了香草GRPO(Vanilla GRPO)和免训练GRPO(Training-Free GRPO)的系统架构。图(a)展示了Vanilla GRPO通过策略模型、参考模型和奖励模型进行组计算并更新策略模型的过程。图(b)描绘了Training-Free GRPO,其策略模型结合了经验知识,并通过一个控制器和大型语言模型(LLMs)在组计算中实现总结和经验提取,最终更新经验库。这种方法避免了模型参数更新,而侧重于经验知识的迭代学习。

  • (a) Vanilla GRPO: A trainable policy model generates multiple outputs (O1,...,OGO_1, ..., O_G). A reward model scores each output to get rewards (r1,...,rGr_1, ..., r_G). A Group Computation step calculates a numerical advantage (AiA_i) for each output. This advantage is used to perform a gradient Update on the policy model's parameters.

  • (b) Training-Free GRPO: An experiences library is fed into a frozen policy model. The model generates outputs, which are scored by a reward model. The key difference is the Group Computation block. Here, an LLM is used to summarize each output and then extract experience by comparing them. This produces a textual, semantic advantage (A_text). A controller model then uses this feedback to Update the experiences library with operations like ADD, DELETE, KEEP, or MODIFY. The policy model itself is never changed.

    The step-by-step procedure is as follows:

  1. Initialization: The method starts with a powerful, frozen LLM policy (πθπ_θ) and an empty experience library, E=\mathcal{E} = \emptyset.

  2. Rollout and Reward:

    • For a given query qq from the training set, the model generates a group of GG different outputs: {o1,o2,...,oG}\{o_1, o_2, ..., o_G\}. The generation is conditioned on the current experience library: πθ(oiq,E)π_θ(o_i | q, \mathcal{E}).
    • Each output oio_i is evaluated by a reward model R\mathcal{R} to get a scalar reward ri=R(q,oi)r_i = \mathcal{R}(q, o_i). In the experiments, this is a simple binary reward (correct/incorrect) based on the ground-truth answer.
  3. Group Advantage Computation (Semantic Advantage): This is the most innovative step. Instead of calculating a numerical score representing how much better an output is than the group average, the method generates a natural language explanation.

    • This step is only performed for groups that have both successful and unsuccessful rollouts, ensuring a clear learning signal.

    • First, the LLM is prompted to create a summary sis_i for each output oio_i.

    • Next, the LLM is prompted again, this time with all the summaries {s1,...,sG}\{s_1, ..., s_G\} for the group. It is asked to introspect and articulate a generalizable piece of advice or strategy, AtextA_{\mathrm{text}}, that explains what led to success or failure. This AtextA_{\mathrm{text}} is the semantic advantage.

      Figure 3 provides a concrete example of this process for a geometry problem.

      Figure 3. Example of a Training-Free GRPO learning step. 该图像是图3,展示了Training-Free GRPO的一个学习步骤示例。它以一个几何问题开始,接着展示了G个轨迹(Rollout x G),其中轨迹1失败而轨迹2成功地找到了线段 y=2x4y=2x-4。每个轨迹随后被详细总结,并进行优势计算。通过比较不同尝试,该过程学习到经验知识,例如在几何问题中数学解需满足几何约束,从而指导LLM行为,实现无参数更新的学习。

    In this example, different solution attempts (rollouts) are generated. trajectory 1 fails by finding an intersection point outside the required line segment. trajectory G succeeds. After summarizing each trajectory, the advantage computation module compares them and distills a key insight: "When solving geometry problems involving intersections with bounded regions, always validate that mathematical solutions satisfy geometric constraints." This becomes the learned experience.

  4. Optimization (Experience Library Update):

    • The semantic advantages (AtextA_{\mathrm{text}}) from all groups in a training batch are collected.
    • An LLM is prompted to act as a "controller." It reviews the existing experience library E\mathcal{E} and the newly generated semantic advantages, and decides how to update the library. The possible operations are:
      • Add: Add a new piece of advice to E\mathcal{E}.
      • Delete: Remove a low-quality or redundant piece of advice from E\mathcal{E}.
      • Revise: Modify an existing piece of advice based on new insights.
      • Keep: Make no changes.
  5. Multi-Epoch Learning: This entire process is repeated for multiple epochs over the small training dataset. With each epoch, the experience library E\mathcal{E} becomes more refined and effective. During final evaluation on the test set, this curated library E\mathcal{E} is included in the prompt to guide the frozen LLM.

5. Experimental Setup

  • Datasets:

    • Mathematical Reasoning:
      • Training: A small set of 100 problems (DAPO-100) randomly sampled from the DAPO-Math-17K dataset.
      • Evaluation: The challenging, out-of-domain AIME 2024 and AIME 2025 competition benchmarks.
    • Web Searching:
      • Training: A set of 100 queries (AFM-100) randomly sampled from the AFM web interaction RL dataset.
      • Evaluation: The WebWalkerQA benchmark, which requires agents to navigate web pages to find answers.
  • Evaluation Metrics:

    • Mean@k: Used for the AIME math benchmarks.
      1. Conceptual Definition: This metric is equivalent to Pass@k. It measures the success rate on a set of problems, where a problem is considered "solved" if at least one correct answer is found within kk independent attempts (generations). It evaluates the model's ability to produce a correct solution within a limited budget of tries. The paper uses k=32k=32.
      2. Mathematical Formula: Pass@k=Eproblems[1i=1k(1ci)] \text{Pass@k} = \mathbb{E}_{\text{problems}} \left[ 1 - \prod_{i=1}^{k} (1 - c_i) \right]
      3. Symbol Explanation:
        • cic_i is a binary variable that is 1 if the ii-th generated solution is correct, and 0 otherwise.
        • The formula calculates the probability that not all kk attempts are incorrect.
    • Pass@k: Used for the WebWalkerQA benchmark. The definition is the same as above. The paper reports pass@1 (the accuracy of the very first attempt) and pass@3.
  • Baselines:

    • ReAct: A standard agentic prompting framework without any learned experiences. This serves as the zero-shot baseline.
    • Fine-tuned Models: A suite of models trained with agentic RL methods on thousands of data points, including ReTool, AFM, ZeroTIR, and SimpleTIR, all based on a 32B parameter Qwen model.
    • MiroThinker: Another open-source agentic model fine-tuned for research and problem-solving.
    • Base Models: DeepSeek-V3.1-Terminus (a very powerful proprietary model), Qwen3-32B, and Qwen2.5-72B-Instruct.

6. Results & Analysis

The experiments convincingly demonstrate the effectiveness and efficiency of Training-Free GRPO.

  • Core Results:

    Image 1 visually summarizes the main findings on the AIME benchmarks.

    该图像是三个柱状图的组合,展示了Training-Free GRPO(Ours)在性能、训练成本和训练数据方面的优势。在AIME基准测试中,我们的方法显著提升了无工具和有工具条件下的性能。同时,与传统的RL训练(32B)相比,我们的方法(671B)将训练成本从约 `10000` 降低到约 `18`,并将训练数据从约 `17000` 个样本减少到 `100` 个,突显了其高效率和轻量级特性。 该图像是三个柱状图的组合,展示了Training-Free GRPO(Ours)在性能、训练成本和训练数据方面的优势。在AIME基准测试中,我们的方法显著提升了无工具和有工具条件下的性能。同时,与传统的RL训练(32B)相比,我们的方法(671B)将训练成本从约 10000 降低到约 18,并将训练数据从约 17000 个样本减少到 100 个,突显了其高效率和轻量级特性。

    The charts show that the proposed method (+Ours) significantly boosts performance on AIME24 and AIME25, both with and without tools. Crucially, it achieves this with a training cost of only ~18 and 100 data samples, compared to traditional RL training which costs ~10,000 and requires ~17,000 samples.

    • Mathematical Reasoning: The following is a transcription of Table 1 from the paper.

      Method Learning Cost Training Set Model Tool AIME24 AIME25
      Direct - - DeepSeek-V3.1-Terminus - 68.6 52.9
      + Training-Free GRPO 8</td><tdcolspan="2">DAPO100</td><tdcolspan="2">DeepSeekV3.1Terminus</td><td></td><td>72.6(4.0)</td><td>54.0(1.1)</td></tr><tr><tdcolspan="2">ReAct[1]</td><tdcolspan="2"></td><tdcolspan="2"></td><tdcolspan="2">DeepSeekV3.1Terminus</td><td>CI</td><td>80.0</td><td>67.9</td></tr><tr><tdcolspan="2">+TrainingFreeGRPO</td><tdcolspan="2">8</td> <td colspan="2">DAPO-100</td> <td colspan="2">DeepSeek-V3.1-Terminus</td> <td>-</td> <td>72.6 (↑4.0)</td> <td>54.0 (↑1.1)</td> </tr> <tr> <td colspan="2">ReAct [1]</td> <td colspan="2">-</td> <td colspan="2">-</td> <td colspan="2">DeepSeek-V3.1-Terminus</td> <td>CI</td> <td>80.0</td> <td>67.9</td> </tr> <tr> <td colspan="2">+ Training-Free GRPO</td> <td colspan="2">≈ 18 DAPO-100 DeepSeek-V3.1-Terminus CI 82.7 (↑2.7) 73.3 (↑5.4)
      ReAct [1] - - DeepSeek-V3.2-Exp CI 71.0 61.8
      + Training-Free GRPO ≈ $8 DAPO-100 DeepSeek-V3.2-Exp CI 73.1 (↑2.1) 63.2 (↑1.4)

      When applied to the powerful DeepSeek-V3.1-Terminus model with a code interpreter (CI) tool, Training-Free GRPO improves the Mean@32 score from 80.0% to 82.7% on AIME24 and from 67.9% to 73.3% on AIME25. This result, achieved with only 100 training samples, surpasses all the heavily fine-tuned 32B models shown in Table 3. This supports the paper's central argument: it's more effective to guide a powerful frozen model with context than to exhaustively train a weaker one.

    • Web Searching: The following is a transcription of Table 4.

      Method Training Set Model pass@1
      ReAct [1] - DeepSeek-V3.1-Terminus 63.2
      + Training-Free GRPO AFM-100 DeepSeek-V3.1-Terminus 67.8 (↑4.6)

      On the WebWalkerQA benchmark, the method improves the pass@1 score by a significant 4.6 absolute points, demonstrating its applicability beyond mathematical reasoning.

  • Ablations / Parameter Sensitivity:

    The following is a transcription of the ablation study in Table 2.

    Method Training Set AIME24 AIME25
    ReAct [1] 80.0 67.9
    ReAct [1] + Directly Generated Experiences 79.8 67.3
    ReAct [1] + Training-Free GRPO (w/o ground truths) DAPO-100 80.7 68.9
    ReAct [1] + Training-Free GRPO (w/o group computation) DAPO-100 80.4 69.3
    ReAct [1] + Training-Free GRPO DAPO-100 82.7 73.3
    1. Importance of the Optimization Process: Simply adding "directly generated experiences" without the GRPO-style learning process slightly harms performance. This proves that the iterative refinement and distillation of high-quality experiences are crucial.
    2. Robustness without Ground Truth: When ground truths are removed (w/o ground truths), the method still provides a small performance boost. In this setting, the LLM relies on self-consistency and majority voting within the group to identify better trajectories, showing the method's robustness in scenarios where rewards are unavailable.
    3. Necessity of Group Computation: Removing the group comparison (setting group size to 1) severely diminishes the performance gains. This confirms that comparing multiple diverse trajectories is key to distilling effective semantic advantages.
  • Cross-domain Transferability: The following is a transcription of Table 6.

    Method Learned Domain Math Reasoning Web Searching
    AIME24 AIME25 WebWalker
    ReAct [1] (Qwen2.5-32B-Instruct) - 29.6 23.1 31.9
    ReTool [4] (Qwen2.5-32B-Instruct) Math 67.0 49.3 18.3
    MiroThinker [10] (Qwen3-32B) Web 43.5 36.8 53.6
    Training-Free GRPO (DeepSeek-V3.1-Terminus) Math / Web 82.7 73.3 67.8

    This table highlights a major advantage. The fine-tuned models (ReTool, MiroThinker) are domain specialists. ReTool, trained on math, fails miserably on web searching (18.3%). MiroThinker, trained on web tasks, is weak at math. In contrast, Training-Free GRPO achieves state-of-the-art performance in both domains simply by loading the appropriate experience library for the task at hand, demonstrating excellent modularity and generalization.

7. Conclusion & Reflections

  • Conclusion Summary: The paper introduces Training-Free GRPO, a novel and highly practical paradigm for enhancing LLM agent performance. By shifting RL-style policy optimization from the expensive parameter space to the lightweight context space, it effectively addresses the key limitations of traditional agent fine-tuning. The method leverages an LLM's own reasoning capabilities to distill semantic advantages from group rollouts, iteratively building an experiential knowledge base that guides a frozen model's behavior. The results show that this approach is not only orders of magnitude cheaper and more data-efficient but also leads to superior performance and better generalization compared to fine-tuning smaller models.

  • Limitations & Future Work: The paper does not explicitly list its limitations, but some can be inferred:

    1. Dependence on a Strong Base Model: The experiments show that the method's effectiveness is highly dependent on the capability of the underlying frozen LLM. As seen with QwQ-32B on web tasks, if the base model lacks fundamental reasoning or tool-use skills, providing it with experiences may not be sufficient to elicit good performance.
    2. Inference Overhead: Including the experiential knowledge library in every prompt increases the context length. This can lead to higher per-query API costs and increased latency, which might be a concern for high-throughput applications. However, the paper argues this is offset by the elimination of fixed GPU serving costs.
    3. Scalability of the Experience Library: The paper does not explore how the system behaves as the experience library grows very large. An overly large set of experiences could potentially become noisy or computationally burdensome for the LLM to process effectively in-context.
    4. Complexity of the Controller: The process of updating the experience library relies on another LLM call. The quality and consistency of this "controller" LLM are critical to the learning process and could be a potential point of failure.
  • Personal Insights & Critique: This paper presents a compelling and pragmatic solution to a significant real-world problem. The core idea of "optimizing the context, not the weights" is elegant and leverages the unique strengths of modern LLMs.

    • Novelty: The concept of "semantic advantage" is a clever translation of a core RL principle into the natural language domain. It essentially turns the optimization process into a meta-reasoning task that LLMs are well-suited for.
    • Practical Impact: This approach could democratize the development of specialized LLM agents. Instead of requiring massive GPU clusters for fine-tuning, organizations could adapt powerful API-based models for their specific needs using small, curated datasets and this lightweight learning process.
    • Future Directions: This paradigm opens up many exciting avenues. The experience library could be structured hierarchically, or a retrieval mechanism could be used to select only the most relevant experiences for a given query, mitigating the context length issue. Furthermore, the process of distilling semantic advantage could itself be refined to produce even more potent and generalizable insights. Overall, Training-Free GRPO is a significant step towards making advanced agentic AI more accessible, efficient, and adaptable.

Similar papers

Recommended via semantic vector search.

Discussion

Leave a comment

Sign in to join the discussion.

No comments yet. Start the discussion!