Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning
TL;DR Summary
Tool-Light leverages self-evolved sampling and multi-stage fine-tuning to optimize large language models’ tool-integrated reasoning, reducing misuse and enhancing efficiency and accuracy via entropy-based analysis.
Abstract
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools. However, models employing TIR often display suboptimal behaviors, such as insufficient or excessive tool usage and overthinking after tool calls. The challenge of incentivizing LLMs to perform TIR efficiently and accurately, while stabilizing the reasoning process, remains an open question. In this paper, we start by exploring the impact of tool calls on model reasoning from the perspective of information entropy. Our findings indicate that tool call results lead to a distinct change in the information entropy of subsequent reasoning, with the overall entropy of the reasoning chain varying based on the number of tool calls. Building on these insights, we propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately. Our framework includes dataset construction and multi-stage fine-tuning. For dataset construction, we employ continuous self-evolved sampling using the fine-tuned model, integrating both vanilla sampling and entropy-guided sampling. Besides, we establish strict criteria for selecting positive-negative pairs during sampling. The training process involves a two-stage approach, comprising Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference Optimization (DPO). Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light, significantly improving the model's efficiency in executing TIR tasks.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning
- Authors: Yifei Chen, Guanting Dong, Zhicheng Dou. Their affiliation is with Renmin University of China.
- Journal/Conference: The paper is a preprint submitted to arXiv. The publication date is listed as September 27, 2025, suggesting it is intended for a future conference or journal submission. arXiv is a common platform for researchers to share their work before or during the formal peer-review process.
- Publication Year: 2025 (as listed on the preprint).
- Abstract: The abstract introduces Tool-Integrated Reasoning (TIR), a method for Large Language Models (LLMs) to use external tools to enhance their reasoning. The authors identify key problems with existing TIR models, such as inefficient tool use (too many or too few calls) and "overthinking." The paper first investigates TIR from an information entropy perspective, finding that tool calls alter the entropy of the model's subsequent text generation. Based on this, they propose Tool-Light, a framework to improve TIR efficiency and accuracy. Tool-Light consists of two main parts: a novel dataset construction method using self-evolved sampling (combining vanilla and entropy-guided approaches) and a two-stage fine-tuning process involving Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference Optimization (DPO). Experiments on 10 datasets show that Tool-Light significantly improves the model's efficiency in TIR tasks.
- Original Source Link:
- Original Source:
https://arxiv.org/abs/2509.23285 - PDF Link:
https://arxiv.org/pdf/2509.23285v2.pdf - Publication Status: This is a preprint, meaning it has not yet undergone formal peer review for a conference or journal.
- Original Source:
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Large Language Models (LLMs) are powerful but often fail at complex reasoning tasks that require up-to-date information or precise calculations. Tool-Integrated Reasoning (TIR)—allowing LLMs to use external tools like search engines or code interpreters—is a promising solution. However, current TIR models often use tools suboptimally. They might call tools too often (
tool-overuse), not call them when needed (tool-underuse), or get stuck in "analysis paralysis" after a tool provides low-quality results. - Importance & Gaps: The challenge is to train LLMs to use tools efficiently and accurately. Previous work often focused only on reducing excessive tool use or was designed for a single tool, failing to generalize to scenarios with multiple tools. These methods did not comprehensively address the full spectrum of incorrect tool calls, including underuse and the reasoning process after a tool call.
- Fresh Angle: This paper introduces a novel perspective by analyzing the TIR process through the lens of information entropy. The authors hypothesise that the uncertainty (entropy) in a model's output can signal key decision points in the reasoning chain. This insight is used to develop a more intelligent data sampling strategy to train the model. The core innovation is a self-evolving training loop where the model generates its own preference data to learn better tool-use strategies.
- Core Problem: Large Language Models (LLMs) are powerful but often fail at complex reasoning tasks that require up-to-date information or precise calculations. Tool-Integrated Reasoning (TIR)—allowing LLMs to use external tools like search engines or code interpreters—is a promising solution. However, current TIR models often use tools suboptimally. They might call tools too often (
-
Main Contributions / Findings (What):
- Entropy-based Analysis of TIR: The paper is the first to systematically analyze TIR using information entropy, showing a clear link between tool calls, the resulting information, and the entropy (uncertainty) of the model's subsequent reasoning steps. It finds that reasoning paths with fewer, more effective tool calls tend to have lower overall entropy.
- Tool-Light Framework: A comprehensive framework is proposed to improve TIR, which includes:
- Entropy-Guided Sampling: A novel data generation strategy that identifies high-entropy (most uncertain) points in a reasoning chain and generates alternative reasoning paths (branches) from there. This is more efficient than naively generating many full paths.
- Two-Stage Self-Evolved Training: A training pipeline that starts with standard Supervised Fine-Tuning (SFT) to teach basic tool use, followed by a multi-round Self-Evolved Direct Preference Optimization (DPO). In the DPO stage, the model continuously generates its own training data (pairs of "good" and "bad" reasoning paths) and learns from them, progressively improving its tool-use efficiency and accuracy.
- Demonstrated Effectiveness: The
Tool-Lightframework, when applied to a Qwen2.5-7B model, achieves state-of-the-art or highly competitive results across 10 challenging mathematical and knowledge-intensive reasoning datasets, outperforming previous methods in both correctness and efficiency.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks (e.g., GPT-4, Llama 3) trained on vast amounts of text to understand and generate human-like language. They have powerful internal knowledge but are limited to what they learned during training.
- Tool-Integrated Reasoning (TIR): This paradigm extends LLMs by allowing them to call external tools (e.g., a web search API, a calculator, a code interpreter) to perform tasks they cannot do internally. The LLM generates a "thought" process, decides to call a tool with specific inputs, receives the tool's output, and then continues its reasoning to arrive at a final answer.
- Information Entropy: In information theory, entropy measures the average level of "surprise" or "uncertainty" in a variable's possible outcomes. In the context of LLMs, the entropy of the next token prediction reflects the model's uncertainty. High entropy means the model finds many tokens plausible, while low entropy means it is confident about the next token. This paper uses entropy to find "critical" points in the reasoning process where the model is most uncertain.
- Supervised Fine-Tuning (SFT): A common technique to adapt a pre-trained LLM for a specific task. It involves training the model on a dataset of high-quality examples (e.g., question-answer pairs with correct tool use).
- Direct Preference Optimization (DPO): A technique for aligning LLMs with human or AI preferences. Instead of a complex reinforcement learning (RL) pipeline, DPO directly optimizes the model to increase the probability of "preferred" (winning) responses and decrease the probability of "dispreferred" (losing) responses, given a dataset of preference pairs .
- Self-Evolved Learning: A training paradigm where a model improves by generating its own training data. The model acts as both the student and the teacher, creating progressively more challenging or higher-quality examples to learn from.
-
Previous Works & Differentiation:
- Early TIR Methods: Many studies like
IKEAandSMARTfocused on teaching models to recognize their knowledge boundaries to decide when to call a tool. Others likeSelf-DCused internal model signals for control. However, these often addressed single-tool scenarios or focused mainly ontool-overuse. - RL-based Optimization: Works like
Search Wisely,OTC, andCoRTused reinforcement learning to optimize tool usage. These methods can be complex to train and often require carefully designed reward functions. The paper notes they are hard to generalize to multiple tools. - Multi-Tool Reasoning:
Tool-Starexplored reasoning with multiple tools but, like others, primarily focused on overuse, neglecting underuse and the quality of reasoning after a tool call. - Entropy in Reasoning: Prior works (
Cui et al., 2025,Wang et al., 2025c) have used entropy to analyze general reasoning chains, finding that high-entropy parts often dictate the reasoning direction. This paper is the first to apply this perspective specifically to TIR. - Tool-Light's Differentiation:
Tool-Lightstands out by:- Addressing a broader problem: It tackles tool overuse, underuse, and "overthinking" simultaneously.
- Using entropy for data sampling: Its entropy-guided sampling is a novel and computationally cheaper way to generate diverse and useful training data.
- Employing a self-evolved DPO loop: This allows the model to dynamically adapt and improve its own capabilities without constant human supervision, adjusting data difficulty to its current skill level.
- Early TIR Methods: Many studies like
4. Methodology (Core Technology & Implementation)
The core of the paper is the Tool-Light framework, which comprises two main components: Dataset Construction and a Two-Stage TIR Training Paradigm.
4.1. Dataset Construction
The goal is to create high-quality preference data (pairs of good and bad reasoning paths) for DPO training.
1. Source Data Construction:
- First, an initial SFT model,
M_sft, is trained on an existing TIR dataset (D_sft). - This
M_sftmodel is then used to directly infer answers for questions inD_sftwithout access to any tools. - The paper retains only the questions where the model's answer is incorrect. This filtered dataset is called
D_source. - Intuition: These are "hard" questions that the model cannot solve using its internal knowledge alone, making them ideal candidates for learning effective tool use.
2. Sampling Strategy Design:
The M_sft model is then used to generate multiple reasoning paths (trajectories) for each question in D_source, this time with access to tools. Two strategies are combined to create a diverse set of paths, denoted D_dpo.
-
(1) Vanilla TIR Sampling: For each question, the model generates multiple complete reasoning paths from start to finish. This is a straightforward but computationally expensive way to explore different solutions. The resulting dataset is .
-
(2) Entropy-Guided Sampling: This is the more innovative and efficient strategy, visualized in Figure 2.
-
Step 1: Generate a Main Chain: A single, primary reasoning path (
C_main) is generated for a question. -
Step 2: Calculate Entropy Distribution: The reasoning chain is divided into steps (segments of thought). For each step, the model's uncertainty is measured by calculating the information entropy at each token position. The formula for entropy at position is: where is the vocabulary size, is the sequence of tokens before position , and is the model's predicted probability for the -th token in the vocabulary at position .
-
Step 3: Identify Branching Points: The authors calculate the average entropy over initial token subsequences (e.g., first 10, 20, ..., 50 tokens) within each reasoning step to find the points of highest uncertainty. The steps with the top- highest average entropy are selected as "branching points."
-
Step 4: Branch Sampling: Instead of restarting from scratch, the model resumes generation from these high-entropy branching points, creating multiple alternative continuations or "branches."
-
Intuition: High-entropy points are where the model is most undecided. Forcing it to explore alternatives from these points is more likely to yield diverse and meaningful variations in the reasoning path compared to random sampling. This reduces redundant computation, as the initial shared part of the path is generated only once.
The paths from both vanilla and entropy-guided sampling are combined to form the candidate pool
D_dpofor creating positive-negative pairs.
该图像是示意图,展示了熵引导采样的整体流程。图中用灰色和红色节点分别表示工具调用位置和分叉位置,描述了从问题到答案的多条推理路径。
-
4.2. Two-Stage TIR Training Paradigm
The training pipeline, shown in Figure 3, is designed to progressively enhance the model's TIR capabilities.

Stage 1: Supervised Fine-Tuning (SFT)
- The base LLM is fine-tuned on a standard dataset of correct TIR examples (
D_sft). - The goal is to provide the model with a foundational ability to understand the syntax of tool calls and follow a basic reasoning structure.
- The loss function is the standard cross-entropy loss:
where is the input question and is the target reasoning path with correct tool use. The resulting model is
M_sft.
Stage 2: Self-Evolved DPO
This stage further refines M_sft using preference data. It is broken down into two phases. The core algorithm is DPO, with the loss function:
-
Symbol Explanation:
- : The policy (model) being trained.
- : A reference policy, which is a frozen copy of the model before DPO training begins (in this case,
M_sft). It helps prevent the trained model from deviating too far from its initial capabilities. - : The "winning" or preferred response.
- : The "losing" or dispreferred response.
- : A hyperparameter that controls the strength of the preference.
- : The sigmoid function, which maps the log-probability difference to a value between 0 and 1.
-
The goal of this loss is to maximize the probability of the winning response while minimizing the probability of the losing response .
The two phases of self-evolved DPO use different criteria for selecting pairs:
-
(Phase I) Pre-Aligned DPO Training:
- Goal: Teach the model to be efficient—reduce unnecessary tool calls and avoid overthinking—while maintaining correctness.
- Data: The
D_dposet is generated using theM_sftmodel. Trajectories are classified as correct (F1 score = 1) or incorrect (F1 score = 0). - Pair Selection Criteria ():
- Positive Example (): The correct trajectory with the fewest tool calls and lowest entropy. This path is considered the most efficient and direct solution.
- Negative Example (): An incorrect trajectory that uses more tool calls than the positive example. This teaches the model to avoid convoluted, erroneous paths.
- This phase produces a model
M_dpo1that is biased towards shorter, correct reasoning paths.
-
(Phase II) Self-Evolved DPO Alignment:
- Goal: Teach the model to make necessary tool calls, correcting for any tendency towards
tool-underusethat might have been learned in the previous phase. - Process: This phase is iterative. The model from the previous step (
M_dpo1) is used to generate a new set of reasoning paths. The training data is dynamically adjusted based on the model's current performance on each question. - Pair Selection Criteria ():
- For "Easy" questions (where the model now generates many correct paths):
- Positive (): A correct trajectory with fewer tool calls (reinforcing efficiency).
- Negative (): An incorrect trajectory with the most tool calls (punishing inefficiency and error).
- For "Hard" questions (where the model still struggles):
- Positive (): The correct trajectory with the longest reasoning chain. This encourages the model to explore more complex reasoning if that's what's needed to find the correct answer, counteracting the bias for brevity.
- Negative (): An incorrect trajectory with the shortest reasoning chain. This teaches the model that being concise is not good if it leads to an error.
- For "Easy" questions (where the model now generates many correct paths):
- This iterative process of sampling and training continues for several loops, allowing the model to balance efficiency with the necessity of making enough tool calls to solve hard problems. The final model is
M_dpo2.
- Goal: Teach the model to make necessary tool calls, correcting for any tendency towards
5. Experimental Setup
-
Datasets: 10 datasets were used, split into two categories:
- Mathematical-Reasoning:
AIME24,AIME25,AMC23,MATH,MATH500,GSM8K. These require logical deduction and precise calculations, making them good tests for thecode interpretertool. - Knowledge-Intensive:
HotpotQA,2WikiMultiHopQA,MuSiQue,Bamboogle. These are multi-hop question-answering tasks that require finding and synthesizing information from multiple sources, testing thesearchtool.
- Mathematical-Reasoning:
-
Evaluation Metrics:
- Correctness:
- For math tasks,
LLM-as-Judgewas used, where a powerful model (Qwen2.5-72B-Instruct) evaluates the correctness of the final answer. - For knowledge tasks, F1 score was used, which measures the overlap between the predicted and ground-truth answers.
- For math tasks,
- Efficiency (
Effi): A custom metric to measure performance per tool call.- Conceptual Definition: It rewards models for achieving high performance (correctness) while using fewer tools. A higher
Effiscore means better efficiency. - Mathematical Formula:
- Symbol Explanation:
- : Total number of samples in the test set.
- : The performance score (e.g., 1 for correct, 0 for incorrect) for the -th sample.
- : The number of tool calls used for the -th sample.
- Conceptual Definition: It rewards models for achieving high performance (correctness) while using fewer tools. A higher
- Necessity (
Nece): A custom metric to evaluate if the model is avoiding tool underuse.- Conceptual Definition: This metric assesses the model's ability to make necessary tool calls. It increases when the model avoids situations where using more tools leads to incorrect answers () and decreases when it fails to find correct answers that other, shorter paths found (). A higher
Necescore suggests the model is better at making necessary calls without overusing tools. - Mathematical Formula:
- Symbol Explanation:
- : Min-Max Scaling, which normalizes the final score to a standard range (e.g., [0, 1]).
- : Total number of samples.
- : For the -th sample, this is the count of alternative reasoning paths that used more tool calls than the model's chosen path but ended up with an incorrect answer.
- : For the -th sample, this is the count of alternative paths that used fewer tool calls but got the correct answer.
- Conceptual Definition: This metric assesses the model's ability to make necessary tool calls. It increases when the model avoids situations where using more tools leads to incorrect answers () and decreases when it fails to find correct answers that other, shorter paths found (). A higher
- Correctness:
-
Baselines:
- Single-Tool-Integrated Reasoning: Models trained specifically for one type of tool, such as
Search-R1(search) andToRL(code). - Multi-Tool-Integrated Reasoning:
Prompting-Based: Using an LLM with carefully crafted prompts to guide tool use, without any fine-tuning.ReCall: A method that also uses DPO but focuses on different aspects.Tool-Star: A strong baseline that also uses reinforcement learning for multi-tool reasoning.
- Direct Inference: The base LLM without any tools.
- Single-Tool-Integrated Reasoning: Models trained specifically for one type of tool, such as
6. Results & Analysis
-
Core Results: The main results are presented in Table 1.
(Manual transcription of Table 1 from the paper) Table 1: Results on 10 reasoning tasks, with top two results highlighted in bold and underlined. Unless noted, Qwen2.5-7B-Instruct is used as the backbone. For ToRL and ReTool, their RL models generate training data inferences for SFT, which serves as the baseline. Abbreviation: 2Wiki. (2WikiMultiHopQA). HQA. (HotpotQA). MSQ. (MuSiQue). Bamb. (Bamboogle).
Method Mathematical-Reasoning Tasks Knowledge-Intensive Tasks Avg. AIME24 AIME25 AMC23 MATH MATH500 GSM8K HQA 2Wiki. MSQ Bamb Direct Inference Qwen2.5-7B-Instruct 0.0 6.7 30.0 68.6 57.2 71.4 26.1 25.6 7.9 36.5 33.0 Llama3.1-8B-Instruct 0.0 3.3 15.0 52.8 33.4 75.0 16.2 13.7 7.4 23.2 24.0 Single-TIR Methods Search-o1 6.7 10.0 37.5 73.6 61.8 80.2 41.1 35.4 13.2 39.8 39.9 Search-R1 16.7 6.7 45.0 81.2 63.8 82.4 48.7 40.0 24.1 47.4 45.6 DotaMath 16.7 10.0 50.0 74.6 62.2 82.6 26.2 21.7 6.5 28.6 37.9 ToRL 30.0 26.7 67.5 87.0 80.2 89.2 41.3 35.4 9.5 36.9 50.4 ReTool 23.3 30.0 62.5 84.8 78.4 86.2 31.5 29.0 11.1 35.8 47.3 Multi-TIR Methods Prompting-Based 6.7 13.3 47.5 73.8 62.2 69.4 21.1 23.8 9.9 25.5 35.3 ReCall 3.3 6.7 27.5 73.2 54.6 79.8 51.9 54.0 25.0 55.5 43.2 Tool-Star 30.0 26.7 65.0 85.6 77.2 89.4 54.7 55.7 22.8 58.8 56.6 Ours Tool-Light (Llama) 10.0 6.7 30.0 59.4 56.8 76.6 41.3 33.5 12.2 41.3 36.8 Tool-Light (Qwen) 33.3 23.3 67.5 87.4 79.0 92.0 57.7 56.1 25.0 58.7 58.0 - Insight 1: Tool-use requires training. The
Prompting-Basedmethod performs poorly compared to fine-tuned models likeTool-StarandTool-Light, especially on knowledge-intensive tasks. This shows that simply giving a model access to tools is not enough; it needs to be explicitly trained to use them effectively. - Insight 2: Multi-tool training is crucial for generalization. Single-tool models show biased performance.
ToRL, trained for code, excels at math but is mediocre on knowledge tasks.Search-R1, trained for search, performs well on knowledge tasks but lags in math. In contrast,Tool-StarandTool-Light, trained for multiple tools, achieve strong performance across both categories, withTool-Lighthaving the highest average score (58.0). - Insight 3: The
Tool-Lightframework is highly effective.Tool-Light (Qwen)achieves the best or second-best score on every single dataset. It surpasses the strongTool-Starbaseline on average, demonstrating the superiority of its self-evolved DPO training pipeline over other RL and SFT methods.
- Insight 1: Tool-use requires training. The
-
Quantitative Analysis:
-
Tool-Use Effectiveness (Figure 4): This figure shows that
Tool-Lightis superior in both efficiency and necessity. It achieves the highestEffiscore, indicating it gets more correct answers per tool call. It also has the highestNecescore, suggesting it strikes the best balance between calling tools when necessary and avoiding underuse. Furthermore, the right panel shows thatTool-Lightproduces shorter reasoning sequences thanTool-Starwhile achieving higher accuracy, confirming that it effectively reduces "overthinking."
该图像是图表,展示了图4中Tool-Light与基线方法在效率、必要性以及序列长度分布上的差异。左图显示Tool-Light在效率分数上表现最佳,中图体现了其较高的必要性得分,右图则比较了Tool-Light与Tool-Star的输出序列长度分布,Tool-Light集中在较短序列长度。 -
Entropy Distribution Analysis (Figure 5): This analysis compares the entropy of output sequences from different models.
Tool-Lightconsistently produces reasoning paths with lower information entropy compared to baselines likeSearch-R1andReCall. The authors attribute this to the training process, which explicitly learns from low-entropy (more certain, direct) paths selected as positive examples. By favoring these paths, the model becomes more "confident" and less prone to the kind of high-entropy uncertainty that can lead to overthinking.
该图像是图表,展示了不同方法下输出序列的熵值分布对比,包含Tool-Light、Search-R1和ReCall三种方法在四个步骤中的熵值变化趋势。
-
-
Ablations / Parameter Sensitivity: Table 2 investigates the impact of different components of the
Tool-Lightframework.(Manual transcription of Table 2 from the paper) Table 2: Ablation experiment results for various aspects: 1/1 strategy ratio indicates a 1:1 data ratio for two sampling strategies,
p-randn-r.denote random positive and negative example selection.Method Performance Efficiency Necessity Tool-Light (2 loop) 58.0 0.44 0.75 Ablation for self-evolved Loops w. 1 loop 57.9(-0.1) 0.42 (-0.02) 0.71(-0.04) w. 3 loop 56.1(-1.9) 0.39(-0.05) 0.73(-0.02) w. 4 loop 56.4(-1.6) 0.37(-0.07) 0.71(-0.04) w. 5 loop 54.1(-3.9) 0.36(-0.08) 0.72(-0.03) Ablation for Sampling Criteria w. 1/1 data ratio 56.9 (-1.1) 0.44 0.76 (+0.01) w. p-r. 53.6(-4.4) 0.42 (-0.02) 0.63(-0.12) w. n-r. 53.9(-4.1) 0.41 (-0.03) 0.74(-0.01) - Impact of DPO Loop Number: Performance peaks at 2 loops in the self-evolved DPO stage and then declines. The authors suggest this is because after two rounds, the model has learned from the most useful preference pairs, and further training leads to overfitting on the self-generated data distribution.
- Impact of Sampling Criteria: The most significant drop in performance occurs when the positive (
p-r) or negative (n-r) examples for DPO are chosen randomly instead of using the carefully designed criteria. This highlights that the quality of the preference pairs is crucial for the success of the DPO training. Simply having preference pairs is not enough; they must clearly distinguish between efficient, correct reasoning and inefficient, incorrect reasoning.
7. Conclusion & Reflections
-
Conclusion Summary: The paper introduces
Tool-Light, a novel and effective framework for training LLMs to use tools both accurately and efficiently. By analyzing TIR through the lens of information entropy, the authors developed an innovative entropy-guided sampling method for data construction. This data is then used in a two-stage training pipeline featuring a self-evolved DPO process that dynamically adjusts training to the model's evolving capabilities. The resulting model outperforms strong baselines across a wide range of reasoning tasks, demonstrating superior performance, efficiency, and a reduced tendency to "overthink." -
Limitations & Future Work:
- The paper does not explicitly state its limitations. However, potential limitations can be inferred. The self-evolutionary process could be at risk of mode collapse, where the model continually reinforces its own biases or fails to explore truly novel reasoning strategies. The effectiveness of the entropy-guided sampling also depends on the assumption that high entropy accurately pinpoints the most useful branching points, which may not always hold true. The framework's performance is also tied to the quality of the initial SFT model and the base LLM.
-
Personal Insights & Critique:
- Novelty and Significance: The use of information entropy to guide data sampling for TIR is a clever and well-motivated innovation. It provides a principled way to make the expensive process of generating training data more efficient. The self-evolved DPO pipeline, with its dynamic criteria for "hard" and "easy" problems, is a sophisticated approach to curriculum learning that allows the model to balance exploitation (efficiency) and exploration (correctness on hard tasks).
- Practical Implications: The framework provides a clear and reproducible recipe for improving tool use in LLMs. The custom metrics,
EffiandNece, are valuable contributions for evaluating TIR models beyond simple accuracy. - Potential Improvements:
- The criteria for selecting positive and negative examples, while effective, are heuristic. Future work could explore learning these criteria automatically.
- The evaluation relies on an
LLM-as-a-Judgefor math tasks, which, while practical, is not infallible and can have its own biases. - The framework could be extended to a wider array of tools beyond search and code interpretation, and tested on its ability to handle errors or ambiguous outputs from the tools themselves.
- Overall, this is a strong paper that makes a significant contribution to the field of tool-augmented LLMs. It combines a solid theoretical insight (entropy) with a robust and well-engineered training framework (
Tool-Light) to address a critical and practical problem.
Similar papers
Recommended via semantic vector search.