Self-Improving LLM Agents at Test-Time

diversity will enable models to generalize to novel tasks after post -

论文状态：已完成

Self-Improving LLM Agents at Test-Time

发表：2025/10/08

大语言模型微调 (45)大语言模型推理能力增强 (32)大语言模型强化学习训练 (54)大语言模型置信度校准 (5)自我增强大语言模型 (1)

原文链接 PDF 下载

价格：0.10

已有 10 人读过

本分析由 AI 生成，可能不完全准确，请以原文为准。

TL;DR 精炼摘要

本文提出测试时自改进方法，通过不确定性检测、自我数据增强和测试时微调三步，实现大型语言模型智能体即时强化。TT-SI显著提升准确率并大幅降低训练样本需求，TT-D进一步优化复杂情境表现，展现了低成本高效自演化智能体新范式。

摘要

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 S ELF -I MPROVING LLM A GENTS AT T EST -T IME Anonymous authors Paper under double-blind review A BSTRACT One paradigm of language model (LM) fine-tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post - training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this paper, we explore a new test-time self-improvement method to create more effective and generalizable agentic LMs on-the-fly . The proposed algorithm can be summarized in three steps:

思维导图

论文精读

中文精读约 23 分钟读完 · 12,323 字

1. 论文基本信息

1.1. 标题

Self-Improving LLM Agents at Test-Time (测试时自改进大型语言模型智能体)

1.2. 作者

匿名作者 (Paper under double-blind review)。

1.3. 发表期刊/会议

该论文在 OpenReview 上发布，发布时间为 2025-10-08T00:00:00.000Z。考虑到 OpenReview 是机器学习领域顶级会议（如 ICLR、NeurIPS）的审稿平台，且论文内容涉及前沿的大型语言模型 (LLM) 智能体 (agent) 研究，可以推断其目标是机器学习或自然语言处理领域的顶级会议。

1.4. 发表年份

2025年

1.5. 摘要

大型语言模型 (LLM) 微调 (fine-tuning) 的一个范式是创建大量的训练数据集，假设高数量和多样性将使模型在 后训练 (post-training) 后能够泛化到新任务。然而，在实践中，收集大量数据效率低下，并且训练成本高昂；更糟糕的是，无法保证所得模型能处理复杂场景或更好地泛化。此外，现有技术很少评估训练样本是否提供了新信息或与模型已获取的知识冗余，导致不必要的成本。本文探索了一种新的 测试时自改进 (test-time self-improvement) 方法，以 即时 (on-the-fly) 创建更有效和更具泛化能力的 智能体 (agentic) LLM。所提出的算法可概括为三个步骤：(i) 首先，它利用 不确定性函数 (uncertainty function) 识别模型难以处理的样本（自我感知 (self-awareness)），(ii) 其次，从检测到的不确定样本生成相似示例（自我数据增强 (self-data augmentation)），(iii) 最后，在 测试时微调 (test-time fine-tuning) 中使用这些新生成的样本进行学习（自我学习 (self-learning)）。本文研究了该方法的两种变体：测试时自改进 (Test-Time Self-Improvement, TT-SI)，其中同一模型从自身的不确定案例中生成额外的训练示例并从中学习；以及 测试时蒸馏 (Test-Time Distillation, TT-D)，其中一个更强的模型为这些不确定案例生成相似示例，使 学生模型 (student model) 能够利用 蒸馏监督 (distilled supervision) 进行适应。跨不同 智能体基准 (agent benchmarks) 的实证评估表明，TT-SI 在平均准确率上比其他标准学习方法高出 $+5.36\%$ 的绝对增益，但训练使用的样本量减少了 68 倍；而 TT-D 在需要多样化训练信号的更困难场景中进一步提高了性能。研究结果突出了 TT-SI 的前景，以及当前学习框架在成本和泛化能力方面的局限性，展示了 测试时自演化 (self-evolving) LLM 作为在复杂场景中构建更强大 智能体 (agent) 的新范式的潜力。

1.6. 原文链接

https://openreview.net/forum?id=M1zSTXY1xr PDF 链接: https://openreview.net/pdf?id=M1zSTXY1xr

2. 整体概括

2.1. 研究背景与动机

当前的 大型语言模型 (LLM) 微调 (fine-tuning) 范式主要依赖于构建大规模、多样化的训练数据集，以期望模型能够泛化到新任务。然而，这种 后训练 (post-training) 方法面临以下核心问题和挑战：

低效且昂贵的数据收集与训练成本： 收集高质量、大规模的数据集既耗时又耗力，训练成本也随着数据量呈指数级增长。例如，人工标注或 LLM 合成数据通常需要数天到数周。
泛化能力无保证： 即使投入巨大成本，也无法保证模型在复杂场景或新任务上表现出色。
信息冗余与低效利用： 现有技术很少评估训练样本是否为模型提供了新颖信息，或与模型已掌握的知识重复。这意味着模型可能在大量它已经“学会”或“不重要”的样本上浪费计算资源，导致 计算成本 (computation cost) 增加和 泛化 (generalization) 能力受限，尤其是在处理 长尾分布 (long tail of the data distribution) 或 对抗性示例 (adversarial examples) 时。
分布漂移 (Distributional Shift)： 测试分布 (test distribution) $\mathcal{P}_{\text{test}}$ 常常与 训练分布 (training distribution) $\mathcal{P}_{\text{train}}$ 不同，导致 经验风险 (empirical risk) 无法准确反映真实 测试风险 (test risk)，从而损害模型在新颖或复杂场景下的泛化能力。
灾难性遗忘 (Catastrophic Forgetting) 与模型迭代： 对 LLM 进行 微调 (fine-tuning) 常常导致 灾难性遗忘 (catastrophic forgetting)，即在新任务上 微调 (fine-tuning) 会损害模型在先前习得技能上的表现。此外，新 LLM 的快速发布要求持续且昂贵的 再训练 (re-training) 周期，以利用每个新基础模型在下游任务上的增强知识和推理能力。

这些限制促使研究人员探索一种新的 后训练 (post-training) 范式，该范式根植于 转导学习 (transductive learning) 和 局部学习 (local learning) 原理，能够 即时 (on-the-fly) 适应模型，仅在从 测试分布 (test distribution) 中抽取的 信息量最大 (most informative) 样本上进行训练。

2.2. 核心贡献/主要发现

本文提出了一种新颖的 测试时自改进 (Test-Time Self-Improvement, TT-SI) 框架，旨在解决现有 LLM 智能体 (agentic LLM) 微调 (fine-tuning) 范式的效率和泛化问题。其核心贡献和主要发现总结如下：

提出了三阶段 测试时自改进 (test-time self-improvement) 算法： 借鉴人类学习理论，设计了一个包含三个步骤的算法：(i) 通过新颖的 不确定性估计器 (Uncertainty Estimator, H) 识别模型难以处理的 不确定样本 (uncertain samples)，实现 自我感知 (self-awareness)；(ii) 基于这些不确定样本生成新的训练示例，实现 自我数据增强 (self-data augmentation)；(iii) 即时 (on-the-fly) 更新模型，实现 自我学习 (self-learning)。
系统性实证研究与两种变体： 对 TT-SI 和 测试时蒸馏 (Test-Time Distillation, TT-D) 两种变体进行了系统的实证研究，分析了 不确定性估计器 (uncertainty estimator) 的选择、测试时学习方法 (learning method at test time)、生成样本 (generated samples) 的规模以及其他参数效应等关键组件。
验证 智能体 (agentic) LLM 的 测试时自改进 (test-time self-improvement) 能力： 实验证明 智能体 (agentic) LLM 即使仅从单个训练实例中也能在 推理 (inference) 过程中进行 自改进 (self-improve)。
超越标准 归纳学习 (inductive learning) 方法： 框架显著优于标准 归纳学习 (inductive learning) 方法，在 计算成本 (compute) 大幅减少 (数量级 (orders-of-magnitude)) 的情况下，通过 测试时上下文学习 (test-time ICL) 和 测试时微调 (test-time fine-tuning) 实现了显著的性能提升。
效率与准确性的平衡： TT-SI 在三个 智能体基准 (agent benchmarks) 上实现了 +5.36% 的平均绝对准确率提升，但使用的训练样本量比标准 监督微调 (Supervised Fine-Tuning, SFT) 少 68 倍。这表明即使是最小的、由 不确定性 (uncertainty) 引导的适应也能在 推理 (inference) 过程中显著提升性能。
蒸馏 (Distillation) 的进一步提升： TT-D 在复杂、上下文丰富的场景（如多轮对话）中进一步提高了性能，表明更高质量的训练信号能带来额外的收益。
上下文学习 (In-Context Learning, ICL) 的快速替代方案： 当 训练 (training) 不可行时，结合 ICL 的 TT-SI 提供了一种快速、免训练 (training-free) 的替代方案，在类似条件下优于其他标准学习方法。

3. 预备知识与相关工作

3.1. 基础概念

为了更好地理解本文提出的 测试时自改进 (TT-SI) 框架，我们需要了解以下核心概念：

大型语言模型 (Large Language Models, LLMs)： 指的是参数量巨大、在海量文本数据上进行预训练的深度学习模型。它们能够理解和生成自然语言，并执行各种 自然语言处理 (NLP) 任务。
微调 (Fine-tuning)： 在一个预训练模型的基础上，使用特定任务的数据集对其进行进一步训练，以使模型适应特定任务的过程。这通常涉及到更新模型的大部分或所有参数。
后训练 (Post-training)： 指的是在模型完成大规模预训练之后，针对特定任务或能力进行的一系列优化和调整过程，微调 (fine-tuning) 是其中的一种。
智能体 (Agent)： 在 LLM 语境下，智能体 (agent) 是指能够感知环境、进行决策并采取行动以实现特定目标的 LLM。这些 智能体 (agent) 通常通过调用外部工具或 API 来扩展其能力，而不仅仅是生成文本。
归纳学习 (Inductive Learning)： 一种传统的机器学习范式，模型通过从训练数据中学习普遍规律来泛化到未见过的测试数据。训练和测试阶段是分离的。本文将标准的 微调 (fine-tuning) 视为 归纳学习 (inductive learning)。
转导学习 (Transductive Learning)： 与 归纳学习 (inductive learning) 不同，转导学习 (transductive learning) 在训练阶段可以访问测试数据（但没有标签），并利用这些测试数据的信息来提高在这些特定测试数据上的性能。它不致力于学习一个普适的泛化模型，而是针对特定的测试实例进行预测。
局部学习 (Local Learning)： 一种关注在特定区域或特定数据点周围学习局部模型的学习范式，而不是构建一个覆盖整个数据空间的全局模型。它与 转导学习 (transductive learning) 概念上相关。
测试时训练 (Test-Time Training, TTT)： 在 推理 (inference) 阶段，针对当前的测试输入，对模型进行小幅度的、临时的参数更新。这种方法旨在使模型能够 即时 (on-the-fly) 适应 分布漂移 (distribution shift) 或当前输入带来的挑战。
参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT)： 一系列旨在减少 微调 (fine-tuning) LLM 时代所需计算资源和参数数量的技术。它们通常只更新模型参数的一小部分，例如添加小的可训练模块或修改现有模块的低秩矩阵。
低秩适应 (Low-Rank Adaptation, LoRA)： 一种流行的 PEFT 方法。它通过在预训练模型的每个 Transformer 层中注入一对小的 低秩矩阵 (low-rank matrices) 来 微调 (fine-tuning) 模型。在 微调 (fine-tuning) 过程中，预训练模型的权重被冻结，只有这些 低秩矩阵 (low-rank matrices) 的参数被训练，从而大大减少了可训练参数的数量。
上下文学习 (In-Context Learning, ICL)： LLM 的一种能力，即在不更新模型参数的情况下，通过在 提示 (prompt) 中提供少量示例（少样本 (few-shot) 示例）来引导模型完成特定任务。模型在这些示例中识别模式并将其应用于新的输入。
不确定性函数 (Uncertainty Function)： 用于量化模型对其预测的 置信度 (confidence)。低 置信度 (confidence) 通常表示模型对某个样本的处理存在 不确定性 (uncertainty)，这可能意味着该样本对模型来说具有挑战性或信息量大。
自我感知 (Self-Awareness)： 模型能够识别自身在特定任务或样本上的 不确定性 (uncertainty) 或不足。
自我数据增强 (Self-Data Augmentation)： 模型根据其 不确定 (uncertain) 样本 即时 (on-the-fly) 生成新的、语义相似的训练数据，以扩充训练集。
自我学习 (Self-Learning)： 模型利用 自我数据增强 (self-data augmentation) 生成的数据，通过 测试时微调 (test-time fine-tuning) 来改进自身在当前任务上的表现。
蒸馏监督 (Distilled Supervision)： 指的是利用一个更强大的 教师模型 (teacher model) 的输出作为 学生模型 (student model) 的训练信号。教师模型 (teacher model) 的“知识”通过这种方式被“蒸馏”到 学生模型 (student model) 中。

3.2. 前人工作

本文在 LLM 后训练 (post-training) 和 智能体 (agent) 微调 (fine-tuning) 的背景下，回顾了相关的现有技术，并指出了其局限性。

3.2.1. 归纳式微调的根本问题 (Fundamental Issues in Inductive Fine-Tuning)

传统的 LLM 后训练 (post-training) 范式遵循 归纳学习 (inductive learning) 原则，即将 训练 (training) 和 测试 (testing) 分开。模型通过从大规模的 训练数据集 (training datasets) $\mathcal{D}_{\text{train}} = \{(x_i, y_i)\}_{i=1}^N$ 中学习 可泛化模式 (generalizable patterns)，然后应用于未见的 测试实例 (test instances)。

输入 (input) $x_i \in \mathcal{X}$ (例如 任务查询 (task query))，对应输出 (corresponding desired output) $y_i \in \mathcal{Y}$ (例如 智能体 (agent) 的 行动序列 (sequence of actions))。
目标是找到映射函数 $\mathcal{F}_{\theta}: \mathcal{X} \rightarrow \mathcal{Y}$ 的参数 $\theta$ ，最小化 经验风险 (empirical risk)： $\hat{\mathcal{L}}_{\text{train}}(\theta)=\frac{1}{N} \sum_{i=1}^{N} \ell\left(\mathcal{F}_{\theta}\left(x_{i}\right), y_{i}\right)$ 。
主要局限性： 如 研究背景与动机 部分所述，这种范式存在 分布漂移 (distributional shift)、计算成本 (computation cost) 高昂、信息冗余 (redundancy)、灾难性遗忘 (catastrophic forgetting) 和 模型迭代 (model churn) 等问题。

3.2.2. 测试时训练 (Test-Time Training, TTT)

测试时训练 (TTT) 在 推理 (inference) 过程中执行小规模、临时的参数更新，从而部分模糊了 训练-测试边界 (train-test boundary)。

起源： 思想可追溯到 局部学习 (local learning) 和 转导学习 (transductive learning)，其中 假设 (hypotheses) 在观察 测试输入 (test inputs) 后进行适应。
图像领域： $Sun et al. (2020)$ 表明简单的 自监督 (self-supervised) TTT 目标可以改善 图像分类器 (image classifiers) 在 分布漂移 (distribution shift) 下的 鲁棒性 (robustness)。
LLM 领域： TTT 在 LLM 领域相对较新。
- Hardt & Sun (2024) 通过 微调 (fine-tuning) 检索 (retrieved) 到的 最近邻 (nearest neighbors) 来降低 困惑度 (perplexity)。
- SIFT (Hübotter et al., 2025) 主动选择多样化、信息丰富的 邻居 (neighbors) 以限制 冗余 (redundancy)。
- $Akyürek et al. (2025)$ 在 ARC 中对 上下文测试示例 (in-context test examples) 应用 基于规则的线性变换 (rulebased linear transformations) 以获取额外的 测试时训练数据 (test-time training data)。
现有工作的局限性： 这些方法要么针对 困惑度 (perplexity) 而非 通用推理任务 (general reasoning tasks)，要么假设能够访问高质量的 邻居 (neighbors) 或 上下文示例 (in-context exemplars)。

3.3. 差异化分析

本文提出的 TT-SI 方法与上述 归纳学习 (inductive learning) 和 测试时训练 (TTT) 的现有工作存在显著差异：

主动选择信息丰富实例： 不同于处理所有训练样本或仅依赖于 检索 (retrieved) 邻居，TT-SI 通过 不确定性估计器 (Uncertainty Estimator, H) 主动识别模型最挣扎的、信息量最大的 测试实例 (test instances)，从而聚焦 计算资源 (computational resources)。
即时 (on-the-fly) 生成训练信号： TT-SI 不依赖预先收集的大规模数据集或高质量的 检索 (retrieved) 邻居。它针对每个 不确定 (uncertain) 测试实例 (test instance)，通过 数据合成函数 (Data Synthesis Function, G) 即时 (on-the-fly) 生成并 过滤 (filters) 训练信号，大大降低了数据收集的成本和难度。
面向 LLM 智能体 (agent) 任务： 本文的工作首次将 生成式 (generation-based) 测试时微调 (test-time fine-tuning) 方法应用于 LLM 智能体 (agent) 任务，旨在解决 智能体 (agent) 在 工具使用 (tool use) 或其他复杂 智能体 (agentic) 任务中的泛化和效率问题。
避免 灾难性遗忘 (catastrophic forgetting)： TT-SI 的 测试时微调 (test-time fine-tuning) 是临时的，模型参数在处理完一个 不确定样本 (uncertain sample) 后会 恢复 (restore) 到原始状态，从而避免了 归纳学习 (inductive learning) 中常见的 灾难性遗忘 (catastrophic forgetting) 问题。
效率与有效性兼顾： 实验结果表明，TT-SI 在使用远少于 监督微调 (SFT) 的数据量下，仍能实现更高的准确率，体现了其在效率和有效性上的优势。

4. 方法论

本文引入了一种 测试时自改进 (test-time self-improvement) 框架，旨在使 智能体 (agent) 能够 即时 (on-the-fly) 从具有挑战性的实例中学习。该框架整合了三个关键组件：自我感知 (Self-Awareness)、自我增强 (Self-Augmentation) 和 自我学习 (Self-Learning)，其整体算法流程如 Algorithm 1 所示。

以下是原文 Algorithm 1 的内容，展示了 测试时自改进 (Test-Time Self-Improvement) 框架的详细步骤：

Algorithm 1 Test-Time Self-Improvement Framework
Require: Test dataset  $\mathcal{D}_{\text {test }}$ , model  $\mathcal{M}$ , data generation prompt  $\mathcal{P}$ , temporary dataset size  $K$ , initial
    model parameters  $\theta_{0}$ 
    for each  $x_{i} \in \mathcal{D}_{\text {test }}$  do
        Step 1: Uncertainty Estimator (H)
        Compute uncertainty (softmax-difference):
             $\ell_{n}=-\log P_{\mathcal{M}}\left(a_{n} \mid x_{i}\right), \quad \forall a_{n} \quad \triangleright$  Negative Log-Likelihood (NLL) for candidate action
             $p_{n}=\frac{\exp \left(\ell_{n}-\max _{j} \ell_{j}\right)}{\sum_{k} \exp \left(\ell_{k}-\max _{j} \ell_{j}\right)} \quad \triangleright$  Apply Relative Softmax Scoring (RSS) normalization
             $u\left(x_{i}\right)=p^{(1)}-p^{(2)} \quad \triangleright$  Highest minus second-highest RSS scores
        Step 2: Data Synthesis Function (G)
        if  $u\left(x_{i}\right)<\tau$  then  $\triangleright$  Check uncertainty
            Generate  $K$  synthetic samples using LLM:
                 $\mathcal{D}_{i} \leftarrow \mathcal{L}_{\text {gen }}\left(x_{i}, K\right) \quad \triangleright$  Equation (5)
        Step 3: Test-Time Fine-tuning (T)
            Learn temporary model parameters  $\theta_{i}^{*}$  via LoRA:
                 $\theta_{i}^{*} \leftarrow \arg \min _{\theta_{0}} \sum_{\left(x^{\prime}, y^{\prime}\right) \in \mathcal{D}_{i}} \ell\left(\mathcal{M}\left(x^{\prime} ; \theta_{0}\right), y^{\prime}\right) \quad \triangleright$  Equation (7)
            Perform inference with adapted parameters  $\theta_{i}^{*}$  :
                 $\hat{y}_{i} \leftarrow \mathcal{M}\left(x_{i} ; \theta_{i}^{*}\right)$ 
            Reset model parameters:
                 $\theta_{i}^{*} \rightarrow \theta_{0} \quad \triangleright$  Restore original parameters
        else
            Perform inference directly:
                 $\hat{y}_{i} \leftarrow \mathcal{M}\left(x_{i} ; \theta_{0}\right)$ 
        end if

4.1. 方法原理

该方法的核心思想是在模型 推理 (inference) 阶段，针对模型表现出 不确定性 (uncertainty) 的 测试样本 (test samples)，即时 (on-the-fly) 生成相关的训练数据并进行短暂的 微调 (fine-tuning)，从而提高模型在该特定样本上的性能，同时避免对基础模型造成永久性改变和 灾难性遗忘 (catastrophic forgetting)。这模仿了人类在学习中遇到难题时，会主动寻找类似示例进行练习来弥补知识空白的过程。

4.2. 核心方法详解

4.2.1. 自我感知：测试时 `不确定样本 (Uncertain Sample)` 选择 (Uncertainty Estimator, H)

自我感知 (Self-Awareness) 阶段旨在识别模型 $\mathcal{M}$ 在 推理 (inference) 过程中表现出高 不确定性 (uncertainty) 的 数据样本 (data samples) $x_i$ 。这些样本更有可能是具有挑战性或容易出错的，因此对于进一步学习 信息量最大 (most informative)。

定义： 给定一个 任务 (task)，输入 (input) 为 $x_i$ ，我们定义 不确定性估计器 (Uncertainty Estimator, H)，它估算模型 $\mathcal{M}$ 对其环境中可用的每个 候选行动 (candidate action) $a_1, \ldots, a_n \in \mathcal{A}$ 的 置信度分数 (confidence score) ( $\mathcal{C}$ )。对于每个 输入 (input) $x_i$ 和 候选行动 (candidate action) $a_n$ ，置信度 (confidence) 计算为： $\mathcal{C}_{i}=\mathrm{H}\left(x_{i}, a_{n}, \mathcal{M}\right)$ 此估算是在没有 真实标注 (ground-truth labels) $y_i$ 的情况下进行的，确保了在 推理 (inference) 过程中的公平性和适用性。如果 $\mathcal{C}_i < \tau$ (其中 $\tau$ 是用户定义的 置信度阈值 (confidence threshold))，则样本 $x_i$ 被视为 不确定 (uncertain)。通过过滤掉高 置信度 (confidence) (即 确定 (certain)) 的实例，这个 不确定性估计 (uncertainty estimation) 步骤将 计算 (computational) 和 学习资源 (learning resources) 集中在 信息量最大 (most informative) 和最具挑战性的问题上，从而提高效率和质量。

选择 不确定样本 (Uncertain Samples)： 为了系统地识别 不确定样本 (uncertain samples)，本文采用了一个 基于边缘的置信度估计器 (margin-based confidence estimator)，它使用模型 $\mathcal{M}$ 为给定 输入 (input) $x_i$ 生成的 似然分布 (likelihood distribution)。

计算 负对数似然 (Negative Log-Likelihood, NLL)： 给定一组可用的 行动 (actions) $a_1, a_2, \ldots, a_N$ ，首先计算每个 行动 (action) 的 负对数似然 (NLL)： $\operatorname{NLL}\left(a_{n} \mid x_{i}\right)=-\log P_{\mathcal{M}}\left(a_{n} \mid x_{i}\right), \quad \forall n \in 1,2, \ldots, N$ 其中， $P_{\mathcal{M}}(a_n \mid x_i)$ 是模型 $\mathcal{M}$ 在给定 输入 (input) $x_i$ 下预测 行动 (action) $a_n$ 的概率。NLL 值越大，表示模型对该 行动 (action) 的 置信度 (confidence) 越低。
应用 相对 Softmax 评分 (Relative Softmax Scoring, RSS) 机制： 由于原始 NLL 分数是无界的，难以直接解释其 不确定性 (uncertainty)。为了解决这个问题，本文应用 RSS 机制，将这些分数转换为归一化和可解释的 置信度分布 (confidence distribution)： $p^{n}=\frac{\exp \left(\ell_{n}-\max _{j} \ell_{j}\right)}{\sum_{k=1}^{N} \exp \left(\ell_{k}-\max _{j} \ell_{j}\right)}, \quad \text { where } \quad \ell_{n}=-\operatorname{NLL}\left(a_{n} \mid x_{i}\right)$ 其中：
- $p^n$ 是 行动 (action) $a_n$ 的 RSS 置信度分数 (RSS confidence score)。
- $\ell_n$ 是对应于 $a_n$ 的 负对数似然分数 (negative log-likelihood score)。
- $\max_j \ell_j$ 表示所有 候选行动 (candidate actions) 中最大的 负对数似然分数 (negative log-likelihood score)，用作 数值稳定器 (numerical stabilizer)。这个操作实际上是将所有 对数似然 (log-likelihoods) 平移，使最大值变为 0，然后进行 Softmax 转换，得到相对 置信度 (confidence)。
计算 Softmax 差值 (softmax-difference) 来量化 不确定性 (uncertainty)： 为了量化 预测不确定性 (prediction uncertainty)，计算最高和次高 RSS 分数 (RSS scores) 之间的差值，称之为 Softmax 差值 (softmax-difference)。形式上，输入 (input) $x_i$ 的 不确定性 (uncertainty) 定义为： $u\left(x_{i}\right)=p^{(1)}-p^{(2)}$ 其中， $p^{(1)}$ 和 $p^{(2)}$ 分别表示最高和次高 RSS 分数 (RSS scores)。Softmax 差值 (softmax-difference) 越小，表示模型对最佳 行动 (action) 的 置信度 (confidence) 与次佳 行动 (action) 的 置信度 (confidence) 越接近，即模型越 不确定 (uncertain)。
选择 不确定样本 (uncertain samples)： 最后，使用用户定义的 阈值 (threshold) $\tau$ ，选择表现出高 不确定性 (uncertainty) 的样本 ( $u(x_i) < \tau$ )。这确保了后续的 适应 (adaptation) 或 分析工作 (analysis efforts) 集中在最模糊的实例上，即模型最可能从进一步信息或改进中受益的地方。

4.2.2. 自我增强：数据生成策略 (Data Synthesis Function, G)

一旦 输入样本 (input sample) $x_i$ 被 不确定性估计器 (Uncertainty Estimator, H) 识别为表现出高 不确定性 (uncertainty)，数据合成函数 (Data Synthesis Function, G) 就会立即启动 数据合成 (data synthesis) 过程。本阶段旨在为当前 不确定实例 (uncertain instance) 生成新的、相关的训练数据。核心思想是 即时 (on-the-fly) 创建一个集中的、临时的 数据集 (dataset)，从而实现模型的快速、局部 适应 (adaptation)，以解决其发现具有挑战性的特定查询。

定义： 当 输入样本 (input sample) $x_i$ (没有 真实标注 (ground-truth labels)) 在 推理 (inference) 过程中被处理并被 不确定性估计器 (H) 标记为 不确定 (uncertain) 时，我们触发 $K$ 个新的训练示例及其相应 标签 (labels) 的合成。针对这个特定的 不确定输入 (uncertain input) $x_i$ ，调用 数据合成函数 (G)，生成 $K$ 个新的 输入-输出对 (input-output pairs)，遵循所提供的指令： $\mathbf{G}: x_{i} \rightarrow\left\{\left(x_{i j}^{\prime}, y_{i j}^{\prime}\right)\right\}_{j=1}^{K}$ 其中：

$K$ 是用户定义的 超参数 (hyperparameter)，它决定了为当前 不确定实例 (uncertain instance) $x_i$ 生成的 合成数据 (synthetic data) 的数量。
每个生成的 对 (pair) (x'_{ij}, y'_{ij}) 旨在成为与原始 不确定样本 (uncertain sample) $x_i$ 语义相似，同时引入轻微变异的 实例 (instance)。
在实践中， $x_i$ 作为 提示 (prompt) 中的 种子示例 (seed example)，指导生成 $K$ 个新的 合成训练对 (synthetic training pairs) (x', y')，这些 对 (pairs) 类似于原始 输入 (input) 但扩展了 训练信号 (training signal)。

这些 $K$ 个生成的 对 (pairs) 立即形成一个临时的、查询特定 (query-specific) 的 数据集 (dataset) $\mathcal{D}_i$ ： $\mathcal{D}_{i}=\left\{\left(x_{i j}^{\prime}, y_{i j}^{\prime}\right)\right\}_{j=1}^{K}$ 这个 数据集 (dataset) $\mathcal{D}_i$ 随后用于 模型参数 (model parameters) $\theta$ 的局部 适应 (adaptation)，然后通过 测试时微调 (Test-Time Fine-tuning, T) 处理后续的 输入样本 (input samples)。这种检测、合成和 适应 (adaptation) 的迭代过程针对每个被识别为 不确定 (uncertain) 的样本执行。

生成样本： 合成函数 (synthesis function) G (公式 (5)) 的实现，在每个 不确定样本 (uncertain sample) $x_i$ 被触发时，利用 智能体 (agent) 本身进行 数据合成 (data synthesis) (即 $\mathcal{L}_{\text{gen}}$ ) 作为 自我增强 (self-augmentation)。

对于每个生成实例， $\mathcal{L}_{\text{gen}}$ 会获得一个精心设计的 人工提示 (hand-crafted prompt) $\mathcal{P}$ (参见 Figure 6)，不确定输入 (uncertain input) $x_i$ 作为直接 种子 (seed) (关键是不带其对应的 标签 (label) $y_i$ )，以及指定数量的样本 $K$ 进行生成。
模型然后生成 $K$ 个新的 输入-输出对 (input-output pairs)，表示为 $\{(x'_{ij}, y'_{ij})\}_{j=1}^{K}$ 。
这种 基于种子 (seed-based) 的生成过程，灵感来源于 自我指导方法 (self-instruction methodologies)，引导 $\mathcal{L}_{\text{gen}}$ 生成保持 $x_i$ 核心 语义含义 (semantic meaning) 和 任务相关性 (task relevance)，同时引入受控 表面级变异 (surface-level variations) 的变体。
通过以这种 即时 (on-the-fly) 方式为每个 不确定实例 (uncertain instance) 合成数据，我们促进了有针对性和及时的模型 适应 (adaptation)，旨在精确改善模型在它遇到的、难以处理的查询类型上的性能。

4.2.3. 自我学习：测试时微调 (Test-Time Fine-tuning, T)

测试时微调 (Test-Time Fine-tuning) 使 参数模型 (parametric models) 能够在 推理 (inference) 期间临时更新其权重，但这在 LLM 领域尚未被充分探索，尤其是在 智能体 (agentic) 任务中。在 不确定性检测 (uncertainty detection) (Section 3.1) 和 目标数据合成 (targeted data synthesis) (Section 3.2) 的基础上，本文现在使用 测试时微调 (T) 来 适应 (adapt) 模型 $\mathcal{M}$ 在 推理 (inference) 期间，利用为每个 不确定测试查询 (uncertain test query) $x_i$ 生成的样本 $\mathcal{D}_i$ 。

定义： 一旦通过公式 (5) 获得 $\mathcal{D}_i$ ，我们就优化初始参数 ( $\theta_0$ ) 以最小化 损失函数 (loss function) $\mathcal{L}(\mathcal{D}_i; \theta_0)$ ，从而为 目标任务预测 (target task prediction) 生成临时更新的参数 $\theta_i^*$ 。重要的是，在生成预测后，模型会 恢复 (restore) 到原始参数 $\theta_0$ 以进行使用样本 $x_{i+1}$ 的下一次迭代。因此，为每个 分布外样本 (out-of-distribution sample) 创建了一个专门的 预测模型 (prediction model)，而不会永久改变 基础模型 (base model)。

测试时微调： 推理时适应 (inference-time adaptation) 的主要目标是临时调整模型 $\mathcal{M}$ 的参数 ( $\theta$ )，以更好地处理当前的 不确定样本 (uncertain sample) $x_i$ 。这是通过在新的 合成数据集 (newly synthesized dataset) $\mathcal{D}_i = \{(x'_{ij}, y'_{ij})\}_{j=1}^{K}$ 上 微调 (fine-tuning) 模型来实现的。适应 (adaptation) 涉及最小化 任务特定损失函数 (task-specific loss function) $\mathcal{L}_{\text{task}}$ 在 $\mathcal{D}_i$ 中的样本上。

对于给定的 自生成样本 (self-generated sample) $(x'_{ij}, y'_{ij}) \in \mathcal{D}_i$ ，损失 (loss) 计算为 $\ell(\mathcal{M}(x'_{ij}; \theta), y'_{ij})$ 。使用 数据集 (dataset) $\mathcal{D}_i$ 适应 (adapting) 参数 $\theta$ 的目标是： $\theta_{i}^{*}=\arg \min _{\theta^{\prime}} \sum_{\left(x_{i j}^{\prime}, y_{i j}^{\prime}\right) \in \mathcal{D}_{i}} \ell\left(\mathcal{M}\left(x_{i j}^{\prime} ; \theta^{\prime}\right), y_{i j}^{\prime}\right)$ 其中：

$\theta_i^*$ 代表针对 $x_i$ 上下文的 适应参数 (adapted parameters)。
$\ell(\cdot, \cdot)$ 是 损失函数 (loss function)。
$\mathcal{M}(\cdot; \theta')$ 是带有参数 $\theta'$ 的模型。

为了确保 推理时更新 (inference-time updates) 的 计算效率 (computational efficiency)，本文采用了 低秩适应 (Low-Rank Adaptation, LoRA) 技术。

4.3. 两种变体

本文研究了 TT-SI 框架的两种变体：

TT-SI (Test-Time Self-Improvement)： 模型使用 自我数据增强 (self-data augmentation) 阶段 (G) 自生成 (self-generated) 的样本进行训练。即 不确定性估计器 (H) 识别出 不确定样本 (uncertain sample) $x_i$ 后，数据合成函数 (G) 会利用模型 $\mathcal{M}$ 自身生成 $K$ 个新的训练样本，然后模型 $\mathcal{M}$ 使用这些样本进行 测试时微调 (T)。
TT-D (Test-Time Distillation)： 适应 (adaptation) 过程由一个更强大的 教师模型 (teacher model) 提供的 监督 (supervision) 来引导。在这种情况下，当 不确定性估计器 (H) 识别出 不确定样本 (uncertain sample) $x_i$ 后，数据合成函数 (G) 不会使用模型 $\mathcal{M}$ 自身，而是使用一个更强大的 教师模型 (teacher model) (例如 GPT-5-mini) 为这些相同的 不确定案例 (uncertain cases) 生成相似的示例。然后，学生模型 (student model)（即 $\mathcal{M}$ ）使用这些由 教师模型 (teacher model) 蒸馏 (distilled) 来的 监督 (supervision) 进行 适应 (adaptation)。

5. 实验设置

5.1. 数据集

本文在三个不同的 智能体基准 (agent benchmarks) 上评估了所提出的方法：NexusRaven、SealTool 和 API-Bank。

5.1.1. NexusRaven

描述： NexusRaven (Srinivasan et al., 2023) 是一个 函数调用 (function-calling) 基准 (benchmark)，旨在测试 智能体 (agent) 执行单层、嵌套和并行 函数调用 (function calls) 的能力，这些 函数调用 (function calls) 具有不同复杂性，尤其侧重于网络安全和企业应用等领域的实际软件操作任务。
特点： 它设计用于测试在商业场景中的高保真 函数执行 (function execution)，涉及 65 个不同的 API，总共有 318 个样本。

示例： 以下是原文 Figure 7 中 NexusRaven 测试数据的一个样本示例：

## NexusRaven Test Sample Example (ID: 317)

You are an advanced assistant capable of using tools to help the user. You may call one or more functions to assist with the user query. For any user request that requires a function, respond by returning a function call inside <tool_call>...</tool_call> XML tags, with a JSON object specifying the "name" of the function and the "arguments".

## Task Instruction

In order to complete the user's request, you need to select one or more appropriate tools from the following tools and fill in the correct values for the tool parameters. Your specific tasks are:

1. Make one or more function/tool calls to meet the request based on the question.
2. If none of the functions can be used, point it out as an empty list and refuse to answer.
3. If the given question lacks the parameters required by the function, also point it out.

## Output Format

For each function call, return a JSON object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call>[ ["name": "<function-name>", "arguments": [ "arg1": "value1", "arg2": "value2", ...] , ...] </tool_call> If no function call is needed, please directly output an empty list '[]' as <tool_call>[]</tool_call>.

## Available Tools:

In your response, you can use the following tools:
<tools>

1. Name: verifyUSAddress

   Description: Verify a given US address to ensure it meets USPS standards and is deliverable.
Parameters: [ "addressLine1": [ "type": "str", "description": "The primary address line, including street number and name.", "required": True], "addressLine2": [ "type": "str", "description": "The secondary address line, such as apartment or suite number.", "required": True], "city": [ "type": "str", "description": "The city of the address.", "required": True], "state": [ "type": "str", "description": "The state or territory of the address.", "required": True], "zipCode": [ "type": "str", "description": "The 5-digit ZIP code of the address.", "required": True]]
2. Name: standardizeUSAddress

   Description: Standardize a given US address to create consistency and accuracy in addressing.
Parameters: [ "addressLine1": [ "type": "str", "description": "The primary address line, including street number and name.", "required": True], "addressLine2": [ "type": "str", "description": "The secondary address line, such as apartment or suite number.", "required": True], "city": [ "type": "str", "description": "The city of the address.", "required": True], "state": [ "type": "str", "description": "The state or territory of the address.", "required": True], "zipCode": [ "type": "str", "description": "The 5-digit ZIP code of the address.", "required": True]]
</tools>

## Question

User: I'm organizing a mailing list for my business, and I want to make sure all the addresses are standardized. Can you help me standardize this address? 456 Street, Suite 7891, Los Angeles, CA, 90011.

Your Response: <tool_call>[ ["name": "standardizeUSAddress", "arguments": ["addressLine1": "456 Street", "addressLine2": "Suite 7891", "city": "Los Angeles", "state": "CA", "zipCode": "90011"] ] </tool_call>

5.1.2. SealTool

描述： SealTool (Wu et al., 2024) 是一个用于 工具学习 (tool learning) 的 自指导 (self-instruct) 数据集 (dataset)，衡量在 工具选择 (tool selection)、输出格式 (output formats) 的遵守以及跨不同场景的 适应性 (adaptability) 方面的精度。
特点： 它是最广泛和最新的 基准 (benchmarks) 之一，包含 4,076 个 API，涵盖各种领域。其最新版本旨在最大程度地减少潜在的 数据泄露 (data leakage)。在实验中，使用了 294 个样本的 精选测试集 (curated test set)。

示例： 以下是原文 Figure 8 中 SealTool 测试数据的一个样本示例：

# SealTool Test Sample Example (ID: 4) 

You are an advanced assistant capable of using tools to help the user. You are given a conversation between a user and an assistant, together with the available tools.
You may call one or more functions to assist with the user query.
You will be provided with a set of Available Functions inside <tools>...</tools> tags.
For any user request that requires a function, respond by returning a function call inside <tool_call>...</tool_call> XML tags, with a JSON object specifying the "name" of the function and the "arguments".

## Task

1. Think and recall relevant context, analyze the current user goal.
2. Refer to the previous dialogue records in the conversations, including the user's queries.
3. Decide on which tool to use from Available Tools and specify the tool name.
4. At the end, you need to output the JSON object of the function call inside the <tool_call> and </tool_call> tags.
5. Output format of the function calls must be EXACTLY like in the Output Format section, the function calls must be a list of JSON objects, each object must have a "name" key and an "arguments" key.
6. This year is 2023 .

## Output Format

For each function call, return a JSON object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call>[ {"name": "<function-name>", "arguments": {"arg1": "value1", "arg2": "value2", ...} , ...] </tool_call>

## Available Tools

## <tools>

1. Name: analyzeSample

   Description: Analyze a given sample using analytical chemistry techniques
Field: Chemistry/Analytical chemistry
Parameters: {'sample': {'type': 'str', 'description': 'The sample to be analyzed'}, 'method': {'type': 'str', 'description': 'The analytical method to be used for analysis (e.g., chromatography, spectroscopy)'}, 'instrument': {'type': 'str', 'description': 'The instrument or equipment to be used for analysis (e.g., gas chromatograph, mass spectrometer)'}, 'conditions': {'type': 'str', 'description': 'Any specific conditions required for the analysis (e.g., temperature, pressure)'} }
Required: [sample, method]
Responses: {'results': {'type': 'str', 'description': 'The analysis results containing information about the sample'}}
2. Name: analyzeEvidence

   Description: Analyze the chemical evidence collected from a crime scene
Field: Chemical Engineering/Forensic engineering
Parameters: {'evidence_type': {'type': 'str', 'description': 'The type of evidence to be analyzed (e.g., DNA, fingerprints, blood, fibers)'}, 'method': {'type': 'str', 'description': 'The method or technique to be used for analysis (e.g., spectroscopy, chromatography, microscopy)'}, 'sample': {'type': 'str', 'description': 'The sample or specimen to be analyzed (e.g., crime scene swab, hair strand, fabric sample)'}}
Required: [evidence_type, method, sample]
Responses: {'analysis_results': {'type': 'str', 'description': 'The results of the chemical analysis of the evidence'}, 'conclusion': {'type': 'str', 'description': 'The conclusion drawn from the analysis'}}
3. Name: getSampleSize

   Description: Retrieve the sample size of a mixed methods research study
Field: Research/Mixed Methods Research
Parameters: {'study_id': {'type': 'str', 'description': 'The unique identifier of the research study'}}
Required: [study_id]
Responses: {'sample_size': {'type': 'int', 'description': 'The sample size of the research study'}}
4. Name: getFabricComposition

   Description: Retrieve fabric composition information for a specific clothing item
Field: Fashion/Fashion Technology
Parameters: {'clothing_item': {'type': 'str', 'description': 'The type of clothing item for which you want fabric composition (e.g., t-shirt, jeans, dress)'}, 'brand': {'type': 'str', 'description': 'The brand of the clothing item (e.g., Nike, Zara, Gucci)'}}
Required: [clothing_item]
Responses: {'composition': {'type': 'str', 'description': 'The fabric composition of the specified clothing item'}, 'brand': {'type': 'str', 'description': 'The brand of the clothing item'}}
5. Name: evaluateDataBias

   Description: Evaluate data bias in a dataset
Field: Data Analysis/Data Ethics
Parameters: {'dataset': {'type': 'str', 'description': 'The dataset to evaluate for bias (e.g., hiring records, loan applications)'}, 'protected_attributes': {'type': 'str', 'description': 'The protected attributes to consider for bias assessment (e.g., gender, race)'}, 'measures': {'type': 'str', 'description': 'The bias assessment measures to be used (e.g., disparate impact, statistical parity index)'}, 'reference_group': {'type': 'str', 'description': 'The reference group to compare with for bias assessment'}}
Required: [dataset, protected_attributes]
Responses: {'bias_score': {'type': 'float', 'description': 'The overall bias score of the dataset'}, 'protected_attributes_bias': {'type': 'str', 'description': 'Detailed bias assessment for each protected attribute'}}
</tools>

## Input

User: Provide the statistics for the Real Madrid team.
Your Response: <tool_call>[ {"name": "getTeamStats", "arguments": {"team": "Real Madrid"} } ] textbf</tool_call>

5.1.3. API-Bank

描述： API-Bank (Li et al., 2023) 评估 多轮用户-智能体对话 (multi-turn user-agent dialogues)，要求 智能体 (agent) 跟踪 对话状态 (conversational state)，在每轮中进行明智的 工具调用 (tool calls)，并处理噪声或不完整的 输入 (inputs) 等现实情况。
特点： 包含 314 个 多轮对话 (multi-turn conversations)，涉及 753 个不同的 API 调用。遵循先前的工作，本文重点关注 Level 1 和 Level 2 的 316 个样本，这些样本在 任务复杂性 (task complexity) 和 数据可用性 (data availability) 之间取得了平衡。

示例： 以下是原文 Figure 9 中 API-Bank 测试数据的一个样本示例：

# API-Bank Test Sample Example (ID: 0) 

You are an advanced assistant capable of using tools to help the user. You are given a conversation between a user and an assistant, together with the available tools.
You may call one or more functions to assist with the user query.
For any user request that requires a function, respond by returning a function call inside
textbfexttt<tool_call>...</tool_call> XML tags, with a JSON object specifying the "name" of the function and the "arguments".

## Task

1. Think and recall relevant context, analyze the current user goal.
2. Refer to the previous dialogue records in the conversations, including the user's queries.
3. Decide on which tool to use from Available Tools and specify the tool name.
4. At the end, you need to output the JSON object of the function call inside the <tool_call> and </tool_call> tags.
5. Output format of the function calls must be EXACTLY like in the Output Format section, the function calls must be a list of JSON objects, each object must have a "name" key and an "arguments" key.
6. This year is 2023 .

## Output Format

For each function call, return a JSON object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call>[ {"name": "<function-name>", "arguments": {"arg1": "value1", "arg2": "value2", ...} , ...] </tool_call>

## Available Tools:

In your response, you can use the following tools:

## <tools>

1. Name: QueryHealthData

   Description: This API queries the recorded health data in database of a given user and time span.
Parameters: {'user_id': {'type': 'str', 'description': 'The user id of the given user. Cases are ignored.'}, 'start_time': {'type': 'str', 'description': 'The start time of the time span. Format: %Y-%m-%d %H:%M:%S'}, 'end_time': {'type': 'str', 'description': 'The end time of the time span. Format: %Y-%m-%d %H:%M:%S'}}
2. Name: CancelRegistration

   Description: This API cancels the registration of a patient given appointment ID.
Parameters: {'appointment_id': {'type': 'str', 'description': 'The ID of appointment.'}}
3. Name: ModifyRegistration

   Description: This API modifies the registration of a patient given appointment ID.
Parameters: {'appointment_id': {'type': 'str', 'description': 'The ID of appointment.'}, 'new_appointment_date': {'type': 'str', 'description': 'The new appointment date. Format: %Y-%m-%d.'}, 'new_appointment_doctor': {'type': 'str', 'description': 'The new appointment doctor.'}}}
</tools>

## Conversation

User: Can you please modify my appointment scheduled for March 25th with Dr. Kim to March 26th with Dr. Lee?
Assistant: Sure, I can help you with that. Please provide me with the appointment ID and the new appointment date and doctor's name.
User: The appointment ID is 34567890 and the new date is March 26th with Dr. Lee.
Assistant: Alright. I'll modify your appointment now.
User: Based on our conversation above, please only make one tool call to solve my need.
Output: [<tool_call>[{"name": "ModifyRegistration", "arguments": {"appointment_id": "34567890", "new_appointment_date": "2023-03-26", "new_appointment_doctor": "Dr. Lee"}]]</tool_call>]

5.1.4. 数据集选择理由

选择这些 基准 (benchmarks) 是因为它们代表了 智能体 (agent) 任务的互补方面：

NexusRaven 侧重于 函数调用 (function-calling) 的复杂性和多样性。
SealTool 评估 工具选择 (tool selection) 和 输出格式 (output format) 遵守的精度。
API-Bank 专注于 多轮对话 (multi-turn conversations) 中的 工具使用 (tool use) 能力。这些数据集共同提供了对 TT-SI 在不同 智能体 (agentic) 场景中性能的全面评估。

5.2. 评估指标

论文主要关注 准确率 (Accuracy) 作为核心性能指标，并在 不确定性估计 (Uncertainty Estimation) 章节中使用了 TPR、FPR、F1 和 Youden's J 来评估 不确定性估计器 (Uncertainty Estimator) 的表现。

5.2.1. 准确率 (Accuracy)

概念定义： 准确率 (Accuracy) 是衡量分类器正确预测样本比例的指标。在 智能体 (agent) 任务中，它通常指模型 函数调用 (function call) 的 函数名 (function name)、参数 (arguments) 和 值/类型 (values/types) 全部与 真实标注 (ground truth) 匹配的正确预测比例。
数学公式： $\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$
符号解释：
- Number of correct predictions：正确预测的样本数量。
- Total number of predictions：总预测样本数量。

5.2.2. 真实正例率 (True Positive Rate, TPR) / 召回率 (Recall) / 敏感度 (Sensitivity)

概念定义： 真实正例率 (TPR) 衡量的是所有实际为正例的样本中，模型正确识别为正例的比例。在 不确定性估计 (Uncertainty Estimation) 的语境中，它表示 不确定性估计器 (Uncertainty Estimator) 正确地将模型犯错的 预测 (predictions) 标记为 不确定 (uncertain) 的比例。
数学公式： $\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}$
符号解释：
- TP (True Positives)：真阳性，即模型正确地预测为正例的样本数量。
- FN (False Negatives)：假阴性，即模型错误地预测为负例（但实际为正例）的样本数量。

5.2.3. 假正例率 (False Positive Rate, FPR)

概念定义： 假正例率 (FPR) 衡量的是所有实际为负例的样本中，模型错误地识别为正例的比例。在 不确定性估计 (Uncertainty Estimation) 的语境中，它表示 不确定性估计器 (Uncertainty Estimator) 错误地将模型正确 预测 (predictions) 标记为 不确定 (uncertain) 的比例。
数学公式： $\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}$
符号解释：
- FP (False Positives)：假阳性，即模型错误地预测为正例（但实际为负例）的样本数量。
- TN (True Negatives)：真阴性，即模型正确地预测为负例的样本数量。

5.2.4. F1-分数 (F1-Score)

概念定义： F1-分数 (F1-Score) 是 精确率 (precision) 和 召回率 (recall) 的 调和平均值 (harmonic mean)。它综合考虑了 假正例 (false positives) 和 假阴性 (false negatives)，对于 类别不平衡 (class imbalance) 的数据集特别有用。在 不确定性估计 (Uncertainty Estimation) 中，它衡量了 不确定性估计器 (Uncertainty Estimator) 识别错误 预测 (predictions) 的综合能力。
数学公式： $F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$ 其中，Precision (精确率) = $\frac{\text{TP}}{\text{TP} + \text{FP}}$ ，Recall (召回率) = $\frac{\text{TP}}{\text{TP} + \text{FN}}$ 。
符号解释：
- Precision：精确率。
- Recall：召回率 (即 TPR)。
- TP, FP, FN 如上所述。

5.2.5. Youden's J 统计量 (Youden's J statistic)

概念定义： Youden's J 统计量 (Youden's J statistic) 是一个衡量 诊断测试 (diagnostic test) 整体性能的指标，它表示 诊断测试 (diagnostic test) 的 敏感度 (sensitivity) (即 TPR) 和 特异度 (specificity) (即 $1 - FPR$ ) 之间的平衡。 $J$ 值越高，表示 测试 (test) 的 鉴别能力 (discriminative ability) 越好。
数学公式： $J = \text{TPR} - \text{FPR}$
符号解释：
- TPR：真实正例率。
- FPR：假正例率。

5.3. 对比基线

论文将 TT-SI 方法与以下 基线模型 (Baselines) 进行了比较：

标准 提示 (Prompting) / 基础模型 (Base Model) (w/o TT-SI (Base))： 采用直接的 零样本推理 (zero-shot inference)，模型不进行任何 测试时适应 (test-time adaptation)。
监督微调 (Supervised Fine-Tuning, SFT)： 传统的 归纳学习 (inductive learning) 方法，模型在整个 SealTool 训练集 (training split) (约 13k 样本) 上进行 微调 (fine-tuning)。
上下文学习 (In-Context Learning, ICL)： 在 提示 (prompt) 中提供少量 (1-shot) 示例，引导模型进行 推理 (inference)，而不更新模型参数。
TT-SI without H (TT-SI w/o H)： TT-SI 的一个变体，其中 不确定性估计器 (H) 被禁用，所有 测试输入 (test inputs) 都被视为 不确定 (uncertain) 并进行 适应 (adaptation)。

5.4. 模型与硬件

主要实验模型： Qwen2.5-1.5B-Instruct。选择该模型是因为其强大的性能和较小的尺寸，能够在有限的硬件上高效运行，并展示 紧凑型智能体模型 (compact agentic models) 的潜力。
扩展性和架构变体： 额外包括了 Qwen2.5-7B-Instruct 进行 消融实验 (ablations)，以检查 缩放 (scaling) 和 架构变体 (architectural variations) 的影响。
硬件： 所有模型都在 单个 NVIDIA A40 GPU 上进行训练。

5.5. 训练与生成参数

参数高效微调 (PEFT) 策略： 使用 LoRA (Hu et al., 2022) 进行 参数高效微调 (PEFT)。
- LoRA 参数： $rank = 8$ ， $\alpha = 16$ ，应用于所有 线性层 (linear layers)。
训练细节：
- 训练 3 个 epochs。
- 固定学习率： $1.0 \times 10^{-4}$ 。
- 热身比率 (warm-up ratio)：0.03。
- 批次大小 (batch size)：1。
- 使用 余弦调度器 (cosine scheduler)。
- Alpaca-style 数据格式：指令 (instruction) 和 输入 (input) 字段 零填充 (zero-padded)，损失 (loss) 仅在 输出 (output) 字段上计算。
数据生成 (G) / 推理设置：
- vLLM 用于 数据合成 (data synthesis) 和 推理 (inference)。
- 温度 (temperature)：0.7。
- 重复惩罚 (repetition penalty)：1.1。
- top-k：20。
- top-p：0.8。
- 最大长度 (maximum length)：32768。
- 解析错误 (parsing errors) 处理：允许最多 5 次重试 (retries)。
TT-D 中的 教师模型 (Teacher Model)： GPT-5-mini 用于生成样本，使用标准 OpenAI API，无 温度 (temperature) 调整或额外 解码策略 (decoding strategies)。

5.6. 实验协议

所有实验（包括 基线 (baselines)）都重复 5 次，使用不同的 样本训练 (sample trainings) 和 随机种子 (seeds)，并报告平均结果。
在 评估 (evaluation) 中，判断 函数调用 (function call) 是否正确的要求是 函数名 (function name)、参数 (arguments) 和 对应值/类型 (corresponding values/types) 必须同时与 真实标注 (ground truth) 匹配。

6. 实验结果与分析

6.1. 核心结果分析

6.1.1. TT-SI 的主要结果

以下是原文 Table 1 的结果：

Inference	Method	NexusRaven	SealTool	API-Bank	Avg.	$\Delta\%$
Inference	Method
Input/Output	w/o TT-SI (Base)	44.03	66.67	70.08	60.26	-
	w. TT-SI	50.08	72.43	74.34	65.62	$\uparrow 5.36$
	w. TT-D	52.52	75.17	77.29	68.33	$\uparrow 8.07$
Majority Vote	w/o TT-SI (Base)	46.56	69.73	73.96	63.42	-
	w. TT-SI	52.20	72.93	75.68	66.94	$\uparrow 3.52$
	w. TT-D	54.53	72.25	77.56	68.11	$\uparrow 4.69$

洞察 1：智能体即使只训练一个样本，也能在 测试时 (test-time) 实现 自改进 (self-improve)。

在 直接推理 (direct inference) 设置下，TT-SI 相较于基线 (w/o TT-SI (Base)) 实现了 $5.36\%$ 的绝对增益（从 $60.26\%$ 提高到 $65.62\%$ ）。
在 多数投票 (majority vote) (自我一致性 (self-consistency)) 设置下，TT-SI 实现了 $3.52\%$ 的增益。
这表明 TT-SI 能够使 智能体 (agent) 在 推理 (inference) 过程中，仅通过对每个 不确定案例 (uncertain case) 训练一个生成的样本，就实现 自改进 (self-improve)。
测试时蒸馏 (TT-D)，即使用 GPT-5-mini 生成 自我数据 (self-generated data)，进一步提升了性能，在 直接推理 (direct inference) 中比 TT-SI 高出 $2.71\%$ ，在 多数投票 (majority vote) 中高出 $1.17\%$ 。这表明更高质量的训练信号可以带来额外但稳定的收益。

6.1.2. TT-SI 与 `SFT` 和 `ICL` 的比较

下图（原文 Figure 2 左侧）展示了 SealTool 基准 (benchmark) 上的实验结果。

洞察 2：TT-SI 以少几个数量级的数据量优于 归纳式 SFT。

在 SealTool 基准 (benchmark) 上，TT-SI (72.43%) 超过了所有三个 基线 (baselines)（包括 基础模型 (Base)、ICL 和 SFT），甚至比标准 归纳式 SFT (70.20%) 高出 $2.23\%$ 。
值得注意的是，TT-SI 仅使用了 190 个 不确定案例 (uncertain cases)（每个案例配对一个 合成示例 (synthetic example)），而非完整的 12k 训练集。这相当于使用了少大约 68 倍的样本量，却提供了更好的准确率，突显了 TT-SI 作为传统学习方法的有效替代方案。

洞察 3：当 训练 (training) 不可行时，结合 ICL 的 测试时自改进 (Test-Time Self-Improvement) 提供了一种快速替代方案。

将 TT-SI 扩展到 上下文学习 (ICL) 设置（Figure 2 左侧），即生成的示例直接插入到 提示 (prompt) 中而非用于 微调 (fine-tuning)。
TT-SI 与 ICL 结合后，相对于 基础模型 (base model) ( $66.37% -> 66.38%$ ) 有轻微提升，甚至优于利用 SealTool 训练集的标准 ICL 基线 (ICL baseline) (67.74%)。
这表明 ICL 是一种 免训练 (training-free)、低开销的 归纳方法 (inductive methods) 替代方案。这种提升可能源于增强的模型 确定性 (certainty)：TT-SI 生成的 演示 (demonstrations) 提升了模型对正确 输出格式 (output format) 和 推理过程 (reasoning process) 的 置信度 (confidence)，从而增加了准确 预测 (predictions) 的可能性，而无需依赖外部训练数据。

6.1.3. 不确定性过滤对准确率和效率的影响

以下是原文 Table 2 的结果：

$\tau$ / Setting	TPR ( $\uparrow$ )	FPR ( $\downarrow$ )	Unc. ( $\Delta\%$ )	Acc.
Base	-	-	-	66.37
0.35	0.42	0.09	51 (17%)	68.10
0.95	0.96	0.53	190 (65%)	72.43
No Unc. (all)	1.00	1.00	294 (100%)	73.47

洞察 4：不确定性过滤 (uncertainty filtering) 平衡了准确率和效率。

TT-SI 在 推理 (inference) 时运行，效率至关重要。不确定性估计器 (H) 识别 不确定样本 (uncertain samples) 进行有针对性的 适应 (adaptation)，而 确定样本 (certain ones) 则直接由 基础模型 (base model) 处理。
当所有 测试输入 (test inputs) 都被视为 不确定 (uncertain) (TT-SI w/o H) 时，准确率仅边际提升 $+1.04\%$ ，但需要对所有 294 个 测试样本 (test samples) 进行 适应 (adaptation) (即额外学习 104 个 LoRA 权重)，导致更高的成本。
因此，微小的准确率增益被效率损失所抵消，这强调了 不确定性过滤 (uncertainty filtering) 对于实际 测试时适应 (test-time adaptation) 的重要性。

6.1.4. 数据规模对 `OOD` 数据的影响

下图（原文 Figure 2 右侧）展示了不同 适应策略 (adaptation strategies) 在不同样本数量下的 缩放行为 (scaling behavior)。

洞察 5：OOD 数据上的 数据缩放 (data scaling) 突显了 SFT 的局限性和 TT-SI 的优势。

本文使用 最先进 (state-of-the-art) 的 xLAM 函数调用数据集 (xLAM function-calling dataset) 进行 SFT，在 分布外 (out-of-distribution, OOD) 设置下进行 数据缩放 (data scaling) 实验。
TT-SI 在所有 规模 (scales) 下始终优于 SFT，并且随着 不确定示例 (uncertain examples) 的增加，性能提升也随之增长。这强调了 不确定性引导数据 (uncertainty-guided data) 的重要性以及 有针对性的测试时学习 (targeted test-time learning) 的价值。
此外，TT-SI 的 免训练 (training-free) ICL 变体在使用相同数据量的情况下，在强大 数据集 (dataset) 上也超越了标准 SFT。这表明即使没有专门的 训练集 (training split) 或 微调 (fine-tuning)，测试时方法 (test-time approaches) 在相同 OOD 数据条件下也能优于 SFT。

6.1.5. 最佳 $\tau$ 对效率和准确率的影响

洞察 6：最佳 $\tau$ 在保持效率的同时，最小化了准确率损失。

如 Table 2 所示，无论 $\tau$ 值如何，TT-SI 都能提升准确率，始终超过基线的 $66.37\%$ 。
当 $\tau$ 值较高（接近 1）时，会选择所有样本，从而达到最高准确率 (73.47%)。但这需要对所有 294 个实例进行更新，导致边际收益带来巨大的 计算开销 (computational overhead)。例如， $\tau = 0.95$ 实现了 $72.43\%$ 的准确率，但仅需要 190 次更新（减少了 $35\%$ ），从而保持了接近最佳的性能。
相反，较低的 $\tau = 0.35$ 最大限度地减少了 假阳性 (FPR = 0.09)，但错过了许多错误 ( $TPR = 0.42$ )，将准确率降低到 $68.10\%$ 。
因此， $\tau = 0.95$ 提供了一个有效的平衡，捕获了大部分错误，同时避免了不必要的更新，并优化了 成本-性能权衡 (cost-performance trade-off)，这类似于人类学习中专注于 不确定案例 (uncertain cases) 的方式。

6.1.6. TT-SI 在不同规模模型上的表现

下图（原文 Figure 3）展示了 Qwen2.5-1.5B-Instruct 和 Qwen2.5-7B-Instruct 在 SealTool 基准 (benchmark) 上的 不确定性估计 (Uncertainty Estimation) 性能。

洞察 7：TT-SI 改进了 Qwen 大小模型，小型模型具有更大的相对增益。

在较小的 Qwen2.5-1.5B-Instruct 模型上，TT-SI 带来了显著的 $+5.76\%$ 绝对增益（从 66.67 提升到 72.43）。
在较大的 Qwen2.5-7B-Instruct 模型上，TT-SI 带来了 $+3.02\%$ 的增益（从 80.95 提升到 83.97）。
这些改进表明 TT-SI 无论模型大小如何都能持续提升性能，支持其 架构通用性 (architectural generality)。
有趣的是，对小型模型的相对提升更为显著，这凸显了小型 智能体 (agent) 作为 实用部署 (practical deployments) 的 效率导向策略 (efficiency-oriented strategy) 的潜力。

6.1.7. 不确定性估计器 (H) 的性能

图 4 展示了 Qwen-2.5-1.5B-Instruct 在 Nexus-Raven 上 不确定性估计器 (H) 的性能。

左图：随着 Softmax 差值阈值 (softmax-difference threshold) $\tau$ 的增加，不确定 (uncertain) 的定义变得不那么严格，导致更多样本被识别为 不确定 (uncertain) 并进行后续 适应 (adaptation)。TPR 稳定上升，而 FPR 保持在较低水平。当 $\tau = 0.95$ 时，Youden's J 统计量达到峰值 $71.0\%$ ，实现了 敏感度 (sensitivity) 和 特异度 (specificity) 之间的良好平衡，能够捕获近 $88\%$ 的模型错误，同时仅将 $17\%$ 的正确答案错误分类。
右图：本文的 不确定性估计器 (H) 在所有指标上均优于 基线 (baselines)（Random、Trivial、Perplexity），实现了最高的 $F1 (61.19)$ 和 Youden's J (63.29)。这验证了 不确定性估计器 (H) 的有效性。

6.1.8. 自生成数据可视化

下图（原文 Figure 5）展示了 SealTool 测试样本 (test samples) 和 自生成查询 (self-generated queries) 在二维语义空间中的可视化。

该图像是一个散点图，展示了测试样本和生成样本在二维空间中的分布情况。图中多种颜色和形状区分了不同类别的原始样本和对应的生成样本，突出显示了生成样本围绕着对应的测试样本聚集，反映了自我数据增强的效果。

图 5 显示，生成的样本在嵌入空间中形成了一个紧密的簇，与 不确定输入 (uncertain input) 紧密对齐。这表明 数据合成函数 (G) 能够生成语义一致且忠实的示例，有效弥合了 适应 (adaptation) 挑战性查询的鸿沟。

6.2. 运行时开销和资源使用分析

不确定性估计器 (H)：平均每个样本需要 0.88 秒。
数据合成函数 (G)：对于 不确定样本 (uncertain samples)，生成 合成变体 (synthetic variants) 需要 3.55 秒。
测试时微调 (T)：训练需要 4.55 秒。
推理 (inference)：1.26 秒。
总计： 对于 不确定样本 (uncertain samples)，总计 10.24 秒；否则为 2.14 秒。对于 190 次更新，总计 2,168.2 秒（约 36 分钟）。
与 SFT 对比： SFT 在 SealTool 的 13K 样本 (13K-sample) 集上训练需要 7,966.6 秒（约 2 小时 12 分钟）。
效率： 尽管 TT-SI 训练的样本量少了大约 68 倍，但实现了 3.7 倍 的 挂钟时间加速 (wall-clock speed-up)。大部分额外时间是由于训练后模型 合并 (merging) 和 文件保存 (file-saving) 操作。

7. 总结与思考

7.1. 结论总结

本文提出了一种新颖的 测试时自改进 (Test-Time Self-Improvement, TT-SI) 框架，旨在使 大型语言模型 (LLM) 智能体 (agent) 能够在 推理 (inference) 过程中 即时 (on-the-fly) 进行学习和改进。该框架包含三个核心组件：不确定性估计器 (Uncertainty Estimator, H) 用于 自我感知 (self-awareness)，数据合成函数 (Data Synthesis Function, G) 用于 自我数据增强 (self-data augmentation)，以及 测试时微调 (Test-Time Fine-tuning, T) 用于 自我学习 (self-learning)。

通过在 NexusRaven、SealTool 和 API-Bank 等多个 智能体基准 (agent benchmarks) 上的广泛实验，本文证明了 TT-SI 能够显著提升 LLM 智能体 (LLM agents) 的性能，即使仅通过为每个 不确定案例 (uncertain case) 训练一个 合成示例 (synthetic example)。TT-SI 在平均准确率上比其他标准学习方法高出 $+5.36\%$ 的绝对增益，同时使用的训练样本量减少了 68 倍，这凸显了其在效率和有效性方面的优势。此外，引入 教师模型 (teacher model) 的 测试时蒸馏 (Test-Time Distillation, TT-D) 变体在更复杂的场景中进一步提高了性能。研究还表明，不确定性过滤 (uncertainty filtering) 对于平衡准确率和 计算效率 (computational efficiency) 至关重要。

本文的发现挑战了传统 归纳学习 (inductive learning) 范式中对大规模、多样化数据集的依赖，并提供了一个新颖的 测试时学习 (test-time learning) 范式 (paradigm)，为构建在复杂场景中能够 自演化 (self-evolving) 的 LLM 智能体 (LLM agents) 奠定了基础。

7.2. 局限性与未来工作

7.2.1. 局限性

不确定性阈值 (uncertainty threshold) $\tau$ 的选择： 识别 不确定样本 (uncertain samples) 需要设置一个 阈值 (threshold) $\tau$ 。尽管 消融实验 (ablations) 表明性能增益在不同 $\tau$ 值下保持一致，但最佳性能对该选择敏感。在 不确定性校准 (uncertainty calibration) 领域中，如何自动学习这个 阈值 (threshold) 仍然是一个开放的挑战。
受 基础模型参数 (base model parameters) 容量的限制： TT-SI 本身受 基础模型 (base model) $\theta$ 参数容量的限制。如果解决任务所需的知识在预训练模型中不存在（例如，一个新引入的医学概念），单纯的 自改进 (self-improvement) 无法弥补。在这种情况下，需要通过 检索 (retrieval) 或 搜索机制 (search mechanisms) 进行 外部知识集成 (external knowledge integration)。

7.2.2. 未来工作

自演化智能体 (self-evolving agents) 的 持续学习 (continual learning) 和 可迁移学习 (transferable learning)： 在 TT-SI 中，临时 (temporary) 且 样本特定 (sample-specific) 的 LoRA 更新在 推理 (inference) 后即被丢弃。未来的重要一步是研究一个 不确定样本 (uncertain sample) 的改进是否可以 迁移 (transfer) 到其他样本，以及 智能体 (agent) 如何 自我评估 (self-assess) 这种 迁移 (transfer) 的可能性。
自适应数据生成 (adaptive data generation)： 模型 (model) 自身可以确定为给定 不确定案例 (uncertain case) 需要多少 合成示例 (synthetic examples)，而不是依赖于固定的 超参数 (hyperparameters)。
共同演化 (co-evolutionary) 设置： 当前框架只优化 智能体 (agent)，而不是 数据生成器 (data generator)。一个 共同演化 (co-evolutionary) 的设置，例如 双重学习 (dual-learning)，其中 智能体 (agent) 和 生成器 (generator) 相互适应，可以进一步提升性能。
扩展到其他领域： 将 TT-SI 扩展到数学或医学等领域，探索 领域特定不确定性 (domain-specific uncertainty) 和 知识结构 (knowledge structures) 如何与 自改进 (self-improvement) 互动。

7.3. 个人启发与批判

7.3.1. 个人启发

这篇论文提供了一个非常深刻的启发，即 LLM 的学习范式可以从传统的 归纳学习 (inductive learning) 向更类似于人类学习的 测试时局部适应 (test-time local adaptation) 转变。这种转变不仅在计算效率上具有巨大优势（68倍 样本减少，甚至更高的性能），更重要的是，它提供了一种在模型部署后持续 自改进 (self-improvement) 的机制。

“学习之于特定问题”的范式转变：传统的 SFT 试图让模型“一次性”学习所有知识，并假设这些知识能泛化到所有未见情况。但 TT-SI 认识到，模型总会有 不确定 (uncertain) 的情况，并在这些“痛点”上进行局部、有针对性的学习。这种“按需学习”的模式，可以极大地提高模型的实用性和生命周期。
小模型潜力巨大： 论文证明了 TT-SI 对小模型（如 Qwen2.5-1.5B-Instruct）的提升效果更为显著，这对于资源受限的部署场景具有重要意义。它意味着即使是小型 LLM，通过巧妙的 测试时适应 (test-time adaptation) 也能达到甚至超越大型模型在通用 SFT 上的表现。
结合 ICL 的 免训练 (training-free) 潜力： TT-SI 结合 ICL 即使不进行 参数更新 (parameter updates) 也能有提升，这为 边缘设备 (edge devices) 或 实时性要求高 (real-time requirements) 的应用提供了 零训练成本 (zero-training-cost) 的 自适应 (adaptive) 方案。
“自我”机制的强大： 自我感知 (self-awareness)、自我数据增强 (self-data augmentation)、自我学习 (self-learning) 这三步构建了一个强大的“自我”循环。这不仅是技术上的创新，也为 人工智能 (AI) 走向更高级的 自主学习 (autonomous learning) 提供了新的思路。

7.3.2. 批判与改进

尽管 TT-SI 展现了巨大的潜力，但仍有一些可以批判或改进的地方：

阈值 (threshold) $\tau$ 的鲁棒性： 论文指出 $\tau$ 的选择对性能敏感。虽然 0.95 被认为是最佳平衡点，但在实际应用中，不同的任务和数据集可能需要不同的 $\tau$ 。如何实现 $\tau$ 的 自适应调整 (adaptive adjustment) 或 学习 (learning) 将是关键。例如，可以通过 元学习 (meta-learning) 或 强化学习 (reinforcement learning) 来优化 $\tau$ ，使其能够根据当前 任务 (task) 和模型状态动态调整。
数据合成函数 (Data Synthesis Function, G) 的质量： $G$ 的能力直接影响 TT-SI 的效果。虽然论文提到生成的样本语义一致，但 合成数据 (synthetic data) 的 多样性 (diversity) 和 新颖性 (novelty) 仍然是挑战。如果 $G$ 只能生成与原始 不确定样本 (uncertain sample) “过于相似”的样本，可能无法有效扩展模型的 决策边界 (decision boundary)。未来的工作可以探索更先进的 生成模型 (generative models) 或 多样性促进机制 (diversity-promoting mechanisms)。
LoRA 更新的 迁移性 (transferability)： 论文明确指出 LoRA 更新是 临时 (temporary) 且 样本特定 (sample-specific) 的，这避免了 灾难性遗忘 (catastrophic forgetting) 但也限制了知识的积累和 迁移 (transfer)。未来的研究应探索如何将这些 局部学习 (local learning) 到的知识以某种方式 持久化 (persistent) 或 泛化 (generalize)，从而实现更长期的 自演化 (self-evolving)。这可能涉及到 知识蒸馏 (knowledge distillation) 到 基础模型 (base model) 的某些部分，或维护一个 动态知识库 (dynamic knowledge base)。
上下文学习 (ICL) 变体的进一步探索： TT-SI 结合 ICL 的表现虽然优于基线 ICL，但相较于 测试时微调 (test-time fine-tuning) 仍有差距。探索如何优化 提示工程 (prompt engineering)，或将 自生成数据 (self-generated data) 与 ICL 更有效地结合，以在 免训练 (training-free) 模式下最大化性能，值得深入研究。
LLM 在 $G$ 中的 偏见 (bias) 和 幻觉 (hallucination)： LLM 作为 数据生成器 (data generator) 可能会引入 偏见 (bias) 或 幻觉 (hallucination)，尤其是在生成带有 标签 (labels) 的数据时。虽然 TT-D 使用更强的 教师模型 (teacher model) 可以缓解部分问题，但 自生成 (self-generated) 数据本身的质量和可靠性仍是一个值得关注的问题。需要 鲁棒性 (robustness) 机制来 过滤 (filter) 或 验证 (validate) 合成数据 (synthetic data)。
实际部署的 工程挑战 (engineering challenges)： 尽管 TT-SI 提供了 挂钟时间加速 (wall-clock speed-up)，但在实际部署中，实时 (real-time) 进行 LoRA 模型的加载、微调 (fine-tuning)、推理和卸载，仍然存在复杂的 工程挑战 (engineering challenges)，特别是对于需要低 延迟 (latency) 的应用。模型合并 (model merging) 和 I/O 操作 (I/O operations) 的开销，需要更精细的优化。

相似论文推荐

基于向量语义检索推荐的相关论文。

暂时没有找到相似论文。

Self-Improving LLM Agents at Test-Time

TL;DR 精炼摘要

摘要

思维导图

论文精读

中文精读约 23 分钟读完 · 12,323 字

1. 论文基本信息

1.1. 标题

1.2. 作者

1.3. 发表期刊/会议

1.4. 发表年份

1.5. 摘要

1.6. 原文链接

2. 整体概括

2.1. 研究背景与动机

2.2. 核心贡献/主要发现

3. 预备知识与相关工作

3.1. 基础概念

3.2. 前人工作

3.2.1. 归纳式微调的根本问题 (Fundamental Issues in Inductive Fine-Tuning)

3.2.2. 测试时训练 (Test-Time Training, TTT)

3.3. 差异化分析

4. 方法论

4.1. 方法原理

4.2. 核心方法详解

4.2.1. 自我感知：测试时 不确定样本 (Uncertain Sample) 选择 (Uncertainty Estimator, H)

4.2.2. 自我增强：数据生成策略 (Data Synthesis Function, G)

4.2.3. 自我学习：测试时微调 (Test-Time Fine-tuning, T)

4.3. 两种变体

5. 实验设置

5.1. 数据集

5.1.1. NexusRaven

5.1.2. SealTool

5.1.3. API-Bank

5.1.4. 数据集选择理由

5.2. 评估指标

5.2.1. 准确率 (Accuracy)

5.2.2. 真实正例率 (True Positive Rate, TPR) / 召回率 (Recall) / 敏感度 (Sensitivity)

5.2.3. 假正例率 (False Positive Rate, FPR)

5.2.4. F1-分数 (F1-Score)

5.2.5. Youden's J 统计量 (Youden's J statistic)

5.3. 对比基线

5.4. 模型与硬件

5.5. 训练与生成参数

5.6. 实验协议

6. 实验结果与分析

6.1. 核心结果分析

6.1.1. TT-SI 的主要结果

6.1.2. TT-SI 与 SFT 和 ICL 的比较

6.1.3. 不确定性过滤对准确率和效率的影响

6.1.4. 数据规模对 OOD 数据的影响

6.1.5. 最佳 τ\tauτ 对效率和准确率的影响

6.1.6. TT-SI 在不同规模模型上的表现

6.1.7. 不确定性估计器 (H) 的性能

6.1.8. 自生成数据可视化

6.2. 运行时开销和资源使用分析

7. 总结与思考

7.1. 结论总结

7.2. 局限性与未来工作

7.2.1. 局限性

7.2.2. 未来工作

7.3. 个人启发与批判

7.3.1. 个人启发

7.3.2. 批判与改进

相似论文推荐

4.2.1. 自我感知：测试时 `不确定样本 (Uncertain Sample)` 选择 (Uncertainty Estimator, H)

6.1.2. TT-SI 与 `SFT` 和 `ICL` 的比较

6.1.4. 数据规模对 `OOD` 数据的影响

6.1.5. 最佳 $\tau$ 对效率和准确率的影响