论文状态：已完成

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

发表：2024/06/01

多回合对话建模 (1)基于行动的对比自我训练 (1)大语言模型对话策略优化 (1)人机交互中的歧义识别 (1)无标签对话训练 (1)

价格：0.100000

已有 2 人读过

本分析由 AI 生成，可能不完全准确，请以原文为准。

TL;DR 精炼摘要

本文提出了基于行动对比自训练的多轮对话模型，旨在提升大语言模型在模糊性处理中的能力。通过引入准在线偏好优化算法，ACT 在数据稀疏场景下有效学习对话策略，验证了其在表格问答、机器阅读理解及 AmbigSQL 等任务中的出色表现，显示出显著优于传统微调方法的效果。

摘要

Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users' true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT's efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.

思维导图

论文精读

中文精读约 34 分钟读完 · 25,439 字

1. 论文基本信息

1.1. 标题

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training (学习澄清：基于行动对比自训练的多轮对话)

1.2. 作者

Maximillian Chen*, Ruoxi Sun, Tomas Pfister, Sercan Ö. Arik
所属机构：Google 和 Columbia University (Maximillian Chen 隶属于两家机构)

1.3. 发表期刊/会议

本文于 2024 年 5 月 31 日作为预印本发表在 ArXiv。ArXiv 是一个广受欢迎的预印本服务器，在人工智能、机器学习和自然语言处理等领域，研究者通常会先将论文上传至此，以便快速分享研究成果并接受社区反馈。

1.4. 发表年份

2024 年

1.5. 摘要

大语言模型 (LLM) 通过人类反馈进行优化，已迅速成为开发智能对话助手的主流范式。然而，尽管它们在许多基准测试中表现出色，但基于 LLM 的智能体可能仍然缺乏对话技能，例如歧义消除 (disambiguation)——当它们面临模糊性时，通常会过度规避或隐式猜测用户的真实意图，而不是提出澄清问题 (clarification questions)。在任务特定设置中，高质量的对话样本往往有限，这构成了 LLM 学习最佳对话动作策略 (dialogue action policies) 的瓶颈。

本文提出了 Action-Based Contrastive Self-Training (ACT)，一种基于 Direct Preference Optimization (DPO) 的准在线偏好优化算法 (quasi-online preference optimization algorithm)，旨在实现多轮对话建模中的数据高效对话策略学习。研究人员在多个真实世界的对话任务中（包括表格型问答、机器阅读理解以及 AmbigSQL——一个用于向数据分析智能体澄清复杂 SQL 生成的信息寻求请求的新任务）验证了 ACT 的有效性，即使在没有行动标签的数据稀疏微调 (data-efficient tuning) 场景下也表现出色。此外，本文提出通过检查 LLM 是否能隐式识别和推理对话中的模糊性来评估其作为对话代理 (conversational agents) 的能力。ACT 相较于监督微调 (SFT) 和 DPO 等标准微调方法，展现出显著的对话建模改进。

1.6. 原文链接

原文链接: https://arxiv.org/abs/2406.00222
PDF 链接: https://arxiv.org/pdf/2406.00222v2.pdf
发布状态：预印本 (Preprint)

2. 整体概括

2.1. 研究背景与动机

核心问题： 尽管大语言模型 (LLM) 在许多任务中表现出了强大的能力，但它们在多轮对话中，尤其是在处理用户请求的模糊性时，仍然存在显著缺陷。当前 LLM 倾向于过度规避 (overhedging)（例如，给出过于笼统或模棱两可的回答）或隐式猜测 (implicitly guess) 用户的真实意图，而非主动提出澄清问题 (clarification questions) 来消除歧义，这导致对话效率和用户满意度降低。Figure 1 提供了一个直观的例子。

该图像是示意图，展示了在基于表格的对话问答中过度推测及意图澄清的示例。对话代理需要识别出用户问题中的模糊性，并提出澄清问题，以便得到更准确的答案。

Figure 1 | 简化版的表格型对话问答中模糊性示例，一个对话代理应能识别模糊性并提出澄清问题以获得更准确的最终答案。
为什么重要： 在复杂的任务场景中，用户往往会欠详尽 (underspecify) 他们的请求，导致信息模糊。有效的歧义消除 (disambiguation) 是实现共同基础 (common ground) 和完成任务的关键。例如，在生成复杂 SQL 查询或进行机器阅读理解时，一个微小的歧义都可能导致最终结果的错误。然而，为 LLM 学习这些实用技能 (pragmatic skills) 的高质量对话数据往往稀缺，这成为一个主要瓶颈。
现有研究的挑战或空白：
1. LLM 的预训练或监督微调 (SFT) 目标通常不直接对齐对话中的实用技能（如澄清）。
2. 虽然 RLHF 等方法可以对齐 LLM，但现有模型在处理多轮对话任务时仍存在困难，部分原因是它们没有直接优化实用技能。
3. 收集高质量的多轮对话数据集成本高昂且受隐私问题限制，使得在数据稀疏场景下学习最佳对话策略 (dialogue policies) 极具挑战性。
4. 现有 DPO 类算法在对话任务中的应用多集中于单轮响应优化，未能充分考虑多轮轨迹 (multi-turn trajectories) 和对话行动 (conversational actions) 的优化。
本文的切入点或创新思路：
- 聚焦行动规划： 提出直接优化 LLM 在模糊上下文中隐式选择对话策略 (conversational strategies) 的能力，尤其关注澄清问题 (clarifying questions)。
- 数据高效： 设计一种数据高效的算法，使其在高质量对话样本有限的情况下也能有效学习。
- 准在线偏好优化： 引入 Action-Based Contrastive Self-Training (ACT)，结合了 DPO 的易用性和在线学习的探索能力，通过对比不同实用对话行动 (pragmatic conversational actions) 的差异来优化模型。
- 多任务验证： 在多个真实世界对话任务上验证方法，包括一个新提出的 AmbigSQL 任务，突出其通用性。
- 新型评估： 提出通过检查 LLM 是否能隐式识别和推理对话中的模糊性来评估其作为对话代理 (conversational agents) 的能力。

2.2. 核心贡献/主要发现

提出了 Action-Based Contrastive Self-Training (ACT) 算法： ACT 是一种样本高效 (sample-efficient)、基于 DPO 的准在线 (quasi-online) 偏好优化算法。它专注于通过对比智能体可能的实用对话行动 (pragmatic conversational actions) 之间的差异来改进 LLM 的多轮对话能力。
在数据稀疏场景下表现出色： ACT 在数据高效微调 (data-efficient tuning) 场景下展示了其有效性，即使在没有明确行动标签 (action labels) 的情况下，也能在多个真实世界的对话任务中实现显著改进。这些任务包括：
- 表格型问答 (tabular-grounded question-answering)
- 机器阅读理解 (machine reading comprehension)
- AmbigSQL 任务： 一个新颖的、用于歧义消除 (disambiguating) 针对数据分析智能体生成复杂 SQL 的信息寻求请求的任务。
多轮对话建模的显著改进： ACT 在多轮任务完成 (multi-turn task completion) 能力上显著优于 Supervised Fine-tuning (SFT) 和 Direct Preference Optimization (DPO) 等标准微调方法。Figure 2 直观展示了 ACT 的性能优势。

$Figure 2 $| A C T$ greatly outperforms standard tuning approaches in data-efficient settings for conversational modeling, as exemplified here on PACIFIC.$ 该图像是一个图表，展示了 Action-Based Contrastive Self-Training (ACT) 在数据效率设置中相较于其他标准调优方法的表现。随着对话数量的增加，ACT 的 Clarification Reasoning (DROP F1) 指标显著提升，标记为 24% 的推理提升和 19% 的动作优化提升，并显示出 5 倍的数据效率。

Figure 2 $| ACT$ 在数据高效的对话建模设置中，大大优于标准微调方法，如在 PACIFIC 任务上的表现。
评估框架的创新： 论文提出了一种评估 LLM 作为对话代理 (conversational agents) 能力的新工作流程，即通过检查模型是否能隐式识别 (implicitly recognize) 和推理 (reason about) 对话中的模糊性。
模型无关性 (Model Agnostic)： ACT 对基础模型具有良好的适应性，即使是对未经过人类反馈对齐的基础模型，也能带来性能提升。

3. 预备知识与相关工作

本节旨在为读者提供理解论文核心内容所需的背景知识，并阐述本文工作与现有研究的联系与区别。

3.1. 基础概念

大语言模型 (Large Language Models, LLMs)： 指的是参数量巨大、在海量文本数据上进行过预训练的深度学习模型，如 GPT 系列、Gemini、Mistral 等。它们通过学习语言的统计规律，能够生成连贯、有意义的文本，并执行各种自然语言处理任务，如问答、翻译、摘要和对话。本文中，LLM 是构建对话代理 (conversational agents) 的基础。
人类反馈 (Human Feedback)： 优化 LLM 性能的关键机制。通常指在模型生成响应后，人类标注者对其质量、有用性、安全性等方面进行评价或排序。这种反馈用于进一步微调模型，使其行为更符合人类偏好。
对话代理 (Conversational Agents)： 能够与人类进行自然语言交互，并协助完成特定任务或提供信息的智能系统。它们需要具备理解用户意图、生成恰当响应、管理对话流程等能力。本文关注的是如何提升 LLM 作为对话代理的能力。
歧义消除 (Disambiguation)： 指在对话中识别并解决用户请求或表达中的模糊性。例如，用户说“展示详情”，代理需要澄清“详情”指的是哪些具体信息。这是混合主导对话 (Mixed-Initiative Conversation) 的一个核心技能。
对话动作策略 (Dialogue Action Policies)： 描述对话代理 (dialogue agent) 在特定对话状态 (dialogue state) 下应采取的行动。例如，当用户请求模糊时，策略可能是“提出澄清问题”；当请求清晰时，策略可能是“提供答案”。本文的目标是优化 LLM 学习这些策略的能力。
直接偏好优化 (Direct Preference Optimization, DPO)： 一种强化学习 (Reinforcement Learning) 的替代方法，用于从人类偏好数据中对齐 LLM。DPO 不需要训练一个单独的奖励模型 (reward model)，而是直接将偏好数据转换为一个分类损失函数，通过最大化获胜响应相对于失败响应的概率比来优化策略模型。这简化了 RLHF 的训练过程，通常更稳定且计算效率更高。
监督微调 (Supervised Fine-tuning, SFT)： 在预训练 LLM 的基础上，使用带有任务特定标注数据（例如，问答对、对话轮次）进一步训练模型的过程。SFT 的目标是使模型更好地遵循指令并适应特定任务。

3.2. 前人工作

3.2.1. 混合主导对话代理 (Mixed-Initiative Conversational Agents)

混合主导 (Mixed-Initiative) 是指在人机交互中，人与机器都可以在对话中承担主导角色，而不是由一方完全控制。这种代理通常包含两个核心组件：

理解与规划模块 (Understanding and Planning Module)： 负责分析对话状态（如用户意图、上下文），并决定接下来要执行的对话行动 (dialogue action)，例如是问澄清问题 (clarifying question) 还是直接提供答案 (answer)。这通常被视为一个马尔可夫决策过程 (Markov Decision Process, MDP)，其中对话状态 (dialogue state) 是从一个潜在未知分布中抽取的，而行动 (action) 是对话话语中携带的实用意图 (pragmatic intent) 的低维表示（即对话行为 (dialogue act)）。
生成模块 (Generation Module)： 根据规划模块的输出，生成具体的话语 (utterance)。早期的研究可能使用多目标 SFT 或专门的控制代码嵌入 (control codes embeddings) 来实现受控生成。LLM 极大地改进了实用受控生成 (pragmatically-controlled generation) 的性能。

挑战：

规划 (Planning)： 这是一个困难的任务，因为它需要长期规划 (long-horizon planning) 和推理 (reasoning) 用户的响应和意图。传统的模块化方法（如结合神经网络模型 (neural models) 和搜索算法 (search algorithms) 或模拟 (simulation)）计算开销大，可能导致错误传播 (error propagation)，并且不直接优化响应质量。
本文提出直接将对话行动规划 (dialogue action planning) 作为混合主导对话 (mixed-initiative conversation) 中响应生成 (response generation) 的一个隐式子任务来优化。

3.2.2. `LLM` 对齐学习 (Learning for LLM Alignment)

当前 LLM 训练范式通常包含三个阶段：

预训练 (Pretraining)： 在大规模数据上学习通用语言模式。
监督微调 (Supervised Fine-tuning, SFT)： 在指令遵循数据上进行微调。
人类偏好对齐 (Tuning for alignment with human preferences)： 使模型行为与人类偏好保持一致。

本文主要关注第三个阶段。常用的对齐方法包括：

人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF)： 如 Ouyang et al. (2022)。首先根据人类偏好数据训练一个奖励模型 (reward model)，然后使用近端策略优化 (Proximal Policy Optimization, PPO) 等在线算法 (online algorithms) 优化 LLM。RLHF 具有灵活的奖励函数和更广阔的策略探索 (policy exploration) 空间，但 PPO notoriously 难以调优。
离线偏好学习算法 (Offline Preference Learning Algorithms)： 针对 RLHF 的复杂性，DPO (Rafailov et al., 2024)、SLiC (Zhao et al., 2023) 和 IPO (Azar et al., 2024) 等算法被广泛采用。它们绕过了显式奖励建模 (reward modeling)，只需一组超参数即可优化，同时能达到相似的经验效果。
在线策略 DPO (On-Policy DPO)： 鉴于纯离线方法的局限性，一些研究探索了 DPO 的在线变体 (online variants)。例如，Yuan et al. (2024) 提出了迭代 DPO (iterative DPO)，Chen et al. (2024) 提出了将真实响应视为“获胜”响应，从前一迭代策略模型中采样的响应视为“失败”响应的变体。Pang et al. (2024) 将迭代 DPO 应用于优化外部化推理链 (reasoning chains)。

3.3. 技术演进

对话系统从早期的基于规则和知识的系统，发展到以统计模型为基础（如隐马尔可夫模型 (HMM)、条件随机场 (CRF)），再到基于深度学习 (deep learning) 的端到端 (end-to-end) 系统。LLM 的出现极大地推动了这一领域的发展，使得通用对话代理 (conversational agents) 成为可能。

在 LLM 的对齐方面，从最初的监督微调 (SFT)，到通过人类反馈 (human feedback) 进行强化学习 (RLHF)，再到近期兴起的直接偏好优化 (DPO) 系列算法，对齐技术不断简化和优化，旨在更高效、稳定地使 LLM 的行为符合人类预期。本文的 ACT 算法正是基于 DPO 框架，并针对多轮对话 (multi-turn conversations) 中的行动规划 (action planning) 这一特定挑战进行了创新性扩展。

3.4. 差异化分析

聚焦多轮对话行动： 现有许多 DPO 相关工作主要关注单轮响应优化 (single-turn response optimization)，例如生成更优的、更符合偏好的答案。本文的 ACT 则将 DPO 扩展到多轮轨迹 (multi-turn trajectories)，并特别关注 LLM 在对话中隐式选择和执行对话行动 (conversational actions) 的能力，尤其是澄清问题 (clarification questions)。这是本文的核心创新点之一。
行动作为对比基础： 据作者所知，ACT 是首个以对话行动 (conversational actions) 为基础进行对比学习 (contrastive learning) 的方法。通过明确对比“获胜行动”和“失败行动”所导致的响应和轨迹，模型能够更有效地学习何时应该澄清、何时应该回答。
准在线学习范式： ACT 结合了离线 DPO 的易用性和在线学习的探索能力。它在训练过程中在线采样 (on-policy sampling) 模型自身的响应，并通过用户模拟器 (User Simulator) 评估这些响应所产生的多轮对话轨迹 (multi-turn dialogue trajectories)，从而动态更新偏好对 (preference pairs)。这种准在线机制使其比纯离线方法更能适应动态的对话环境，并学习到更鲁棒的策略。
新任务 AmbigSQL： 论文引入了一个新的文本到 SQL 生成 (text-to-SQL generation) 任务 AmbigSQL，专门用于评估和训练 LLM 在复杂数据分析请求中的歧义消除 (disambiguation) 能力，填补了该领域多轮交互任务资源的空白。

4. 方法论

4.1. 方法原理

ACT（Action-Based Contrastive Self-Training，基于行动对比自训练）的核心思想在于，将 LLM 微调为混合主导对话代理 (mixed-initiative conversational agent) 的关键在于使其能够自动生成能够最大化对话成功概率 (conversational success) 的响应。ACT 通过以下几个直觉实现这一目标：

对比偏好 (Contrastive Preferences)： 对比“获胜”和“失败”的对话响应 (dialogue responses) 之间的实用差异 (pragmatic differences)，是直观地教导模型如何选择正确对话行动 (dialogue actions) 的有效方式。
多轮优化 (Multi-turn Optimization)： 对话能力的改进需要多轮优化 (multi-turn optimization)，而不仅仅是单轮的对比。ACT 通过模拟多轮对话轨迹 (conversation trajectories) 来实现这一点。
在线策略采样 (On-policy Sampling)： DPO 类算法的梯度更新基于获胜和失败响应的对数概率。在线策略采样 (on-policy response sampling) 可以生成高概率的词元序列，从而更好地利用 DPO 的优化机制。

ACT 算法分为两个主要阶段：行动基于对比数据集构建 (action-based contrast dataset construction) 和对比自训练 (contrastive self-training)。整个过程如 Figure 3 所示。

该图像是示意图，展示了基于行动的对比自我训练 (ACT) 方法在多轮对话中的政策更新过程。图中包含了响应采样、轨迹模拟与评估的场景，分别示例了错误和正确的隐式行动检测，以及如何更新政策以实现收敛的步骤。

Figure 3 | ACT 微调阶段概述。对于 $D_{pref}$ 中的每个初始对比配对（如 3.2.1 节所述构建），我们从正在微调的模型中采样一个在线策略响应。在评估了采样响应的轨迹后，我们通过替换现有的获胜或失败响应来更新对比配对。模型策略使用公式 1 的目标进行更新。

4.2. 核心方法详解

4.2.1. 问题设置 (Problem Setup)

本文将任务定义为将 LLM 微调为一个混合主导对话代理 (mixed-initiative conversational agent)。该代理应通过一系列对话与用户互动，最终为用户的请求提供正确响应。与用户完全控制交互流的常见代理设置不同，混合主导代理 (mixed-initiative agents) 应通过执行对话行动 (conversational actions) 或策略 (strategies)（如澄清问题 (clarifying questions)）来理解如何重定向交互流。

符号定义：

$\pi_{\theta_i}$ ：在时间步 $i \geq 0$ 时，LLM 的策略 (policy)，由参数 $\theta$ 参数化。
$\pi_{ref}$ ：参考策略模型 (reference policy model)，即 $\pi_{\theta_0}$ (初始模型)。
$D$ ：包含多个对话 (conversations) 的数据集。
$c$ ： $D$ 中的一个对话，包含 $n$ 个对话轮次 (dialogue turns)。
$t_i$ ：时间步 $i$ 的对话轮次状态 (turn state)，包括每个交互方观察到的话语 (utterances) 和行动 (actions)。每个 $t_i$ 都属于一个轨迹，该轨迹在用户在更早的时间步 $j \le i$ 提出的问题得到回答时结束。
$p_i$ ：在时间步 $i$ 的提示 (prompt)，包含任何任务特定信息 (task-specific information)（如 SQL 数据库 schema、表格数据或检索到的段落）以及任何现有的对话上下文 (dialogue context)。
$r_i$ ：在时间步 $i$ 的系统端真实响应 (ground truth system-side response)。
$g_i$ ：解决 $t_i$ 隐式轨迹的目标响应 (goal response)，即在任何可能的澄清轮次后，用户原始问题的答案。在单轮轨迹情况下， $g_i \gets r_i$ 。
$a_i$ ： $r_i$ 隐式表达的行动 (action)。 $a_i$ 存在于特定任务的潜在行动空间 (latent Action Space) $S$ 中，可由行动标注代理 (Action Annotation Agent) $G$ 推断。
行动空间 (Action Space) $S$ ：在实验中， $S =$ [CLARIFY, ANSWER] (澄清，回答)。
辅助模块：
- $M$ ：可控生成模型 (controllable generation model)，用于偏好数据 (preference data) 的创建。
- $A$ ：行动分类器 (Action Classifier)，用于微调 (tuning) 和评估 (evaluation) 期间。
- $U$ ：用户模拟器 (User Simulator)，可以被控制以模拟用户行为，用于微调和评估期间。
  
  Figure A4 展示了如何在 Abg-CoQA 中构建一个对比配对的例子。
  
  该图像是一个关于物质主义与哲学分类的对话示例。图中展示了选中的响应、拒绝的响应及其对应的操作，体现了在处理模糊问题时所需的澄清过程。对谈的核心是物质主义如何与其他哲学理论对比。

Figure A4 | RL 微调中 Abg-CoQA (Guo et al., 2021) 的对比配对示例。所用符号如 3.1 节所述。

用户模拟器 (User Simulators, $U$ )： $U$ 的实现灵感来自 Deng et al. (2023c) 和 $Yu et al. (2023)$ 的工作，他们直接提示 LLM 根据对话上下文 (dialogue context) 和任务目标 (task objectives) 执行目标导向任务 (goal-oriented tasks)。本文首先提示 LLM 总结用户的信息寻求目标 (information-seeking goal)。然后，使用此摘要和当前对话上下文形成另一个提示，以模拟用户响应。这种带有目标摘要的提示方式比直接提供用户模拟器 (user simulator) 真实信息目标更具灵活性。

行动分类器 (Action Classifiers, $A$ )： 在本文考虑的数据集中，可能的行动是“澄清”或“直接回答”问题。本文直接使用少样本上下文学习 (few-shot in-context learning) 作为行动分类器 $A$ 。

4.2.2. `ACT`: 行动基于对比自训练 (Action-Based Contrastive Self-training)

ACT 是一种准在线 (quasi-online) 扩展的 DPO 算法，它保持了离线方法的易用性，同时融入了在线学习 (online learning) 中的灵活探索。ACT 包含两个阶段：行动基于对比数据集构建 (action-based contrast dataset construction) 和对比自训练 (contrastive self-training)。

4.2.2.1. 构建偏好数据 (Construction of Preference Data)

算法 1: 构建对比行动对 (Building Contrastive Action Pairs)

输入：
- $D$ ：数据集
- $M$ ：条件生成模型 (Conditional generation model)
- $S$ ：行动空间 (Action Space)
- $G$ ：行动标注代理 (Action Annotation Agent)
步骤：
1. 初始化空数据集 $D_{pref}$ 。
2. 对于 $D$ 中每个对话轮次 (conversation turn) $t_i$ ： 3. 令 $a_i = G(p_i, r_i)$ 。 $\vartriangleright$ 推断上下文行动 (Contextual Action) 4. 令 $a_i' = S \setminus a_i$ 。 $\vartriangleright$ 确定被拒绝行动 (Rejected Action) 5. 令 $y_{wi} = r_i$ 。 $\vartriangleright$ 将真实响应设为获胜响应 (winning response) 6. 从 $M$ 中采样 $y_{li} \sim P_M(\cdot | p_i, a_i')$ 。 $\vartriangleright$ 采样失败响应 (losing response)，该响应基于被拒绝的行动 $a_i'$ 7. 令 $t_i' = (p_i, r_i, g_i, a_i, a_i', y_{wi}, y_{li})$ 。 $\vartriangleright$ 构建增强的元组 8. 将 $t_i'$ 添加到 $D_{pref}$ 。

解释： 这个算法构建了一个包含对比“获胜-失败”行动对 (action pairs) 的偏好数据集 (preference dataset)。对于数据集 $D$ 中的每个对话轮次 $t_i$ ，它会：

推断真实行动： 使用行动标注代理 (Action Annotation Agent) $G$ （例如一个 LLM 或人类标注者）来推断真实响应 $r_i$ 所对应的隐式行动 $a_i$ 。
确定被拒绝行动： 从行动空间 (Action Space) $S$ 中选择与 $a_i$ 不同的行动 $a_i'$ 作为“被拒绝行动”。
生成失败响应： 使用高能力 LLM (high capacity LLM) 作为条件生成模型 (conditional generation model) $M$ ，根据原始提示 $p_i$ 和被拒绝行动 $a_i'$ 生成一个“失败响应” $y_{li}$ 。这样， $y_{li}$ 代表了如果智能体采取了错误的行动会产生的响应。
构建对比元组： 最终形成一个包含原始信息、真实行动、被拒绝行动、获胜响应和失败响应的增强元组 $t_i'$ 。

野外无标签对话的行动优化 (Action optimization for unlabeled conversations "in-the-wild")： 在某些情况下，可能无法获得黄金标准 (gold-standard) 的歧义标注。在这种设置下，可以使用一个分类器 (classifier) 作为行动标注代理 (Action Annotation Agent) $G$ 来获取伪标签监督 (pseudo-label supervision)，而不是依赖人工标注。

4.2.2.2. 使用在线策略对话轨迹模拟进行自训练 (Self-Training Using On-policy Conversation Trajectory Simulation)

算法 2: ACT: 行动基于对比自训练 (Action-Based Contrastive Self-Training)

输入：
- $\pi_{\theta_0}$ ：初始策略模型 (Initial Policy Model)
- $D_{pref}$ ：行动对比数据集 (Action Contrast Dataset)
- $B$ ：批次数量 (Number of Batches)
- $A$ ：行动分类器 (Action Classifier)
- $U$ ：用户模拟器 (User Simulator)
- $H$ ：任务启发式 (Task Heuristic)
- $\epsilon$ ：启发式容忍度 (Heuristic Tolerance)
步骤：
1. 对于从 $D_{pref}$ 中采样的批次 $b$ 中每个对话轮次 (conversation turn) $t_j$ （其中 $0 \leq j \leq B$ ）： 2. 从当前模型策略中采样一个响应： $y_j \sim \pi_{\theta_j}(\cdot | p_j)$ 。 3. 如果行动 $A(y_j) \neq a_j$ ： $\vartriangleright$ 隐式实用行动 (pragmatic action) 与真实情况不匹配 4. 设置 $y_{lj} = y_j$ 。 $\vartriangleright$ 将当前采样响应 $y_j$ 作为失败响应 (losing response) 5. 否则（行动 $A(y_j) = a_j$ ）： 6. 初始化轨迹 (Trajectory)，将 $y_j$ 添加到轨迹 (Trajectory) 中。 7. 当 $A(y_k) \neq \text{ANSWER}$ 时： $\vartriangleright$ 模拟用户澄清 8. 澄清答案 (Clarification Answer) $= U(p_k, y_k)$ 。 9. 将澄清答案 (Clarification Answer) 添加到轨迹 (Trajectory) 中。 10. $y_{k+1} = \pi_{\theta_j}(\cdot | p_k, \text{Clarification Answer})$ 。 $\vartriangleright$ 模拟下一个策略响应 11. 将 $y_{k+1}$ 添加到轨迹 (Trajectory) 中。 12. 如果 $H(\text{Trajectory outcome}, g_j) > \epsilon$ ： 13. 令 $y_{wj} = \text{Trajectory}$ 。 $\vartriangleright$ 奖励可接受的轨迹结果，将其设为获胜响应 (winning response) 14. 否则： 15. 令 $y_{lj} = \text{Trajectory}$ 。 $\vartriangleright$ 惩罚不好的轨迹结果，将其设为失败响应 (losing response) 16. $\theta \leftarrow \text{Update}(\theta)$ 直到收敛（使用 DPO 目标函数）。
2. 输出： $\pi_{\theta_B}$

解释： 这个算法是 ACT 的核心，它是一个准在线 (quasi-online) 的训练过程。

在线策略采样： 在每个训练批次中，模型 $\pi_{\theta_j}$ 会根据当前提示 (prompt) $p_j$ 采样一个响应 $y_j$ 。
行动匹配检查： 使用行动分类器 (Action Classifier) $A$ 来判断采样响应 $y_j$ 的隐式行动 (implicit action) 是否与真实行动 $a_j$ 匹配。
- 行动不匹配： 如果不匹配，则说明模型采取了错误的实用行动 (pragmatic action)（例如，应该澄清却直接回答），此时将采样响应 $y_j$ 直接作为失败响应 (losing response)，以惩罚这种错误。
- 行动匹配： 如果匹配，则会进入多轮轨迹模拟 (multi-turn trajectory simulation) 阶段。
轨迹模拟：
- 使用用户模拟器 (User Simulator) $U$ 模拟用户对模型响应的后续回复（特别是当模型提出澄清问题 (clarification question) 时）。
- 模型 $\pi_{\theta_j}$ 会继续生成后续响应，直到其行动变为 ANSWER（即尝试回答用户原始问题）。
- 整个交互序列构成了从模型自身策略生成的对话轨迹 (conversation trajectory)。
轨迹结果评估： 使用任务启发式 (Task Heuristic) $H$ 评估模拟轨迹的最终结果（例如，答案的语义相似度、SQL 执行结果的正确性）与用户原始目标 $g_j$ 的匹配程度。
- 轨迹成功： 如果结果满足任务启发式，则整个模拟轨迹被视为获胜响应 (winning response)。
- 轨迹失败： 否则，整个模拟轨迹被视为失败响应 (losing response)。
DPO 更新： 根据动态生成的获胜/失败响应 (winning/losing response) 对，使用 DPO 目标函数更新模型参数 $\theta$ 。

Figure A5 展示了一个轨迹级内容评估的示例。

该图像是示意图，展示了评估模型如何通过用户模拟器进行多轮对话的轨迹级内容评估。图中描述了用户提问、模型的澄清问题及其最终答案的过程，并通过任务指标（如 DROP F1）来评分轨迹分数。

Figure A5 | 使用 Figure 1 示例场景进行轨迹级内容评估。轨迹级评估旨在衡量候选 LLM 与“用户”交互以达到目标信息目标的能力。给定实例的“交互式”评估持续进行，直到候选 LLM 尝试通过提供直接答案来解决用户的请求。候选轨迹分辨率使用下游任务指标进行评分。在此示例中，根据 PACIFIC 的任务指标，使用 DROP F1。

4.2.2.3. 对比强化学习对齐微调 (Contrastive RL Tuning for Alignment)

在通过模拟构建了最新的获胜响应 (winning response) $y_{wi}$ 和失败响应 (losing response) $y_{li}$ 对之后，本文使用 DPO (Direct Preference Optimization) 训练目标来更新策略模型 $\pi_{\theta}$ (Rafailov et al., 2024)。

DPO 损失函数 (DPO Training Objective)： （为简化表示，此处忽略了 $i$ 迭代器） $\mathcal { L } _ { \mathrm { D P O } } ( \pi _ { \theta } ; \pi _ { r ef } ) = - \mathbb { E } _ { ( p , y _ { w } , y _ { l } ) \sim \mathcal { D } } \left[ \log \sigma \left( \beta \log \frac { \pi _ { \theta } ( y _ { w } \mid p ) } { \pi _ { r ef } ( y _ { w } \mid p ) } - \beta \log \frac { \pi _ { \theta } ( y _ { l } \mid p ) } { \pi _ { r ef } ( y _ { l } \mid p ) } \right) \right]$ 其中：

$p$ $p$ 是一个提示 (prompt)，由任务信息 (task info) 和对话历史 (conversation history) 拼接而成，形式为 $\{x_1, y_1, ..., x_{i-1}, y_{i-1}, x_i\}$ ${x_{1}, y_{1}, ..., x_{i - 1}, y_{i - 1}, x_{i}}$ 。
- $x_i$ 代表在第 $i$ 轮观察到的用户端话语 (user-side utterance)。
- $y_i$ 代表在第 $i$ 轮观察到的系统端话语 (system-side utterance)。
$y_w$ 和 $y_l$ 分别是在 4.2.2.2 节中确定的“获胜”和“失败”响应或轨迹 (trajectories)。
$\pi_{ref}$ 是初始参考策略模型 (initial reference policy model)。
$\beta$ 是一个超参数 (hyperparameter)，用于正则化 $\pi_\theta$ 和 $\pi_{ref}$ 之间的比率。
$\sigma(\cdot)$ 是 sigmoid 函数 $\sigma(x) = \frac{1}{1 + e^{-x}}$ 。
$\mathcal{D}$ 是偏好数据集 (preference dataset)，包含了 $(p, y_w, y_l)$ 对。
$\mathbb{E}$ 表示期望。

DPO 损失函数的梯度 (Gradient of the DPO Loss Function)： $\begin{array} { r l } { \boldsymbol { \mathrm { \widehat { \tau } } } _ { \theta } \mathcal { L } _ { \mathrm { D P O } } ( \pi _ { \theta } ; \pi _ { r ef } ) = } & { } \\ & { \phantom { \frac { \widehat { \tau } _ { \theta } } { \pi _ { \theta } } } - \beta \mathbb { E } _ { ( p , y _ { w } , y _ { l } ) \sim \mathcal { D } } \bigg [ \sigma ( \widehat { R } _ { \theta } ( p , y _ { l } ) - \widehat { R } _ { \theta } ( p , y _ { w } ) ) \bigg [ \nabla _ { \theta } \log \pi ( y _ { w } \mid p ) - \nabla _ { \theta } \log \pi ( y _ { l } \mid p ) \bigg ] \bigg ] } \end{array}$ 这里，隐式奖励模型 (implicitly defined reward model) 定义为 $\widehat{R}_\theta(p, y) = \beta \log \frac{\pi_\theta(y|p)}{\pi_{ref}(y|p)}$ ，正如 Rafailov et al. (2024) 中所证明的。

直觉 (Intuition)： DPO 目标函数的直觉是，损失函数的梯度会增加获胜响应 $y_w$ 的似然 (likelihood)，并减少失败响应 $y_l$ 的似然 (likelihood)。每个样本的权重由隐式奖励模型对配对响应排名错误的程度决定。如果模型错误地认为失败响应比获胜响应更好（即 $\widehat{R}_\theta(p, y_l) - \widehat{R}_\theta(p, y_w)$ 为正且较大），那么梯度会更大，从而更强烈地调整模型，使其偏好 $y_w$ 而非 $y_l$ 。

5. 实验设置

ACT 是一种样本高效 (sample-efficient) 的方法，用于使 LLM 适应对话行动策略 (conversational action policy)。本文主要关注提升 LLM 隐式选择 (implicit selection) 代理端澄清问题 (clarification question) 的能力。因此，研究人员在三个复杂的对话信息寻求任务 (conversational information-seeking tasks) 中评估 ACT 作为一种微调方法。实验使用的基础模型是 Zephyr-\beta \mathrm{F1} = \frac{2 \times \mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $其中，$ \mathrm{Precision} = \frac{\text{正确预测的词元数量}}{\text{模型生成的所有词元数量}} \mathrm{Recall} = \frac{\text{正确预测的词元数量}}{\text{真实答案中所有词元数量}} $* **符号解释：** * $\mathrm{F1}$：`F1` 分数，精确率和召回率的调和平均值。 * $\mathrm{Precision}$：精确率，模型预测正确部分的比例。 * $\mathrm{Recall}$：召回率，真实答案中被模型正确预测的比例。 * **`Trajectory-level DROP F1`：** `PACIFIC` 任务的平均轨迹结果 <code>DROP F1</code>。 * **`Post-Clarification DROP F1`：** `PACIFIC` 任务的澄清后 <code>DROP F1</code>，即仅针对包含代理澄清轮次的轨迹的 `Trajectory-level DROP F1`。 * **`Turn Similarity`：** `Abg-CoQA` 任务的即时响应嵌入相似度 (Immediate response embedding similarity)。 * **概念定义：** `Embedding-based Semantic Distance` (`Risch et al., 2021`) 使用句子的嵌入向量来衡量模型生成答案与真实答案之间的语义相似度。它比简单的词元重叠度量更能捕捉文本的意义，允许生成更多样化但语义等效的答案。 * **数学公式：** 通常使用余弦相似度 (Cosine Similarity)，对于两个嵌入向量 $A$ 和 $B$：$ \mathrm{Similarity}(A, B) = \frac{A \cdot B}{||A|| \cdot ||B||} $* **符号解释：** * `A, B`：两个待比较句子的嵌入向量。 * $\cdot$：向量的点积。 * $||\cdot||$：向量的欧几里得范数（L2 范数）。 * **`Trajectory Similarity`：** `Abg-CoQA` 任务的轨迹结果嵌入相似度 (Trajectory outcome embedding similarity)。 * **`Trajectory-level Execution Match`：** `AmbigSQL` 任务的轨迹结果执行匹配百分比 (Percentage of trajectory outcomes with correct execution results)。 * **概念定义：** `Execution Match` 是衡量文本到 `SQL` (Text-to-SQL) 任务中模型生成 `SQL` 查询准确性的客观指标。如果模型生成的 `SQL` 查询在数据库中执行后，其结果与真实 `SQL` (ground truth SQL) 查询的执行结果完全一致，则认为匹配成功。 * **`Post-Clarification Execution Match`：** `AmbigSQL` 任务的澄清后执行匹配百分比 (Percentage of trajectory outcomes with correct execution results out of those that which contain clarification turns)。 ### 5.2.2. 隐式模糊性识别 (Implicit ambiguity recognition) 为了进一步理解代理的多轮任务完成能力 (multi-turn task completion ability)，研究人员考虑了“对话行为准确率 (dialogue act accuracy)” (`Chen et al., 2023b`)。假设可以访问真实模糊性标签 (ground-truth ambiguity label)，如果一个请求是真正模糊的，模型应该生成一个澄清问题 (clarification question)；否则，它应该尝试提供请求的信息。由于 `PACIFIC` 和 `Abg-CoQA` 的类别高度不平衡，主要考虑 `Macro F1` 作为指标。 **具体行动层面评估指标：** * **`Accuracy`：** 正确隐式行动 (correct implicit actions) 的百分比。 * **概念定义：** 衡量模型在给定对话上下文下，其隐式选择的对话行动（如“澄清”或“回答”）与真实标注行动一致的比例。 * **数学公式：**$ \mathrm{Accuracy} = \frac{\text{正确预测的行动数量}}{\text{总行动数量}} $* **`Macro F1`：** 各行动 `F1` 的未加权平均 (Unweighted Average)。 * **概念定义：** 当类别不平衡时，`Macro F1` 比加权平均更公平地反映模型在每个类别上的性能，它计算每个类别的 `F1` 分数，然后取所有类别 `F1` 分数的平均值。 * **数学公式：**$ \mathrm{Macro F1} = \frac{1}{N} \sum_{c=1}^{N} \mathrm{F1}_c $其中，$\mathrm{F1}_c$ 是类别 $c$ 的 `F1` 分数，$N$ 是类别的总数。 ## 5.3. 对比基线 ### 5.3.1. 提示基线 (Prompting baselines) 本文将 `ACT` 微调的小模型与多种前沿 `LLM` (frontier LLMs) 的基于提示的方法 (prompt-based approaches) 进行比较： * `Gemini 1.5 Pro` * `Gemini 1.5 Flash` * `Claude 3.5 Sonnet` * `Claude 3.0 Haiku` 所有提示基线使用 10 个对话作为上下文示例 (in-context examples)，采用三种不同的提示框架： 1. `Standard` (标准提示)： 使用与微调相同的指令格式。Table A17 是 `PACIFIC` 的一个示例。 2. `Chain-of-Thought` (思维链)： 结合 $Wei et al. (2022)$ 提出的思维链推理 (chain-of-thought reasoning)。Table A18 是 `PACIFIC` 的一个示例。 3. `Proactive MIPrompt` (主动混合主导提示)： `Deng et al. (2023c)` 中的提示基线，结合了 `Chen et al. (2023b)` 的混合主导提示方法 (mixed-initiative prompting approach) 和 `Deng et al. (2023b)` 的主动提示 (Proactive Prompting)。Table A19 是 `PACIFIC` 的一个示例。 ### 5.3.2. 微调基线 (Tuning baselines) * `Supervised Fine-tuning (SFT)`： 使用每个数据集训练集的真实响应 (ground truth responses) 进行微调。 * `Iterative Reasoning Preference Optimization (IRPO)`： 一种最近提出的在线策略 `DPO` (on-policy DPO) 变体，在算术等推理任务中取得了关注。本文在 `PACIFIC` 和 `AmbigSQL` 这两个定量推理任务 (quantitative reasoning tasks) 上评估 `IRPO`。 * `DPO-Dist` (DPO Distillation)： 一种流行的离线 `DPO` (off-policy DPO) 方法，其中获胜响应 (winning responses) $Y_w$ 来自能力更强的模型，而失败响应 (losing responses) $Y_l$ 来自能力较弱的模型 (`Mitra et al., 2023; Mukherjee et al., 2023; Xu et al., 2024a`)。相关结果在附录 B 中提供。 # 6. 实验结果与分析为了模拟数据有限的真实世界场景，本文在不同数据有限的对话样本设置下，对 `ACT` 作为一种微调方法进行了评估。基础模型为 `Zephyr 7B-`\beta$ 。

6.1. 核心结果分析

6.1.1. 表格型问答 (`PACIFIC` 任务)

以下是原文 Table 1 的结果：

Adaption Setting			\|Action-level	Content-level
Base Model	Approach	Conversations	\|Action-level	Macro F1 ↑	Turn F1 ↑ Traj. F1 ↑	Post-Clarify F1 ↑
Gemini Pro	Standard ICL	10	81.4	59.7	58.7	49.7
Claude Sonnet	Standard ICL	10	71.9	43.7	42.0	28.5
Gemini Pro	SFT	50	71.2	51.8	45.7	9.9
Gemini Pro	SFT	100	75.2	64.3	54.6	8.5
Gemini Pro	SFT	250	88.0	67.4	59.3	10.2
Zephyr 7B-β	SFT	50	69.0	57.8	61.3	43.5
Zephyr 7B-β	IRPO	50	67.7	59.1	56.7	34.4
Zephyr 7B-β	ACT (ours)	50	82.2	62.8	61.9	57.2
Zephyr 7B-β	SFT	100	82.3	58.6	60.3	49.9
Zephyr 7B-β	IRPO	100	84.5	60.4	55.2	38.2
Zephyr 7B-β	ACT (ours)	100	86.0	65.0	62.0	57.4
Zephyr 7B-β	SFT	250	86.9	65.1	63.3	56.7
Zephyr 7B-β	IRPO	250	85.4	64.9	58.4	40.3
Zephyr 7B-β	ACT (ours)	250	89.6	68.1	65.7	62.0

ACT 性能领先： Table 1 显示，在所有三种数据效率设置 (data-efficient settings) 下，ACT 在所有指标上均优于 SFT 和 IRPO。值得注意的是，IRPO 具有额外的测试时计算 (test-time computation) 优势 (Snell et al., 2024; Pang et al., 2024)，但 ACT 仍能超越。
模糊性识别能力显著提升： 在仅有 50 个对话作为微调数据的情况下，ACT 在衡量模型隐式识别模糊性 (implicitly recognize ambiguity) 的能力方面，相比 SFT 实现了高达 19.1% 的相对提升（从 69.0 Macro F1 提高到 82.2）。
数据效率： ACT 展现出比基于适配器 (adapter-based) SFT 与 Gemini Pro 更高的数据效率。在多轮任务性能方面（轨迹级 DROP F1），相对提升高达 35.7%（从 45.6 提高到 61.9）。
媲美或超越前沿 LLM： 在数据有限的设置下，通过 ACT 微调的模型在推理时即使没有上下文示例 (in-context examples)，也能达到或超越使用上下文学习 (in-context learning, ICL) 的前沿 LLM 性能。这强调了在线策略学习 (on-policy learning) 和多轮轨迹模拟 (multi-turn trajectory simulation) 对改进多轮任务完成的关键作用。

6.1.2. 机器阅读理解 (`Abg-CoQA` 任务)

以下是原文 Table 2 的结果：

Adaptation Setting			\| Action-level \|	Content-level
Base Model	Approach	Conversations	\| Action-level \|	Macro F1 ↑	Turn Similarity ↑ Traj. Similarity ↑
Gemini Pro	Standard ICL	10	55.5	67.0	72.2
Claude Sonnet	Standard ICL	10	66.0	50.1	54.3
Zephyr 7B-β	SFT	50	44.6	53.3	64.2
Zephyr 7B-β	ACT (ours)	50	52.3	66.2	68.8
Zephyr 7B-β	SFT	100	52.6	63.1	69.4
Zephyr 7B-β	ACT (ours)	100	51.1	69.5	71.4
Zephyr 7B-β	SFT	250	53.5	64.0	66.2
Zephyr 7B-β	ACT (ours)	250	53.3	72.5	75.1

任务特定指标表现最佳： Table 2 显示，在所有三种数据设置下，ACT 在任务特定指标（特别是轨迹级嵌入相似度 (trajectory-level embedding similarity)）方面表现最佳，这表明其在多轮推理方面的改进。
行动层面与内容层面的权衡： 在 100 和 250 个对话的设置中，SFT 微调的 Zephyr 在隐式行动识别 (implicit action recognition) 方面略优于 ACT（Macro F1），但这并不意味着整体性能更优。作者在附录 A 中进一步讨论了这一点，强调行动层面的性能主要有助于理解澄清推理能力 (clarification reasoning ability)。
多轮推理能力改进： ACT 在所有条件下均带来了最强的轮次级 (turn-level) 和轨迹级任务性能 (trajectory-level task performance)，这表明其改进了多轮推理能力。

6.1.3. 对话式文本到 `SQL` 生成 (`AmbigSQL` 任务)

以下是原文 Table 3 的结果：

Adaptation Setting			Action-level		Content-level
Base Model	Approach	Conversations	Accuracy ↑	Macro F1 ↑	Execution Match ↑	PC Execution Match ↑
Gemini Pro	Standard ICL	10	72.1	70.9	63.5	75.2
Claude Sonnet	Standard ICL	10	68.5	63.8	66.5	72.4
Zephyr 7B-β	SFT	50	77.4	77.4	21.9	13.9
Zephyr 7B-β	IRPO	50	91.0	91.0	27.8	30.8
Zephyr 7B-β	ACT (ours)	50	80.8	80.7	43.6	38.1
Zephyr 7B-β	SFT	100	97.2	97.2	43.3	34.3
Zephyr 7B-β	IRPO	100	96.2	96.1	45.0	37.0
Zephyr 7B-β	ACT (ours)	100	99.2	99.3	48.0	49.6
Zephyr 7B-β	SFT	250	99.8	99.7	51.0	50.7
Zephyr 7B-β	IRPO	250	97.0	97.1	49.7	45.6
Zephyr 7B-β	ACT (ours)	250	99.9	99.8	52.3	53.0
Zephyr 7B-β	SFT	14,000 (All)	99.8	99.8	63.1	60.4

ACT 在任务性能上最强： Table 3 显示，使用 ACT 微调的 Zephyr 在每个数据设置中都能实现最强的任务性能。
澄清后 SQL 执行匹配的显著提升： 当数据资源稀缺时，Post-Clarification SQL Execution Match 的性能提升尤为显著。例如，在 50 个对话的设置中，ACT 的 PC Execution Match 为 38.1%，远高于 SFT 的 13.9% 和 IRPO 的 30.8%。
行动准确率与 SQL 性能的差异： 尽管提示基线在行动准确率 (Action Accuracy) 上可能不高，但基准测试的前沿 LLM 在执行匹配 (execution match) 方面表现相对较强。相反，SFT 和 ACT 微调的 Zephyr 具有较高的行动准确率 (Action Accuracy)，但在文本到 SQL 性能方面低于前沿 LLM。这主要归因于 SQL 生成任务对模型规模的巨大益处 (Sun et al., 2023b)。
ACT 在多轮任务性能上的相对优势： 总体而言，ACT 在多轮任务性能 (multi-turn task performance) 方面实现了最大的相对性能提升，甚至超过了 IRPO 等用于定量推理的基线方法。这表明，如果将 ACT 应用于更大的模型，可能会进一步提升其多轮性能。

6.1.4. `ACT` 在野外：无对话行动监督的学习

以下是原文 Table 4 的结果：

Task Adaptation Environment				Action-level	Content-level
Base Model	Framework	Action Supervision	Tuning Ex.	Macro F1 ↑	Turn F1 ↑	Traj. F1 ↑	Post-Clarify F1 ↑
Zephyr 7B-β	SFT	NA	50	69.0	57.8	61.3	43.5
Zephyr 7B-β	ACT	Crowdsourced	50	82.2	62.8	61.9	57.2
Zephyr 7B-β	ACT	Pseudo-labeled	50	80.1	62.4	61.1	54.7
Zephyr 7B-β	SFT	NA	100	82.3	58.6	60.3	49.9
Zephyr 7B-β	ACT	Crowdsourced	100	86.0	65.0	62.0	57.4
Zephyr 7B-β	ACT	Pseudo-labeled	100	84.8	63.5	61.5	56.1
Zephyr 7B-β	SFT	NA	250	86.9	65.1	63.3	56.7
Zephyr 7B-β	ACT	Crowdsourced	250	89.6	68.1	65.7	62.0
Zephyr 7B-β	ACT	Pseudo-labeled	250	89.0	68.1	64.9	61.0

伪标签监督的有效性： 尽管在 Table 1-3 中使用了人工标注的 (crowdsourced) 模糊性标签，本文还证明了在没有行动标签监督的情况下执行行动基于微调 (action-based tuning) 的可能性。
高一致性： 使用 Gemini 1.5 Pro 作为零样本行动标注器 (zero-shot action annotator) 重新标记 PACIFIC 语料库中的真实助手端轮次 (Assistant-side turns) 时，与真实行动标签达到了惊人的高一致性（98.5%）。
性能无显著差异： Table 4 显示，无论是行动层面 (Action-level) 还是内容层面 (Content-level) 的指标，使用伪标签 (Pseudo-labeled) 的 ACT 与使用**人工标注 (Crowdsourced)的ACT` 之间几乎没有经验上的性能差异。
“野外”场景的潜力：这凸显了 ACT 在“野外”设置 (in-the-wild settings) 中，即使只有少量无标签对话数据 (unlabeled conversational data)，也可能非常有效。

6.2. 数据呈现 (表格)

以下是原文 Table A6 的结果：

Adaptation Setting			\| Action-level \|	Content-level
Base Model	Approach	Conversations	Macro F1 ↑	Turn Similarity ↑	Traj. Similarity ↑
Gemini Pro	ICL	50	56.4	64.5	68.9
Zephyr 7B-β	ACT (ours)	50	52.3	66.2	68.8
Gemini Pro	ICL	100	59.2	67.0	72.0
Zephyr 7B-β	ACT (ours)	100	51.1	69.5	71.4
Gemini Pro	ICL	250	58.8	66.0	71.1
Zephyr 7B-β	ACT (ours)	250	53.3	72.5	75.1

以下是原文 Table A7 的结果：

Adaption Setting			\| Action-level \|	Content-level
Base Model	Approach	Conversations	\| Action-level \|	Macro F1 ↑	Turn F1 ↑	Traj. F1 ↑ Post-Clarify F1 ↑
Gemini Pro	Standard Prompt	10	81.4	59.7	58.7	49.7
Gemini Pro	Chain-of-Thought	10	86.3	66.3	17.1	19.2
Gemini Pro	Proactive MIPrompt	10	78.9	63.4	61.1	18.9
Gemini Flash	Standard Prompt	10	67.4	58.8	58.7	17.9
Gemini Flash	Chain-of-Thought	10	77.1	62.0	16.9	20.0
Gemini Flash	Proactive MIPrompt	10	76.8	64.0	62.0	24.4
Claude Sonnet	Standard Prompt	10	71.9	43.7	42.0	28.5
Claude Sonnet	Chain-of-Thought	10	80.0	37.2	13.0	6.8
Claude Sonnet	Proactive MIPrompt	10	74.9	47.2	45.9	7.6
Claude Haiku	Standard Prompt	10	46.9	26.4	26.2	—
Claude Haiku	Chain-of-Thought	10	48.6	23.7	12.0	2.9
Claude Haiku	Proactive MIPrompt	10	48.3	18.6	18.2	7.3
Gemini Pro	SFT	50	71.2	51.8	45.7	9.9
Gemini Pro	SFT	100	75.2	64.3	54.6	8.5
Gemini Pro	SFT	250	88.0	67.4	59.3	10.2
Zephyr 7B-β	SFT	50	69.0	57.8	61.3	43.5
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	50	75.5	61.7	55.7	30.8
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	50	74.8	62.0	56.3	31.9
Zephyr 7B-β	IRPO	50	67.7	59.1	56.7	34.4
Zephyr 7B-β	ACT (ours)	50	82.2	62.8	61.9	57.2
Zephyr 7B-β	SFT	100	82.3	58.6	60.3	49.9
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	100	68.8	53.3	53.3	31.7
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	100	83.0	59.0	53.7	29.3
Zephyr 7B-β	IRPO	100	84.5	60.4	55.2	38.2
Zephyr 7B-β	ACT (ours)	100	86.0	65.0	62.0	57.4
Zephyr 7B-β	SFT	250	86.9	65.1	63.3	56.7
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	250	65.6	53.6	54.1	30.9
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	250	82.8	43.3	38.6	19.6
Zephyr 7B-β	IRPO	250	85.4	64.9	58.4	40.3
Zephyr 7B-β	ACT (ours)	250	89.6	68.1	65.7	62.0

以下是原文 Table A8 的结果：

Adaptation Setting			\| Action-level\|	Content-level
Base Model	Approach	Conversations	Macro F1 ↑	Turn Similarity ↑	Traj. Similarity ↑
Gemini Pro	Standard Prompt	10	55.5	67.0	72.2
Gemini Pro	Chain-of-Thought	10	61.2	63.4	39.1
Gemini Pro	Proactive MIPrompt	10	55.5	63.3	33.3
Gemini Flash	Standard Prompt	10	52.6	62.5	67.4
Gemini Flash	Chain-of-Thought	10	61.2	56.5	36.6
Gemini Flash	Proactive MIPrompt	10	58.1	61.7	36.1
Claude Sonnet	Standard Prompt	10	66.0	50.1	54.3
Claude Sonnet	Chain-of-Thought	10	63.7	46.2	36.8
Claude Sonnet	Proactive MIPrompt	10	57.2	60.8	32.9
Claude Haiku	Standard Prompt	10	49.3	40.9	41.7
Claude Haiku	Chain-of-Thought	10	46.2	30.7	28.0
Claude Haiku	Proactive MIPrompt	10	45.2	34.5	31.4
Zephyr 7B-β	SFT	50	44.6	53.3	64.2
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	50	46.9	57.2	61.2
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	50	44.7	57.9	61.5
Zephyr 7B-β	ACT (ours)	50	52.3	66.2	68.8
Zephyr 7B-β	SFT	100	52.6	63.1	69.4
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	100	47.8	61.9	67.1
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	100	44.8	62.0	66.4
Zephyr 7B-β	ACT (ours)	100	51.1	69.5	71.4
Zephyr 7B-β	SFT	250	53.5	64.0	66.2
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	250	46.0	61.9	66.3
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	250	46.3	62.6	67.0
Zephyr 7B-β	ACT (ours)	250	53.3	72.5	75.1

以下是原文 Table A9 的结果：

Adaptation Setting			Action-level		Content-level
Base Model	Approach	Conversations	Accuracy ↑	Execution Match ↑	PC Execution Match ↑
Gemini Pro	Standard Prompt	10	72.1	63.5	75.2
Gemini Flash	Standard Prompt	10	75.6	64.2	66.2
Claude Sonnet	Standard Prompt	10	68.5	66.5	72.4
Claude Haiku	Standard Prompt	10	73.8	57.3	65.3
Zephyr 7B-β	SFT	50	77.4	21.9	13.9
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	50	77.7	42.6	31.5
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	50	78.0	40.9	41.2
Zephyr 7B-β	IRPO	50	91.0	27.8	30.8
Zephyr 7B-β	ACT (ours)	50	80.8	43.6	38.1
Zephyr 7B-β	SFT	100	97.2	43.3	34.3
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	100	98.7	45.1	45.3
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	100	99.8	47.8	44.8
Zephyr 7B-β	IRPO	100	96.2	45.0	37.0
Zephyr 7B-β	ACT (ours)	100	99.2	48.0	49.6
Zephyr 7B-β	SFT	250	99.8	51.0	50.7
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	250	97.3	49.7	44.2
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	250	99.7	50.7	50.3
Zephyr 7B-β	IRPO	250	97.0	49.7	45.6
Zephyr 7B-β	ACT (ours)	250	99.9	99.8	52.3	53.0
Zephyr 7B-β	SFT	14,000 (All)	99.8	99.8	63.1	60.4

以下是原文 Table A10 的结果：

	Train	Dev	Test
Num. Unambiguous Requests	7,000	1,034	1,034
Num. Ambiguous Requests	7,000	1,034	1,034
Num. Unique Schemas	1,056	145	145
Types of Ambiguity	3	3	3

以下是原文 Table A11 的结果：

No. Interacting Party		Utterance
	User Assistant	Can you list all the singer ids that aren't present in the song table? SELECT Name FROM singer WHERE Singer_ID NOT IN ...
1	User Assistant	Thanks! You should ask at least 3 questions
2	Assistant	Did you want the full name of makers and the number?
3	Assistant	Do you mean the address of the customer with first name Luis?

以下是原文 Table A12 的结果：

[Database Schema Omitted] The target SQL query is the following:
`SELECT professional_id , last_name , cell_number FROM Professionals`
`WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,`
`T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON`
`T1.professional_id = T2.professional_id`
`GROUP BY T1.professional_id HAVING count(*) > 2`
Here is a clear request that would correspond to this SQL query:
"Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List
his or her id, last name and cell phone."
Here is the same request converted into an ambiguous format by underspecifying the target columns:
"Which ones who live in the state of Indiana or have done treatment on more than 2 treatments?"

Here is an appropriate clarifying question to recover the clear request from the ambiguous request:

"Are you asking about the Professionals?"

以下是原文 Table A13 的结果：

[Database Schema Omitted] The target SQL query is the following:
`SELECT professional_id , last_name , cell_number FROM Professionals`
`WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,`
`T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON`
`T1.professional_id = T2.professional_id`
`GROUP BY T1.professional_id HAVING count(*) > 2`
Here is a clear request that would correspond to this SQL query:
"Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List
his or her id, last name and cell phone."
Here is the same request converted into an ambiguous format by underspecifying the target columns:
"Which professionals live in the state of Indiana or have done treatment on more than 2 treatments?"

Here is an appropriate clarifying question to recover the clear request from the ambiguous request:
"Which information of the professionals do you want to know?"

以下是原文 Table A14 的结果：

Model	Ambiguous Request Execution Match Execution Match with Clarification Turns
Gemini Pro	28.5	68.7
Gemini Ultra	31.2	77.0

以下是原文 Table A15 的结果：

	Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables `Singer` with columns `Singer_ID, Name, Country, Age` and `Concert` with columns `Concert_ID, Concert_Name, Theme, Stadium_ID, Year`. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request.
	User: Show all singers who are not from China or USA.
	Assistant: `SELECT Name FROM Singer WHERE Country != 'China' AND Country != 'USA'`
	Trajectory Goal: `SELECT Name FROM Singer WHERE Country != 'China' AND Country != 'USA'`
	Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables `Singer` with columns `Singer_ID, Name, Country, Age` and `Concert` with columns `Concert_ID, Concert_Name, Theme, Stadium_ID, Year`. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request.
	User: Show all singers who are not from China.
	Assistant: Which country other than China are you asking about?
	Trajectory Goal: User: Show all singers who are not from China and USA. Assistant: `SELECT Name FROM Singer WHERE Country != 'China' AND Country != 'USA'`

以下是原文 Table A16 的结果：

	Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables `Concert` with columns `Concert_ID, Concert_Name, Theme, Stadium_ID, Year` and `Stadium` with columns `Stadium_ID, Location, Name, Capacity`. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request.
	User: What are the names of the stadiums where concerts were held in 2014?
	Assistant: `SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014`
	Trajectory Goal: `SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014`
	Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables `Concert` with columns `Concert_ID, Concert_Name, Theme, Stadium_ID, Year` and `Stadium` with columns `Stadium_ID, Location, Name, Capacity`. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request.
	User: What are the names of the stadiums where concerts were held?
	Assistant: Which year are you asking about?
	Trajectory Goal: User: What are the names of the stadiums where concerts were held in 2014? Assistant: `SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014`

以下是原文 Table A17 的结果：

You are an Assistant answering questions from a User. You should either attempt to answer the question or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted]
User: How much does the company expect to make contributions to the defined plans during the next fiscal year?
Assistant: ['What kind of defined plans are you asking about?']
User: The defined benefit plans and the defined contribution plan respectively.
Assistant: ['`5.1 million', '`0.6 million'] User: What is the value of the expected defined contribution plan as a percentage of the defined benefit
plan during the next fiscal year?
Assistant: 0.6/5.1 = 11.76 User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount
rate? Assistant: ['What kind of change are you asking about?']
User: The pension. Assistant: [' $(39,145)', '$ 49,361']
User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?
Assistant: 49,361 - (39,145) = 88506 User: How much would the pension change if there is a 1% increase and a 1% decrease respectively?
Assistant: ['Which rate of increase and decrease are you talking about?']
User: The future salary growth.
Assistant: ['`7,572', '`(6,919)']
User: What is the difference between them?
Assistant: 7,572 - (6,919) = 14491

以下是原文 Table A18 的结果：

or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted]	You are an Assistant answering questions from a User. You should either attempt to answer the question
User What is the value of the expected defined contribution plan as a percentage of the defined benefit plan during the next fiscal year?
Insrction: I the user' question is biguous, ask an apropriate clarifyi question.Otherwise, dily answer the user's question using the information from the passage context and the table. Let's think step
by step. Reasoning: The user's question is not ambiguous. Assistant: 0.6/5.1 = 11.76
rate?	User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount
Insrction: I the user' question is biguous, ask an apropriate clarifyi question.Otherwise, dily
answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question was ambiguous. Assistant: ['What kind of change are you asking about?]
User: The pension.
Instion: Ithe user' question is mbiguous, ask an apropriate clarifyi question.Otherwi, drly
answer the user's question using the information from the passage context and the table. Let's think step
by step.
Reasoning: The user's question is not ambiguous. Assistant: [' $(39,145)', '$ 49,361']
User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?
Insction: I he user' question is mbiguous, askan appropriateclarifyig questio. Otherwis, rey
answer the user's question using the information from the passage context and the table. Let's think step
by step.
Reasoning: The user's question is not ambiguous. Assistant: 49,361 - (39,145) = 88506 User: How much
would the pension change if there is a 1% increase and a 1% decrease respectively?
Instion: Ithe user' question is mbiguous, ask an apropriate clarifyi question.Otherwi, drly
answer the user's question using the information from the passage context and the table. Let's think step
by step.
Reasoning: The user's question was ambiguous.
Assistant: ['Which rate of increase and decrease are you talking about?']

以下是原文 Table A19 的结果：

or ask a clarifying question if there is any ambiguity.	You are an Assistant answering questions from a User. You should either attempt to answer the question
[Table and Passage Omitted] User: How much does the company expect to make contributions to the defined plans during the next fiscal year?
The user's last question was ambiguous. The Assistant asks a clarifying question.
Assistant: ['What kind of defined plans are you asking about?']
User: The defined benefit plans and the defined contribution plan respectively.
The user's last question was unambiguous. The Assistant directly answers the question.
Assistant: ['5.1million' ,'0.6 million'] User: What is the value of the expected defined contribution plan as a percentage of the defined benefit
plan during the next fiscal year?
The user's last question was unambiguous. The Assistant directly answers the question. Assistant: 0.6/5.1 = 11.76
User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount
rate? The user's last question was ambiguous. The Assistant asks a clarifying question.
Assistant: ['What kind of change are you asking about?']
User: The pension. The user's last question was unambiguous. The Assistant directly answers the question.
Assistant: ['(39, 145)','49,361']
User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?
The user's last question was unambiguous. The Assistant directly answers the question.
Assistant: 49,361 - (39,145) = 88506
User: How much would the pension change if there is a 1% increase and a 1% decrease respectively?
Actions: ["Directly Answer", "Ask a Clarification Question"]
Prompt: Given the task background and the conversation history, please use appropriate actions to
generate the response.
Response: The user's last question was ambiguous. The Assistant asks a clarifying question.
Assistant: ['Which rate of increase and decrease are you talking about?']

以下是原文 Table A20 的结果：

[Table and Passage Omitted]
User: What was the total number of acquisitions in FY19?
Assistant: 35 acquisitions
User: What are the benefits of active growth strategy?
Assistant: strengthens and scales the National Storage operating platform which drives efficiencies across
the business.
User: What was the total NLA?
Assistant: Which region are you asking about?
The last Assistant utterance is a clarifying question.

以下是原文 Table A21 的结果：

[Passage Omitted]
User: And who in the group was anxious? Assistant: Peppe
User: Was she well-rested?
Assistant: no
The last Assistant utterance is a direct answer.

以下是原文 Table A22 的结果：

[Database Schema Omitted]
User: "Which professionals live in the state of Indiana or have done treatment on more than 2 treatments?
List his or her id, last name and cell phone."
Here is the same request converted into an ambiguous format by underspecifying the target columns:
Assistant:
`SELECT professional_id , last_name , cell_number FROM Professionals`
`WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,`
`T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON`
`T1.professional_id = T2.professional_id`
`GROUP BY T1.professional_id HAVING count(*) > 2`
The last Assistant utterance is a direct answer.

以下是原文 Table A23 的结果：

The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables `Singer` with columns `Singer_ID, Name, Country, Age` and `Concert` with columns `Concert_ID, Concert_Name, Theme, Stadium_ID, Year`. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request.
	User: What are the names of the stadiums where concerts were held?
Assistant: Which year are you asking about?
	User: What are the names of the stadiums where concerts were held in 2014?
Assistant: `SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014`
The following is a summary of the information the User is looking for: 1. Names of stadiums where concerts were held in 2014.

以下是原文 Table A24 的结果：

The following is a snippet from a conversation between an Assistant and a User who is asking some questions about the Story below. The user only wants to know the following information: 1. What his ranking was. 2. Whether someone else had horse fighters. 3. Who had a considerable force of horse fighters west of him.
User: What was his ranking?
Assistant: General User: Did someone else have horse fighters?
Assistant: yes
User: Who?
Assistant: Do you want to know who had 2,500 horse fighters initially? User: No, I want to know who had a considerable force of horse fighters west of him.
Assistant: General (Bishop) Polk.

以下是原文 Table A25 的结果：

Prompt: The following is a snippet from a conversation between an Assistant and a User who is asking some questions about the Story below. The user only wants to know the following information: 1. What his ranking was. 2. Whether someone else had horse fighters. 3. Who had a considerable force of horse fighters west of him.
User: What was his ranking?
Assistant: General
User: Did someone else have horse fighters?
Assistant: yes
User: Who?
Assistant: Do you want to know who had 2,500 horse fighters initially?
User: No, I want to know who had a considerable force of horse fighters west of him.
Assistant: General (Bishop) Polk.
The following is a summary of the information the User is looking for: 1. What was his ranking? 2. Whether someone else had horse fighters? 3. Who had a considerable force of horse fighters west of him?

以下是原文 Table A26 的结果：

The following is a snippet from a conversation between an Assistant and a User who is asking some questions about the Story below. The user only wants to know the following information: 1. What his
ranking was. 2. Whether someone else had horse fighters. 3. Who had a considerable force of horse fighters west of him.
[Passage Omitted]
User: What was his ranking?
Assistant: General User: Did someone else have horse fighters?
Assistant: yes
User: Who?
Assistant: Do you mean who had 2,500 horse fighters initially? User: No, I want to know who had a considerable force of horse fighters west of him.
Assistant: General (Bishop) Polk.

以下是原文 Table A27 的结果：

A user is asking an assistant to retrieve some information from a SQL database. The command that the assistant should ultimately return is as follows:
`SELECT county FROM campuses where campus = 'California State University-Chico'`
The assistant wil ask some questions to cariy he user's intent. The user should respond with a rephrased request that reflects their desired query.
User: what is the county? Assistant: Are you asking for a list of all of the counties in the database? User: I'm looking for the county of the campus 'California State University-Chico

以下是原文 Table A28 的结果：

Relevant Table Data	SFT	ACT
Year: 2019 \|\| 2018	User: What were the total liabilities of IMFT? Assistant: \$909 ×	User: What were the total liabilities of IMFT? Assistant: Which year are you asking about?
Total current liabilities : 851 \| \| 167 Other noncurrent liabilities : 5 \| \|74		User: 2018
Total Liabilities: $909 \| \|$ 1,305		Assistant: \$1,305

以下是原文 Table A29 的结果：

	SFT	ACT
Relevant Data Net profit/(loss) after tax was \$(9.8) million	User: What was the net profit/(loss) after tax?	User: What was the net profit/(loss) after tax? Assistant: Which year are you asking the amount of?
Net profit/(loss) after tax : (9,819) \| \|6,639 \|\| (248%)	Assistant: \$(9,819) ×	User: FY19 Assistant: \$(9,819) ×

以下是原文 Table A30 的结果：

SFT		ACT
Relevant Table Data	User: In which year was the amount of Investments higher?	? User: In which year was the amount of Investments higher?
Year: 2019 \| \| 2018 Investments: 1,216.0 \|\| 1,212.9	Assistant: [2018] ×	Assistant: [2018, 2019] ×

以下是原文 Table A31 的结果：

SFT		ACT
Relevant Passage	User: What did Meghan ask?	User: What did Meghan ask?
She yawned, stretching out like a lazy cat.	Assistant: Do you mean where did Lizzie see her father?	Assistant: Do you mean that morning or the night before?
Her sister was also awake. "Lizzie", Meghan whispered, "are you awake?"	User: No A	User: The night before.

6.3. 消融实验/参数分析

以下是原文 Table 5 的结果：

				Macro F1 ↑ Turn F1 ↑ Traj. F1 ↑ Post-Clarify F1 ↑
Action Importance
ACT
w/ Random Actions	63.2	55.3	58.7	32.8
Ablation of ACT subcomponents
ACT w/o on-policy sampling	74.8	61.5	59.1	40.5
ACT
w/ sampling but w/o simulation ACT (full)	81.4 82.2	60.8 62.8	60.2 61.9	50.1 57.2
ACT with unaligned foundation models
Gemma 2B SFT	57.7	38.0	40.5	17.0
Gemma 2B ACT	62.7	42.6	44.0	24.8
Mistral 7B SFT	57.7	53.8	51.4	27.7
Mistral 7B ACT	75.7	58.1	57.6	31.9

Table 5 展示了使用 PACIFIC 50 对话设置进行的消融研究，以理解 ACT 各组件的重要性。

行动基于偏好是否必要？ (Are action-based preferences necessary?)
- 分析： ACT 的关键因素之一是对比对 (contrastive pairs) 突出了对话行动 (conversational actions) 之间的差异。当构建偏好对时，如果随机采样获胜行动 (winning action) 和失败行动 (losing action)（如 Table 5 中的 “ACT w/ Random Actions”），性能会显著下降。例如，Macro F1 从 ACT 的 82.2% 降至 63.2%，Post-Clarify F1 从 57.2% 降至 32.8%。
- 结论： 这表明明确的行动选择 (action selection) 对于 ACT 的有效性至关重要。
是否需要在线策略采样？ (Do we need on-policy sampling?)
- 分析： Table 5 中的 “ACT w/o on-policy sampling” 实验评估了在线策略采样 (on-policy sampling) 的重要性，它本质上是在 4.2.2.1 节构建的数据集上进行普通的离线 DPO (off-policy DPO) 训练。
- 结果： 尽管相比 SFT 有一些改进（例如 Macro F1 从 69.0 提高到 74.8），但与完整的 ACT 相比，整体改进幅度要小得多。
- 结论： 这可能是因为离线负响应 (off-policy negative responses) 不保证位于策略模型 (policy model) 的语言流形 (language manifold) 中，导致分布漂移 (distribution shift) 难以通过离线学习克服 (Guo et al., 2024)。在线策略采样在 ACT 中起着关键作用。
轨迹模拟是否必要？ (Is trajectory simulation necessary?)
- 分析： ACT 通过其在线策略轨迹模拟 (on-policy trajectory simulation) 更好地与多轮对话 (multi-turn conversations) 对齐。如果移除多轮模拟（如 Table 5 中的 “ACT w/ sampling but w/o simulation”），该方法类似于在线策略 DPO 变体 (on-policy DPO variants)（如 Pang et al., 2024），但带有考虑对话行动 (conversation actions) 和任务启发式 (task heuristics) 的对话特定奖励信号 (conversation-specific reward signal)。
- 结果： 发现轨迹级模拟 (trajectory-level simulation) 对于改进多轮性能 (multi-turn performance) 至关重要，尤其是策略模型 (policy model) 对自身澄清问题 (clarification questions) 的推理能力。例如，Post-Clarify F1 从 57.2% 降至 50.1%。
- 结论： 这强调了多轮交互 (multi-turn interaction) 和结果评估 (outcome evaluation) 在 ACT 中的核心地位。
ACT 是否与模型无关？ (Is ACT model agnostic?)
- 分析： 主要实验中的基础模型 Zephyr 是通过对 Mistral 进行对齐得到的。 Table 5 中的 “ACT with unaligned foundation models” 部分展示了使用未对齐的基础模型 Gemma 2B 和 Mistral 7B 的结果。
- 结果： 尽管这两个模型在经过 ACT 微调后，其行动 F1 (Action F1) 和轨迹 F1 (Trajectory F1) 仍存在性能差距（例如 Gemma 2B 提高了 5.0 Action F1，Mistral 7B 提高了 18.0 Action F1），但结果表明 ACT 无论是否存在人类反馈的预对齐，都能提升性能。
- 结论： 这意味着 ACT 具有模型无关性 (model agnostic)，可以改进任何基础模型的性能，尽管更好的模型初始化（预对齐）可能会带来进一步的益处。

7. 总结与思考

7.1. 结论总结

本文提出了 ACT (Action-Based Contrastive Self-Training)，一种模型无关 (model agnostic) 的准在线对比微调方法 (quasi-online contrastive tuning approach)，旨在实现样本高效 (sample-efficient) 的对话任务适应 (conversational task adaptation)。同时，还提出了一种用于评估对话代理 (conversational agents) 的工作流程。

主要发现和贡献总结如下：

有效提升澄清能力： ACT 能够显著提升 LLM 在多轮对话中隐式识别 (implicitly recognize) 和推理模糊性 (reason about ambiguity) 的能力，使其能够更好地提出澄清问题 (clarification questions)，而不是过度猜测用户意图。
数据高效性： 在有限数据 (limited data regime) 的场景下，ACT 表现出高度有效性，甚至在没有行动标签 (action labels) 的情况下，也能通过伪标签 (pseudo-labels) 实现成功的微调。
超越基线方法： ACT 在多个真实世界对话任务（包括表格型问答 PACIFIC、机器阅读理解 Abg-CoQA 和新任务 AmbigSQL）上，相较于 SFT 和 DPO 等标准微调方法，展现出显著的对话建模改进 (conversation modeling improvements)。
关键组件的验证： 消融实验证实了行动基于偏好 (action-based preferences)、在线策略采样 (on-policy sampling) 和多轮轨迹模拟 (multi-turn trajectory simulation) 作为 ACT 核心组件的必要性。

7.2. 局限性与未来工作

7.2.1. 局限性 (Limitations)

对澄清问题时机的假设： ACT 假设澄清问题 (clarification questions) 的提出是适时的。然而，众包对话数据集 (crowdsourced conversation datasets) 可能存在噪声，导致模型学到次优策略（例如，提出不必要的澄清问题或生成不流畅的语言）。未来的工作可能需要额外的预处理阶段 (preprocessing stage) 来推断某个行动是否有效。
标签噪声的影响： 隐式行动识别评估 (implicit action recognition evaluation) 假设基准任务中的行动是“最优”的。在标注者间一致性 (inter-annotator agreement) 低的数据集（如 Abg-CoQA）中，这可能导致评估结果的不一致性。
对任务特定启发式的依赖： ACT 依赖于任务特定启发式 (task-specific heuristics) 来评估轨迹结果。虽然这允许根据不同领域定制成功标准，但也增加了定制化 (customization) 和工程专业知识 (engineering expertise) 的需求。
对现有 LLM 的依赖： ACT 的实现 heavily 依赖于现有 LLM（如 Gemini）进行偏好数据集 (preference dataset) 构建、行动分类 (Action Classification) 和用户模拟 (User Simulation)。这可能受限于商业 LLM 的可访问性（成本、隐私）和潜在的偏差。
研究范围： 本研究主要关注有限数据 (limited data regime) 场景。在训练数据充足且分布与目标分布高度匹配的情况下，ACT 相对于 SFT 等方法的优势可能不会那么显著。
“准在线”的界定： ACT 被定义为准在线对比 RL 微调 (quasi-online contrastive RL tuning) 方法，因为它结合了离线方法的固定数据集 (fixed dataset) 和在线方法的在线策略采样 (on-policy sampling)。但其在线探索的程度仍受限于对话轨迹 (dialogue trajectories) 的有限性质（尤其是在有明确正确答案的任务中）。

7.2.2. 未来工作 (Future Work)

与其他复杂微调方法的结合： 考虑将 ACT 与现有针对文本到 SQL 生成 (text-to-SQL generation) 等复杂任务的复杂微调方法 (sophisticated tuning approaches) 相结合，以进一步提升性能。
大规模数据和多任务环境的泛化： 研究 ACT 在大规模数据 (large-scale data) 和多任务环境 (multi-task environments) 中的泛化能力。
与检索增强生成 (RAG) 的结合： 鉴于 LLM 的幻觉 (hallucinations) 问题，可以将 ACT 与检索增强生成 (Retrieval-Augmented Generation, RAG) 方法结合，以提升事实准确性。

7.3. 个人启发与批判

7.3.1. 个人启发 (Personal Insights)

行动规划的价值： 这篇论文强调了在多轮对话中，单纯生成流畅的文本是不够的，智能体 (agent) 的行动规划 (action planning) 能力（何时澄清、何时回答）是实现对话成功的核心。这为 LLM 的对齐和微调提供了新的视角，超越了传统上关注的文本质量。
DPO 的灵活性与在线扩展： DPO 作为 RLHF 的轻量级替代品，其潜力不仅在于简化训练，更在于其通过灵活的损失函数设计来融入更复杂的偏好。ACT 将在线策略采样 (on-policy sampling) 和轨迹模拟 (trajectory simulation) 引入 DPO 框架，为数据稀疏 (data-efficient) 场景下的策略学习 (policy learning) 提供了有效途径。
弱监督和伪标签的实用性： ACT 在没有行动标签 (action labels) 的情况下，通过伪标签 (pseudo-labels) 也能取得良好性能，这极大地降低了对昂贵人工标注 (human annotation) 的依赖，使其在实际应用中更具可行性。对于新兴或低资源任务，这种方法能够快速启动模型开发。
多维度评估的重要性： 论文提出的行动层面 (action-level) 和内容层面 (content-level) 的评估指标，以及轨迹级评估 (trajectory-level evaluation)，为全面衡量对话代理 (conversational agents) 性能提供了更丰富的视角，避免了仅关注表面生成质量的局限性。

7.3.2. 批判 (Critique)

用户模拟器的可靠性： ACT 严重依赖用户模拟器 (User Simulator, U) 来生成多轮对话轨迹 (multi-turn dialogue trajectories)。如果 $U$ 自身的模拟能力不足或存在偏差，它可能无法准确反映真实用户的行为和意图，从而可能引导 ACT 学习到次优的策略。 $U$ 的鲁棒性和泛化能力（特别是对未见过的策略 (policies) 和错误 (errors) 的反应）是 ACT 成功的关键瓶颈。
任务特定启发式的通用性： 论文明确指出 ACT 依赖任务特定启发式 (task-specific heuristics) 进行轨迹评估。虽然这允许在不同领域进行定制，但这也意味着将 ACT 应用到新任务时，需要投入额外的工程专业知识 (engineering expertise) 来设计和验证这些启发式规则。这可能限制了 ACT 在完全开放域或快速变化的复杂任务中的应用。
“准在线”的理论严格性：尽管 ACT 被描述为准在线 (quasi-online)，但其在线探索 (online exploration) 的性质与纯粹的在线强化学习 (online reinforcement learning) 仍有区别。论文对这种“准在线”模式的理论性质（例如，收敛性、探索-利用权衡）的讨论相对较少。更严格的理论分析可以进一步支持其有效性。
对商业 LLM 的依赖与复现性： 实验中大量使用 Gemini Ultra 等商业 LLM 进行偏好数据构建 (preference data construction)、行动分类 (action classification) 和用户模拟 (user simulation)。这对于学术界和资源有限的研究者来说，可能是一个挑战，影响了研究的复现性和透明度。如果这些功能由开源模型替代，其性能可能会有所下降。
行动空间和复杂对话： 当前的行动空间 (action space) 主要限于 CLARIFY 和 ANSWER。在更复杂的对话中，可能存在更多元的对话行动 (dialogue actions)（如确认、建议、拒绝、提供额外信息等）。如何将 ACT 扩展到更丰富、层级化的行动空间，以及如何处理这些行动之间的依赖关系，是未来的挑战。

相似论文推荐

基于向量语义检索推荐的相关论文。

暂时没有找到相似论文。