AiPaper
论文状态:已完成

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

发表:2024/06/01
原文链接PDF 下载
价格:0.10
价格:0.10
已有 2 人读过
本分析由 AI 生成,可能不完全准确,请以原文为准。

TL;DR 精炼摘要

本文提出了基于行动对比自训练的多轮对话模型,旨在提升大语言模型在模糊性处理中的能力。通过引入准在线偏好优化算法,ACT 在数据稀疏场景下有效学习对话策略,验证了其在表格问答、机器阅读理解及 AmbigSQL 等任务中的出色表现,显示出显著优于传统微调方法的效果。

摘要

Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users' true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT's efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.

思维导图

论文精读

中文精读

1. 论文基本信息

1.1. 标题

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training (学习澄清:基于行动对比自训练的多轮对话)

1.2. 作者

  • Maximillian Chen*, Ruoxi Sun, Tomas Pfister, Sercan Ö. Arik
  • 所属机构:Google 和 Columbia University (Maximillian Chen 隶属于两家机构)

1.3. 发表期刊/会议

本文于 2024 年 5 月 31 日作为预印本发表在 ArXiv。ArXiv 是一个广受欢迎的预印本服务器,在人工智能、机器学习和自然语言处理等领域,研究者通常会先将论文上传至此,以便快速分享研究成果并接受社区反馈。

1.4. 发表年份

2024 年

1.5. 摘要

大语言模型 (LLM) 通过人类反馈进行优化,已迅速成为开发智能对话助手的主流范式。然而,尽管它们在许多基准测试中表现出色,但基于 LLM 的智能体可能仍然缺乏对话技能,例如歧义消除 (disambiguation)——当它们面临模糊性时,通常会过度规避或隐式猜测用户的真实意图,而不是提出澄清问题 (clarification questions)。在任务特定设置中,高质量的对话样本往往有限,这构成了 LLM 学习最佳对话动作策略 (dialogue action policies) 的瓶颈。

本文提出了 Action-Based Contrastive Self-Training (ACT),一种基于 Direct Preference Optimization (DPO)准在线偏好优化算法 (quasi-online preference optimization algorithm),旨在实现多轮对话建模中的数据高效对话策略学习。研究人员在多个真实世界的对话任务中(包括表格型问答、机器阅读理解以及 AmbigSQL——一个用于向数据分析智能体澄清复杂 SQL 生成的信息寻求请求的新任务)验证了 ACT 的有效性,即使在没有行动标签的数据稀疏微调 (data-efficient tuning) 场景下也表现出色。此外,本文提出通过检查 LLM 是否能隐式识别和推理对话中的模糊性来评估其作为对话代理 (conversational agents) 的能力。ACT 相较于监督微调 (SFT) 和 DPO 等标准微调方法,展现出显著的对话建模改进。

1.6. 原文链接

2. 整体概括

2.1. 研究背景与动机

  • 核心问题: 尽管大语言模型 (LLM) 在许多任务中表现出了强大的能力,但它们在多轮对话中,尤其是在处理用户请求的模糊性时,仍然存在显著缺陷。当前 LLM 倾向于过度规避 (overhedging)(例如,给出过于笼统或模棱两可的回答)或隐式猜测 (implicitly guess) 用户的真实意图,而非主动提出澄清问题 (clarification questions) 来消除歧义,这导致对话效率和用户满意度降低。Figure 1 提供了一个直观的例子。

    Figure 1 | Simplified example of ambiguity present at tabular-grounded conversational question answering based on Deng et al. (2022). A conversational agent should recognize when there is ambiguity and ask a clarifying question towards a more accurate final answer. 该图像是示意图,展示了在基于表格的对话问答中过度推测及意图澄清的示例。对话代理需要识别出用户问题中的模糊性,并提出澄清问题,以便得到更准确的答案。

    Figure 1 | 简化版的表格型对话问答中模糊性示例,一个对话代理应能识别模糊性并提出澄清问题以获得更准确的最终答案。

  • 为什么重要: 在复杂的任务场景中,用户往往会欠详尽 (underspecify) 他们的请求,导致信息模糊。有效的歧义消除 (disambiguation) 是实现共同基础 (common ground) 和完成任务的关键。例如,在生成复杂 SQL 查询或进行机器阅读理解时,一个微小的歧义都可能导致最终结果的错误。然而,为 LLM 学习这些实用技能 (pragmatic skills) 的高质量对话数据往往稀缺,这成为一个主要瓶颈。

  • 现有研究的挑战或空白:

    1. LLM 的预训练或监督微调 (SFT) 目标通常不直接对齐对话中的实用技能(如澄清)。
    2. 虽然 RLHF 等方法可以对齐 LLM,但现有模型在处理多轮对话任务时仍存在困难,部分原因是它们没有直接优化实用技能。
    3. 收集高质量的多轮对话数据集成本高昂且受隐私问题限制,使得在数据稀疏场景下学习最佳对话策略 (dialogue policies) 极具挑战性。
    4. 现有 DPO 类算法在对话任务中的应用多集中于单轮响应优化,未能充分考虑多轮轨迹 (multi-turn trajectories)对话行动 (conversational actions) 的优化。
  • 本文的切入点或创新思路:

    • 聚焦行动规划: 提出直接优化 LLM 在模糊上下文中隐式选择对话策略 (conversational strategies) 的能力,尤其关注澄清问题 (clarifying questions)
    • 数据高效: 设计一种数据高效的算法,使其在高质量对话样本有限的情况下也能有效学习。
    • 准在线偏好优化: 引入 Action-Based Contrastive Self-Training (ACT),结合了 DPO 的易用性和在线学习的探索能力,通过对比不同实用对话行动 (pragmatic conversational actions) 的差异来优化模型。
    • 多任务验证: 在多个真实世界对话任务上验证方法,包括一个新提出的 AmbigSQL 任务,突出其通用性。
    • 新型评估: 提出通过检查 LLM 是否能隐式识别和推理对话中的模糊性来评估其作为对话代理 (conversational agents) 的能力。

2.2. 核心贡献/主要发现

  • 提出了 Action-Based Contrastive Self-Training (ACT) 算法: ACT 是一种样本高效 (sample-efficient)、基于 DPO准在线 (quasi-online) 偏好优化算法。它专注于通过对比智能体可能的实用对话行动 (pragmatic conversational actions) 之间的差异来改进 LLM 的多轮对话能力。

  • 在数据稀疏场景下表现出色: ACT数据高效微调 (data-efficient tuning) 场景下展示了其有效性,即使在没有明确行动标签 (action labels) 的情况下,也能在多个真实世界的对话任务中实现显著改进。这些任务包括:

    • 表格型问答 (tabular-grounded question-answering)
    • 机器阅读理解 (machine reading comprehension)
    • AmbigSQL 任务: 一个新颖的、用于歧义消除 (disambiguating) 针对数据分析智能体生成复杂 SQL 的信息寻求请求的任务。
  • 多轮对话建模的显著改进: ACT多轮任务完成 (multi-turn task completion) 能力上显著优于 Supervised Fine-tuning (SFT)Direct Preference Optimization (DPO) 等标准微调方法。Figure 2 直观展示了 ACT 的性能优势。

    Figure 2 \(| A C T\) greatly outperforms standard tuning approaches in data-efficient settings for conversational modeling, as exemplified here on PACIFIC. 该图像是一个图表,展示了 Action-Based Contrastive Self-Training (ACT) 在数据效率设置中相较于其他标准调优方法的表现。随着对话数量的增加,ACT 的 Clarification Reasoning (DROP F1) 指标显著提升,标记为 24% 的推理提升和 19% 的动作优化提升,并显示出 5 倍的数据效率。

    Figure 2 ACT| ACT 在数据高效的对话建模设置中,大大优于标准微调方法,如在 PACIFIC 任务上的表现。

  • 评估框架的创新: 论文提出了一种评估 LLM 作为对话代理 (conversational agents) 能力的新工作流程,即通过检查模型是否能隐式识别 (implicitly recognize)推理 (reason about) 对话中的模糊性。

  • 模型无关性 (Model Agnostic): ACT 对基础模型具有良好的适应性,即使是对未经过人类反馈对齐的基础模型,也能带来性能提升。

3. 预备知识与相关工作

本节旨在为读者提供理解论文核心内容所需的背景知识,并阐述本文工作与现有研究的联系与区别。

3.1. 基础概念

  • 大语言模型 (Large Language Models, LLMs): 指的是参数量巨大、在海量文本数据上进行过预训练的深度学习模型,如 GPT 系列、GeminiMistral 等。它们通过学习语言的统计规律,能够生成连贯、有意义的文本,并执行各种自然语言处理任务,如问答、翻译、摘要和对话。本文中,LLM 是构建对话代理 (conversational agents) 的基础。
  • 人类反馈 (Human Feedback): 优化 LLM 性能的关键机制。通常指在模型生成响应后,人类标注者对其质量、有用性、安全性等方面进行评价或排序。这种反馈用于进一步微调模型,使其行为更符合人类偏好。
  • 对话代理 (Conversational Agents): 能够与人类进行自然语言交互,并协助完成特定任务或提供信息的智能系统。它们需要具备理解用户意图、生成恰当响应、管理对话流程等能力。本文关注的是如何提升 LLM 作为对话代理的能力。
  • 歧义消除 (Disambiguation): 指在对话中识别并解决用户请求或表达中的模糊性。例如,用户说“展示详情”,代理需要澄清“详情”指的是哪些具体信息。这是混合主导对话 (Mixed-Initiative Conversation) 的一个核心技能。
  • 对话动作策略 (Dialogue Action Policies): 描述对话代理 (dialogue agent) 在特定对话状态 (dialogue state) 下应采取的行动。例如,当用户请求模糊时,策略可能是“提出澄清问题”;当请求清晰时,策略可能是“提供答案”。本文的目标是优化 LLM 学习这些策略的能力。
  • 直接偏好优化 (Direct Preference Optimization, DPO): 一种强化学习 (Reinforcement Learning) 的替代方法,用于从人类偏好数据中对齐 LLMDPO 不需要训练一个单独的奖励模型 (reward model),而是直接将偏好数据转换为一个分类损失函数,通过最大化获胜响应相对于失败响应的概率比来优化策略模型。这简化了 RLHF 的训练过程,通常更稳定且计算效率更高。
  • 监督微调 (Supervised Fine-tuning, SFT): 在预训练 LLM 的基础上,使用带有任务特定标注数据(例如,问答对、对话轮次)进一步训练模型的过程。SFT 的目标是使模型更好地遵循指令并适应特定任务。

3.2. 前人工作

3.2.1. 混合主导对话代理 (Mixed-Initiative Conversational Agents)

混合主导 (Mixed-Initiative) 是指在人机交互中,人与机器都可以在对话中承担主导角色,而不是由一方完全控制。这种代理通常包含两个核心组件:

  • 理解与规划模块 (Understanding and Planning Module): 负责分析对话状态(如用户意图、上下文),并决定接下来要执行的对话行动 (dialogue action),例如是问澄清问题 (clarifying question) 还是直接提供答案 (answer)。这通常被视为一个马尔可夫决策过程 (Markov Decision Process, MDP),其中对话状态 (dialogue state) 是从一个潜在未知分布中抽取的,而行动 (action) 是对话话语中携带的实用意图 (pragmatic intent) 的低维表示(即对话行为 (dialogue act))。
  • 生成模块 (Generation Module): 根据规划模块的输出,生成具体的话语 (utterance)。早期的研究可能使用多目标 SFT 或专门的控制代码嵌入 (control codes embeddings) 来实现受控生成。LLM 极大地改进了实用受控生成 (pragmatically-controlled generation) 的性能。

挑战:

  • 规划 (Planning): 这是一个困难的任务,因为它需要长期规划 (long-horizon planning)推理 (reasoning) 用户的响应和意图。传统的模块化方法(如结合神经网络模型 (neural models)搜索算法 (search algorithms)模拟 (simulation))计算开销大,可能导致错误传播 (error propagation),并且不直接优化响应质量。
  • 本文提出直接将对话行动规划 (dialogue action planning) 作为混合主导对话 (mixed-initiative conversation)响应生成 (response generation) 的一个隐式子任务来优化。

3.2.2. LLM 对齐学习 (Learning for LLM Alignment)

当前 LLM 训练范式通常包含三个阶段:

  1. 预训练 (Pretraining): 在大规模数据上学习通用语言模式。

  2. 监督微调 (Supervised Fine-tuning, SFT): 在指令遵循数据上进行微调。

  3. 人类偏好对齐 (Tuning for alignment with human preferences): 使模型行为与人类偏好保持一致。

    本文主要关注第三个阶段。常用的对齐方法包括:

  • 人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF):Ouyang et al. (2022)。首先根据人类偏好数据训练一个奖励模型 (reward model),然后使用近端策略优化 (Proximal Policy Optimization, PPO)在线算法 (online algorithms) 优化 LLMRLHF 具有灵活的奖励函数和更广阔的策略探索 (policy exploration) 空间,但 PPO notoriously 难以调优。
  • 离线偏好学习算法 (Offline Preference Learning Algorithms): 针对 RLHF 的复杂性,DPO (Rafailov et al., 2024)SLiC (Zhao et al., 2023)IPO (Azar et al., 2024) 等算法被广泛采用。它们绕过了显式奖励建模 (reward modeling),只需一组超参数即可优化,同时能达到相似的经验效果。
  • 在线策略 DPO (On-Policy DPO): 鉴于纯离线方法的局限性,一些研究探索了 DPO在线变体 (online variants)。例如,Yuan et al. (2024) 提出了迭代 DPO (iterative DPO)Chen et al. (2024) 提出了将真实响应视为“获胜”响应,从前一迭代策略模型中采样的响应视为“失败”响应的变体。Pang et al. (2024) 将迭代 DPO 应用于优化外部化推理链 (reasoning chains)

3.3. 技术演进

对话系统从早期的基于规则和知识的系统,发展到以统计模型为基础(如隐马尔可夫模型 (HMM)条件随机场 (CRF)),再到基于深度学习 (deep learning)端到端 (end-to-end) 系统。LLM 的出现极大地推动了这一领域的发展,使得通用对话代理 (conversational agents) 成为可能。

LLM 的对齐方面,从最初的监督微调 (SFT),到通过人类反馈 (human feedback) 进行强化学习 (RLHF),再到近期兴起的直接偏好优化 (DPO) 系列算法,对齐技术不断简化和优化,旨在更高效、稳定地使 LLM 的行为符合人类预期。本文的 ACT 算法正是基于 DPO 框架,并针对多轮对话 (multi-turn conversations) 中的行动规划 (action planning) 这一特定挑战进行了创新性扩展。

3.4. 差异化分析

  • 聚焦多轮对话行动: 现有许多 DPO 相关工作主要关注单轮响应优化 (single-turn response optimization),例如生成更优的、更符合偏好的答案。本文的 ACT 则将 DPO 扩展到多轮轨迹 (multi-turn trajectories),并特别关注 LLM 在对话中隐式选择和执行对话行动 (conversational actions) 的能力,尤其是澄清问题 (clarification questions)。这是本文的核心创新点之一。
  • 行动作为对比基础: 据作者所知,ACT 是首个以对话行动 (conversational actions) 为基础进行对比学习 (contrastive learning) 的方法。通过明确对比“获胜行动”和“失败行动”所导致的响应和轨迹,模型能够更有效地学习何时应该澄清、何时应该回答。
  • 准在线学习范式: ACT 结合了离线 DPO 的易用性和在线学习的探索能力。它在训练过程中在线采样 (on-policy sampling) 模型自身的响应,并通过用户模拟器 (User Simulator) 评估这些响应所产生的多轮对话轨迹 (multi-turn dialogue trajectories),从而动态更新偏好对 (preference pairs)。这种准在线机制使其比纯离线方法更能适应动态的对话环境,并学习到更鲁棒的策略。
  • 新任务 AmbigSQL 论文引入了一个新的文本到 SQL 生成 (text-to-SQL generation) 任务 AmbigSQL,专门用于评估和训练 LLM 在复杂数据分析请求中的歧义消除 (disambiguation) 能力,填补了该领域多轮交互任务资源的空白。

4. 方法论

4.1. 方法原理

ACTAction-Based Contrastive Self-Training,基于行动对比自训练)的核心思想在于,将 LLM 微调为混合主导对话代理 (mixed-initiative conversational agent) 的关键在于使其能够自动生成能够最大化对话成功概率 (conversational success) 的响应。ACT 通过以下几个直觉实现这一目标:

  1. 对比偏好 (Contrastive Preferences): 对比“获胜”和“失败”的对话响应 (dialogue responses) 之间的实用差异 (pragmatic differences),是直观地教导模型如何选择正确对话行动 (dialogue actions) 的有效方式。

  2. 多轮优化 (Multi-turn Optimization): 对话能力的改进需要多轮优化 (multi-turn optimization),而不仅仅是单轮的对比。ACT 通过模拟多轮对话轨迹 (conversation trajectories) 来实现这一点。

  3. 在线策略采样 (On-policy Sampling): DPO 类算法的梯度更新基于获胜和失败响应的对数概率。在线策略采样 (on-policy response sampling) 可以生成高概率的词元序列,从而更好地利用 DPO 的优化机制。

    ACT 算法分为两个主要阶段:行动基于对比数据集构建 (action-based contrast dataset construction)对比自训练 (contrastive self-training)。整个过程如 Figure 3 所示。

    该图像是示意图,展示了基于行动的对比自我训练 (ACT) 方法在多轮对话中的政策更新过程。图中包含了响应采样、轨迹模拟与评估的场景,分别示例了错误和正确的隐式行动检测,以及如何更新政策以实现收敛的步骤。 该图像是示意图,展示了基于行动的对比自我训练 (ACT) 方法在多轮对话中的政策更新过程。图中包含了响应采样、轨迹模拟与评估的场景,分别示例了错误和正确的隐式行动检测,以及如何更新政策以实现收敛的步骤。

Figure 3 | ACT 微调阶段概述。对于 DprefD_{pref} 中的每个初始对比配对(如 3.2.1 节所述构建),我们从正在微调的模型中采样一个在线策略响应。在评估了采样响应的轨迹后,我们通过替换现有的获胜或失败响应来更新对比配对。模型策略使用公式 1 的目标进行更新。

4.2. 核心方法详解

4.2.1. 问题设置 (Problem Setup)

本文将任务定义为将 LLM 微调为一个混合主导对话代理 (mixed-initiative conversational agent)。该代理应通过一系列对话与用户互动,最终为用户的请求提供正确响应。与用户完全控制交互流的常见代理设置不同,混合主导代理 (mixed-initiative agents) 应通过执行对话行动 (conversational actions)策略 (strategies)(如澄清问题 (clarifying questions))来理解如何重定向交互流。

符号定义:

  • πθi\pi_{\theta_i}:在时间步 i0i \geq 0 时,LLM策略 (policy),由参数 θ\theta 参数化。
  • πref\pi_{ref}参考策略模型 (reference policy model),即 πθ0\pi_{\theta_0} (初始模型)。
  • DD:包含多个对话 (conversations) 的数据集。
  • ccDD 中的一个对话,包含 nn对话轮次 (dialogue turns)
  • tit_i:时间步 ii对话轮次状态 (turn state),包括每个交互方观察到的话语 (utterances)行动 (actions)。每个 tit_i 都属于一个轨迹,该轨迹在用户在更早的时间步 jij \le i 提出的问题得到回答时结束。
  • pip_i:在时间步 ii提示 (prompt),包含任何任务特定信息 (task-specific information)(如 SQL 数据库 schema、表格数据或检索到的段落)以及任何现有的对话上下文 (dialogue context)
  • rir_i:在时间步 ii系统端真实响应 (ground truth system-side response)
  • gig_i:解决 tit_i 隐式轨迹的目标响应 (goal response),即在任何可能的澄清轮次后,用户原始问题的答案。在单轮轨迹情况下,girig_i \gets r_i
  • aia_irir_i 隐式表达的行动 (action)aia_i 存在于特定任务的潜在行动空间 (latent Action Space) SS 中,可由行动标注代理 (Action Annotation Agent) GG 推断。
  • 行动空间 (Action Space) SS:在实验中,S=S = [CLARIFY, ANSWER] (澄清,回答)。
  • 辅助模块:
    • MM可控生成模型 (controllable generation model),用于偏好数据 (preference data) 的创建。

    • AA行动分类器 (Action Classifier),用于微调 (tuning)评估 (evaluation) 期间。

    • UU用户模拟器 (User Simulator),可以被控制以模拟用户行为,用于微调和评估期间。

      Figure A4 展示了如何在 Abg-CoQA 中构建一个对比配对的例子。

      Figure A4 | Example of a contrastive pairing constructed for RL tuning with Abg-CoQA (Guo et al., 2021). The notation used is as described in Section 3.1. 该图像是一个关于物质主义与哲学分类的对话示例。图中展示了选中的响应、拒绝的响应及其对应的操作,体现了在处理模糊问题时所需的澄清过程。对谈的核心是物质主义如何与其他哲学理论对比。

Figure A4 | RL 微调中 Abg-CoQA (Guo et al., 2021) 的对比配对示例。所用符号如 3.1 节所述。

用户模拟器 (User Simulators, UU): UU 的实现灵感来自 Deng et al. (2023c)Yuetal.(2023)Yu et al. (2023) 的工作,他们直接提示 LLM 根据对话上下文 (dialogue context)任务目标 (task objectives) 执行目标导向任务 (goal-oriented tasks)。本文首先提示 LLM 总结用户的信息寻求目标 (information-seeking goal)。然后,使用此摘要和当前对话上下文形成另一个提示,以模拟用户响应。这种带有目标摘要的提示方式比直接提供用户模拟器 (user simulator) 真实信息目标更具灵活性。

行动分类器 (Action Classifiers, AA): 在本文考虑的数据集中,可能的行动是“澄清”或“直接回答”问题。本文直接使用少样本上下文学习 (few-shot in-context learning) 作为行动分类器 AA

4.2.2. ACT: 行动基于对比自训练 (Action-Based Contrastive Self-training)

ACT 是一种准在线 (quasi-online) 扩展的 DPO 算法,它保持了离线方法的易用性,同时融入了在线学习 (online learning) 中的灵活探索。ACT 包含两个阶段:行动基于对比数据集构建 (action-based contrast dataset construction)对比自训练 (contrastive self-training)

4.2.2.1. 构建偏好数据 (Construction of Preference Data)

算法 1: 构建对比行动对 (Building Contrastive Action Pairs)

  • 输入:
    • DD:数据集
    • MM条件生成模型 (Conditional generation model)
    • SS行动空间 (Action Space)
    • GG行动标注代理 (Action Annotation Agent)
  • 步骤:
    1. 初始化空数据集 DprefD_{pref}
    2. 对于 DD 中每个对话轮次 (conversation turn) tit_i: 3. 令 ai=G(pi,ri)a_i = G(p_i, r_i)\vartriangleright 推断上下文行动 (Contextual Action) 4. 令 ai=Saia_i' = S \setminus a_i\vartriangleright 确定被拒绝行动 (Rejected Action) 5. 令 ywi=riy_{wi} = r_i\vartriangleright 将真实响应设为获胜响应 (winning response) 6. 从 MM 中采样 yliPM(pi,ai)y_{li} \sim P_M(\cdot | p_i, a_i')\vartriangleright 采样失败响应 (losing response),该响应基于被拒绝的行动 aia_i' 7. 令 ti=(pi,ri,gi,ai,ai,ywi,yli)t_i' = (p_i, r_i, g_i, a_i, a_i', y_{wi}, y_{li})\vartriangleright 构建增强的元组 8. 将 tit_i' 添加到 DprefD_{pref}

解释: 这个算法构建了一个包含对比“获胜-失败”行动对 (action pairs)偏好数据集 (preference dataset)。对于数据集 DD 中的每个对话轮次 tit_i,它会:

  • 推断真实行动: 使用行动标注代理 (Action Annotation Agent) GG(例如一个 LLM 或人类标注者)来推断真实响应 rir_i 所对应的隐式行动 aia_i
  • 确定被拒绝行动:行动空间 (Action Space) SS 中选择与 aia_i 不同的行动 aia_i' 作为“被拒绝行动”。
  • 生成失败响应: 使用高能力 LLM (high capacity LLM) 作为条件生成模型 (conditional generation model) MM,根据原始提示 pip_i 和被拒绝行动 aia_i' 生成一个“失败响应” yliy_{li}。这样,yliy_{li} 代表了如果智能体采取了错误的行动会产生的响应。
  • 构建对比元组: 最终形成一个包含原始信息、真实行动、被拒绝行动、获胜响应和失败响应的增强元组 tit_i'

野外无标签对话的行动优化 (Action optimization for unlabeled conversations "in-the-wild"): 在某些情况下,可能无法获得黄金标准 (gold-standard) 的歧义标注。在这种设置下,可以使用一个分类器 (classifier) 作为行动标注代理 (Action Annotation Agent) GG 来获取伪标签监督 (pseudo-label supervision),而不是依赖人工标注。

4.2.2.2. 使用在线策略对话轨迹模拟进行自训练 (Self-Training Using On-policy Conversation Trajectory Simulation)

算法 2: ACT: 行动基于对比自训练 (Action-Based Contrastive Self-Training)

  • 输入:
    • πθ0\pi_{\theta_0}初始策略模型 (Initial Policy Model)
    • DprefD_{pref}行动对比数据集 (Action Contrast Dataset)
    • BB批次数量 (Number of Batches)
    • AA行动分类器 (Action Classifier)
    • UU用户模拟器 (User Simulator)
    • HH任务启发式 (Task Heuristic)
    • ϵ\epsilon启发式容忍度 (Heuristic Tolerance)
  • 步骤:
    1. 对于从 DprefD_{pref} 中采样的批次 bb 中每个对话轮次 (conversation turn) tjt_j (其中 0jB0 \leq j \leq B): 2. 从当前模型策略中采样一个响应:yjπθj(pj)y_j \sim \pi_{\theta_j}(\cdot | p_j)。 3. 如果行动 A(yj)ajA(y_j) \neq a_j\vartriangleright 隐式实用行动 (pragmatic action) 与真实情况不匹配 4. 设置 ylj=yjy_{lj} = y_j\vartriangleright 将当前采样响应 yjy_j 作为失败响应 (losing response) 5. 否则(行动 A(yj)=ajA(y_j) = a_j): 6. 初始化轨迹 (Trajectory),将 yjy_j 添加到轨迹 (Trajectory) 中。 7. 当 A(yk)ANSWERA(y_k) \neq \text{ANSWER} 时:\vartriangleright 模拟用户澄清 8. 澄清答案 (Clarification Answer) =U(pk,yk)= U(p_k, y_k)。 9. 将澄清答案 (Clarification Answer) 添加到轨迹 (Trajectory) 中。 10. yk+1=πθj(pk,Clarification Answer)y_{k+1} = \pi_{\theta_j}(\cdot | p_k, \text{Clarification Answer})\vartriangleright 模拟下一个策略响应 11. 将 yk+1y_{k+1} 添加到轨迹 (Trajectory) 中。 12. 如果 H(Trajectory outcome,gj)>ϵH(\text{Trajectory outcome}, g_j) > \epsilon: 13. 令 ywj=Trajectoryy_{wj} = \text{Trajectory}\vartriangleright 奖励可接受的轨迹结果,将其设为获胜响应 (winning response) 14. 否则: 15. 令 ylj=Trajectoryy_{lj} = \text{Trajectory}\vartriangleright 惩罚不好的轨迹结果,将其设为失败响应 (losing response) 16. θUpdate(θ)\theta \leftarrow \text{Update}(\theta) 直到收敛(使用 DPO 目标函数)。
    2. 输出: πθB\pi_{\theta_B}

解释: 这个算法是 ACT 的核心,它是一个准在线 (quasi-online) 的训练过程。

  • 在线策略采样: 在每个训练批次中,模型 πθj\pi_{\theta_j} 会根据当前提示 (prompt) pjp_j 采样一个响应 yjy_j

  • 行动匹配检查: 使用行动分类器 (Action Classifier) AA 来判断采样响应 yjy_j隐式行动 (implicit action) 是否与真实行动 aja_j 匹配。

    • 行动不匹配: 如果不匹配,则说明模型采取了错误的实用行动 (pragmatic action)(例如,应该澄清却直接回答),此时将采样响应 yjy_j 直接作为失败响应 (losing response),以惩罚这种错误。
    • 行动匹配: 如果匹配,则会进入多轮轨迹模拟 (multi-turn trajectory simulation) 阶段。
  • 轨迹模拟:

    • 使用用户模拟器 (User Simulator) UU 模拟用户对模型响应的后续回复(特别是当模型提出澄清问题 (clarification question) 时)。
    • 模型 πθj\pi_{\theta_j} 会继续生成后续响应,直到其行动变为 ANSWER(即尝试回答用户原始问题)。
    • 整个交互序列构成了从模型自身策略生成的对话轨迹 (conversation trajectory)
  • 轨迹结果评估: 使用任务启发式 (Task Heuristic) HH 评估模拟轨迹的最终结果(例如,答案的语义相似度、SQL 执行结果的正确性)与用户原始目标 gjg_j 的匹配程度。

    • 轨迹成功: 如果结果满足任务启发式,则整个模拟轨迹被视为获胜响应 (winning response)
    • 轨迹失败: 否则,整个模拟轨迹被视为失败响应 (losing response)
  • DPO 更新: 根据动态生成的获胜/失败响应 (winning/losing response) 对,使用 DPO 目标函数更新模型参数 θ\theta

    Figure A5 展示了一个轨迹级内容评估的示例。

    Figure A5 | Trajectory-level content evaluation using the example scenario from Figure 1. Trajectory-level evaluation seeks to measure the extent to which a candidate LLM can interact with a User" to reach a target information goal. The "interactive" evaluation of a given instance continues until the candidate LLM attempts to resolve the User's request by providing a direct answer. The candidate trajectory resolution is scored using downstream task metrics. In this example, DROP F1 is used following the task metrics for PACIFIC. 该图像是示意图,展示了评估模型如何通过用户模拟器进行多轮对话的轨迹级内容评估。图中描述了用户提问、模型的澄清问题及其最终答案的过程,并通过任务指标(如 DROP F1)来评分轨迹分数。

Figure A5 | 使用 Figure 1 示例场景进行轨迹级内容评估。轨迹级评估旨在衡量候选 LLM 与“用户”交互以达到目标信息目标的能力。给定实例的“交互式”评估持续进行,直到候选 LLM 尝试通过提供直接答案来解决用户的请求。候选轨迹分辨率使用下游任务指标进行评分。在此示例中,根据 PACIFIC 的任务指标,使用 DROP F1

4.2.2.3. 对比强化学习对齐微调 (Contrastive RL Tuning for Alignment)

在通过模拟构建了最新的获胜响应 (winning response) ywiy_{wi}失败响应 (losing response) yliy_{li} 对之后,本文使用 DPO (Direct Preference Optimization) 训练目标来更新策略模型 πθ\pi_{\theta} (Rafailov et al., 2024)。

DPO 损失函数 (DPO Training Objective): (为简化表示,此处忽略了 ii 迭代器) LDPO(πθ;πref)=E(p,yw,yl)D[logσ(βlogπθ(ywp)πref(ywp)βlogπθ(ylp)πref(ylp))] \mathcal { L } _ { \mathrm { D P O } } ( \pi _ { \theta } ; \pi _ { r ef } ) = - \mathbb { E } _ { ( p , y _ { w } , y _ { l } ) \sim \mathcal { D } } \left[ \log \sigma \left( \beta \log \frac { \pi _ { \theta } ( y _ { w } \mid p ) } { \pi _ { r ef } ( y _ { w } \mid p ) } - \beta \log \frac { \pi _ { \theta } ( y _ { l } \mid p ) } { \pi _ { r ef } ( y _ { l } \mid p ) } \right) \right] 其中:

  • pp 是一个提示 (prompt),由任务信息 (task info)对话历史 (conversation history) 拼接而成,形式为 {x1,y1,...,xi1,yi1,xi}\{x_1, y_1, ..., x_{i-1}, y_{i-1}, x_i\}
    • xix_i 代表在第 ii 轮观察到的用户端话语 (user-side utterance)
    • yiy_i 代表在第 ii 轮观察到的系统端话语 (system-side utterance)
  • ywy_wyly_l 分别是在 4.2.2.2 节中确定的“获胜”和“失败”响应或轨迹 (trajectories)
  • πref\pi_{ref}初始参考策略模型 (initial reference policy model)
  • β\beta 是一个超参数 (hyperparameter),用于正则化 πθ\pi_\thetaπref\pi_{ref} 之间的比率。
  • σ()\sigma(\cdot)sigmoid 函数 σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}
  • D\mathcal{D}偏好数据集 (preference dataset),包含了 (p,yw,yl)(p, y_w, y_l) 对。
  • E\mathbb{E} 表示期望。

DPO 损失函数的梯度 (Gradient of the DPO Loss Function): τ^θLDPO(πθ;πref)=τ^θπθβE(p,yw,yl)D[σ(R^θ(p,yl)R^θ(p,yw))[θlogπ(ywp)θlogπ(ylp)]] \begin{array} { r l } { \boldsymbol { \mathrm { \widehat { \tau } } } _ { \theta } \mathcal { L } _ { \mathrm { D P O } } ( \pi _ { \theta } ; \pi _ { r ef } ) = } & { } \\ & { \phantom { \frac { \widehat { \tau } _ { \theta } } { \pi _ { \theta } } } - \beta \mathbb { E } _ { ( p , y _ { w } , y _ { l } ) \sim \mathcal { D } } \bigg [ \sigma ( \widehat { R } _ { \theta } ( p , y _ { l } ) - \widehat { R } _ { \theta } ( p , y _ { w } ) ) \bigg [ \nabla _ { \theta } \log \pi ( y _ { w } \mid p ) - \nabla _ { \theta } \log \pi ( y _ { l } \mid p ) \bigg ] \bigg ] } \end{array} 这里,隐式奖励模型 (implicitly defined reward model) 定义为 R^θ(p,y)=βlogπθ(yp)πref(yp)\widehat{R}_\theta(p, y) = \beta \log \frac{\pi_\theta(y|p)}{\pi_{ref}(y|p)},正如 Rafailov et al. (2024) 中所证明的。

直觉 (Intuition): DPO 目标函数的直觉是,损失函数的梯度会增加获胜响应 ywy_w 的似然 (likelihood),并减少失败响应 yly_l 的似然 (likelihood)。每个样本的权重由隐式奖励模型对配对响应排名错误的程度决定。如果模型错误地认为失败响应比获胜响应更好(即 R^θ(p,yl)R^θ(p,yw)\widehat{R}_\theta(p, y_l) - \widehat{R}_\theta(p, y_w) 为正且较大),那么梯度会更大,从而更强烈地调整模型,使其偏好 ywy_w 而非 yly_l

5. 实验设置

ACT 是一种样本高效 (sample-efficient) 的方法,用于使 LLM 适应对话行动策略 (conversational action policy)。本文主要关注提升 LLM 隐式选择 (implicit selection) 代理端澄清问题 (clarification question) 的能力。因此,研究人员在三个复杂的对话信息寻求任务 (conversational information-seeking tasks) 中评估 ACT 作为一种微调方法。实验使用的基础模型是 Zephyr-\beta,它是 `Mistral 7B` (`Jiang et al., 2023`) 的一个版本,经过 `UltraChat` 的指令微调,并与 `UltraFeedback` (`Cui et al., 2023; Ding et al., 2023; Tunstall et al., 2023`) 的人类偏好对齐。 ## 5.1. 数据集 本文调查了三个<strong>混合主导对话任务 (mixed-initiative conversation tasks)</strong>,用户在其中与助手互动以检索信息。在每个任务设置中,用户提出一个可能模糊的查询。助手的任务是提供一个响应,该响应可以是<strong>澄清问题 (clarifying question)</strong>,也可以是直接回答用户查询的尝试。对于每个任务,初始的<strong>被拒绝响应 (rejected responses)</strong> 是通过提示 `Gemini Ultra` 作为<strong>条件生成模型 (conditional generation model)</strong> $M$ 来合成的。 `ACT` 在各种领域的多样化数据集上进行评估:<strong>表格型对话问答 (tabular conversational QA)</strong>、<strong>机器阅读理解的对话问答 (conversational QA for machine reading comprehension)</strong> 和<strong>对话式文本到 `SQL` 生成 (conversational text-to-SQL generation)</strong>。 ### 5.1.1. `PACIFIC`: 表格数据的对话问答 `PACIFIC` 是一个<strong>主动式对话问答 (proactive conversational question answering)</strong> 任务,其基础是<strong>表格 (tabular)</strong> 和<strong>文本 (textual)</strong> 金融数据的混合 (`Deng et al., 2022`)。这可能涉及从给定范围、多个范围生成正确单词,或提供正确的算术表达式。`PACIFIC` 的官方评估使用<strong>以数字为中心的词元重叠度量 (numeracy-focused token overlap metric)</strong>,称为 `DROP F1`。 Figure 1 是 `PACIFIC` 任务的一个简化示例,展示了模糊性如何导致代理需要提出澄清问题。 表格 A17、A18 和 A19 分别展示了在 `PACIFIC` 任务中使用<strong>标准提示 (Standard Prompting)</strong>、<strong>思维链提示 (Chain-of-Thought Prompting)</strong> 和<strong>主动混合主导提示 (Proactive Mixed-Initiative Prompting)</strong> 的上下文示例。 ### 5.1.2. `Abg-CoQA`: 机器阅读理解的对话问答 `Abg-CoQA` 是一个用于<strong>机器阅读理解 (machine reading comprehension)</strong> 中<strong>歧义消除 (disambiguation)</strong> 的<strong>对话问答数据集 (conversational question answering dataset)</strong> (`Guo et al., 2021`)。由于没有算术表达式,本文使用基于 `SentenceBERT` (`Reimers & Gurevych, 2019`) 的<strong>嵌入语义距离 (embedding-based semantic distance)</strong> 作为评估指标,这已被用于更灵活地衡量问答性能 (`Risch et al., 2021`)。 Figure A4 展示了 `Abg-CoQA` 中一个对比配对的例子。 ### 5.1.3. `AmbigSQL`: 模糊对话式文本到 `SQL` 生成 `AmbigSQL` 是本文提出的一个新任务,用于<strong>`SQL` 基础的对话式歧义消除 (`SQL`-grounded conversational disambiguation)。</strong> 研究人员系统地扰动了 `Spider`(一个流行的单轮<strong>文本到 `SQL` 基准 (text-to-SQL benchmark)</strong>,`Yu et al., 2018`)中的无歧义查询,得到了可以轻松纳入<strong>对比 `RL` 微调 (contrastive RL tuning)</strong> 的配对训练示例。每个<strong>轨迹 (trajectory)</strong> 都通过最终提出的 `SQL` 查询是否与真实查询的<strong>执行结果 (execution result)</strong> 匹配来评估。 **`AmbigSQL` 的动机:** 歧义消除可以提高任务性能。研究人员提示 `LLM` 引入三种类型的模糊信息请求: 1. <strong>请求信息模糊 (requested information is ambiguous)</strong>(例如:“Show details about singers ordered by age from the oldest to the youngest”)。 2. <strong>请求群体模糊 (requested population is ambiguous)</strong>(例如:“Which ones who live in the state of Indiana?”;参见 Table A12)。 3. <strong>请求结果展示模糊 (requested presentation of results is ambiguous)</strong>(例如:“Show name, country, age for all singers ordered by age”;参见 Table A13)。 研究发现,在<strong>欠详尽请求 (underspecified request)</strong> 下构建 `SQL` 查询,有无澄清的性能差距高达 45.8% (参见 Table A14),这表明澄清问题的必要性。 Table A10 | `AmbigSQL` 概览,一个从 `Spider` 合成的模糊<strong>文本到 `SQL` 数据集 (Text-to-SQL dataset)</strong>。 <div class="table-wrapper"><table> <thead> <tr> <th></th> <th>Train</th> <th>Dev</th> <th>Test</th> </tr> </thead> <tbody> <tr> <td>Num. Unambiguous Requests</td> <td>7,000</td> <td>1,034</td> <td>1,034</td> </tr> <tr> <td>Num. Ambiguous Requests</td> <td>7,000</td> <td>1,034</td> <td>1,034</td> </tr> <tr> <td>Num. Unique Schemas</td> <td>1,056</td> <td>145</td> <td>145</td> </tr> <tr> <td>Types of Ambiguity</td> <td>3</td> <td>3</td> <td>3</td> </tr> </tbody> </table></div> Table A12 | 上下文示例,作为提示的一部分,用于创建目标群体模糊的信息请求。黑色文本的格式表示如何使用真实请求来形成目标示例的提示。蓝色文本表示将从 `LLM` 合成的内容。论文中省略了数据库 `schema`。 <div class="table-wrapper"><table> <thead> <tr> <th>[Database Schema Omitted] The target SQL query is the following:</th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <td><code>SELECT professional_id , last_name , cell_number FROM Professionals</code></td> <td></td> <td></td> <td></td> </tr> <tr> <td><code>WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,</code></td> <td></td> <td></td> <td></td> </tr> <tr> <td><code>T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON</code></td> <td></td> <td></td> <td></td> </tr> <tr> <td><code>T1.professional_id = T2.professional_id</code></td> <td></td> <td></td> <td></td> </tr> <tr> <td><code>GROUP BY T1.professional_id HAVING count(*) &gt; 2</code></td> <td></td> <td></td> <td></td> </tr> <tr> <td>Here is a clear request that would correspond to this SQL query:</td> <td></td> <td></td> <td></td> </tr> <tr> <td>&quot;Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List</td> <td></td> <td></td> <td></td> </tr> <tr> <td>his or her id, last name and cell phone.&quot;</td> <td></td> <td></td> <td></td> </tr> <tr> <td>Here is the same request converted into an ambiguous format by underspecifying the target columns:</td> <td></td> <td></td> <td></td> </tr> <tr> <td>&quot;Which ones who live in the state of Indiana or have done treatment on more than 2 treatments?&quot;</td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td>Here is an appropriate clarifying question to recover the clear request from the ambiguous request:</td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td>&quot;Are you asking about the Professionals?&quot;</td> <td></td> <td></td> <td></td> </tr> </tbody> </table></div> Table A13 | 上下文示例,作为提示的一部分,用于创建目标列模糊的信息请求。黑色文本的格式表示如何使用真实请求来形成目标示例的提示。蓝色文本表示将从 `LLM` 合成的内容。论文中省略了数据库 `schema`。 <div class="table-wrapper"><table> <thead> <tr> <th>[Database Schema Omitted] The target SQL query is the following:</th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <td><code>SELECT professional_id , last_name , cell_number FROM Professionals</code></td> <td></td> <td></td> <td></td> </tr> <tr> <td><code>WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,</code></td> <td></td> <td></td> <td></td> </tr> <tr> <td><code>T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON</code></td> <td></td> <td></td> <td></td> </tr> <tr> <td><code>T1.professional_id = T2.professional_id</code></td> <td></td> <td></td> <td></td> </tr> <tr> <td><code>GROUP BY T1.professional_id HAVING count(*) &gt; 2</code></td> <td></td> <td></td> <td></td> </tr> <tr> <td>Here is a clear request that would correspond to this SQL query:</td> <td></td> <td></td> <td></td> </tr> <tr> <td>&quot;Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List</td> <td></td> <td></td> <td></td> </tr> <tr> <td>his or her id, last name and cell phone.&quot;</td> <td></td> <td></td> <td></td> </tr> <tr> <td>Here is the same request converted into an ambiguous format by underspecifying the target columns:</td> <td></td> <td></td> <td></td> </tr> <tr> <td>&quot;Which professionals live in the state of Indiana or have done treatment on more than 2 treatments?&quot;</td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td>Here is an appropriate clarifying question to recover the clear request from the ambiguous request:</td> <td></td> <td></td> <td></td> </tr> <tr> <td>&quot;Which information of the professionals do you want to know?&quot;</td> <td></td> <td></td> <td></td> </tr> </tbody> </table></div> ## 5.2. 评估指标 本文从两个维度评估 `ACT` 在<strong>对话中推理歧义 (reason about ambiguity in conversation)</strong> 以更好地实现<strong>对话目标 (conversational goals)</strong> 的能力: ### 5.2.1. 代理任务性能 (Agent task performance) 评估 `ACT` 是否能改善<strong>多轮任务完成 (multi-turn task completion)</strong> 能力。 * <strong>轮次级评估 (Turn-level evaluation):</strong> 将模型响应与用户查询的真实话语进行比较,使用 4.1 节中定义的<strong>任务特定启发式 (task-specific heuristics)</strong>。 * <strong>多轮评估方案 (Multi-turn evaluation scheme) / 轨迹级评估 (Trajectory-level evaluation):</strong> * 当 `LLM` 采样响应是<strong>澄清问题 (clarifying question)</strong> 时,模拟用户响应,并再次从 `LLM` 采样另一个响应,直到其尝试回答原始查询。 * 将此结果与用户的<strong>真实信息寻求目标 (ground truth information-seeking goal)</strong> 进行评估。 * 使用 $A$ (行动分类器) 和 $U$ (用户模拟器) 进行模拟,并使用 4.1 节中定义的启发式。Figure A5 展示了一个示例。 * <strong>澄清后性能 (Post-Clarification Performance):</strong> 在 `PACIFIC` 和 `AmbigSQL` 中,还计算模型在之前提出<strong>澄清问题 (clarifying questions)</strong> 的模拟响应上的任务性能,以更精细地衡量模型推理自身<strong>澄清问题 (clarification questions)</strong> 的能力。 **具体内容层面评估指标:** * **`Turn-level DROP F1`:** `PACIFIC` 任务的<strong>平均即时响应 <code>DROP F1</code></strong>。 * **概念定义:** `DROP F1` (`Deng et al., 2022; Dua et al., 2019`) 是一个用于评估问答系统抽取式答案准确性的指标,衡量模型生成的答案与真实答案之间的词元(token)重叠度。它对数字和实体等关键信息尤其敏感。 * **数学公式:** \mathrm{F1} = \frac{2 \times \mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} 其中, 其中, \mathrm{Precision} = \frac{\text{正确预测的词元数量}}{\text{模型生成的所有词元数量}} \mathrm{Recall} = \frac{\text{正确预测的词元数量}}{\text{真实答案中所有词元数量}} * **符号解释:** * $\mathrm{F1}$:`F1` 分数,精确率和召回率的调和平均值。 * $\mathrm{Precision}$:精确率,模型预测正确部分的比例。 * $\mathrm{Recall}$:召回率,真实答案中被模型正确预测的比例。 * **`Trajectory-level DROP F1`:** `PACIFIC` 任务的<strong>平均轨迹结果 <code>DROP F1</code></strong>。 * **`Post-Clarification DROP F1`:** `PACIFIC` 任务的<strong>澄清后 <code>DROP F1</code></strong>,即仅针对包含代理澄清轮次的轨迹的 `Trajectory-level DROP F1`。 * **`Turn Similarity`:** `Abg-CoQA` 任务的<strong>即时响应嵌入相似度 (Immediate response embedding similarity)</strong>。 * **概念定义:** `Embedding-based Semantic Distance` (`Risch et al., 2021`) 使用句子的嵌入向量来衡量模型生成答案与真实答案之间的语义相似度。它比简单的词元重叠度量更能捕捉文本的意义,允许生成更多样化但语义等效的答案。 * **数学公式:** 通常使用<strong>余弦相似度 (Cosine Similarity)</strong>,对于两个嵌入向量 $A$ 和 $B$: \mathrm{Similarity}(A, B) = \frac{A \cdot B}{||A|| \cdot ||B||} * **符号解释:** * `A, B`:两个待比较句子的嵌入向量。 * $\cdot$:向量的点积。 * $||\cdot||$:向量的欧几里得范数(L2 范数)。 * **`Trajectory Similarity`:** `Abg-CoQA` 任务的<strong>轨迹结果嵌入相似度 (Trajectory outcome embedding similarity)</strong>。 * **`Trajectory-level Execution Match`:** `AmbigSQL` 任务的<strong>轨迹结果执行匹配百分比 (Percentage of trajectory outcomes with correct execution results)</strong>。 * **概念定义:** `Execution Match` 是衡量<strong>文本到 `SQL` (Text-to-SQL)</strong> 任务中模型生成 `SQL` 查询准确性的客观指标。如果模型生成的 `SQL` 查询在数据库中执行后,其结果与<strong>真实 `SQL` (ground truth SQL)</strong> 查询的执行结果完全一致,则认为匹配成功。 * **`Post-Clarification Execution Match`:** `AmbigSQL` 任务的<strong>澄清后执行匹配百分比 (Percentage of trajectory outcomes with correct execution results out of those that which contain clarification turns)</strong>。 ### 5.2.2. 隐式模糊性识别 (Implicit ambiguity recognition) 为了进一步理解代理的<strong>多轮任务完成能力 (multi-turn task completion ability)</strong>,研究人员考虑了<strong>“对话行为准确率 (dialogue act accuracy)”</strong> (`Chen et al., 2023b`)。假设可以访问<strong>真实模糊性标签 (ground-truth ambiguity label)</strong>,如果一个请求是真正模糊的,模型应该生成一个<strong>澄清问题 (clarification question)</strong>;否则,它应该尝试提供请求的信息。由于 `PACIFIC` 和 `Abg-CoQA` 的类别高度不平衡,主要考虑 `Macro F1` 作为指标。 **具体行动层面评估指标:** * **`Accuracy`:** <strong>正确隐式行动 (correct implicit actions)</strong> 的百分比。 * **概念定义:** 衡量模型在给定对话上下文下,其隐式选择的对话行动(如“澄清”或“回答”)与真实标注行动一致的比例。 * **数学公式:** \mathrm{Accuracy} = \frac{\text{正确预测的行动数量}}{\text{总行动数量}} MacroF1各行动F1<strong>未加权平均(UnweightedAverage)</strong>概念定义:当类别不平衡时,MacroF1比加权平均更公平地反映模型在每个类别上的性能,它计算每个类别的F1分数,然后取所有类别F1分数的平均值。数学公式: * **`Macro F1`:** 各行动 `F1` 的<strong>未加权平均 (Unweighted Average)</strong>。 * **概念定义:** 当类别不平衡时,`Macro F1` 比加权平均更公平地反映模型在每个类别上的性能,它计算每个类别的 `F1` 分数,然后取所有类别 `F1` 分数的平均值。 * **数学公式:** \mathrm{Macro F1} = \frac{1}{N} \sum_{c=1}^{N} \mathrm{F1}_c 其中,$\mathrm{F1}_c$ 是类别 $c$ 的 `F1` 分数,$N$ 是类别的总数。 ## 5.3. 对比基线 ### 5.3.1. 提示基线 (Prompting baselines) 本文将 `ACT` 微调的小模型与多种<strong>前沿 `LLM` (frontier LLMs)</strong> 的<strong>基于提示的方法 (prompt-based approaches)</strong> 进行比较: * `Gemini 1.5 Pro` * `Gemini 1.5 Flash` * `Claude 3.5 Sonnet` * `Claude 3.0 Haiku` 所有提示基线使用 10 个对话作为<strong>上下文示例 (in-context examples)</strong>,采用三种不同的提示框架: 1. <strong>`Standard` (标准提示):</strong> 使用与微调相同的指令格式。Table A17 是 `PACIFIC` 的一个示例。 2. <strong>`Chain-of-Thought` (思维链):</strong> 结合 $Wei et al. (2022)$ 提出的<strong>思维链推理 (chain-of-thought reasoning)</strong>。Table A18 是 `PACIFIC` 的一个示例。 3. <strong>`Proactive MIPrompt` (主动混合主导提示):</strong> `Deng et al. (2023c)` 中的提示基线,结合了 `Chen et al. (2023b)` 的<strong>混合主导提示方法 (mixed-initiative prompting approach)</strong> 和 `Deng et al. (2023b)` 的<strong>主动提示 (Proactive Prompting)</strong>。Table A19 是 `PACIFIC` 的一个示例。 ### 5.3.2. 微调基线 (Tuning baselines) * <strong>`Supervised Fine-tuning (SFT)`:</strong> 使用每个数据集训练集的<strong>真实响应 (ground truth responses)</strong> 进行微调。 * <strong>`Iterative Reasoning Preference Optimization (IRPO)`:</strong> 一种最近提出的<strong>在线策略 `DPO` (on-policy DPO)</strong> 变体,在算术等推理任务中取得了关注。本文在 `PACIFIC` 和 `AmbigSQL` 这两个<strong>定量推理任务 (quantitative reasoning tasks)</strong> 上评估 `IRPO`。 * <strong>`DPO-Dist` (DPO Distillation):</strong> 一种流行的<strong>离线 `DPO` (off-policy DPO)</strong> 方法,其中<strong>获胜响应 (winning responses)</strong> $Y_w$ 来自能力更强的模型,而<strong>失败响应 (losing responses)</strong> $Y_l$ 来自能力较弱的模型 (`Mitra et al., 2023; Mukherjee et al., 2023; Xu et al., 2024a`)。相关结果在附录 B 中提供。 # 6. 实验结果与分析 为了模拟数据有限的真实世界场景,本文在不同数据有限的对话样本设置下,对 `ACT` 作为一种微调方法进行了评估。基础模型为 `Zephyr 7B-`\beta

6.1. 核心结果分析

6.1.1. 表格型问答 (PACIFIC 任务)

以下是原文 Table 1 的结果:

Adaption Setting |Action-level Content-level
Base Model Approach Conversations Macro F1 ↑ Turn F1 ↑ Traj. F1 ↑ Post-Clarify F1 ↑
Gemini Pro Standard ICL 10 81.4 59.7 58.7 49.7
Claude Sonnet Standard ICL 10 71.9 43.7 42.0 28.5
Gemini Pro SFT 50 71.2 51.8 45.7 9.9
Gemini Pro SFT 100 75.2 64.3 54.6 8.5
Gemini Pro SFT 250 88.0 67.4 59.3 10.2
Zephyr 7B-β SFT 50 69.0 57.8 61.3 43.5
Zephyr 7B-β IRPO 50 67.7 59.1 56.7 34.4
Zephyr 7B-β ACT (ours) 50 82.2 62.8 61.9 57.2
Zephyr 7B-β SFT 100 82.3 58.6 60.3 49.9
Zephyr 7B-β IRPO 100 84.5 60.4 55.2 38.2
Zephyr 7B-β ACT (ours) 100 86.0 65.0 62.0 57.4
Zephyr 7B-β SFT 250 86.9 65.1 63.3 56.7
Zephyr 7B-β IRPO 250 85.4 64.9 58.4 40.3
Zephyr 7B-β ACT (ours) 250 89.6 68.1 65.7 62.0
  • ACT 性能领先: Table 1 显示,在所有三种数据效率设置 (data-efficient settings) 下,ACT 在所有指标上均优于 SFTIRPO。值得注意的是,IRPO 具有额外的测试时计算 (test-time computation) 优势 (Snell et al., 2024; Pang et al., 2024),但 ACT 仍能超越。
  • 模糊性识别能力显著提升: 在仅有 50 个对话作为微调数据的情况下,ACT 在衡量模型隐式识别模糊性 (implicitly recognize ambiguity) 的能力方面,相比 SFT 实现了高达 19.1% 的相对提升(从 69.0 Macro F1 提高到 82.2)。
  • 数据效率: ACT 展现出比基于适配器 (adapter-based) SFTGemini Pro 更高的数据效率。在多轮任务性能方面(轨迹级 DROP F1),相对提升高达 35.7%(从 45.6 提高到 61.9)。
  • 媲美或超越前沿 LLM 在数据有限的设置下,通过 ACT 微调的模型在推理时即使没有上下文示例 (in-context examples),也能达到或超越使用上下文学习 (in-context learning, ICL) 的前沿 LLM 性能。这强调了在线策略学习 (on-policy learning)多轮轨迹模拟 (multi-turn trajectory simulation) 对改进多轮任务完成的关键作用。

6.1.2. 机器阅读理解 (Abg-CoQA 任务)

以下是原文 Table 2 的结果:

Adaptation Setting | Action-level | Content-level
Base Model Approach Conversations Macro F1 ↑ Turn Similarity ↑ Traj. Similarity ↑
Gemini Pro Standard ICL 10 55.5 67.0 72.2
Claude Sonnet Standard ICL 10 66.0 50.1 54.3
Zephyr 7B-β SFT 50 44.6 53.3 64.2
Zephyr 7B-β ACT (ours) 50 52.3 66.2 68.8
Zephyr 7B-β SFT 100 52.6 63.1 69.4
Zephyr 7B-β ACT (ours) 100 51.1 69.5 71.4
Zephyr 7B-β SFT 250 53.5 64.0 66.2
Zephyr 7B-β ACT (ours) 250 53.3 72.5 75.1
  • 任务特定指标表现最佳: Table 2 显示,在所有三种数据设置下,ACT 在任务特定指标(特别是轨迹级嵌入相似度 (trajectory-level embedding similarity))方面表现最佳,这表明其在多轮推理方面的改进。
  • 行动层面与内容层面的权衡: 在 100 和 250 个对话的设置中,SFT 微调的 Zephyr隐式行动识别 (implicit action recognition) 方面略优于 ACTMacro F1),但这并不意味着整体性能更优。作者在附录 A 中进一步讨论了这一点,强调行动层面的性能主要有助于理解澄清推理能力 (clarification reasoning ability)
  • 多轮推理能力改进: ACT 在所有条件下均带来了最强的轮次级 (turn-level)轨迹级任务性能 (trajectory-level task performance),这表明其改进了多轮推理能力。

6.1.3. 对话式文本到 SQL 生成 (AmbigSQL 任务)

以下是原文 Table 3 的结果:

Adaptation Setting Action-level Content-level
Base Model Approach Conversations Accuracy ↑ Macro F1 ↑ Execution Match ↑ PC Execution Match ↑
Gemini Pro Standard ICL 10 72.1 70.9 63.5 75.2
Claude Sonnet Standard ICL 10 68.5 63.8 66.5 72.4
Zephyr 7B-β SFT 50 77.4 77.4 21.9 13.9
Zephyr 7B-β IRPO 50 91.0 91.0 27.8 30.8
Zephyr 7B-β ACT (ours) 50 80.8 80.7 43.6 38.1
Zephyr 7B-β SFT 100 97.2 97.2 43.3 34.3
Zephyr 7B-β IRPO 100 96.2 96.1 45.0 37.0
Zephyr 7B-β ACT (ours) 100 99.2 99.3 48.0 49.6
Zephyr 7B-β SFT 250 99.8 99.7 51.0 50.7
Zephyr 7B-β IRPO 250 97.0 97.1 49.7 45.6
Zephyr 7B-β ACT (ours) 250 99.9 99.8 52.3 53.0
Zephyr 7B-β SFT 14,000 (All) 99.8 99.8 63.1 60.4
  • ACT 在任务性能上最强: Table 3 显示,使用 ACT 微调的 Zephyr 在每个数据设置中都能实现最强的任务性能。
  • 澄清后 SQL 执行匹配的显著提升: 当数据资源稀缺时,Post-Clarification SQL Execution Match 的性能提升尤为显著。例如,在 50 个对话的设置中,ACTPC Execution Match 为 38.1%,远高于 SFT 的 13.9% 和 IRPO 的 30.8%。
  • 行动准确率与 SQL 性能的差异: 尽管提示基线在行动准确率 (Action Accuracy) 上可能不高,但基准测试的前沿 LLM执行匹配 (execution match) 方面表现相对较强。相反,SFTACT 微调的 Zephyr 具有较高的行动准确率 (Action Accuracy),但在文本到 SQL 性能方面低于前沿 LLM。这主要归因于 SQL 生成任务对模型规模的巨大益处 (Sun et al., 2023b)。
  • ACT 在多轮任务性能上的相对优势: 总体而言,ACT多轮任务性能 (multi-turn task performance) 方面实现了最大的相对性能提升,甚至超过了 IRPO 等用于定量推理的基线方法。这表明,如果将 ACT 应用于更大的模型,可能会进一步提升其多轮性能。

6.1.4. ACT 在野外:无对话行动监督的学习

以下是原文 Table 4 的结果:

Task Adaptation Environment Action-level Content-level
Base Model Framework Action Supervision Tuning Ex. Macro F1 ↑ Turn F1 ↑ Traj. F1 ↑ Post-Clarify F1 ↑
Zephyr 7B-β SFT NA 50 69.0 57.8 61.3 43.5
Zephyr 7B-β ACT Crowdsourced 50 82.2 62.8 61.9 57.2
Zephyr 7B-β ACT Pseudo-labeled 50 80.1 62.4 61.1 54.7
Zephyr 7B-β SFT NA 100 82.3 58.6 60.3 49.9
Zephyr 7B-β ACT Crowdsourced 100 86.0 65.0 62.0 57.4
Zephyr 7B-β ACT Pseudo-labeled 100 84.8 63.5 61.5 56.1
Zephyr 7B-β SFT NA 250 86.9 65.1 63.3 56.7
Zephyr 7B-β ACT Crowdsourced 250 89.6 68.1 65.7 62.0
Zephyr 7B-β ACT Pseudo-labeled 250 89.0 68.1 64.9 61.0
  • 伪标签监督的有效性: 尽管在 Table 1-3 中使用了人工标注的 (crowdsourced) 模糊性标签,本文还证明了在没有行动标签监督的情况下执行行动基于微调 (action-based tuning) 的可能性。
  • 高一致性: 使用 Gemini 1.5 Pro 作为零样本行动标注器 (zero-shot action annotator) 重新标记 PACIFIC 语料库中的真实助手端轮次 (Assistant-side turns) 时,与真实行动标签达到了惊人的高一致性(98.5%)。
  • 性能无显著差异: Table 4 显示,无论是行动层面 (Action-level) 还是内容层面 (Content-level) 的指标,使用伪标签 (Pseudo-labeled)ACT 与使用**人工标注 (Crowdsourced)ACT` 之间几乎没有经验上的性能差异。
  • “野外”场景的潜力: 这凸显了 ACT“野外”设置 (in-the-wild settings) 中,即使只有少量无标签对话数据 (unlabeled conversational data),也可能非常有效。

6.2. 数据呈现 (表格)

以下是原文 Table A6 的结果:

Adaptation Setting | Action-level | Content-level
Base Model Approach Conversations Macro F1 ↑ Turn Similarity ↑ Traj. Similarity ↑
Gemini Pro ICL 50 56.4 64.5 68.9
Zephyr 7B-β ACT (ours) 50 52.3 66.2 68.8
Gemini Pro ICL 100 59.2 67.0 72.0
Zephyr 7B-β ACT (ours) 100 51.1 69.5 71.4
Gemini Pro ICL 250 58.8 66.0 71.1
Zephyr 7B-β ACT (ours) 250 53.3 72.5 75.1

以下是原文 Table A7 的结果:

Adaption Setting | Action-level | Content-level
Base Model Approach Conversations Macro F1 ↑ Turn F1 ↑ Traj. F1 ↑ Post-Clarify F1 ↑
Gemini Pro Standard Prompt 10 81.4 59.7 58.7 49.7
Gemini Pro Chain-of-Thought 10 86.3 66.3 17.1 19.2
Gemini Pro Proactive MIPrompt 10 78.9 63.4 61.1 18.9
Gemini Flash Standard Prompt 10 67.4 58.8 58.7 17.9
Gemini Flash Chain-of-Thought 10 77.1 62.0 16.9 20.0
Gemini Flash Proactive MIPrompt 10 76.8 64.0 62.0 24.4
Claude Sonnet Standard Prompt 10 71.9 43.7 42.0 28.5
Claude Sonnet Chain-of-Thought 10 80.0 37.2 13.0 6.8
Claude Sonnet Proactive MIPrompt 10 74.9 47.2 45.9 7.6
Claude Haiku Standard Prompt 10 46.9 26.4 26.2
Claude Haiku Chain-of-Thought 10 48.6 23.7 12.0 2.9
Claude Haiku Proactive MIPrompt 10 48.3 18.6 18.2 7.3
Gemini Pro SFT 50 71.2 51.8 45.7 9.9
Gemini Pro SFT 100 75.2 64.3 54.6 8.5
Gemini Pro SFT 250 88.0 67.4 59.3 10.2
Zephyr 7B-β SFT 50 69.0 57.8 61.3 43.5
Zephyr 7B-β DPO-Dist (Pro v. Flash) 50 75.5 61.7 55.7 30.8
Zephyr 7B-β DPO-Dist (Sonnet v. Haiku) 50 74.8 62.0 56.3 31.9
Zephyr 7B-β IRPO 50 67.7 59.1 56.7 34.4
Zephyr 7B-β ACT (ours) 50 82.2 62.8 61.9 57.2
Zephyr 7B-β SFT 100 82.3 58.6 60.3 49.9
Zephyr 7B-β DPO-Dist (Pro v. Flash) 100 68.8 53.3 53.3 31.7
Zephyr 7B-β DPO-Dist (Sonnet v. Haiku) 100 83.0 59.0 53.7 29.3
Zephyr 7B-β IRPO 100 84.5 60.4 55.2 38.2
Zephyr 7B-β ACT (ours) 100 86.0 65.0 62.0 57.4
Zephyr 7B-β SFT 250 86.9 65.1 63.3 56.7
Zephyr 7B-β DPO-Dist (Pro v. Flash) 250 65.6 53.6 54.1 30.9
Zephyr 7B-β DPO-Dist (Sonnet v. Haiku) 250 82.8 43.3 38.6 19.6
Zephyr 7B-β IRPO 250 85.4 64.9 58.4 40.3
Zephyr 7B-β ACT (ours) 250 89.6 68.1 65.7 62.0

以下是原文 Table A8 的结果:

Adaptation Setting | Action-level| Content-level
Base Model Approach Conversations Macro F1 ↑ Turn Similarity ↑ Traj. Similarity ↑
Gemini Pro Standard Prompt 10 55.5 67.0 72.2
Gemini Pro Chain-of-Thought 10 61.2 63.4 39.1
Gemini Pro Proactive MIPrompt 10 55.5 63.3 33.3
Gemini Flash Standard Prompt 10 52.6 62.5 67.4
Gemini Flash Chain-of-Thought 10 61.2 56.5 36.6
Gemini Flash Proactive MIPrompt 10 58.1 61.7 36.1
Claude Sonnet Standard Prompt 10 66.0 50.1 54.3
Claude Sonnet Chain-of-Thought 10 63.7 46.2 36.8
Claude Sonnet Proactive MIPrompt 10 57.2 60.8 32.9
Claude Haiku Standard Prompt 10 49.3 40.9 41.7
Claude Haiku Chain-of-Thought 10 46.2 30.7 28.0
Claude Haiku Proactive MIPrompt 10 45.2 34.5 31.4
Zephyr 7B-β SFT 50 44.6 53.3 64.2
Zephyr 7B-β DPO-Dist (Pro v. Flash) 50 46.9 57.2 61.2
Zephyr 7B-β DPO-Dist (Sonnet v. Haiku) 50 44.7 57.9 61.5
Zephyr 7B-β ACT (ours) 50 52.3 66.2 68.8
Zephyr 7B-β SFT 100 52.6 63.1 69.4
Zephyr 7B-β DPO-Dist (Pro v. Flash) 100 47.8 61.9 67.1
Zephyr 7B-β DPO-Dist (Sonnet v. Haiku) 100 44.8 62.0 66.4
Zephyr 7B-β ACT (ours) 100 51.1 69.5 71.4
Zephyr 7B-β SFT 250 53.5 64.0 66.2
Zephyr 7B-β DPO-Dist (Pro v. Flash) 250 46.0 61.9 66.3
Zephyr 7B-β DPO-Dist (Sonnet v. Haiku) 250 46.3 62.6 67.0
Zephyr 7B-β ACT (ours) 250 53.3 72.5 75.1

以下是原文 Table A9 的结果:

Adaptation Setting Action-level Content-level
Base Model Approach Conversations Accuracy ↑ Execution Match ↑ PC Execution Match ↑
Gemini Pro Standard Prompt 10 72.1 63.5 75.2
Gemini Flash Standard Prompt 10 75.6 64.2 66.2
Claude Sonnet Standard Prompt 10 68.5 66.5 72.4
Claude Haiku Standard Prompt 10 73.8 57.3 65.3
Zephyr 7B-β SFT 50 77.4 21.9 13.9
Zephyr 7B-β DPO-Dist (Pro v. Flash) 50 77.7 42.6 31.5
Zephyr 7B-β DPO-Dist (Sonnet v. Haiku) 50 78.0 40.9 41.2
Zephyr 7B-β IRPO 50 91.0 27.8 30.8
Zephyr 7B-β ACT (ours) 50 80.8 43.6 38.1
Zephyr 7B-β SFT 100 97.2 43.3 34.3
Zephyr 7B-β DPO-Dist (Pro v. Flash) 100 98.7 45.1 45.3
Zephyr 7B-β DPO-Dist (Sonnet v. Haiku) 100 99.8 47.8 44.8
Zephyr 7B-β IRPO 100 96.2 45.0 37.0
Zephyr 7B-β ACT (ours) 100 99.2 48.0 49.6
Zephyr 7B-β SFT 250 99.8 51.0 50.7
Zephyr 7B-β DPO-Dist (Pro v. Flash) 250 97.3 49.7 44.2
Zephyr 7B-β DPO-Dist (Sonnet v. Haiku) 250 99.7 50.7 50.3
Zephyr 7B-β IRPO 250 97.0 49.7 45.6
Zephyr 7B-β ACT (ours) 250 99.9 99.8 52.3 53.0
Zephyr 7B-β SFT 14,000 (All) 99.8 99.8 63.1 60.4

以下是原文 Table A10 的结果:

Train Dev Test
Num. Unambiguous Requests 7,000 1,034 1,034
Num. Ambiguous Requests 7,000 1,034 1,034
Num. Unique Schemas 1,056 145 145
Types of Ambiguity 3 3 3

以下是原文 Table A11 的结果:

No. Interacting Party Utterance
User Assistant Can you list all the singer ids that aren't present in the song table? SELECT Name FROM singer WHERE Singer_ID NOT IN ...
1 User Assistant Thanks! You should ask at least 3 questions
2 Assistant Did you want the full name of makers and the number?
3 Assistant Do you mean the address of the customer with first name Luis?

以下是原文 Table A12 的结果:

[Database Schema Omitted] The target SQL query is the following:
SELECT professional_id , last_name , cell_number FROM Professionals
WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,
T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON
T1.professional_id = T2.professional_id
GROUP BY T1.professional_id HAVING count(*) > 2
Here is a clear request that would correspond to this SQL query:
"Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List
his or her id, last name and cell phone."
Here is the same request converted into an ambiguous format by underspecifying the target columns:
"Which ones who live in the state of Indiana or have done treatment on more than 2 treatments?"
Here is an appropriate clarifying question to recover the clear request from the ambiguous request:
"Are you asking about the Professionals?"

以下是原文 Table A13 的结果:

[Database Schema Omitted] The target SQL query is the following:
SELECT professional_id , last_name , cell_number FROM Professionals
WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,
T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON
T1.professional_id = T2.professional_id
GROUP BY T1.professional_id HAVING count(*) > 2
Here is a clear request that would correspond to this SQL query:
"Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List
his or her id, last name and cell phone."
Here is the same request converted into an ambiguous format by underspecifying the target columns:
"Which professionals live in the state of Indiana or have done treatment on more than 2 treatments?"
Here is an appropriate clarifying question to recover the clear request from the ambiguous request:
"Which information of the professionals do you want to know?"

以下是原文 Table A14 的结果:

Model Ambiguous Request Execution Match Execution Match with Clarification Turns
Gemini Pro 28.5 68.7
Gemini Ultra 31.2 77.0

以下是原文 Table A15 的结果:

Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables Singer with columns Singer_ID, Name, Country, Age and Concert with columns Concert_ID, Concert_Name, Theme, Stadium_ID, Year. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request.
User: Show all singers who are not from China or USA.
Assistant: SELECT Name FROM Singer WHERE Country != 'China' AND Country != 'USA'
Trajectory Goal: SELECT Name FROM Singer WHERE Country != 'China' AND Country != 'USA'
Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables Singer with columns Singer_ID, Name, Country, Age and Concert with columns Concert_ID, Concert_Name, Theme, Stadium_ID, Year. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request.
User: Show all singers who are not from China.
Assistant: Which country other than China are you asking about?
Trajectory Goal: User: Show all singers who are not from China and USA. Assistant: SELECT Name FROM Singer WHERE Country != 'China' AND Country != 'USA'

以下是原文 Table A16 的结果:

Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables Concert with columns Concert_ID, Concert_Name, Theme, Stadium_ID, Year and Stadium with columns Stadium_ID, Location, Name, Capacity. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request.
User: What are the names of the stadiums where concerts were held in 2014?
Assistant: SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014
Trajectory Goal: SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014
Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables Concert with columns Concert_ID, Concert_Name, Theme, Stadium_ID, Year and Stadium with columns Stadium_ID, Location, Name, Capacity. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request.
User: What are the names of the stadiums where concerts were held?
Assistant: Which year are you asking about?
Trajectory Goal: User: What are the names of the stadiums where concerts were held in 2014? Assistant: SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014

以下是原文 Table A17 的结果:

You are an Assistant answering questions from a User. You should either attempt to answer the question or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted]
User: How much does the company expect to make contributions to the defined plans during the next fiscal year?
Assistant: ['What kind of defined plans are you asking about?']
User: The defined benefit plans and the defined contribution plan respectively.
Assistant: ['`5.1 million', '`0.6 million'] User: What is the value of the expected defined contribution plan as a percentage of the defined benefit
plan during the next fiscal year?
Assistant: 0.6/5.1 = 11.76 User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount
rate? Assistant: ['What kind of change are you asking about?']
User: The pension. Assistant: ['(39,145),(39,145)', '49,361']
User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?
Assistant: 49,361 - (39,145) = 88506 User: How much would the pension change if there is a 1% increase and a 1% decrease respectively?
Assistant: ['Which rate of increase and decrease are you talking about?']
User: The future salary growth.
Assistant: ['`7,572', '`(6,919)']
User: What is the difference between them?
Assistant: 7,572 - (6,919) = 14491

以下是原文 Table A18 的结果:

or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted] You are an Assistant answering questions from a User. You should either attempt to answer the question
User What is the value of the expected defined contribution plan as a percentage of the defined benefit plan during the next fiscal year?
Insrction: I the user' question is biguous, ask an apropriate clarifyi question.Otherwise, dily answer the user's question using the information from the passage context and the table. Let's think step
by step. Reasoning: The user's question is not ambiguous. Assistant: 0.6/5.1 = 11.76
rate? User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount
Insrction: I the user' question is biguous, ask an apropriate clarifyi question.Otherwise, dily
answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question was ambiguous. Assistant: ['What kind of change are you asking about?]
User: The pension.
Instion: Ithe user' question is mbiguous, ask an apropriate clarifyi question.Otherwi, drly
answer the user's question using the information from the passage context and the table. Let's think step
by step.
Reasoning: The user's question is not ambiguous. Assistant: ['(39,145),(39,145)', '49,361']
User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?
Insction: I he user' question is mbiguous, askan appropriateclarifyig questio. Otherwis, rey
answer the user's question using the information from the passage context and the table. Let's think step
by step.
Reasoning: The user's question is not ambiguous. Assistant: 49,361 - (39,145) = 88506 User: How much
would the pension change if there is a 1% increase and a 1% decrease respectively?
Instion: Ithe user' question is mbiguous, ask an apropriate clarifyi question.Otherwi, drly
answer the user's question using the information from the passage context and the table. Let's think step
by step.
Reasoning: The user's question was ambiguous.
Assistant: ['Which rate of increase and decrease are you talking about?']

以下是原文 Table A19 的结果:

or ask a clarifying question if there is any ambiguity. You are an Assistant answering questions from a User. You should either attempt to answer the question
[Table and Passage Omitted] User: How much does the company expect to make contributions to the defined plans during the next fiscal year?
The user's last question was ambiguous. The Assistant asks a clarifying question.
Assistant: ['What kind of defined plans are you asking about?']
User: The defined benefit plans and the defined contribution plan respectively.
The user's last question was unambiguous. The Assistant directly answers the question.
Assistant: ['5.1million' ,'0.6 million'] User: What is the value of the expected defined contribution plan as a percentage of the defined benefit
plan during the next fiscal year?
The user's last question was unambiguous. The Assistant directly answers the question. Assistant: 0.6/5.1 = 11.76
User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount
rate? The user's last question was ambiguous. The Assistant asks a clarifying question.
Assistant: ['What kind of change are you asking about?']
User: The pension. The user's last question was unambiguous. The Assistant directly answers the question.
Assistant: ['(39, 145)','49,361']
User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?
The user's last question was unambiguous. The Assistant directly answers the question.
Assistant: 49,361 - (39,145) = 88506
User: How much would the pension change if there is a 1% increase and a 1% decrease respectively?
Actions: ["Directly Answer", "Ask a Clarification Question"]
Prompt: Given the task background and the conversation history, please use appropriate actions to
generate the response.
Response: The user's last question was ambiguous. The Assistant asks a clarifying question.
Assistant: ['Which rate of increase and decrease are you talking about?']

以下是原文 Table A20 的结果:

[Table and Passage Omitted]
User: What was the total number of acquisitions in FY19?
Assistant: 35 acquisitions
User: What are the benefits of active growth strategy?
Assistant: strengthens and scales the National Storage operating platform which drives efficiencies across
the business.
User: What was the total NLA?
Assistant: Which region are you asking about?
The last Assistant utterance is a clarifying question.

以下是原文 Table A21 的结果:

[Passage Omitted]
User: And who in the group was anxious? Assistant: Peppe
User: Was she well-rested?
Assistant: no
The last Assistant utterance is a direct answer.

以下是原文 Table A22 的结果:

[Database Schema Omitted]
User: "Which professionals live in the state of Indiana or have done treatment on more than 2 treatments?
List his or her id, last name and cell phone."
Here is the same request converted into an ambiguous format by underspecifying the target columns:
Assistant:
SELECT professional_id , last_name , cell_number FROM Professionals
WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,
T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON
T1.professional_id = T2.professional_id
GROUP BY T1.professional_id HAVING count(*) > 2
The last Assistant utterance is a direct answer.

以下是原文 Table A23 的结果:

The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables Singer with columns Singer_ID, Name, Country, Age and Concert with columns Concert_ID, Concert_Name, Theme, Stadium_ID, Year. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request.
User: What are the names of the stadiums where concerts were held?
Assistant: Which year are you asking about?
User: What are the names of the stadiums where concerts were held in 2014?
Assistant: SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014
The following is a summary of the information the User is looking for: 1. Names of stadiums where concerts were held in 2014.

以下是原文 Table A24 的结果:

The following is a snippet from a conversation between an Assistant and a User who is asking some questions about the Story below. The user only wants to know the following information: 1. What his ranking was. 2. Whether someone else had horse fighters. 3. Who had a considerable force of horse fighters west of him.
User: What was his ranking?
Assistant: General User: Did someone else have horse fighters?
Assistant: yes
User: Who?
Assistant: Do you want to know who had 2,500 horse fighters initially? User: No, I want to know who had a considerable force of horse fighters west of him.
Assistant: General (Bishop) Polk.

以下是原文 Table A25 的结果:

Prompt: The following is a snippet from a conversation between an Assistant and a User who is asking some questions about the Story below. The user only wants to know the following information: 1. What his ranking was. 2. Whether someone else had horse fighters. 3. Who had a considerable force of horse fighters west of him.
User: What was his ranking?
Assistant: General
User: Did someone else have horse fighters?
Assistant: yes
User: Who?
Assistant: Do you want to know who had 2,500 horse fighters initially?
User: No, I want to know who had a considerable force of horse fighters west of him.
Assistant: General (Bishop) Polk.
The following is a summary of the information the User is looking for: 1. What was his ranking? 2. Whether someone else had horse fighters? 3. Who had a considerable force of horse fighters west of him?

以下是原文 Table A26 的结果:

The following is a snippet from a conversation between an Assistant and a User who is asking some questions about the Story below. The user only wants to know the following information: 1. What his
ranking was. 2. Whether someone else had horse fighters. 3. Who had a considerable force of horse fighters west of him.
[Passage Omitted]
User: What was his ranking?
Assistant: General User: Did someone else have horse fighters?
Assistant: yes
User: Who?
Assistant: Do you mean who had 2,500 horse fighters initially? User: No, I want to know who had a considerable force of horse fighters west of him.
Assistant: General (Bishop) Polk.

以下是原文 Table A27 的结果:

A user is asking an assistant to retrieve some information from a SQL database. The command that the assistant should ultimately return is as follows:
SELECT county FROM campuses where campus = 'California State University-Chico'
The assistant wil ask some questions to cariy he user's intent. The user should respond with a rephrased request that reflects their desired query.
User: what is the county? Assistant: Are you asking for a list of all of the counties in the database? User: I'm looking for the county of the campus 'California State University-Chico

以下是原文 Table A28 的结果:

Relevant Table Data SFT ACT
Year: 2019 || 2018 User: What were the total liabilities of IMFT? Assistant: \$909 × User: What were the total liabilities of IMFT? Assistant: Which year are you asking about?
Total current liabilities : 851 | | 167 Other noncurrent liabilities : 5 | |74 User: 2018
Total Liabilities: 909909 | | 1,305 Assistant: \$1,305

以下是原文 Table A29 的结果:

SFT ACT
Relevant Data Net profit/(loss) after tax was \$(9.8) million User: What was the net profit/(loss) after tax? User: What was the net profit/(loss) after tax? Assistant: Which year are you asking the amount of?
Net profit/(loss) after tax : (9,819) | |6,639 || (248%) Assistant: \$(9,819) × User: FY19 Assistant: \$(9,819) ×

以下是原文 Table A30 的结果:

SFT ACT
Relevant Table Data User: In which year was the amount of Investments higher? ? User: In which year was the amount of Investments higher?
Year: 2019 | | 2018 Investments: 1,216.0 || 1,212.9 Assistant: [2018] × Assistant: [2018, 2019] ×

以下是原文 Table A31 的结果:

SFT ACT
Relevant Passage User: What did Meghan ask? User: What did Meghan ask?
She yawned, stretching out like a lazy cat. Assistant: Do you mean where did Lizzie see her father? Assistant: Do you mean that morning or the night before?
Her sister was also awake. "Lizzie", Meghan whispered, "are you awake?" User: No A User: The night before.

6.3. 消融实验/参数分析

以下是原文 Table 5 的结果:

Macro F1 ↑ Turn F1 ↑ Traj. F1 ↑ Post-Clarify F1 ↑
Action Importance
ACT
w/ Random Actions 63.2 55.3 58.7 32.8
Ablation of ACT subcomponents
ACT w/o on-policy sampling 74.8 61.5 59.1 40.5
ACT
w/ sampling but w/o simulation ACT (full) 81.4 82.2 60.8 62.8 60.2 61.9 50.1 57.2
ACT with unaligned foundation models
Gemma 2B SFT 57.7 38.0 40.5 17.0
Gemma 2B ACT 62.7 42.6 44.0 24.8
Mistral 7B SFT 57.7 53.8 51.4 27.7
Mistral 7B ACT 75.7 58.1 57.6 31.9

Table 5 展示了使用 PACIFIC 50 对话设置进行的消融研究,以理解 ACT 各组件的重要性。

  • 行动基于偏好是否必要? (Are action-based preferences necessary?)

    • 分析: ACT 的关键因素之一是对比对 (contrastive pairs) 突出了对话行动 (conversational actions) 之间的差异。当构建偏好对时,如果随机采样获胜行动 (winning action)失败行动 (losing action)(如 Table 5 中的 “ACT w/ Random Actions”),性能会显著下降。例如,Macro F1ACT 的 82.2% 降至 63.2%,Post-Clarify F1 从 57.2% 降至 32.8%。
    • 结论: 这表明明确的行动选择 (action selection) 对于 ACT 的有效性至关重要。
  • 是否需要在线策略采样? (Do we need on-policy sampling?)

    • 分析: Table 5 中的 “ACT w/o on-policy sampling” 实验评估了在线策略采样 (on-policy sampling) 的重要性,它本质上是在 4.2.2.1 节构建的数据集上进行普通的离线 DPO (off-policy DPO) 训练。
    • 结果: 尽管相比 SFT 有一些改进(例如 Macro F1 从 69.0 提高到 74.8),但与完整的 ACT 相比,整体改进幅度要小得多。
    • 结论: 这可能是因为离线负响应 (off-policy negative responses) 不保证位于策略模型 (policy model)语言流形 (language manifold) 中,导致分布漂移 (distribution shift) 难以通过离线学习克服 (Guo et al., 2024)。在线策略采样ACT 中起着关键作用。
  • 轨迹模拟是否必要? (Is trajectory simulation necessary?)

    • 分析: ACT 通过其在线策略轨迹模拟 (on-policy trajectory simulation) 更好地与多轮对话 (multi-turn conversations) 对齐。如果移除多轮模拟(如 Table 5 中的 “ACT w/ sampling but w/o simulation”),该方法类似于在线策略 DPO 变体 (on-policy DPO variants)(如 Pang et al., 2024),但带有考虑对话行动 (conversation actions)任务启发式 (task heuristics)对话特定奖励信号 (conversation-specific reward signal)
    • 结果: 发现轨迹级模拟 (trajectory-level simulation) 对于改进多轮性能 (multi-turn performance) 至关重要,尤其是策略模型 (policy model) 对自身澄清问题 (clarification questions) 的推理能力。例如,Post-Clarify F1 从 57.2% 降至 50.1%。
    • 结论: 这强调了多轮交互 (multi-turn interaction)结果评估 (outcome evaluation)ACT 中的核心地位。
  • ACT 是否与模型无关? (Is ACT model agnostic?)

    • 分析: 主要实验中的基础模型 Zephyr 是通过对 Mistral 进行对齐得到的。 Table 5 中的 “ACT with unaligned foundation models” 部分展示了使用未对齐的基础模型 Gemma 2BMistral 7B 的结果。
    • 结果: 尽管这两个模型在经过 ACT 微调后,其行动 F1 (Action F1)轨迹 F1 (Trajectory F1) 仍存在性能差距(例如 Gemma 2B 提高了 5.0 Action F1Mistral 7B 提高了 18.0 Action F1),但结果表明 ACT 无论是否存在人类反馈的预对齐,都能提升性能。
    • 结论: 这意味着 ACT 具有模型无关性 (model agnostic),可以改进任何基础模型的性能,尽管更好的模型初始化(预对齐)可能会带来进一步的益处。

7. 总结与思考

7.1. 结论总结

本文提出了 ACT (Action-Based Contrastive Self-Training),一种模型无关 (model agnostic)准在线对比微调方法 (quasi-online contrastive tuning approach),旨在实现样本高效 (sample-efficient)对话任务适应 (conversational task adaptation)。同时,还提出了一种用于评估对话代理 (conversational agents) 的工作流程。

主要发现和贡献总结如下:

  • 有效提升澄清能力: ACT 能够显著提升 LLM 在多轮对话中隐式识别 (implicitly recognize)推理模糊性 (reason about ambiguity) 的能力,使其能够更好地提出澄清问题 (clarification questions),而不是过度猜测用户意图。
  • 数据高效性:有限数据 (limited data regime) 的场景下,ACT 表现出高度有效性,甚至在没有行动标签 (action labels) 的情况下,也能通过伪标签 (pseudo-labels) 实现成功的微调。
  • 超越基线方法: ACT 在多个真实世界对话任务(包括表格型问答 PACIFIC、机器阅读理解 Abg-CoQA 和新任务 AmbigSQL)上,相较于 SFTDPO 等标准微调方法,展现出显著的对话建模改进 (conversation modeling improvements)
  • 关键组件的验证: 消融实验证实了行动基于偏好 (action-based preferences)在线策略采样 (on-policy sampling)多轮轨迹模拟 (multi-turn trajectory simulation) 作为 ACT 核心组件的必要性。

7.2. 局限性与未来工作

7.2.1. 局限性 (Limitations)

  • 对澄清问题时机的假设: ACT 假设澄清问题 (clarification questions) 的提出是适时的。然而,众包对话数据集 (crowdsourced conversation datasets) 可能存在噪声,导致模型学到次优策略(例如,提出不必要的澄清问题或生成不流畅的语言)。未来的工作可能需要额外的预处理阶段 (preprocessing stage) 来推断某个行动是否有效。
  • 标签噪声的影响: 隐式行动识别评估 (implicit action recognition evaluation) 假设基准任务中的行动是“最优”的。在标注者间一致性 (inter-annotator agreement) 低的数据集(如 Abg-CoQA)中,这可能导致评估结果的不一致性。
  • 对任务特定启发式的依赖: ACT 依赖于任务特定启发式 (task-specific heuristics) 来评估轨迹结果。虽然这允许根据不同领域定制成功标准,但也增加了定制化 (customization)工程专业知识 (engineering expertise) 的需求。
  • 对现有 LLM 的依赖: ACT 的实现 heavily 依赖于现有 LLM(如 Gemini)进行偏好数据集 (preference dataset) 构建、行动分类 (Action Classification)用户模拟 (User Simulation)。这可能受限于商业 LLM 的可访问性(成本、隐私)和潜在的偏差。
  • 研究范围: 本研究主要关注有限数据 (limited data regime) 场景。在训练数据充足且分布与目标分布高度匹配的情况下,ACT 相对于 SFT 等方法的优势可能不会那么显著。
  • “准在线”的界定: ACT 被定义为准在线对比 RL 微调 (quasi-online contrastive RL tuning) 方法,因为它结合了离线方法的固定数据集 (fixed dataset) 和在线方法的在线策略采样 (on-policy sampling)。但其在线探索的程度仍受限于对话轨迹 (dialogue trajectories) 的有限性质(尤其是在有明确正确答案的任务中)。

7.2.2. 未来工作 (Future Work)

  • 与其他复杂微调方法的结合: 考虑将 ACT 与现有针对文本到 SQL 生成 (text-to-SQL generation) 等复杂任务的复杂微调方法 (sophisticated tuning approaches) 相结合,以进一步提升性能。
  • 大规模数据和多任务环境的泛化: 研究 ACT大规模数据 (large-scale data)多任务环境 (multi-task environments) 中的泛化能力。
  • 与检索增强生成 (RAG) 的结合: 鉴于 LLM幻觉 (hallucinations) 问题,可以将 ACT检索增强生成 (Retrieval-Augmented Generation, RAG) 方法结合,以提升事实准确性。

7.3. 个人启发与批判

7.3.1. 个人启发 (Personal Insights)

  • 行动规划的价值: 这篇论文强调了在多轮对话中,单纯生成流畅的文本是不够的,智能体 (agent)行动规划 (action planning) 能力(何时澄清、何时回答)是实现对话成功的核心。这为 LLM 的对齐和微调提供了新的视角,超越了传统上关注的文本质量。
  • DPO 的灵活性与在线扩展: DPO 作为 RLHF 的轻量级替代品,其潜力不仅在于简化训练,更在于其通过灵活的损失函数设计来融入更复杂的偏好。ACT在线策略采样 (on-policy sampling)轨迹模拟 (trajectory simulation) 引入 DPO 框架,为数据稀疏 (data-efficient) 场景下的策略学习 (policy learning) 提供了有效途径。
  • 弱监督和伪标签的实用性: ACT 在没有行动标签 (action labels) 的情况下,通过伪标签 (pseudo-labels) 也能取得良好性能,这极大地降低了对昂贵人工标注 (human annotation) 的依赖,使其在实际应用中更具可行性。对于新兴或低资源任务,这种方法能够快速启动模型开发。
  • 多维度评估的重要性: 论文提出的行动层面 (action-level)内容层面 (content-level) 的评估指标,以及轨迹级评估 (trajectory-level evaluation),为全面衡量对话代理 (conversational agents) 性能提供了更丰富的视角,避免了仅关注表面生成质量的局限性。

7.3.2. 批判 (Critique)

  • 用户模拟器的可靠性: ACT 严重依赖用户模拟器 (User Simulator, U) 来生成多轮对话轨迹 (multi-turn dialogue trajectories)。如果 UU 自身的模拟能力不足或存在偏差,它可能无法准确反映真实用户的行为和意图,从而可能引导 ACT 学习到次优的策略。UU 的鲁棒性和泛化能力(特别是对未见过的策略 (policies)错误 (errors) 的反应)是 ACT 成功的关键瓶颈。
  • 任务特定启发式的通用性: 论文明确指出 ACT 依赖任务特定启发式 (task-specific heuristics) 进行轨迹评估。虽然这允许在不同领域进行定制,但这也意味着将 ACT 应用到新任务时,需要投入额外的工程专业知识 (engineering expertise) 来设计和验证这些启发式规则。这可能限制了 ACT 在完全开放域或快速变化的复杂任务中的应用。
  • “准在线”的理论严格性: 尽管 ACT 被描述为准在线 (quasi-online),但其在线探索 (online exploration) 的性质与纯粹的在线强化学习 (online reinforcement learning) 仍有区别。论文对这种“准在线”模式的理论性质(例如,收敛性、探索-利用权衡)的讨论相对较少。更严格的理论分析可以进一步支持其有效性。
  • 对商业 LLM 的依赖与复现性: 实验中大量使用 Gemini Ultra 等商业 LLM 进行偏好数据构建 (preference data construction)行动分类 (action classification)用户模拟 (user simulation)。这对于学术界和资源有限的研究者来说,可能是一个挑战,影响了研究的复现性和透明度。如果这些功能由开源模型替代,其性能可能会有所下降。
  • 行动空间和复杂对话: 当前的行动空间 (action space) 主要限于 CLARIFYANSWER。在更复杂的对话中,可能存在更多元的对话行动 (dialogue actions)(如确认、建议、拒绝、提供额外信息等)。如何将 ACT 扩展到更丰富、层级化的行动空间,以及如何处理这些行动之间的依赖关系,是未来的挑战。

相似论文推荐

基于向量语义检索推荐的相关论文。

暂时没有找到相似论文。