Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training
TL;DR 精炼摘要
本文提出了基于行动对比自训练的多轮对话模型,旨在提升大语言模型在模糊性处理中的能力。通过引入准在线偏好优化算法,ACT 在数据稀疏场景下有效学习对话策略,验证了其在表格问答、机器阅读理解及 AmbigSQL 等任务中的出色表现,显示出显著优于传统微调方法的效果。
摘要
Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users' true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT's efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.
思维导图
论文精读
中文精读
1. 论文基本信息
1.1. 标题
Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training (学习澄清:基于行动对比自训练的多轮对话)
1.2. 作者
- Maximillian Chen*, Ruoxi Sun, Tomas Pfister, Sercan Ö. Arik
- 所属机构:Google 和 Columbia University (Maximillian Chen 隶属于两家机构)
1.3. 发表期刊/会议
本文于 2024 年 5 月 31 日作为预印本发表在 ArXiv。ArXiv 是一个广受欢迎的预印本服务器,在人工智能、机器学习和自然语言处理等领域,研究者通常会先将论文上传至此,以便快速分享研究成果并接受社区反馈。
1.4. 发表年份
2024 年
1.5. 摘要
大语言模型 (LLM) 通过人类反馈进行优化,已迅速成为开发智能对话助手的主流范式。然而,尽管它们在许多基准测试中表现出色,但基于 LLM 的智能体可能仍然缺乏对话技能,例如歧义消除 (disambiguation)——当它们面临模糊性时,通常会过度规避或隐式猜测用户的真实意图,而不是提出澄清问题 (clarification questions)。在任务特定设置中,高质量的对话样本往往有限,这构成了 LLM 学习最佳对话动作策略 (dialogue action policies) 的瓶颈。
本文提出了 Action-Based Contrastive Self-Training (ACT),一种基于 Direct Preference Optimization (DPO) 的准在线偏好优化算法 (quasi-online preference optimization algorithm),旨在实现多轮对话建模中的数据高效对话策略学习。研究人员在多个真实世界的对话任务中(包括表格型问答、机器阅读理解以及 AmbigSQL——一个用于向数据分析智能体澄清复杂 SQL 生成的信息寻求请求的新任务)验证了 ACT 的有效性,即使在没有行动标签的数据稀疏微调 (data-efficient tuning) 场景下也表现出色。此外,本文提出通过检查 LLM 是否能隐式识别和推理对话中的模糊性来评估其作为对话代理 (conversational agents) 的能力。ACT 相较于监督微调 (SFT) 和 DPO 等标准微调方法,展现出显著的对话建模改进。
1.6. 原文链接
- 原文链接: https://arxiv.org/abs/2406.00222
- PDF 链接: https://arxiv.org/pdf/2406.00222v2.pdf
- 发布状态:预印本 (Preprint)
2. 整体概括
2.1. 研究背景与动机
-
核心问题: 尽管大语言模型 (
LLM) 在许多任务中表现出了强大的能力,但它们在多轮对话中,尤其是在处理用户请求的模糊性时,仍然存在显著缺陷。当前LLM倾向于过度规避 (overhedging)(例如,给出过于笼统或模棱两可的回答)或隐式猜测 (implicitly guess) 用户的真实意图,而非主动提出澄清问题 (clarification questions) 来消除歧义,这导致对话效率和用户满意度降低。Figure 1 提供了一个直观的例子。
该图像是示意图,展示了在基于表格的对话问答中过度推测及意图澄清的示例。对话代理需要识别出用户问题中的模糊性,并提出澄清问题,以便得到更准确的答案。Figure 1 | 简化版的表格型对话问答中模糊性示例,一个对话代理应能识别模糊性并提出澄清问题以获得更准确的最终答案。
-
为什么重要: 在复杂的任务场景中,用户往往会欠详尽 (underspecify) 他们的请求,导致信息模糊。有效的歧义消除 (disambiguation) 是实现共同基础 (common ground) 和完成任务的关键。例如,在生成复杂
SQL查询或进行机器阅读理解时,一个微小的歧义都可能导致最终结果的错误。然而,为LLM学习这些实用技能 (pragmatic skills) 的高质量对话数据往往稀缺,这成为一个主要瓶颈。 -
现有研究的挑战或空白:
LLM的预训练或监督微调 (SFT) 目标通常不直接对齐对话中的实用技能(如澄清)。- 虽然
RLHF等方法可以对齐LLM,但现有模型在处理多轮对话任务时仍存在困难,部分原因是它们没有直接优化实用技能。 - 收集高质量的多轮对话数据集成本高昂且受隐私问题限制,使得在数据稀疏场景下学习最佳对话策略 (dialogue policies) 极具挑战性。
- 现有
DPO类算法在对话任务中的应用多集中于单轮响应优化,未能充分考虑多轮轨迹 (multi-turn trajectories) 和对话行动 (conversational actions) 的优化。
-
本文的切入点或创新思路:
- 聚焦行动规划: 提出直接优化
LLM在模糊上下文中隐式选择对话策略 (conversational strategies) 的能力,尤其关注澄清问题 (clarifying questions)。 - 数据高效: 设计一种数据高效的算法,使其在高质量对话样本有限的情况下也能有效学习。
- 准在线偏好优化: 引入
Action-Based Contrastive Self-Training (ACT),结合了DPO的易用性和在线学习的探索能力,通过对比不同实用对话行动 (pragmatic conversational actions) 的差异来优化模型。 - 多任务验证: 在多个真实世界对话任务上验证方法,包括一个新提出的
AmbigSQL任务,突出其通用性。 - 新型评估: 提出通过检查
LLM是否能隐式识别和推理对话中的模糊性来评估其作为对话代理 (conversational agents) 的能力。
- 聚焦行动规划: 提出直接优化
2.2. 核心贡献/主要发现
-
提出了
Action-Based Contrastive Self-Training (ACT)算法:ACT是一种样本高效 (sample-efficient)、基于DPO的准在线 (quasi-online) 偏好优化算法。它专注于通过对比智能体可能的实用对话行动 (pragmatic conversational actions) 之间的差异来改进LLM的多轮对话能力。 -
在数据稀疏场景下表现出色:
ACT在数据高效微调 (data-efficient tuning) 场景下展示了其有效性,即使在没有明确行动标签 (action labels) 的情况下,也能在多个真实世界的对话任务中实现显著改进。这些任务包括:- 表格型问答 (tabular-grounded question-answering)
- 机器阅读理解 (machine reading comprehension)
AmbigSQL任务: 一个新颖的、用于歧义消除 (disambiguating) 针对数据分析智能体生成复杂SQL的信息寻求请求的任务。
-
多轮对话建模的显著改进:
ACT在多轮任务完成 (multi-turn task completion) 能力上显著优于Supervised Fine-tuning (SFT)和Direct Preference Optimization (DPO)等标准微调方法。Figure 2 直观展示了ACT的性能优势。
该图像是一个图表,展示了 Action-Based Contrastive Self-Training (ACT) 在数据效率设置中相较于其他标准调优方法的表现。随着对话数量的增加,ACT 的 Clarification Reasoning (DROP F1) 指标显著提升,标记为 24% 的推理提升和 19% 的动作优化提升,并显示出 5 倍的数据效率。Figure 2 在数据高效的对话建模设置中,大大优于标准微调方法,如在
PACIFIC任务上的表现。 -
评估框架的创新: 论文提出了一种评估
LLM作为对话代理 (conversational agents) 能力的新工作流程,即通过检查模型是否能隐式识别 (implicitly recognize) 和推理 (reason about) 对话中的模糊性。 -
模型无关性 (Model Agnostic):
ACT对基础模型具有良好的适应性,即使是对未经过人类反馈对齐的基础模型,也能带来性能提升。
3. 预备知识与相关工作
本节旨在为读者提供理解论文核心内容所需的背景知识,并阐述本文工作与现有研究的联系与区别。
3.1. 基础概念
- 大语言模型 (Large Language Models, LLMs): 指的是参数量巨大、在海量文本数据上进行过预训练的深度学习模型,如
GPT系列、Gemini、Mistral等。它们通过学习语言的统计规律,能够生成连贯、有意义的文本,并执行各种自然语言处理任务,如问答、翻译、摘要和对话。本文中,LLM是构建对话代理 (conversational agents) 的基础。 - 人类反馈 (Human Feedback): 优化
LLM性能的关键机制。通常指在模型生成响应后,人类标注者对其质量、有用性、安全性等方面进行评价或排序。这种反馈用于进一步微调模型,使其行为更符合人类偏好。 - 对话代理 (Conversational Agents): 能够与人类进行自然语言交互,并协助完成特定任务或提供信息的智能系统。它们需要具备理解用户意图、生成恰当响应、管理对话流程等能力。本文关注的是如何提升
LLM作为对话代理的能力。 - 歧义消除 (Disambiguation): 指在对话中识别并解决用户请求或表达中的模糊性。例如,用户说“展示详情”,代理需要澄清“详情”指的是哪些具体信息。这是混合主导对话 (Mixed-Initiative Conversation) 的一个核心技能。
- 对话动作策略 (Dialogue Action Policies): 描述对话代理 (dialogue agent) 在特定对话状态 (dialogue state) 下应采取的行动。例如,当用户请求模糊时,策略可能是“提出澄清问题”;当请求清晰时,策略可能是“提供答案”。本文的目标是优化
LLM学习这些策略的能力。 - 直接偏好优化 (Direct Preference Optimization, DPO): 一种强化学习 (Reinforcement Learning) 的替代方法,用于从人类偏好数据中对齐
LLM。DPO不需要训练一个单独的奖励模型 (reward model),而是直接将偏好数据转换为一个分类损失函数,通过最大化获胜响应相对于失败响应的概率比来优化策略模型。这简化了RLHF的训练过程,通常更稳定且计算效率更高。 - 监督微调 (Supervised Fine-tuning, SFT): 在预训练
LLM的基础上,使用带有任务特定标注数据(例如,问答对、对话轮次)进一步训练模型的过程。SFT的目标是使模型更好地遵循指令并适应特定任务。
3.2. 前人工作
3.2.1. 混合主导对话代理 (Mixed-Initiative Conversational Agents)
混合主导 (Mixed-Initiative) 是指在人机交互中,人与机器都可以在对话中承担主导角色,而不是由一方完全控制。这种代理通常包含两个核心组件:
- 理解与规划模块 (Understanding and Planning Module): 负责分析对话状态(如用户意图、上下文),并决定接下来要执行的对话行动 (dialogue action),例如是问澄清问题 (clarifying question) 还是直接提供答案 (answer)。这通常被视为一个马尔可夫决策过程 (Markov Decision Process, MDP),其中对话状态 (dialogue state) 是从一个潜在未知分布中抽取的,而行动 (action) 是对话话语中携带的实用意图 (pragmatic intent) 的低维表示(即对话行为 (dialogue act))。
- 生成模块 (Generation Module): 根据规划模块的输出,生成具体的话语 (utterance)。早期的研究可能使用多目标
SFT或专门的控制代码嵌入 (control codes embeddings) 来实现受控生成。LLM极大地改进了实用受控生成 (pragmatically-controlled generation) 的性能。
挑战:
- 规划 (Planning): 这是一个困难的任务,因为它需要长期规划 (long-horizon planning) 和推理 (reasoning) 用户的响应和意图。传统的模块化方法(如结合神经网络模型 (neural models) 和搜索算法 (search algorithms) 或模拟 (simulation))计算开销大,可能导致错误传播 (error propagation),并且不直接优化响应质量。
- 本文提出直接将对话行动规划 (dialogue action planning) 作为混合主导对话 (mixed-initiative conversation) 中响应生成 (response generation) 的一个隐式子任务来优化。
3.2.2. LLM 对齐学习 (Learning for LLM Alignment)
当前 LLM 训练范式通常包含三个阶段:
-
预训练 (Pretraining): 在大规模数据上学习通用语言模式。
-
监督微调 (Supervised Fine-tuning, SFT): 在指令遵循数据上进行微调。
-
人类偏好对齐 (Tuning for alignment with human preferences): 使模型行为与人类偏好保持一致。
本文主要关注第三个阶段。常用的对齐方法包括:
- 人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF): 如
Ouyang et al. (2022)。首先根据人类偏好数据训练一个奖励模型 (reward model),然后使用近端策略优化 (Proximal Policy Optimization, PPO) 等在线算法 (online algorithms) 优化LLM。RLHF具有灵活的奖励函数和更广阔的策略探索 (policy exploration) 空间,但PPOnotoriously 难以调优。 - 离线偏好学习算法 (Offline Preference Learning Algorithms): 针对
RLHF的复杂性,DPO (Rafailov et al., 2024)、SLiC (Zhao et al., 2023)和IPO (Azar et al., 2024)等算法被广泛采用。它们绕过了显式奖励建模 (reward modeling),只需一组超参数即可优化,同时能达到相似的经验效果。 - 在线策略
DPO(On-Policy DPO): 鉴于纯离线方法的局限性,一些研究探索了DPO的在线变体 (online variants)。例如,Yuan et al. (2024)提出了迭代DPO(iterative DPO),Chen et al. (2024)提出了将真实响应视为“获胜”响应,从前一迭代策略模型中采样的响应视为“失败”响应的变体。Pang et al. (2024)将迭代DPO应用于优化外部化推理链 (reasoning chains)。
3.3. 技术演进
对话系统从早期的基于规则和知识的系统,发展到以统计模型为基础(如隐马尔可夫模型 (HMM)、条件随机场 (CRF)),再到基于深度学习 (deep learning) 的端到端 (end-to-end) 系统。LLM 的出现极大地推动了这一领域的发展,使得通用对话代理 (conversational agents) 成为可能。
在 LLM 的对齐方面,从最初的监督微调 (SFT),到通过人类反馈 (human feedback) 进行强化学习 (RLHF),再到近期兴起的直接偏好优化 (DPO) 系列算法,对齐技术不断简化和优化,旨在更高效、稳定地使 LLM 的行为符合人类预期。本文的 ACT 算法正是基于 DPO 框架,并针对多轮对话 (multi-turn conversations) 中的行动规划 (action planning) 这一特定挑战进行了创新性扩展。
3.4. 差异化分析
- 聚焦多轮对话行动: 现有许多
DPO相关工作主要关注单轮响应优化 (single-turn response optimization),例如生成更优的、更符合偏好的答案。本文的ACT则将DPO扩展到多轮轨迹 (multi-turn trajectories),并特别关注LLM在对话中隐式选择和执行对话行动 (conversational actions) 的能力,尤其是澄清问题 (clarification questions)。这是本文的核心创新点之一。 - 行动作为对比基础: 据作者所知,
ACT是首个以对话行动 (conversational actions) 为基础进行对比学习 (contrastive learning) 的方法。通过明确对比“获胜行动”和“失败行动”所导致的响应和轨迹,模型能够更有效地学习何时应该澄清、何时应该回答。 - 准在线学习范式:
ACT结合了离线DPO的易用性和在线学习的探索能力。它在训练过程中在线采样 (on-policy sampling) 模型自身的响应,并通过用户模拟器 (User Simulator) 评估这些响应所产生的多轮对话轨迹 (multi-turn dialogue trajectories),从而动态更新偏好对 (preference pairs)。这种准在线机制使其比纯离线方法更能适应动态的对话环境,并学习到更鲁棒的策略。 - 新任务
AmbigSQL: 论文引入了一个新的文本到SQL生成 (text-to-SQL generation) 任务AmbigSQL,专门用于评估和训练LLM在复杂数据分析请求中的歧义消除 (disambiguation) 能力,填补了该领域多轮交互任务资源的空白。
4. 方法论
4.1. 方法原理
ACT(Action-Based Contrastive Self-Training,基于行动对比自训练)的核心思想在于,将 LLM 微调为混合主导对话代理 (mixed-initiative conversational agent) 的关键在于使其能够自动生成能够最大化对话成功概率 (conversational success) 的响应。ACT 通过以下几个直觉实现这一目标:
-
对比偏好 (Contrastive Preferences): 对比“获胜”和“失败”的对话响应 (dialogue responses) 之间的实用差异 (pragmatic differences),是直观地教导模型如何选择正确对话行动 (dialogue actions) 的有效方式。
-
多轮优化 (Multi-turn Optimization): 对话能力的改进需要多轮优化 (multi-turn optimization),而不仅仅是单轮的对比。
ACT通过模拟多轮对话轨迹 (conversation trajectories) 来实现这一点。 -
在线策略采样 (On-policy Sampling):
DPO类算法的梯度更新基于获胜和失败响应的对数概率。在线策略采样 (on-policy response sampling) 可以生成高概率的词元序列,从而更好地利用DPO的优化机制。ACT算法分为两个主要阶段:行动基于对比数据集构建 (action-based contrast dataset construction) 和对比自训练 (contrastive self-training)。整个过程如 Figure 3 所示。
该图像是示意图,展示了基于行动的对比自我训练 (ACT) 方法在多轮对话中的政策更新过程。图中包含了响应采样、轨迹模拟与评估的场景,分别示例了错误和正确的隐式行动检测,以及如何更新政策以实现收敛的步骤。
Figure 3 | ACT 微调阶段概述。对于 中的每个初始对比配对(如 3.2.1 节所述构建),我们从正在微调的模型中采样一个在线策略响应。在评估了采样响应的轨迹后,我们通过替换现有的获胜或失败响应来更新对比配对。模型策略使用公式 1 的目标进行更新。
4.2. 核心方法详解
4.2.1. 问题设置 (Problem Setup)
本文将任务定义为将 LLM 微调为一个混合主导对话代理 (mixed-initiative conversational agent)。该代理应通过一系列对话与用户互动,最终为用户的请求提供正确响应。与用户完全控制交互流的常见代理设置不同,混合主导代理 (mixed-initiative agents) 应通过执行对话行动 (conversational actions) 或策略 (strategies)(如澄清问题 (clarifying questions))来理解如何重定向交互流。
符号定义:
- :在时间步 时,
LLM的策略 (policy),由参数 参数化。 - :参考策略模型 (reference policy model),即 (初始模型)。
- :包含多个对话 (conversations) 的数据集。
- : 中的一个对话,包含 个对话轮次 (dialogue turns)。
- :时间步 的对话轮次状态 (turn state),包括每个交互方观察到的话语 (utterances) 和行动 (actions)。每个 都属于一个轨迹,该轨迹在用户在更早的时间步 提出的问题得到回答时结束。
- :在时间步 的提示 (prompt),包含任何任务特定信息 (task-specific information)(如
SQL数据库 schema、表格数据或检索到的段落)以及任何现有的对话上下文 (dialogue context)。 - :在时间步 的系统端真实响应 (ground truth system-side response)。
- :解决 隐式轨迹的目标响应 (goal response),即在任何可能的澄清轮次后,用户原始问题的答案。在单轮轨迹情况下,。
- : 隐式表达的行动 (action)。 存在于特定任务的潜在行动空间 (latent Action Space) 中,可由行动标注代理 (Action Annotation Agent) 推断。
- 行动空间 (Action Space) :在实验中, [
CLARIFY,ANSWER] (澄清,回答)。 - 辅助模块:
-
:可控生成模型 (controllable generation model),用于偏好数据 (preference data) 的创建。
-
:行动分类器 (Action Classifier),用于微调 (tuning) 和评估 (evaluation) 期间。
-
:用户模拟器 (User Simulator),可以被控制以模拟用户行为,用于微调和评估期间。
Figure A4 展示了如何在
Abg-CoQA中构建一个对比配对的例子。
该图像是一个关于物质主义与哲学分类的对话示例。图中展示了选中的响应、拒绝的响应及其对应的操作,体现了在处理模糊问题时所需的澄清过程。对谈的核心是物质主义如何与其他哲学理论对比。
-
Figure A4 | RL 微调中 Abg-CoQA (Guo et al., 2021) 的对比配对示例。所用符号如 3.1 节所述。
用户模拟器 (User Simulators, ):
的实现灵感来自 Deng et al. (2023c) 和 的工作,他们直接提示 LLM 根据对话上下文 (dialogue context) 和任务目标 (task objectives) 执行目标导向任务 (goal-oriented tasks)。本文首先提示 LLM 总结用户的信息寻求目标 (information-seeking goal)。然后,使用此摘要和当前对话上下文形成另一个提示,以模拟用户响应。这种带有目标摘要的提示方式比直接提供用户模拟器 (user simulator) 真实信息目标更具灵活性。
行动分类器 (Action Classifiers, ): 在本文考虑的数据集中,可能的行动是“澄清”或“直接回答”问题。本文直接使用少样本上下文学习 (few-shot in-context learning) 作为行动分类器 。
4.2.2. ACT: 行动基于对比自训练 (Action-Based Contrastive Self-training)
ACT 是一种准在线 (quasi-online) 扩展的 DPO 算法,它保持了离线方法的易用性,同时融入了在线学习 (online learning) 中的灵活探索。ACT 包含两个阶段:行动基于对比数据集构建 (action-based contrast dataset construction) 和对比自训练 (contrastive self-training)。
4.2.2.1. 构建偏好数据 (Construction of Preference Data)
算法 1: 构建对比行动对 (Building Contrastive Action Pairs)
- 输入:
- :数据集
- :条件生成模型 (Conditional generation model)
- :行动空间 (Action Space)
- :行动标注代理 (Action Annotation Agent)
- 步骤:
- 初始化空数据集 。
- 对于 中每个对话轮次 (conversation turn) : 3. 令 。 推断上下文行动 (Contextual Action) 4. 令 。 确定被拒绝行动 (Rejected Action) 5. 令 。 将真实响应设为获胜响应 (winning response) 6. 从 中采样 。 采样失败响应 (losing response),该响应基于被拒绝的行动 7. 令 。 构建增强的元组 8. 将 添加到 。
解释: 这个算法构建了一个包含对比“获胜-失败”行动对 (action pairs) 的偏好数据集 (preference dataset)。对于数据集 中的每个对话轮次 ,它会:
- 推断真实行动: 使用行动标注代理 (Action Annotation Agent) (例如一个
LLM或人类标注者)来推断真实响应 所对应的隐式行动 。 - 确定被拒绝行动: 从行动空间 (Action Space) 中选择与 不同的行动 作为“被拒绝行动”。
- 生成失败响应: 使用高能力
LLM(high capacity LLM) 作为条件生成模型 (conditional generation model) ,根据原始提示 和被拒绝行动 生成一个“失败响应” 。这样, 代表了如果智能体采取了错误的行动会产生的响应。 - 构建对比元组: 最终形成一个包含原始信息、真实行动、被拒绝行动、获胜响应和失败响应的增强元组 。
野外无标签对话的行动优化 (Action optimization for unlabeled conversations "in-the-wild"): 在某些情况下,可能无法获得黄金标准 (gold-standard) 的歧义标注。在这种设置下,可以使用一个分类器 (classifier) 作为行动标注代理 (Action Annotation Agent) 来获取伪标签监督 (pseudo-label supervision),而不是依赖人工标注。
4.2.2.2. 使用在线策略对话轨迹模拟进行自训练 (Self-Training Using On-policy Conversation Trajectory Simulation)
算法 2: ACT: 行动基于对比自训练 (Action-Based Contrastive Self-Training)
- 输入:
- :初始策略模型 (Initial Policy Model)
- :行动对比数据集 (Action Contrast Dataset)
- :批次数量 (Number of Batches)
- :行动分类器 (Action Classifier)
- :用户模拟器 (User Simulator)
- :任务启发式 (Task Heuristic)
- :启发式容忍度 (Heuristic Tolerance)
- 步骤:
- 对于从 中采样的批次 中每个对话轮次 (conversation turn) (其中 ):
2. 从当前模型策略中采样一个响应:。
3. 如果行动 : 隐式实用行动 (pragmatic action) 与真实情况不匹配
4. 设置 。 将当前采样响应 作为失败响应 (losing response)
5. 否则(行动 ):
6. 初始化轨迹 (Trajectory),将 添加到轨迹 (Trajectory) 中。
7. 当 时: 模拟用户澄清
8. 澄清答案 (Clarification Answer) 。
9. 将澄清答案 (Clarification Answer) 添加到轨迹 (Trajectory) 中。
10. 。 模拟下一个策略响应
11. 将 添加到轨迹 (Trajectory) 中。
12. 如果 :
13. 令 。 奖励可接受的轨迹结果,将其设为获胜响应 (winning response)
14. 否则:
15. 令 。 惩罚不好的轨迹结果,将其设为失败响应 (losing response)
16. 直到收敛(使用
DPO目标函数)。 - 输出:
- 对于从 中采样的批次 中每个对话轮次 (conversation turn) (其中 ):
2. 从当前模型策略中采样一个响应:。
3. 如果行动 : 隐式实用行动 (pragmatic action) 与真实情况不匹配
4. 设置 。 将当前采样响应 作为失败响应 (losing response)
5. 否则(行动 ):
6. 初始化轨迹 (Trajectory),将 添加到轨迹 (Trajectory) 中。
7. 当 时: 模拟用户澄清
8. 澄清答案 (Clarification Answer) 。
9. 将澄清答案 (Clarification Answer) 添加到轨迹 (Trajectory) 中。
10. 。 模拟下一个策略响应
11. 将 添加到轨迹 (Trajectory) 中。
12. 如果 :
13. 令 。 奖励可接受的轨迹结果,将其设为获胜响应 (winning response)
14. 否则:
15. 令 。 惩罚不好的轨迹结果,将其设为失败响应 (losing response)
16. 直到收敛(使用
解释:
这个算法是 ACT 的核心,它是一个准在线 (quasi-online) 的训练过程。
-
在线策略采样: 在每个训练批次中,模型 会根据当前提示 (prompt) 采样一个响应 。
-
行动匹配检查: 使用行动分类器 (Action Classifier) 来判断采样响应 的隐式行动 (implicit action) 是否与真实行动 匹配。
- 行动不匹配: 如果不匹配,则说明模型采取了错误的实用行动 (pragmatic action)(例如,应该澄清却直接回答),此时将采样响应 直接作为失败响应 (losing response),以惩罚这种错误。
- 行动匹配: 如果匹配,则会进入多轮轨迹模拟 (multi-turn trajectory simulation) 阶段。
-
轨迹模拟:
- 使用用户模拟器 (User Simulator) 模拟用户对模型响应的后续回复(特别是当模型提出澄清问题 (clarification question) 时)。
- 模型 会继续生成后续响应,直到其行动变为
ANSWER(即尝试回答用户原始问题)。 - 整个交互序列构成了从模型自身策略生成的对话轨迹 (conversation trajectory)。
-
轨迹结果评估: 使用任务启发式 (Task Heuristic) 评估模拟轨迹的最终结果(例如,答案的语义相似度、
SQL执行结果的正确性)与用户原始目标 的匹配程度。- 轨迹成功: 如果结果满足任务启发式,则整个模拟轨迹被视为获胜响应 (winning response)。
- 轨迹失败: 否则,整个模拟轨迹被视为失败响应 (losing response)。
-
DPO更新: 根据动态生成的获胜/失败响应 (winning/losing response) 对,使用DPO目标函数更新模型参数 。Figure A5 展示了一个轨迹级内容评估的示例。
该图像是示意图,展示了评估模型如何通过用户模拟器进行多轮对话的轨迹级内容评估。图中描述了用户提问、模型的澄清问题及其最终答案的过程,并通过任务指标(如 DROP F1)来评分轨迹分数。
Figure A5 | 使用 Figure 1 示例场景进行轨迹级内容评估。轨迹级评估旨在衡量候选 LLM 与“用户”交互以达到目标信息目标的能力。给定实例的“交互式”评估持续进行,直到候选 LLM 尝试通过提供直接答案来解决用户的请求。候选轨迹分辨率使用下游任务指标进行评分。在此示例中,根据 PACIFIC 的任务指标,使用 DROP F1。
4.2.2.3. 对比强化学习对齐微调 (Contrastive RL Tuning for Alignment)
在通过模拟构建了最新的获胜响应 (winning response) 和失败响应 (losing response) 对之后,本文使用 DPO (Direct Preference Optimization) 训练目标来更新策略模型 (Rafailov et al., 2024)。
DPO 损失函数 (DPO Training Objective): (为简化表示,此处忽略了 迭代器) 其中:
- 是一个提示 (prompt),由任务信息 (task info) 和对话历史 (conversation history) 拼接而成,形式为 。
- 代表在第 轮观察到的用户端话语 (user-side utterance)。
- 代表在第 轮观察到的系统端话语 (system-side utterance)。
- 和 分别是在 4.2.2.2 节中确定的“获胜”和“失败”响应或轨迹 (trajectories)。
- 是初始参考策略模型 (initial reference policy model)。
- 是一个超参数 (hyperparameter),用于正则化 和 之间的比率。
- 是
sigmoid函数 。 - 是偏好数据集 (preference dataset),包含了 对。
- 表示期望。
DPO 损失函数的梯度 (Gradient of the DPO Loss Function):
这里,隐式奖励模型 (implicitly defined reward model) 定义为 ,正如 Rafailov et al. (2024) 中所证明的。
直觉 (Intuition):
DPO 目标函数的直觉是,损失函数的梯度会增加获胜响应 的似然 (likelihood),并减少失败响应 的似然 (likelihood)。每个样本的权重由隐式奖励模型对配对响应排名错误的程度决定。如果模型错误地认为失败响应比获胜响应更好(即 为正且较大),那么梯度会更大,从而更强烈地调整模型,使其偏好 而非 。
5. 实验设置
ACT 是一种样本高效 (sample-efficient) 的方法,用于使 LLM 适应对话行动策略 (conversational action policy)。本文主要关注提升 LLM 隐式选择 (implicit selection) 代理端澄清问题 (clarification question) 的能力。因此,研究人员在三个复杂的对话信息寻求任务 (conversational information-seeking tasks) 中评估 ACT 作为一种微调方法。实验使用的基础模型是 Zephyr-\beta,它是 `Mistral 7B` (`Jiang et al., 2023`) 的一个版本,经过 `UltraChat` 的指令微调,并与 `UltraFeedback` (`Cui et al., 2023; Ding et al., 2023; Tunstall et al., 2023`) 的人类偏好对齐。
## 5.1. 数据集
本文调查了三个<strong>混合主导对话任务 (mixed-initiative conversation tasks)</strong>,用户在其中与助手互动以检索信息。在每个任务设置中,用户提出一个可能模糊的查询。助手的任务是提供一个响应,该响应可以是<strong>澄清问题 (clarifying question)</strong>,也可以是直接回答用户查询的尝试。对于每个任务,初始的<strong>被拒绝响应 (rejected responses)</strong> 是通过提示 `Gemini Ultra` 作为<strong>条件生成模型 (conditional generation model)</strong> $M$ 来合成的。
`ACT` 在各种领域的多样化数据集上进行评估:<strong>表格型对话问答 (tabular conversational QA)</strong>、<strong>机器阅读理解的对话问答 (conversational QA for machine reading comprehension)</strong> 和<strong>对话式文本到 `SQL` 生成 (conversational text-to-SQL generation)</strong>。
### 5.1.1. `PACIFIC`: 表格数据的对话问答
`PACIFIC` 是一个<strong>主动式对话问答 (proactive conversational question answering)</strong> 任务,其基础是<strong>表格 (tabular)</strong> 和<strong>文本 (textual)</strong> 金融数据的混合 (`Deng et al., 2022`)。这可能涉及从给定范围、多个范围生成正确单词,或提供正确的算术表达式。`PACIFIC` 的官方评估使用<strong>以数字为中心的词元重叠度量 (numeracy-focused token overlap metric)</strong>,称为 `DROP F1`。
Figure 1 是 `PACIFIC` 任务的一个简化示例,展示了模糊性如何导致代理需要提出澄清问题。
表格 A17、A18 和 A19 分别展示了在 `PACIFIC` 任务中使用<strong>标准提示 (Standard Prompting)</strong>、<strong>思维链提示 (Chain-of-Thought Prompting)</strong> 和<strong>主动混合主导提示 (Proactive Mixed-Initiative Prompting)</strong> 的上下文示例。
### 5.1.2. `Abg-CoQA`: 机器阅读理解的对话问答
`Abg-CoQA` 是一个用于<strong>机器阅读理解 (machine reading comprehension)</strong> 中<strong>歧义消除 (disambiguation)</strong> 的<strong>对话问答数据集 (conversational question answering dataset)</strong> (`Guo et al., 2021`)。由于没有算术表达式,本文使用基于 `SentenceBERT` (`Reimers & Gurevych, 2019`) 的<strong>嵌入语义距离 (embedding-based semantic distance)</strong> 作为评估指标,这已被用于更灵活地衡量问答性能 (`Risch et al., 2021`)。
Figure A4 展示了 `Abg-CoQA` 中一个对比配对的例子。
### 5.1.3. `AmbigSQL`: 模糊对话式文本到 `SQL` 生成
`AmbigSQL` 是本文提出的一个新任务,用于<strong>`SQL` 基础的对话式歧义消除 (`SQL`-grounded conversational disambiguation)。</strong> 研究人员系统地扰动了 `Spider`(一个流行的单轮<strong>文本到 `SQL` 基准 (text-to-SQL benchmark)</strong>,`Yu et al., 2018`)中的无歧义查询,得到了可以轻松纳入<strong>对比 `RL` 微调 (contrastive RL tuning)</strong> 的配对训练示例。每个<strong>轨迹 (trajectory)</strong> 都通过最终提出的 `SQL` 查询是否与真实查询的<strong>执行结果 (execution result)</strong> 匹配来评估。
**`AmbigSQL` 的动机:** 歧义消除可以提高任务性能。研究人员提示 `LLM` 引入三种类型的模糊信息请求:
1. <strong>请求信息模糊 (requested information is ambiguous)</strong>(例如:“Show details about singers ordered by age from the oldest to the youngest”)。
2. <strong>请求群体模糊 (requested population is ambiguous)</strong>(例如:“Which ones who live in the state of Indiana?”;参见 Table A12)。
3. <strong>请求结果展示模糊 (requested presentation of results is ambiguous)</strong>(例如:“Show name, country, age for all singers ordered by age”;参见 Table A13)。
研究发现,在<strong>欠详尽请求 (underspecified request)</strong> 下构建 `SQL` 查询,有无澄清的性能差距高达 45.8% (参见 Table A14),这表明澄清问题的必要性。
Table A10 | `AmbigSQL` 概览,一个从 `Spider` 合成的模糊<strong>文本到 `SQL` 数据集 (Text-to-SQL dataset)</strong>。
<div class="table-wrapper"><table>
<thead>
<tr>
<th></th>
<th>Train</th>
<th>Dev</th>
<th>Test</th>
</tr>
</thead>
<tbody>
<tr>
<td>Num. Unambiguous Requests</td>
<td>7,000</td>
<td>1,034</td>
<td>1,034</td>
</tr>
<tr>
<td>Num. Ambiguous Requests</td>
<td>7,000</td>
<td>1,034</td>
<td>1,034</td>
</tr>
<tr>
<td>Num. Unique Schemas</td>
<td>1,056</td>
<td>145</td>
<td>145</td>
</tr>
<tr>
<td>Types of Ambiguity</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
</tbody>
</table></div>
Table A12 | 上下文示例,作为提示的一部分,用于创建目标群体模糊的信息请求。黑色文本的格式表示如何使用真实请求来形成目标示例的提示。蓝色文本表示将从 `LLM` 合成的内容。论文中省略了数据库 `schema`。
<div class="table-wrapper"><table>
<thead>
<tr>
<th>[Database Schema Omitted] The target SQL query is the following:</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><code>SELECT professional_id , last_name , cell_number FROM Professionals</code></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,</code></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON</code></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>T1.professional_id = T2.professional_id</code></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>GROUP BY T1.professional_id HAVING count(*) > 2</code></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Here is a clear request that would correspond to this SQL query:</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>"Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>his or her id, last name and cell phone."</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Here is the same request converted into an ambiguous format by underspecifying the target columns:</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>"Which ones who live in the state of Indiana or have done treatment on more than 2 treatments?"</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Here is an appropriate clarifying question to recover the clear request from the ambiguous request:</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>"Are you asking about the Professionals?"</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table></div>
Table A13 | 上下文示例,作为提示的一部分,用于创建目标列模糊的信息请求。黑色文本的格式表示如何使用真实请求来形成目标示例的提示。蓝色文本表示将从 `LLM` 合成的内容。论文中省略了数据库 `schema`。
<div class="table-wrapper"><table>
<thead>
<tr>
<th>[Database Schema Omitted] The target SQL query is the following:</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><code>SELECT professional_id , last_name , cell_number FROM Professionals</code></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,</code></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON</code></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>T1.professional_id = T2.professional_id</code></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>GROUP BY T1.professional_id HAVING count(*) > 2</code></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Here is a clear request that would correspond to this SQL query:</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>"Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>his or her id, last name and cell phone."</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Here is the same request converted into an ambiguous format by underspecifying the target columns:</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>"Which professionals live in the state of Indiana or have done treatment on more than 2 treatments?"</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Here is an appropriate clarifying question to recover the clear request from the ambiguous request:</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>"Which information of the professionals do you want to know?"</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table></div>
## 5.2. 评估指标
本文从两个维度评估 `ACT` 在<strong>对话中推理歧义 (reason about ambiguity in conversation)</strong> 以更好地实现<strong>对话目标 (conversational goals)</strong> 的能力:
### 5.2.1. 代理任务性能 (Agent task performance)
评估 `ACT` 是否能改善<strong>多轮任务完成 (multi-turn task completion)</strong> 能力。
* <strong>轮次级评估 (Turn-level evaluation):</strong> 将模型响应与用户查询的真实话语进行比较,使用 4.1 节中定义的<strong>任务特定启发式 (task-specific heuristics)</strong>。
* <strong>多轮评估方案 (Multi-turn evaluation scheme) / 轨迹级评估 (Trajectory-level evaluation):</strong>
* 当 `LLM` 采样响应是<strong>澄清问题 (clarifying question)</strong> 时,模拟用户响应,并再次从 `LLM` 采样另一个响应,直到其尝试回答原始查询。
* 将此结果与用户的<strong>真实信息寻求目标 (ground truth information-seeking goal)</strong> 进行评估。
* 使用 $A$ (行动分类器) 和 $U$ (用户模拟器) 进行模拟,并使用 4.1 节中定义的启发式。Figure A5 展示了一个示例。
* <strong>澄清后性能 (Post-Clarification Performance):</strong> 在 `PACIFIC` 和 `AmbigSQL` 中,还计算模型在之前提出<strong>澄清问题 (clarifying questions)</strong> 的模拟响应上的任务性能,以更精细地衡量模型推理自身<strong>澄清问题 (clarification questions)</strong> 的能力。
**具体内容层面评估指标:**
* **`Turn-level DROP F1`:** `PACIFIC` 任务的<strong>平均即时响应 <code>DROP F1</code></strong>。
* **概念定义:** `DROP F1` (`Deng et al., 2022; Dua et al., 2019`) 是一个用于评估问答系统抽取式答案准确性的指标,衡量模型生成的答案与真实答案之间的词元(token)重叠度。它对数字和实体等关键信息尤其敏感。
* **数学公式:**
\mathrm{F1} = \frac{2 \times \mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}
\mathrm{Precision} = \frac{\text{正确预测的词元数量}}{\text{模型生成的所有词元数量}}
\mathrm{Recall} = \frac{\text{正确预测的词元数量}}{\text{真实答案中所有词元数量}}
* **符号解释:**
* $\mathrm{F1}$:`F1` 分数,精确率和召回率的调和平均值。
* $\mathrm{Precision}$:精确率,模型预测正确部分的比例。
* $\mathrm{Recall}$:召回率,真实答案中被模型正确预测的比例。
* **`Trajectory-level DROP F1`:** `PACIFIC` 任务的<strong>平均轨迹结果 <code>DROP F1</code></strong>。
* **`Post-Clarification DROP F1`:** `PACIFIC` 任务的<strong>澄清后 <code>DROP F1</code></strong>,即仅针对包含代理澄清轮次的轨迹的 `Trajectory-level DROP F1`。
* **`Turn Similarity`:** `Abg-CoQA` 任务的<strong>即时响应嵌入相似度 (Immediate response embedding similarity)</strong>。
* **概念定义:** `Embedding-based Semantic Distance` (`Risch et al., 2021`) 使用句子的嵌入向量来衡量模型生成答案与真实答案之间的语义相似度。它比简单的词元重叠度量更能捕捉文本的意义,允许生成更多样化但语义等效的答案。
* **数学公式:** 通常使用<strong>余弦相似度 (Cosine Similarity)</strong>,对于两个嵌入向量 $A$ 和 $B$:
\mathrm{Similarity}(A, B) = \frac{A \cdot B}{||A|| \cdot ||B||}
* **符号解释:**
* `A, B`:两个待比较句子的嵌入向量。
* $\cdot$:向量的点积。
* $||\cdot||$:向量的欧几里得范数(L2 范数)。
* **`Trajectory Similarity`:** `Abg-CoQA` 任务的<strong>轨迹结果嵌入相似度 (Trajectory outcome embedding similarity)</strong>。
* **`Trajectory-level Execution Match`:** `AmbigSQL` 任务的<strong>轨迹结果执行匹配百分比 (Percentage of trajectory outcomes with correct execution results)</strong>。
* **概念定义:** `Execution Match` 是衡量<strong>文本到 `SQL` (Text-to-SQL)</strong> 任务中模型生成 `SQL` 查询准确性的客观指标。如果模型生成的 `SQL` 查询在数据库中执行后,其结果与<strong>真实 `SQL` (ground truth SQL)</strong> 查询的执行结果完全一致,则认为匹配成功。
* **`Post-Clarification Execution Match`:** `AmbigSQL` 任务的<strong>澄清后执行匹配百分比 (Percentage of trajectory outcomes with correct execution results out of those that which contain clarification turns)</strong>。
### 5.2.2. 隐式模糊性识别 (Implicit ambiguity recognition)
为了进一步理解代理的<strong>多轮任务完成能力 (multi-turn task completion ability)</strong>,研究人员考虑了<strong>“对话行为准确率 (dialogue act accuracy)”</strong> (`Chen et al., 2023b`)。假设可以访问<strong>真实模糊性标签 (ground-truth ambiguity label)</strong>,如果一个请求是真正模糊的,模型应该生成一个<strong>澄清问题 (clarification question)</strong>;否则,它应该尝试提供请求的信息。由于 `PACIFIC` 和 `Abg-CoQA` 的类别高度不平衡,主要考虑 `Macro F1` 作为指标。
**具体行动层面评估指标:**
* **`Accuracy`:** <strong>正确隐式行动 (correct implicit actions)</strong> 的百分比。
* **概念定义:** 衡量模型在给定对话上下文下,其隐式选择的对话行动(如“澄清”或“回答”)与真实标注行动一致的比例。
* **数学公式:**
\mathrm{Accuracy} = \frac{\text{正确预测的行动数量}}{\text{总行动数量}}
\mathrm{Macro F1} = \frac{1}{N} \sum_{c=1}^{N} \mathrm{F1}_c
其中,$\mathrm{F1}_c$ 是类别 $c$ 的 `F1` 分数,$N$ 是类别的总数。
## 5.3. 对比基线
### 5.3.1. 提示基线 (Prompting baselines)
本文将 `ACT` 微调的小模型与多种<strong>前沿 `LLM` (frontier LLMs)</strong> 的<strong>基于提示的方法 (prompt-based approaches)</strong> 进行比较:
* `Gemini 1.5 Pro`
* `Gemini 1.5 Flash`
* `Claude 3.5 Sonnet`
* `Claude 3.0 Haiku`
所有提示基线使用 10 个对话作为<strong>上下文示例 (in-context examples)</strong>,采用三种不同的提示框架:
1. <strong>`Standard` (标准提示):</strong> 使用与微调相同的指令格式。Table A17 是 `PACIFIC` 的一个示例。
2. <strong>`Chain-of-Thought` (思维链):</strong> 结合 $Wei et al. (2022)$ 提出的<strong>思维链推理 (chain-of-thought reasoning)</strong>。Table A18 是 `PACIFIC` 的一个示例。
3. <strong>`Proactive MIPrompt` (主动混合主导提示):</strong> `Deng et al. (2023c)` 中的提示基线,结合了 `Chen et al. (2023b)` 的<strong>混合主导提示方法 (mixed-initiative prompting approach)</strong> 和 `Deng et al. (2023b)` 的<strong>主动提示 (Proactive Prompting)</strong>。Table A19 是 `PACIFIC` 的一个示例。
### 5.3.2. 微调基线 (Tuning baselines)
* <strong>`Supervised Fine-tuning (SFT)`:</strong> 使用每个数据集训练集的<strong>真实响应 (ground truth responses)</strong> 进行微调。
* <strong>`Iterative Reasoning Preference Optimization (IRPO)`:</strong> 一种最近提出的<strong>在线策略 `DPO` (on-policy DPO)</strong> 变体,在算术等推理任务中取得了关注。本文在 `PACIFIC` 和 `AmbigSQL` 这两个<strong>定量推理任务 (quantitative reasoning tasks)</strong> 上评估 `IRPO`。
* <strong>`DPO-Dist` (DPO Distillation):</strong> 一种流行的<strong>离线 `DPO` (off-policy DPO)</strong> 方法,其中<strong>获胜响应 (winning responses)</strong> $Y_w$ 来自能力更强的模型,而<strong>失败响应 (losing responses)</strong> $Y_l$ 来自能力较弱的模型 (`Mitra et al., 2023; Mukherjee et al., 2023; Xu et al., 2024a`)。相关结果在附录 B 中提供。
# 6. 实验结果与分析
为了模拟数据有限的真实世界场景,本文在不同数据有限的对话样本设置下,对 `ACT` 作为一种微调方法进行了评估。基础模型为 `Zephyr 7B-`\beta。
6.1. 核心结果分析
6.1.1. 表格型问答 (PACIFIC 任务)
以下是原文 Table 1 的结果:
| Adaption Setting | |Action-level | Content-level | ||||
| Base Model | Approach | Conversations | Macro F1 ↑ | Turn F1 ↑ Traj. F1 ↑ | Post-Clarify F1 ↑ | |
| Gemini Pro | Standard ICL | 10 | 81.4 | 59.7 | 58.7 | 49.7 |
| Claude Sonnet | Standard ICL | 10 | 71.9 | 43.7 | 42.0 | 28.5 |
| Gemini Pro | SFT | 50 | 71.2 | 51.8 | 45.7 | 9.9 |
| Gemini Pro | SFT | 100 | 75.2 | 64.3 | 54.6 | 8.5 |
| Gemini Pro | SFT | 250 | 88.0 | 67.4 | 59.3 | 10.2 |
| Zephyr 7B-β | SFT | 50 | 69.0 | 57.8 | 61.3 | 43.5 |
| Zephyr 7B-β | IRPO | 50 | 67.7 | 59.1 | 56.7 | 34.4 |
| Zephyr 7B-β | ACT (ours) | 50 | 82.2 | 62.8 | 61.9 | 57.2 |
| Zephyr 7B-β | SFT | 100 | 82.3 | 58.6 | 60.3 | 49.9 |
| Zephyr 7B-β | IRPO | 100 | 84.5 | 60.4 | 55.2 | 38.2 |
| Zephyr 7B-β | ACT (ours) | 100 | 86.0 | 65.0 | 62.0 | 57.4 |
| Zephyr 7B-β | SFT | 250 | 86.9 | 65.1 | 63.3 | 56.7 |
| Zephyr 7B-β | IRPO | 250 | 85.4 | 64.9 | 58.4 | 40.3 |
| Zephyr 7B-β | ACT (ours) | 250 | 89.6 | 68.1 | 65.7 | 62.0 |
ACT性能领先: Table 1 显示,在所有三种数据效率设置 (data-efficient settings) 下,ACT在所有指标上均优于SFT和IRPO。值得注意的是,IRPO具有额外的测试时计算 (test-time computation) 优势 (Snell et al., 2024; Pang et al., 2024),但ACT仍能超越。- 模糊性识别能力显著提升: 在仅有 50 个对话作为微调数据的情况下,
ACT在衡量模型隐式识别模糊性 (implicitly recognize ambiguity) 的能力方面,相比SFT实现了高达 19.1% 的相对提升(从 69.0Macro F1提高到 82.2)。 - 数据效率:
ACT展现出比基于适配器 (adapter-based)SFT与Gemini Pro更高的数据效率。在多轮任务性能方面(轨迹级DROP F1),相对提升高达 35.7%(从 45.6 提高到 61.9)。 - 媲美或超越前沿
LLM: 在数据有限的设置下,通过ACT微调的模型在推理时即使没有上下文示例 (in-context examples),也能达到或超越使用上下文学习 (in-context learning, ICL) 的前沿LLM性能。这强调了在线策略学习 (on-policy learning) 和多轮轨迹模拟 (multi-turn trajectory simulation) 对改进多轮任务完成的关键作用。
6.1.2. 机器阅读理解 (Abg-CoQA 任务)
以下是原文 Table 2 的结果:
| Adaptation Setting | | Action-level | | Content-level | |||
| Base Model | Approach | Conversations | Macro F1 ↑ | Turn Similarity ↑ Traj. Similarity ↑ | |
| Gemini Pro | Standard ICL | 10 | 55.5 | 67.0 | 72.2 |
| Claude Sonnet | Standard ICL | 10 | 66.0 | 50.1 | 54.3 |
| Zephyr 7B-β | SFT | 50 | 44.6 | 53.3 | 64.2 |
| Zephyr 7B-β | ACT (ours) | 50 | 52.3 | 66.2 | 68.8 |
| Zephyr 7B-β | SFT | 100 | 52.6 | 63.1 | 69.4 |
| Zephyr 7B-β | ACT (ours) | 100 | 51.1 | 69.5 | 71.4 |
| Zephyr 7B-β | SFT | 250 | 53.5 | 64.0 | 66.2 |
| Zephyr 7B-β | ACT (ours) | 250 | 53.3 | 72.5 | 75.1 |
- 任务特定指标表现最佳: Table 2 显示,在所有三种数据设置下,
ACT在任务特定指标(特别是轨迹级嵌入相似度 (trajectory-level embedding similarity))方面表现最佳,这表明其在多轮推理方面的改进。 - 行动层面与内容层面的权衡: 在 100 和 250 个对话的设置中,
SFT微调的Zephyr在隐式行动识别 (implicit action recognition) 方面略优于ACT(Macro F1),但这并不意味着整体性能更优。作者在附录 A 中进一步讨论了这一点,强调行动层面的性能主要有助于理解澄清推理能力 (clarification reasoning ability)。 - 多轮推理能力改进:
ACT在所有条件下均带来了最强的轮次级 (turn-level) 和轨迹级任务性能 (trajectory-level task performance),这表明其改进了多轮推理能力。
6.1.3. 对话式文本到 SQL 生成 (AmbigSQL 任务)
以下是原文 Table 3 的结果:
| Adaptation Setting | Action-level | Content-level | ||||
| Base Model | Approach | Conversations | Accuracy ↑ | Macro F1 ↑ | Execution Match ↑ | PC Execution Match ↑ |
| Gemini Pro | Standard ICL | 10 | 72.1 | 70.9 | 63.5 | 75.2 |
| Claude Sonnet | Standard ICL | 10 | 68.5 | 63.8 | 66.5 | 72.4 |
| Zephyr 7B-β | SFT | 50 | 77.4 | 77.4 | 21.9 | 13.9 |
| Zephyr 7B-β | IRPO | 50 | 91.0 | 91.0 | 27.8 | 30.8 |
| Zephyr 7B-β | ACT (ours) | 50 | 80.8 | 80.7 | 43.6 | 38.1 |
| Zephyr 7B-β | SFT | 100 | 97.2 | 97.2 | 43.3 | 34.3 |
| Zephyr 7B-β | IRPO | 100 | 96.2 | 96.1 | 45.0 | 37.0 |
| Zephyr 7B-β | ACT (ours) | 100 | 99.2 | 99.3 | 48.0 | 49.6 |
| Zephyr 7B-β | SFT | 250 | 99.8 | 99.7 | 51.0 | 50.7 |
| Zephyr 7B-β | IRPO | 250 | 97.0 | 97.1 | 49.7 | 45.6 |
| Zephyr 7B-β | ACT (ours) | 250 | 99.9 | 99.8 | 52.3 | 53.0 |
| Zephyr 7B-β | SFT | 14,000 (All) | 99.8 | 99.8 | 63.1 | 60.4 |
ACT在任务性能上最强: Table 3 显示,使用ACT微调的Zephyr在每个数据设置中都能实现最强的任务性能。- 澄清后
SQL执行匹配的显著提升: 当数据资源稀缺时,Post-Clarification SQL Execution Match的性能提升尤为显著。例如,在 50 个对话的设置中,ACT的PC Execution Match为 38.1%,远高于SFT的 13.9% 和IRPO的 30.8%。 - 行动准确率与
SQL性能的差异: 尽管提示基线在行动准确率 (Action Accuracy) 上可能不高,但基准测试的前沿LLM在执行匹配 (execution match) 方面表现相对较强。相反,SFT和ACT微调的Zephyr具有较高的行动准确率 (Action Accuracy),但在文本到SQL性能方面低于前沿LLM。这主要归因于SQL生成任务对模型规模的巨大益处 (Sun et al., 2023b)。 ACT在多轮任务性能上的相对优势: 总体而言,ACT在多轮任务性能 (multi-turn task performance) 方面实现了最大的相对性能提升,甚至超过了IRPO等用于定量推理的基线方法。这表明,如果将ACT应用于更大的模型,可能会进一步提升其多轮性能。
6.1.4. ACT 在野外:无对话行动监督的学习
以下是原文 Table 4 的结果:
| Task Adaptation Environment | Action-level | Content-level | |||||
| Base Model | Framework | Action Supervision | Tuning Ex. | Macro F1 ↑ | Turn F1 ↑ | Traj. F1 ↑ | Post-Clarify F1 ↑ |
| Zephyr 7B-β | SFT | NA | 50 | 69.0 | 57.8 | 61.3 | 43.5 |
| Zephyr 7B-β | ACT | Crowdsourced | 50 | 82.2 | 62.8 | 61.9 | 57.2 |
| Zephyr 7B-β | ACT | Pseudo-labeled | 50 | 80.1 | 62.4 | 61.1 | 54.7 |
| Zephyr 7B-β | SFT | NA | 100 | 82.3 | 58.6 | 60.3 | 49.9 |
| Zephyr 7B-β | ACT | Crowdsourced | 100 | 86.0 | 65.0 | 62.0 | 57.4 |
| Zephyr 7B-β | ACT | Pseudo-labeled | 100 | 84.8 | 63.5 | 61.5 | 56.1 |
| Zephyr 7B-β | SFT | NA | 250 | 86.9 | 65.1 | 63.3 | 56.7 |
| Zephyr 7B-β | ACT | Crowdsourced | 250 | 89.6 | 68.1 | 65.7 | 62.0 |
| Zephyr 7B-β | ACT | Pseudo-labeled | 250 | 89.0 | 68.1 | 64.9 | 61.0 |
- 伪标签监督的有效性: 尽管在 Table 1-3 中使用了人工标注的 (crowdsourced) 模糊性标签,本文还证明了在没有行动标签监督的情况下执行行动基于微调 (action-based tuning) 的可能性。
- 高一致性: 使用
Gemini 1.5 Pro作为零样本行动标注器 (zero-shot action annotator) 重新标记PACIFIC语料库中的真实助手端轮次 (Assistant-side turns) 时,与真实行动标签达到了惊人的高一致性(98.5%)。 - 性能无显著差异: Table 4 显示,无论是行动层面 (Action-level) 还是内容层面 (Content-level) 的指标,使用伪标签 (Pseudo-labeled) 的
ACT与使用**人工标注 (Crowdsourced)的ACT` 之间几乎没有经验上的性能差异。 - “野外”场景的潜力: 这凸显了
ACT在“野外”设置 (in-the-wild settings) 中,即使只有少量无标签对话数据 (unlabeled conversational data),也可能非常有效。
6.2. 数据呈现 (表格)
以下是原文 Table A6 的结果:
| Adaptation Setting | | Action-level | | Content-level | |||
| Base Model | Approach | Conversations | Macro F1 ↑ | Turn Similarity ↑ | Traj. Similarity ↑ |
| Gemini Pro | ICL | 50 | 56.4 | 64.5 | 68.9 |
| Zephyr 7B-β | ACT (ours) | 50 | 52.3 | 66.2 | 68.8 |
| Gemini Pro | ICL | 100 | 59.2 | 67.0 | 72.0 |
| Zephyr 7B-β | ACT (ours) | 100 | 51.1 | 69.5 | 71.4 |
| Gemini Pro | ICL | 250 | 58.8 | 66.0 | 71.1 |
| Zephyr 7B-β | ACT (ours) | 250 | 53.3 | 72.5 | 75.1 |
以下是原文 Table A7 的结果:
| Adaption Setting | | Action-level | | Content-level | ||||
| Base Model | Approach | Conversations | Macro F1 ↑ | Turn F1 ↑ | Traj. F1 ↑ Post-Clarify F1 ↑ | |
| Gemini Pro | Standard Prompt | 10 | 81.4 | 59.7 | 58.7 | 49.7 |
| Gemini Pro | Chain-of-Thought | 10 | 86.3 | 66.3 | 17.1 | 19.2 |
| Gemini Pro | Proactive MIPrompt | 10 | 78.9 | 63.4 | 61.1 | 18.9 |
| Gemini Flash | Standard Prompt | 10 | 67.4 | 58.8 | 58.7 | 17.9 |
| Gemini Flash | Chain-of-Thought | 10 | 77.1 | 62.0 | 16.9 | 20.0 |
| Gemini Flash | Proactive MIPrompt | 10 | 76.8 | 64.0 | 62.0 | 24.4 |
| Claude Sonnet | Standard Prompt | 10 | 71.9 | 43.7 | 42.0 | 28.5 |
| Claude Sonnet | Chain-of-Thought | 10 | 80.0 | 37.2 | 13.0 | 6.8 |
| Claude Sonnet | Proactive MIPrompt | 10 | 74.9 | 47.2 | 45.9 | 7.6 |
| Claude Haiku | Standard Prompt | 10 | 46.9 | 26.4 | 26.2 | — |
| Claude Haiku | Chain-of-Thought | 10 | 48.6 | 23.7 | 12.0 | 2.9 |
| Claude Haiku | Proactive MIPrompt | 10 | 48.3 | 18.6 | 18.2 | 7.3 |
| Gemini Pro | SFT | 50 | 71.2 | 51.8 | 45.7 | 9.9 |
| Gemini Pro | SFT | 100 | 75.2 | 64.3 | 54.6 | 8.5 |
| Gemini Pro | SFT | 250 | 88.0 | 67.4 | 59.3 | 10.2 |
| Zephyr 7B-β | SFT | 50 | 69.0 | 57.8 | 61.3 | 43.5 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 50 | 75.5 | 61.7 | 55.7 | 30.8 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 50 | 74.8 | 62.0 | 56.3 | 31.9 |
| Zephyr 7B-β | IRPO | 50 | 67.7 | 59.1 | 56.7 | 34.4 |
| Zephyr 7B-β | ACT (ours) | 50 | 82.2 | 62.8 | 61.9 | 57.2 |
| Zephyr 7B-β | SFT | 100 | 82.3 | 58.6 | 60.3 | 49.9 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 100 | 68.8 | 53.3 | 53.3 | 31.7 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 100 | 83.0 | 59.0 | 53.7 | 29.3 |
| Zephyr 7B-β | IRPO | 100 | 84.5 | 60.4 | 55.2 | 38.2 |
| Zephyr 7B-β | ACT (ours) | 100 | 86.0 | 65.0 | 62.0 | 57.4 |
| Zephyr 7B-β | SFT | 250 | 86.9 | 65.1 | 63.3 | 56.7 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 250 | 65.6 | 53.6 | 54.1 | 30.9 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 250 | 82.8 | 43.3 | 38.6 | 19.6 |
| Zephyr 7B-β | IRPO | 250 | 85.4 | 64.9 | 58.4 | 40.3 |
| Zephyr 7B-β | ACT (ours) | 250 | 89.6 | 68.1 | 65.7 | 62.0 |
以下是原文 Table A8 的结果:
| Adaptation Setting | | Action-level| | Content-level | |||
| Base Model | Approach | Conversations | Macro F1 ↑ | Turn Similarity ↑ | Traj. Similarity ↑ |
| Gemini Pro | Standard Prompt | 10 | 55.5 | 67.0 | 72.2 |
| Gemini Pro | Chain-of-Thought | 10 | 61.2 | 63.4 | 39.1 |
| Gemini Pro | Proactive MIPrompt | 10 | 55.5 | 63.3 | 33.3 |
| Gemini Flash | Standard Prompt | 10 | 52.6 | 62.5 | 67.4 |
| Gemini Flash | Chain-of-Thought | 10 | 61.2 | 56.5 | 36.6 |
| Gemini Flash | Proactive MIPrompt | 10 | 58.1 | 61.7 | 36.1 |
| Claude Sonnet | Standard Prompt | 10 | 66.0 | 50.1 | 54.3 |
| Claude Sonnet | Chain-of-Thought | 10 | 63.7 | 46.2 | 36.8 |
| Claude Sonnet | Proactive MIPrompt | 10 | 57.2 | 60.8 | 32.9 |
| Claude Haiku | Standard Prompt | 10 | 49.3 | 40.9 | 41.7 |
| Claude Haiku | Chain-of-Thought | 10 | 46.2 | 30.7 | 28.0 |
| Claude Haiku | Proactive MIPrompt | 10 | 45.2 | 34.5 | 31.4 |
| Zephyr 7B-β | SFT | 50 | 44.6 | 53.3 | 64.2 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 50 | 46.9 | 57.2 | 61.2 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 50 | 44.7 | 57.9 | 61.5 |
| Zephyr 7B-β | ACT (ours) | 50 | 52.3 | 66.2 | 68.8 |
| Zephyr 7B-β | SFT | 100 | 52.6 | 63.1 | 69.4 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 100 | 47.8 | 61.9 | 67.1 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 100 | 44.8 | 62.0 | 66.4 |
| Zephyr 7B-β | ACT (ours) | 100 | 51.1 | 69.5 | 71.4 |
| Zephyr 7B-β | SFT | 250 | 53.5 | 64.0 | 66.2 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 250 | 46.0 | 61.9 | 66.3 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 250 | 46.3 | 62.6 | 67.0 |
| Zephyr 7B-β | ACT (ours) | 250 | 53.3 | 72.5 | 75.1 |
以下是原文 Table A9 的结果:
| Adaptation Setting | Action-level | Content-level | ||||
| Base Model | Approach | Conversations | Accuracy ↑ | Execution Match ↑ | PC Execution Match ↑ | |
| Gemini Pro | Standard Prompt | 10 | 72.1 | 63.5 | 75.2 | |
| Gemini Flash | Standard Prompt | 10 | 75.6 | 64.2 | 66.2 | |
| Claude Sonnet | Standard Prompt | 10 | 68.5 | 66.5 | 72.4 | |
| Claude Haiku | Standard Prompt | 10 | 73.8 | 57.3 | 65.3 | |
| Zephyr 7B-β | SFT | 50 | 77.4 | 21.9 | 13.9 | |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 50 | 77.7 | 42.6 | 31.5 | |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 50 | 78.0 | 40.9 | 41.2 | |
| Zephyr 7B-β | IRPO | 50 | 91.0 | 27.8 | 30.8 | |
| Zephyr 7B-β | ACT (ours) | 50 | 80.8 | 43.6 | 38.1 | |
| Zephyr 7B-β | SFT | 100 | 97.2 | 43.3 | 34.3 | |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 100 | 98.7 | 45.1 | 45.3 | |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 100 | 99.8 | 47.8 | 44.8 | |
| Zephyr 7B-β | IRPO | 100 | 96.2 | 45.0 | 37.0 | |
| Zephyr 7B-β | ACT (ours) | 100 | 99.2 | 48.0 | 49.6 | |
| Zephyr 7B-β | SFT | 250 | 99.8 | 51.0 | 50.7 | |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 250 | 97.3 | 49.7 | 44.2 | |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 250 | 99.7 | 50.7 | 50.3 | |
| Zephyr 7B-β | IRPO | 250 | 97.0 | 49.7 | 45.6 | |
| Zephyr 7B-β | ACT (ours) | 250 | 99.9 | 99.8 | 52.3 | 53.0 |
| Zephyr 7B-β | SFT | 14,000 (All) | 99.8 | 99.8 | 63.1 | 60.4 |
以下是原文 Table A10 的结果:
| Train | Dev | Test | |
|---|---|---|---|
| Num. Unambiguous Requests | 7,000 | 1,034 | 1,034 |
| Num. Ambiguous Requests | 7,000 | 1,034 | 1,034 |
| Num. Unique Schemas | 1,056 | 145 | 145 |
| Types of Ambiguity | 3 | 3 | 3 |
以下是原文 Table A11 的结果:
| No. Interacting Party | Utterance | |
| User Assistant | Can you list all the singer ids that aren't present in the song table? SELECT Name FROM singer WHERE Singer_ID NOT IN ... | |
| 1 | User Assistant | Thanks! You should ask at least 3 questions |
| 2 | Assistant | Did you want the full name of makers and the number? |
| 3 | Assistant | Do you mean the address of the customer with first name Luis? |
以下是原文 Table A12 的结果:
| [Database Schema Omitted] The target SQL query is the following: | |||
|---|---|---|---|
SELECT professional_id , last_name , cell_number FROM Professionals |
|||
WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name , |
|||
T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON |
|||
T1.professional_id = T2.professional_id |
|||
GROUP BY T1.professional_id HAVING count(*) > 2 |
|||
| Here is a clear request that would correspond to this SQL query: | |||
| "Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List | |||
| his or her id, last name and cell phone." | |||
| Here is the same request converted into an ambiguous format by underspecifying the target columns: | |||
| "Which ones who live in the state of Indiana or have done treatment on more than 2 treatments?" | |||
| Here is an appropriate clarifying question to recover the clear request from the ambiguous request: | |||
| "Are you asking about the Professionals?" |
以下是原文 Table A13 的结果:
| [Database Schema Omitted] The target SQL query is the following: | |||
|---|---|---|---|
SELECT professional_id , last_name , cell_number FROM Professionals |
|||
WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name , |
|||
T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON |
|||
T1.professional_id = T2.professional_id |
|||
GROUP BY T1.professional_id HAVING count(*) > 2 |
|||
| Here is a clear request that would correspond to this SQL query: | |||
| "Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List | |||
| his or her id, last name and cell phone." | |||
| Here is the same request converted into an ambiguous format by underspecifying the target columns: | |||
| "Which professionals live in the state of Indiana or have done treatment on more than 2 treatments?" | |||
| Here is an appropriate clarifying question to recover the clear request from the ambiguous request: | |||
| "Which information of the professionals do you want to know?" |
以下是原文 Table A14 的结果:
| Model | Ambiguous Request Execution Match Execution Match with Clarification Turns | |
| Gemini Pro | 28.5 | 68.7 |
| Gemini Ultra | 31.2 | 77.0 |
以下是原文 Table A15 的结果:
Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables Singer with columns Singer_ID, Name, Country, Age and Concert with columns Concert_ID, Concert_Name, Theme, Stadium_ID, Year. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request. |
|||
| User: Show all singers who are not from China or USA. | |||
Assistant: SELECT Name FROM Singer WHERE Country != 'China' AND Country != 'USA' |
|||
Trajectory Goal: SELECT Name FROM Singer WHERE Country != 'China' AND Country != 'USA' |
|||
Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables Singer with columns Singer_ID, Name, Country, Age and Concert with columns Concert_ID, Concert_Name, Theme, Stadium_ID, Year. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request. |
|||
| User: Show all singers who are not from China. | |||
| Assistant: Which country other than China are you asking about? | |||
Trajectory Goal: User: Show all singers who are not from China and USA. Assistant: SELECT Name FROM Singer WHERE Country != 'China' AND Country != 'USA' |
|||
以下是原文 Table A16 的结果:
Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables Concert with columns Concert_ID, Concert_Name, Theme, Stadium_ID, Year and Stadium with columns Stadium_ID, Location, Name, Capacity. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request. |
|||
| User: What are the names of the stadiums where concerts were held in 2014? | |||
Assistant: SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014 |
|||
Trajectory Goal: SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014 |
|||
Prompt: The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables Concert with columns Concert_ID, Concert_Name, Theme, Stadium_ID, Year and Stadium with columns Stadium_ID, Location, Name, Capacity. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request. |
|||
| User: What are the names of the stadiums where concerts were held? | |||
| Assistant: Which year are you asking about? | |||
Trajectory Goal: User: What are the names of the stadiums where concerts were held in 2014? Assistant: SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014 |
|||
以下是原文 Table A17 的结果:
| You are an Assistant answering questions from a User. You should either attempt to answer the question or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted] | ||
| User: How much does the company expect to make contributions to the defined plans during the next fiscal year? | ||
| Assistant: ['What kind of defined plans are you asking about?'] | ||
| User: The defined benefit plans and the defined contribution plan respectively. | ||
| Assistant: ['`5.1 million', '`0.6 million'] User: What is the value of the expected defined contribution plan as a percentage of the defined benefit | ||
| plan during the next fiscal year? | ||
| Assistant: 0.6/5.1 = 11.76 User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount | ||
| rate? Assistant: ['What kind of change are you asking about?'] | ||
| User: The pension. Assistant: ['49,361'] | ||
| User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease? | ||
| Assistant: 49,361 - (39,145) = 88506 User: How much would the pension change if there is a 1% increase and a 1% decrease respectively? | ||
| Assistant: ['Which rate of increase and decrease are you talking about?'] | ||
| User: The future salary growth. | ||
| Assistant: ['`7,572', '`(6,919)'] | ||
| User: What is the difference between them? | ||
| Assistant: 7,572 - (6,919) = 14491 | ||
以下是原文 Table A18 的结果:
| or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted] | You are an Assistant answering questions from a User. You should either attempt to answer the question | |||
| User What is the value of the expected defined contribution plan as a percentage of the defined benefit plan during the next fiscal year? | ||||
| Insrction: I the user' question is biguous, ask an apropriate clarifyi question.Otherwise, dily answer the user's question using the information from the passage context and the table. Let's think step | ||||
| by step. Reasoning: The user's question is not ambiguous. Assistant: 0.6/5.1 = 11.76 | ||||
| rate? | User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount | |||
| Insrction: I the user' question is biguous, ask an apropriate clarifyi question.Otherwise, dily | ||||
| answer the user's question using the information from the passage context and the table. Let's think step by step. | ||||
| Reasoning: The user's question was ambiguous. Assistant: ['What kind of change are you asking about?] | ||||
| User: The pension. | ||||
| Instion: Ithe user' question is mbiguous, ask an apropriate clarifyi question.Otherwi, drly | ||||
| answer the user's question using the information from the passage context and the table. Let's think step | ||||
| by step. | ||||
| Reasoning: The user's question is not ambiguous. Assistant: ['49,361'] | ||||
| User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease? | ||||
| Insction: I he user' question is mbiguous, askan appropriateclarifyig questio. Otherwis, rey | ||||
| answer the user's question using the information from the passage context and the table. Let's think step | ||||
| by step. | ||||
| Reasoning: The user's question is not ambiguous. Assistant: 49,361 - (39,145) = 88506 User: How much | ||||
| would the pension change if there is a 1% increase and a 1% decrease respectively? | ||||
| Instion: Ithe user' question is mbiguous, ask an apropriate clarifyi question.Otherwi, drly | ||||
| answer the user's question using the information from the passage context and the table. Let's think step | ||||
| by step. | ||||
| Reasoning: The user's question was ambiguous. | ||||
| Assistant: ['Which rate of increase and decrease are you talking about?'] |
以下是原文 Table A19 的结果:
| or ask a clarifying question if there is any ambiguity. | You are an Assistant answering questions from a User. You should either attempt to answer the question |
| [Table and Passage Omitted] User: How much does the company expect to make contributions to the defined plans during the next fiscal year? | |
| The user's last question was ambiguous. The Assistant asks a clarifying question. | |
| Assistant: ['What kind of defined plans are you asking about?'] | |
| User: The defined benefit plans and the defined contribution plan respectively. | |
| The user's last question was unambiguous. The Assistant directly answers the question. | |
| Assistant: ['5.1million' ,'0.6 million'] User: What is the value of the expected defined contribution plan as a percentage of the defined benefit | |
| plan during the next fiscal year? | |
| The user's last question was unambiguous. The Assistant directly answers the question. Assistant: 0.6/5.1 = 11.76 | |
| User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount | |
| rate? The user's last question was ambiguous. The Assistant asks a clarifying question. | |
| Assistant: ['What kind of change are you asking about?'] | |
| User: The pension. The user's last question was unambiguous. The Assistant directly answers the question. | |
| Assistant: ['(39, 145)','49,361'] | |
| User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease? | |
| The user's last question was unambiguous. The Assistant directly answers the question. | |
| Assistant: 49,361 - (39,145) = 88506 | |
| User: How much would the pension change if there is a 1% increase and a 1% decrease respectively? | |
| Actions: ["Directly Answer", "Ask a Clarification Question"] | |
| Prompt: Given the task background and the conversation history, please use appropriate actions to | |
| generate the response. | |
| Response: The user's last question was ambiguous. The Assistant asks a clarifying question. | |
| Assistant: ['Which rate of increase and decrease are you talking about?'] |
以下是原文 Table A20 的结果:
| [Table and Passage Omitted] | |
| User: What was the total number of acquisitions in FY19? | |
| Assistant: 35 acquisitions | |
| User: What are the benefits of active growth strategy? | |
| Assistant: strengthens and scales the National Storage operating platform which drives efficiencies across | |
| the business. | |
| User: What was the total NLA? | |
| Assistant: Which region are you asking about? | |
| The last Assistant utterance is a clarifying question. | |
以下是原文 Table A21 的结果:
| [Passage Omitted] | |
| User: And who in the group was anxious? Assistant: Peppe | |
| User: Was she well-rested? | |
| Assistant: no | |
| The last Assistant utterance is a direct answer. | |
以下是原文 Table A22 的结果:
| [Database Schema Omitted] | |
| User: "Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? | |
| List his or her id, last name and cell phone." | |
| Here is the same request converted into an ambiguous format by underspecifying the target columns: | |
| Assistant: | |
SELECT professional_id , last_name , cell_number FROM Professionals |
|
WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name , |
|
T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON |
|
T1.professional_id = T2.professional_id |
|
GROUP BY T1.professional_id HAVING count(*) > 2 |
|
| The last Assistant utterance is a direct answer. | |
以下是原文 Table A23 的结果:
The following is a conversation between an Assistant and a User who is asking some questions about the SQL database. The database has tables Singer with columns Singer_ID, Name, Country, Age and Concert with columns Concert_ID, Concert_Name, Theme, Stadium_ID, Year. The User's question may be ambiguous or underspecified, and the Assistant must clarify until the User's intent is no longer ambiguous. When the User's intent is no longer ambiguous, the Assistant should generate the SQL query which responds to the User's request. |
|
| User: What are the names of the stadiums where concerts were held? | |
| Assistant: Which year are you asking about? | |
| User: What are the names of the stadiums where concerts were held in 2014? | |
Assistant: SELECT T2.Name FROM Concert AS T1 JOIN Stadium AS T2 ON T1.Stadium_ID = T2.Stadium_ID WHERE T1.Year = 2014 |
|
| The following is a summary of the information the User is looking for: 1. Names of stadiums where concerts were held in 2014. | |
以下是原文 Table A24 的结果:
| The following is a snippet from a conversation between an Assistant and a User who is asking some questions about the Story below. The user only wants to know the following information: 1. What his ranking was. 2. Whether someone else had horse fighters. 3. Who had a considerable force of horse fighters west of him. | |
| User: What was his ranking? | |
| Assistant: General User: Did someone else have horse fighters? | |
| Assistant: yes | |
| User: Who? | |
| Assistant: Do you want to know who had 2,500 horse fighters initially? User: No, I want to know who had a considerable force of horse fighters west of him. | |
| Assistant: General (Bishop) Polk. | |
以下是原文 Table A25 的结果:
| Prompt: The following is a snippet from a conversation between an Assistant and a User who is asking some questions about the Story below. The user only wants to know the following information: 1. What his ranking was. 2. Whether someone else had horse fighters. 3. Who had a considerable force of horse fighters west of him. | |
| User: What was his ranking? | |
| Assistant: General | |
| User: Did someone else have horse fighters? | |
| Assistant: yes | |
| User: Who? | |
| Assistant: Do you want to know who had 2,500 horse fighters initially? | |
| User: No, I want to know who had a considerable force of horse fighters west of him. | |
| Assistant: General (Bishop) Polk. | |
| The following is a summary of the information the User is looking for: 1. What was his ranking? 2. Whether someone else had horse fighters? 3. Who had a considerable force of horse fighters west of him? | |
以下是原文 Table A26 的结果:
| The following is a snippet from a conversation between an Assistant and a User who is asking some questions about the Story below. The user only wants to know the following information: 1. What his | |
| ranking was. 2. Whether someone else had horse fighters. 3. Who had a considerable force of horse fighters west of him. | |
| [Passage Omitted] | |
| User: What was his ranking? | |
| Assistant: General User: Did someone else have horse fighters? | |
| Assistant: yes | |
| User: Who? | |
| Assistant: Do you mean who had 2,500 horse fighters initially? User: No, I want to know who had a considerable force of horse fighters west of him. | |
| Assistant: General (Bishop) Polk. | |
以下是原文 Table A27 的结果:
| A user is asking an assistant to retrieve some information from a SQL database. The command that the assistant should ultimately return is as follows: | ||
SELECT county FROM campuses where campus = 'California State University-Chico' |
||
| The assistant wil ask some questions to cariy he user's intent. The user should respond with a rephrased request that reflects their desired query. | ||
| User: what is the county? Assistant: Are you asking for a list of all of the counties in the database? User: I'm looking for the county of the campus 'California State University-Chico | ||
以下是原文 Table A28 的结果:
| Relevant Table Data | SFT | ACT |
| Year: 2019 || 2018 | User: What were the total liabilities of IMFT? Assistant: \$909 × | User: What were the total liabilities of IMFT? Assistant: Which year are you asking about? |
| Total current liabilities : 851 | | 167 Other noncurrent liabilities : 5 | |74 | User: 2018 | |
| Total Liabilities: 1,305 | Assistant: \$1,305 |
以下是原文 Table A29 的结果:
| SFT | ACT | |
| Relevant Data Net profit/(loss) after tax was \$(9.8) million | User: What was the net profit/(loss) after tax? | User: What was the net profit/(loss) after tax? Assistant: Which year are you asking the amount of? |
| Net profit/(loss) after tax : (9,819) | |6,639 || (248%) | Assistant: \$(9,819) × | User: FY19 Assistant: \$(9,819) × |
以下是原文 Table A30 的结果:
| SFT | ACT | |
| Relevant Table Data | User: In which year was the amount of Investments higher? | ? User: In which year was the amount of Investments higher? |
| Year: 2019 | | 2018 Investments: 1,216.0 || 1,212.9 | Assistant: [2018] × | Assistant: [2018, 2019] × |
以下是原文 Table A31 的结果:
| SFT | ACT | |
| Relevant Passage | User: What did Meghan ask? | User: What did Meghan ask? |
| She yawned, stretching out like a lazy cat. | Assistant: Do you mean where did Lizzie see her father? | Assistant: Do you mean that morning or the night before? |
| Her sister was also awake. "Lizzie", Meghan whispered, "are you awake?" | User: No A | User: The night before. |
6.3. 消融实验/参数分析
以下是原文 Table 5 的结果:
| Macro F1 ↑ Turn F1 ↑ Traj. F1 ↑ Post-Clarify F1 ↑ | ||||
| Action Importance | ||||
| ACT | ||||
| w/ Random Actions | 63.2 | 55.3 | 58.7 | 32.8 |
| Ablation of ACT subcomponents | ||||
| ACT w/o on-policy sampling | 74.8 | 61.5 | 59.1 | 40.5 |
| ACT | ||||
| w/ sampling but w/o simulation ACT (full) | 81.4 82.2 | 60.8 62.8 | 60.2 61.9 | 50.1 57.2 |
| ACT with unaligned foundation models | ||||
| Gemma 2B SFT | 57.7 | 38.0 | 40.5 | 17.0 |
| Gemma 2B ACT | 62.7 | 42.6 | 44.0 | 24.8 |
| Mistral 7B SFT | 57.7 | 53.8 | 51.4 | 27.7 |
| Mistral 7B ACT | 75.7 | 58.1 | 57.6 | 31.9 |
Table 5 展示了使用 PACIFIC 50 对话设置进行的消融研究,以理解 ACT 各组件的重要性。
-
行动基于偏好是否必要? (
Are action-based preferences necessary?)- 分析:
ACT的关键因素之一是对比对 (contrastive pairs) 突出了对话行动 (conversational actions) 之间的差异。当构建偏好对时,如果随机采样获胜行动 (winning action) 和失败行动 (losing action)(如 Table 5 中的 “ACT w/ Random Actions”),性能会显著下降。例如,Macro F1从ACT的 82.2% 降至 63.2%,Post-Clarify F1从 57.2% 降至 32.8%。 - 结论: 这表明明确的行动选择 (action selection) 对于
ACT的有效性至关重要。
- 分析:
-
是否需要在线策略采样? (
Do we need on-policy sampling?)- 分析: Table 5 中的 “ACT w/o on-policy sampling” 实验评估了在线策略采样 (on-policy sampling) 的重要性,它本质上是在 4.2.2.1 节构建的数据集上进行普通的离线
DPO(off-policy DPO) 训练。 - 结果: 尽管相比
SFT有一些改进(例如Macro F1从 69.0 提高到 74.8),但与完整的ACT相比,整体改进幅度要小得多。 - 结论: 这可能是因为离线负响应 (off-policy negative responses) 不保证位于策略模型 (policy model) 的语言流形 (language manifold) 中,导致分布漂移 (distribution shift) 难以通过离线学习克服 (
Guo et al., 2024)。在线策略采样在ACT中起着关键作用。
- 分析: Table 5 中的 “ACT w/o on-policy sampling” 实验评估了在线策略采样 (on-policy sampling) 的重要性,它本质上是在 4.2.2.1 节构建的数据集上进行普通的离线
-
轨迹模拟是否必要? (
Is trajectory simulation necessary?)- 分析:
ACT通过其在线策略轨迹模拟 (on-policy trajectory simulation) 更好地与多轮对话 (multi-turn conversations) 对齐。如果移除多轮模拟(如 Table 5 中的 “ACT w/ sampling but w/o simulation”),该方法类似于在线策略DPO变体 (on-policy DPO variants)(如Pang et al., 2024),但带有考虑对话行动 (conversation actions) 和任务启发式 (task heuristics) 的对话特定奖励信号 (conversation-specific reward signal)。 - 结果: 发现轨迹级模拟 (trajectory-level simulation) 对于改进多轮性能 (multi-turn performance) 至关重要,尤其是策略模型 (policy model) 对自身澄清问题 (clarification questions) 的推理能力。例如,
Post-Clarify F1从 57.2% 降至 50.1%。 - 结论: 这强调了多轮交互 (multi-turn interaction) 和结果评估 (outcome evaluation) 在
ACT中的核心地位。
- 分析:
-
ACT是否与模型无关? (Is ACT model agnostic?)- 分析: 主要实验中的基础模型
Zephyr是通过对Mistral进行对齐得到的。 Table 5 中的 “ACT with unaligned foundation models” 部分展示了使用未对齐的基础模型Gemma 2B和Mistral 7B的结果。 - 结果: 尽管这两个模型在经过
ACT微调后,其行动 F1 (Action F1) 和轨迹 F1 (Trajectory F1) 仍存在性能差距(例如Gemma 2B提高了 5.0Action F1,Mistral 7B提高了 18.0Action F1),但结果表明ACT无论是否存在人类反馈的预对齐,都能提升性能。 - 结论: 这意味着
ACT具有模型无关性 (model agnostic),可以改进任何基础模型的性能,尽管更好的模型初始化(预对齐)可能会带来进一步的益处。
- 分析: 主要实验中的基础模型
7. 总结与思考
7.1. 结论总结
本文提出了 ACT (Action-Based Contrastive Self-Training),一种模型无关 (model agnostic) 的准在线对比微调方法 (quasi-online contrastive tuning approach),旨在实现样本高效 (sample-efficient) 的对话任务适应 (conversational task adaptation)。同时,还提出了一种用于评估对话代理 (conversational agents) 的工作流程。
主要发现和贡献总结如下:
- 有效提升澄清能力:
ACT能够显著提升LLM在多轮对话中隐式识别 (implicitly recognize) 和推理模糊性 (reason about ambiguity) 的能力,使其能够更好地提出澄清问题 (clarification questions),而不是过度猜测用户意图。 - 数据高效性: 在有限数据 (limited data regime) 的场景下,
ACT表现出高度有效性,甚至在没有行动标签 (action labels) 的情况下,也能通过伪标签 (pseudo-labels) 实现成功的微调。 - 超越基线方法:
ACT在多个真实世界对话任务(包括表格型问答PACIFIC、机器阅读理解Abg-CoQA和新任务AmbigSQL)上,相较于SFT和DPO等标准微调方法,展现出显著的对话建模改进 (conversation modeling improvements)。 - 关键组件的验证: 消融实验证实了行动基于偏好 (action-based preferences)、在线策略采样 (on-policy sampling) 和多轮轨迹模拟 (multi-turn trajectory simulation) 作为
ACT核心组件的必要性。
7.2. 局限性与未来工作
7.2.1. 局限性 (Limitations)
- 对澄清问题时机的假设:
ACT假设澄清问题 (clarification questions) 的提出是适时的。然而,众包对话数据集 (crowdsourced conversation datasets) 可能存在噪声,导致模型学到次优策略(例如,提出不必要的澄清问题或生成不流畅的语言)。未来的工作可能需要额外的预处理阶段 (preprocessing stage) 来推断某个行动是否有效。 - 标签噪声的影响: 隐式行动识别评估 (implicit action recognition evaluation) 假设基准任务中的行动是“最优”的。在标注者间一致性 (inter-annotator agreement) 低的数据集(如
Abg-CoQA)中,这可能导致评估结果的不一致性。 - 对任务特定启发式的依赖:
ACT依赖于任务特定启发式 (task-specific heuristics) 来评估轨迹结果。虽然这允许根据不同领域定制成功标准,但也增加了定制化 (customization) 和工程专业知识 (engineering expertise) 的需求。 - 对现有
LLM的依赖:ACT的实现 heavily 依赖于现有LLM(如Gemini)进行偏好数据集 (preference dataset) 构建、行动分类 (Action Classification) 和用户模拟 (User Simulation)。这可能受限于商业LLM的可访问性(成本、隐私)和潜在的偏差。 - 研究范围: 本研究主要关注有限数据 (limited data regime) 场景。在训练数据充足且分布与目标分布高度匹配的情况下,
ACT相对于SFT等方法的优势可能不会那么显著。 - “准在线”的界定:
ACT被定义为准在线对比RL微调 (quasi-online contrastive RL tuning) 方法,因为它结合了离线方法的固定数据集 (fixed dataset) 和在线方法的在线策略采样 (on-policy sampling)。但其在线探索的程度仍受限于对话轨迹 (dialogue trajectories) 的有限性质(尤其是在有明确正确答案的任务中)。
7.2.2. 未来工作 (Future Work)
- 与其他复杂微调方法的结合: 考虑将
ACT与现有针对文本到SQL生成 (text-to-SQL generation) 等复杂任务的复杂微调方法 (sophisticated tuning approaches) 相结合,以进一步提升性能。 - 大规模数据和多任务环境的泛化: 研究
ACT在大规模数据 (large-scale data) 和多任务环境 (multi-task environments) 中的泛化能力。 - 与检索增强生成 (
RAG) 的结合: 鉴于LLM的幻觉 (hallucinations) 问题,可以将ACT与检索增强生成 (Retrieval-Augmented Generation, RAG) 方法结合,以提升事实准确性。
7.3. 个人启发与批判
7.3.1. 个人启发 (Personal Insights)
- 行动规划的价值: 这篇论文强调了在多轮对话中,单纯生成流畅的文本是不够的,智能体 (agent) 的行动规划 (action planning) 能力(何时澄清、何时回答)是实现对话成功的核心。这为
LLM的对齐和微调提供了新的视角,超越了传统上关注的文本质量。 DPO的灵活性与在线扩展:DPO作为RLHF的轻量级替代品,其潜力不仅在于简化训练,更在于其通过灵活的损失函数设计来融入更复杂的偏好。ACT将在线策略采样 (on-policy sampling) 和轨迹模拟 (trajectory simulation) 引入DPO框架,为数据稀疏 (data-efficient) 场景下的策略学习 (policy learning) 提供了有效途径。- 弱监督和伪标签的实用性:
ACT在没有行动标签 (action labels) 的情况下,通过伪标签 (pseudo-labels) 也能取得良好性能,这极大地降低了对昂贵人工标注 (human annotation) 的依赖,使其在实际应用中更具可行性。对于新兴或低资源任务,这种方法能够快速启动模型开发。 - 多维度评估的重要性: 论文提出的行动层面 (action-level) 和内容层面 (content-level) 的评估指标,以及轨迹级评估 (trajectory-level evaluation),为全面衡量对话代理 (conversational agents) 性能提供了更丰富的视角,避免了仅关注表面生成质量的局限性。
7.3.2. 批判 (Critique)
- 用户模拟器的可靠性:
ACT严重依赖用户模拟器 (User Simulator, U) 来生成多轮对话轨迹 (multi-turn dialogue trajectories)。如果 自身的模拟能力不足或存在偏差,它可能无法准确反映真实用户的行为和意图,从而可能引导ACT学习到次优的策略。 的鲁棒性和泛化能力(特别是对未见过的策略 (policies) 和错误 (errors) 的反应)是ACT成功的关键瓶颈。 - 任务特定启发式的通用性: 论文明确指出
ACT依赖任务特定启发式 (task-specific heuristics) 进行轨迹评估。虽然这允许在不同领域进行定制,但这也意味着将ACT应用到新任务时,需要投入额外的工程专业知识 (engineering expertise) 来设计和验证这些启发式规则。这可能限制了ACT在完全开放域或快速变化的复杂任务中的应用。 - “准在线”的理论严格性: 尽管
ACT被描述为准在线 (quasi-online),但其在线探索 (online exploration) 的性质与纯粹的在线强化学习 (online reinforcement learning) 仍有区别。论文对这种“准在线”模式的理论性质(例如,收敛性、探索-利用权衡)的讨论相对较少。更严格的理论分析可以进一步支持其有效性。 - 对商业
LLM的依赖与复现性: 实验中大量使用Gemini Ultra等商业LLM进行偏好数据构建 (preference data construction)、行动分类 (action classification) 和用户模拟 (user simulation)。这对于学术界和资源有限的研究者来说,可能是一个挑战,影响了研究的复现性和透明度。如果这些功能由开源模型替代,其性能可能会有所下降。 - 行动空间和复杂对话: 当前的行动空间 (action space) 主要限于
CLARIFY和ANSWER。在更复杂的对话中,可能存在更多元的对话行动 (dialogue actions)(如确认、建议、拒绝、提供额外信息等)。如何将ACT扩展到更丰富、层级化的行动空间,以及如何处理这些行动之间的依赖关系,是未来的挑战。
相似论文推荐
基于向量语义检索推荐的相关论文。