Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training
TL;DR Summary
This paper introduces Action-Based Contrastive Self-Training (ACT) to enhance LLMs' ability in handling ambiguity. The quasi-online optimization algorithm effectively learns dialogue policies in data-sparse scenarios, demonstrating superior performance over traditional fine-tunin
Abstract
Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users' true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT's efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training
1.2. Authors
Maximillian Chen, Ruoxi Sun, Tomas Pfister, Sercan Ö. Arik. Maximillian Chen is affiliated with Google and Columbia University. The other authors (Ruoxi Sun, Tomas Pfister, Sercan Ö. Arik) are affiliated with Google. Their research backgrounds generally lie in artificial intelligence, machine learning, and natural language processing, particularly in the domain of conversational AI and large language models.
1.3. Journal/Conference
Published at arXiv, a preprint server. arXiv is a widely respected open-access archive for preprints of scientific papers in fields such as mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers on arXiv have not necessarily undergone peer review, but it is a primary platform for disseminating new research rapidly in the AI/ML community.
1.4. Publication Year
2024 (Published at UTC: 2024-05-31T22:44:48.000Z)
1.5. Abstract
The paper addresses the limitation of Large Language Models (LLMs) in handling ambiguity during multi-turn conversations; LLMs often guess user intents or overhedge rather than asking clarifying questions. This issue is exacerbated by the scarcity of high-quality, task-specific conversation samples for training. The authors propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm built upon Direct Preference Optimization (DPO). ACT is designed for data-efficient dialogue policy learning in multi-turn conversation modeling. The paper demonstrates ACT's effectiveness in data-efficient tuning scenarios, even without explicit action labels, across three real-world conversational tasks: tabular-grounded question-answering (PACIFIC), machine reading comprehension (Abg-CoQA), and a novel task called AmbigSQL for disambiguating information-seeking requests for complex SQL generation. Additionally, the paper introduces a method for evaluating LLMs' ability to implicitly recognize and reason about ambiguity. ACT significantly outperforms standard tuning methods like Supervised Fine-Tuning (SFT) and DPO.
1.6. Original Source Link
https://arxiv.org/abs/2406.00222 (Preprint) PDF Link: https://arxiv.org/pdf/2406.00222v2.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the deficiency of Large Language Models (LLMs) in handling ambiguity within multi-turn conversations. While LLMs excel in many benchmarks, they often fail to exhibit crucial conversational skills like disambiguation. When confronted with ambiguous user requests, they tend to either provide vague or overly cautious responses (overhedging) or make implicit guesses about the user's true intent, instead of actively seeking clarification through questions. This behavior leads to suboptimal user experiences and reduces the effectiveness of LLM-based conversational agents.
This problem is particularly important in domains where high-quality conversation samples are limited. Collecting and annotating such data is costly and often faces privacy concerns, creating a bottleneck for LLMs to learn effective dialogue action policies. Existing LLM training paradigms, such as supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), don't always directly optimize for pragmatic conversational skills (e.g., when to clarify), especially in complex, multi-turn interactions.
The paper's entry point or innovative idea is to address this data scarcity and skill gap by proposing a data-efficient adaptation approach. It focuses on enabling LLMs to implicitly select conversational strategies in ambiguous contexts by contrasting pragmatic conversational actions within multi-turn trajectories, rather than just single-turn responses.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Action-Based Contrastive Self-Training (ACT): Introduction of
ACT, a novel, data-efficient, quasi-online preference optimization algorithm based onDPO.ACTis designed specifically for multi-turn conversation modeling and focuses on contrasting pragmatic conversational actions (e.g.,CLARIFYvs.ANSWER). -
Multi-Turn Trajectory Optimization:
ACTexplicitly incorporates multi-turn trajectory simulation into its training objective, allowing it to optimize for long-horizon conversational success, which is a significant improvement over methods focused solely on single-turn responses. -
Data-Efficient Policy Learning: Demonstrated
ACT's efficacy in data-efficient tuning scenarios, showing substantial improvements even with limited conversation samples (e.g., 50, 100, 250 conversations) and even when explicit action labels are unavailable (using pseudo-labeling). -
Novel Task and Evaluation Workflow: Introduction of
AmbigSQL, a new task specifically designed for disambiguating information-seeking requests to generate complexSQLqueries, highlighting the necessity of clarification in data analysis agents. The paper also proposes a workflow for evaluatingLLMs' ability to implicitly recognize and reason about ambiguity in conversation. -
Empirical Superiority:
ACTdemonstrated substantial conversation modeling improvements over standard tuning approaches, includingSupervised Fine-Tuning (SFT)andDirect Preference Optimization (DPO)variants likeIterative Reasoning Preference Optimization (IRPO), across diverse tasks like tabularQA(PACIFIC), machine reading comprehension (Abg-CoQA), and text-to-SQLgeneration (AmbigSQL). -
Model Agnosticism:
ACTis shown to be model-agnostic, capable of improving performance regardless of the base model's pre-existing alignment with human feedback.These findings solve the problem of
LLMs' struggle with conversational ambiguity by providing a method that teaches them toCLARIFYeffectively, even with limited task-specific data, leading to more robust and helpful conversational agents.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following fundamental concepts:
- Large Language Models (LLMs):
LLMsare deep learning models, typically based on the Transformer architecture, that are pre-trained on vast amounts of text data to learn complex patterns of language. They can generate human-like text, translate languages, answer questions, and perform many otherNatural Language Processing (NLP)tasks. The core idea is that throughpre-training, they develop a broad understanding of language, which can then be adapted for specific tasks throughfine-tuning. - Supervised Fine-Tuning (SFT):
SFTis a common technique used to adapt a pre-trainedLLMto a specific downstream task. It involves training theLLMon a dataset of input-output pairs (e.g., instruction-response pairs) for that task. The model's weights are adjusted to minimize a loss function (e.g., cross-entropy) that measures the difference between its generated output and the ground-truth output.SFTteaches the model how to follow instructions or generate responses in a desired format. - Reinforcement Learning from Human Feedback (RLHF):
RLHFis a post-training technique designed to alignLLMswith human preferences and values. It typically involves three steps:- Supervised Fine-Tuning (SFT): An
LLMis initially fine-tuned on instruction-following data. - Reward Model Training: Human annotators rank or rate different
LLM-generated responses to a given prompt. A separatereward modelis then trained on this human preference data to predict human preferences. - Reinforcement Learning (RL): The
SFTmodel is further fine-tuned using anRLalgorithm (e.g.,Proximal Policy Optimization (PPO)) where thereward modelprovides a scalar reward signal. TheLLMlearns to generate responses that maximize this reward, thereby aligning its behavior with human preferences.RLHFis crucial for makingLLMshelpful, harmless, and honest.
- Supervised Fine-Tuning (SFT): An
- Direct Preference Optimization (DPO):
DPOis an alternative toRLHFthat simplifies the alignment process by directly optimizing theLLMpolicy against human preferences, without needing to explicitly train a separatereward model. Instead of learning a reward function and then usingRL,DPOreformulates theRLHFobjective into a simplecross-entropyloss. It directly optimizes theLLMto increase the probability of preferred responses and decrease the probability of dispreferred responses, based on a dataset of human-ranked pairs (preferred, rejected). This makesDPOmore stable and easier to implement thanPPO-basedRLHF. - Multi-turn Conversations/Dialogue Systems: These refer to interactive systems where an agent and a user exchange multiple utterances to achieve a specific goal or engage in natural conversation. Unlike single-turn interactions, multi-turn dialogues require context understanding, memory, and the ability to manage the flow of conversation. Key challenges include maintaining coherence, tracking dialogue state, and handling ambiguity that arises over several turns.
- Dialogue Acts/Pragmatic Intent: A
dialogue actis a linguistic unit that describes the function or communicative intention of an utterance in a conversation (e.g.,question,answer,clarification,greeting,apology). Recognizing and generating appropriatedialogue actsis crucial for effective conversational agents, as it allows them to understand and influence the pragmatic flow of a conversation. Thepragmatic intentrefers to the underlying purpose or goal a speaker has when making an utterance.
3.2. Previous Works
The paper builds upon a rich history of research in conversational AI and LLM alignment.
- Mixed-Initiative Conversational Agents: Early neural approaches (Chen et al., 2017, 2022b; Qian et al., 2022; Yu, 2017) often decoupled
generation(forming utterances) fromplanning(deciding what action to take). WhileLLMsimproved generation, planning, especially for multi-turn interactions, remained challenging. Planning modules often involved complexMarkov Decision Processes (MDPs)(Wang et al., 2020b; Yang et al., 2021; Yu et al., 2023) or neural models combined with search algorithms (Cheng et al., 2022; Väth et al., 2023; Yu et al., 2023) and simulation (Deng et al., 2023c; Wang et al., 2020a; Yu et al., 2023). However, these modular approaches could incur high computational overhead and error propagation. - LLM Alignment (RLHF and DPO): The current paradigm for
LLMtraining involvespre-training,SFT, andalignment.RLHF(Ouyang et al., 2022; Christiano et al., 2017) became prominent for alignment, often usingPPO. RecentofflineRLalgorithms likeDPO(Rafailov et al., 2024),SLiC(Zhao et al., 2023), andIPO(Azar et al., 2024) have gained traction for bypassing explicit reward modeling, making them easier to optimize while achieving similar empirical results on fixed preference datasets. - On-Policy/Iterative DPO Variants: Many contemporary works explore "online" or "iterative" variants of
DPOto overcome the limitations of fully offline methods. Examples includeIterative DPO(Yuan et al., 2024), variants where ground-truth responses are preferred over sampled ones (Chen et al., 2024), orDPOapplied to optimize reasoning chains (Pang et al., 2024). Some works also applyDPOto conversations generally, but often focus on single-turn response optimization (Sun et al., 2024). - User Simulators: Works like Deng et al. (2023c) and Yu et al. (2023) directly prompt
LLMsfor goal-oriented tasks conditioned on dialogue context, building on earlier research in user simulation for dialogue systems (Chandramohan et al., 2011; Kreyssig et al., 2018; Shah et al., 2018).
3.3. Technological Evolution
The field of conversational AI has evolved significantly, particularly with the advent of LLMs. Initially, dialogue systems were often rule-based or heavily reliant on statistical models for specific components like Natural Language Understanding (NLU) and Natural Language Generation (NLG). The shift towards neural approaches brought improvements, leading to modular systems where planning (what to say) and generation (how to say it) were often separate.
The rise of LLMs brought about a paradigm shift. Their impressive zero-shot and few-shot capabilities allowed for highly fluent and contextually aware generation. However, simply using LLMs for generation didn't automatically imbue them with robust conversational pragmatics. The need for alignment with human values and specific task behaviors led to SFT and then RLHF. RLHF effectively enhanced LLMs' ability to follow instructions and generate preferred responses, but its complexity (training a reward model, using PPO) spurred the development of simpler, yet effective, alternatives like DPO.
This paper's work (ACT) fits into this timeline by taking the DPO framework and extending it to specifically address the nuanced, multi-turn challenges of conversational AI. It moves beyond general LLM alignment to focus on learning optimal dialogue action policies in scenarios where data is limited, an increasingly common and critical challenge in deploying specialized conversational agents.
3.4. Differentiation Analysis
Compared to the main methods in related work, ACT offers several core differences and innovations:
-
Action-Based Preference Optimization: While
DPOand its variants optimizeLLMsbased on general human preferences for responses,ACTexplicitly introducesaction-based preferences. It focuses on contrasting responses not just by their content, but by the implicitconversational action(e.g.,CLARIFYvs.ANSWER) they represent. This directly targets the pragmatic skills needed for effective disambiguation. -
Multi-Turn Trajectory Simulation: Unlike most
DPOapplications that focus on single-turn response optimization,ACTincorporatesmulti-turn trajectory simulationinto its self-training loop. It evaluates not just the immediate response but the outcome of a simulated conversation trajectory stemming from that response. This allows for long-horizon optimization, crucial for complex conversational tasks where a single "good" turn might lead to a poor overall outcome. -
Quasi-Online Learning:
ACTis described as aquasi-onlineDPOvariant. It maintains the ease of use of offline methods (by constructing an initial preference dataset) but incorporates the flexible exploration found during online learning by continuously sampling on-policy responses and updating contrastive pairings based on their simulated conversational success. This strikes a balance between computational efficiency and adaptability. -
Data Efficiency for Conversational Policies:
ACTis specifically designed fordata-efficient tuning scenarios, making it highly relevant for task-specific conversationalAIwhere high-qualitymulti-turndialogue data is scarce. It shows significant improvements even with very limited training data (e.g., 50-250 conversations), outperforming approaches that typically require more data or are less focused on conversational pragmatics. -
Implicit Action Learning:
ACTimplicitly learns action planning as a subtask of response generation. This avoids the computational overhead and error propagation issues associated with separateplanningmodules in traditional mixed-initiative agents. It also demonstrates the ability to learn effectively even when explicitaction labelsare unavailable, leveragingpseudo-labeling.In essence,
ACTinnovates by tailoringDPOto the unique demands of multi-turn conversationalAI, emphasizing action-based learning and trajectory-level outcomes to teachLLMshow to effectively manage ambiguity and clarify user intent.
4. Methodology
The core idea of ACT is to adapt generic LLMs for dialogue generation, implicitly modeling action planning. It's a quasi-online extension of DPO, focusing on contrasting differences between an agent's possible pragmatic conversational actions within multi-turn contexts.
4.1. Principles
The theoretical basis behind ACT is rooted in three key intuitions:
- Contrastive Preferences for Pragmatic Actions: The pragmatic differences between "winning" (desired) and "losing" (undesired) dialogue responses can be intuitively demonstrated and learned through contrastive preference pairs. This means explicitly showing the model what constitutes a good conversational action (like asking a clarifying question when appropriate) versus a bad one (like guessing the user's intent).
- Multi-turn Optimization for Conversational Improvements: Real conversational improvements require optimizing over sequences of turns (
multi-turn optimization), not just individual responses. A single turn's quality might not reflect its contribution to the overall conversational goal.ACTtherefore needs to evaluate and learn from entire conversationtrajectories. - On-Policy Response Sampling and DPO Gradient: The gradient of
DPO-like algorithms is weighted by thelog probabilitiesof winning and losing responses. By integratingon-policy response sampling(i.e., sampling responses from the current version of the model being trained),ACTensures that the "losing" responses used in training are often high-probability sequences from the current policy, making the contrastive learning more effective for steering the model's behavior away from its current undesirable tendencies.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Setup
The paper considers the task of tuning an LLM to act as a mixed-initiative conversational agent. This agent's role is to assist users over multiple dialogue interactions, ultimately providing a correct response to their request. Crucially, a mixed-initiative agent must understand when to redirect the flow of interaction by executing specific conversational actions, such as asking clarifying questions.
Notation:
-
Let be an
LLM's policy, parameterized by , at timestep . -
is the
reference policy model, which is typically the initial policy model (i.e., ). -
is a dataset of conversations.
-
Each conversation consists of
dialogue turns. -
A
turn stateat timestep is represented byt _ { i }. This includes observed utterances and actions from both user and agent. -
Each
t _ { i }is part of atrajectorythat ends when the user's initial question (from an earlier timestep ) is answered. -
t _ { i }has two primary components:p _ { i }: Apromptat timestep . This comprises any task-specific information (e.g.,SQLdatabase schema, tabular data, retrieved passages) combined with the existing dialogue context.r _ { i }: Theground truth system-side responseat timestep .
-
g _ { i }: Thegoal responsethat resolves the implicit trajectory oft _ { i }. In single-turn trajectories,g _ { i } = r _ { i }. -
Each
r _ { i }implicitly expresses anactiona _ { i }. -
: The
latent Action Spaceof a particular task. For the experiments in this paper, [CLARIFY,ANSWER]. -
: An
Action Annotation Agentthat can infera _ { i }from( p _ { i } , r _ { i } ). In practice, this can be a human annotator or apseudo-labelingclassifier. -
: A
Controllable generation modelused for preference data creation. This is typically a high-capacityLLM. -
: An
Action Classifiermodel used during tuning and evaluation to identify the implicit action of a generated response. -
: A
User Simulatormodel, also used during tuning and evaluation to simulate user responses and conversational trajectories.The notation is illustrated in Figure A4.
该图像是示意图,展示了一个关于物质主义的多轮对话示例。在对话中,助手被请求澄清物质主义与其他哲学类别的关系。所选的响应是对用户问题的澄清,而被拒绝的响应则是直接回答,突显了对话的意图与信息目标。
Figure A4 shows an example of a contrastive pairing constructed for RL tuning with Abg-CoQA. The example demonstrates how a turn in a conversation is decomposed into its prompt, ground truth response, goal response, and the inferred action. It also shows a winning response () and a losing response (), highlighting that the losing response is generated by forcing an alternative action.
User Simulators: The User Simulator () is implemented by first prompting an LLM to summarize the user's information-seeking goal. Then, another prompt is formed using this summary and the current dialogue context to simulate a user response. This approach allows more flexibility than directly providing the ground truth objective.
Action Classifiers: For the tasks considered, the possible actions are CLARIFY or ANSWER. The Action Classifier () is implemented using few-shot in-context learning with an LLM.
4.2.2. ACT: Action-Based Contrastive Self-training
ACT consists of two main phases: action-based contrast dataset construction (Algorithm 1) and contrastive self-training (Algorithm 2). The overall process is summarized in Figure 3.

该图像是一个示意图,展示了基于行动的对比自我训练(ACT)方法在多轮对话中进行策略模拟与评估的流程。图中描述了不同场景下的示例响应和动作检测,以及策略更新的过程。
Figure 3 provides an overview of the tuning phase of ACT. It shows that for each initial contrastive pairing, an on-policy response is sampled from the model being tuned. This sampled response's trajectory is evaluated, and the contrastive pairing is updated (either the winning or losing response is replaced) before the model policy is updated using the DPO objective.
4.2.2.1. Construction of Preference Data
The preference dataset (D _ { p r e f }) is primarily built from contrastive winning-losing action pairs. This process is detailed in Algorithm 1.
The following are the steps for building contrastive action pairs (Algorithm 1): Explanation:
- An empty dataset
D _ { p r e f }is initialized to store the contrastive pairs. - For each
conversation turnt _ { i }in the input dataset :-
Infer Contextual Action: The
Action Annotation Agentinfers the implicit actiona _ { i }of theground truth responser _ { i }given thepromptp _ { i }. This can be human annotation or apseudo-labelingclassifier. -
Determine Rejected Action: A
rejected actionis chosen from theAction Spacesuch that is not the inferred actiona _ { i }. For [CLARIFY,ANSWER], ifa _ { i }isCLARIFY, then isANSWER, and vice-versa. -
Winning Response: The
ground truth responser _ { i }is designated as thewinning responsey _ { w i }. -
Losing Response Sampling: A
losing responsey _ { l i }is sampled from theConditional generation model. This sampling is conditioned on thepromptp _ { i }and explicitly instructed to generate a response corresponding to therejected action. This means generates a response that performs the opposite pragmatic action of the ground truth. -
Tuple Creation: An augmented tuple is created containing the
promptp _ { i },ground truth responser _ { i },goal responseg _ { i }, inferred actiona _ { i }, rejected action ,winning responsey _ { w i }, andlosing responsey _ { l i }. -
This tuple is added to
D _ { p r e f }.The paper notes that a high-capacity
LLM(Gemini Ultra) is used for to construct losing responses, which avoids the need for crowdworkers or tuning a smaller model for this specific task.
-
Action optimization for unlabeled conversations "in-the-wild": Even without gold-standard ambiguity annotations, pseudo-label supervision can be obtained by using a classifier (e.g., a zero-shot LLM) as the Action Annotation Agent .
4.2.2.2. Self-Training Using On-policy Conversation Trajectory Simulation
This phase involves continuously sampling batches from the preference dataset D _ { p r e f } and dynamically updating the winning/losing pairs based on on-policy sampling and multi-turn trajectory simulation. This process is detailed in Algorithm 2.
The following are the steps for ACT: Action-Based Contrastive Self-Training (Algorithm 2):
Algorithm 2 ACT: Action-Based Contrastive Self-Training
input Initial Policy Model π_θ_0, Action Contrast Dataset D_pref, Number of Batches B, Action Classifier A, User Simulator U, Task Heuristic H, Heuristic Tolerance ε
1: for conversation turn t_i' in batch b sampled from D_pref where 0 ≤ j ≤ B do
2: Sample y_i ~ π_θ_j(· | p_i) ▷ Sample a response from the current model policy
3: if Action A(y_i) ≠ Action a_i then
4: ▷ Implicit pragmatic action does not match ground truth
5: Set y_li = y_i
6: else
7: Initialize Trajectory
8: Add y_i to Trajectory
9: while A(y_i) ≠ ANSWER do ▷ Simulate User Clarification
10: Clarification Answer = U(p_i, Trajectory) ▷ Simulate user response to clarification
11: Add Clarification Answer to Trajectory
12: y_i+1 = π_θ_j(· | p_i, Trajectory) ▷ Simulate next policy response
13: Add y_i+1 to Trajectory
14: if H(Trajectory outcome, Ground Truth Outcome g_i) > ε then
15: Let y_wi = Trajectory ▷ Reward acceptable trajectory outcome
16: else
17: Let y_li = Trajectory ▷ Penalize bad trajectory outcome
18: θ ← Update(θ) until convergence (eq2)
19: output π_θ_B
Explanation:
-
The algorithm iterates for a specified number of batches , sampling
conversation turnsfromD _ { p r e f }. -
On-policy Sampling: For each turn , a response
y _ { i }is sampled from thecurrent policy model(the model being tuned), conditioned on the promptp _ { i }. Thison-policysample represents how the current model would actually respond. -
Action Mismatch Check: The
Action Classifierdetermines the implicit action ofy _ { i }. If this implicit action does not match theground truth actiona _ { i }(which was inferred forr _ { i }duringD_prefconstruction):- The sampled response
y _ { i }is immediately designated as alosing responsey _ { l i }, replacing the initialy _ { l i }fromD _ { p r e f }. This penalizes the model for taking the wrong immediate action.
- The sampled response
-
Trajectory Simulation (if action matches): If the implicit action of
y _ { i }does matcha _ { i }(meaning the model took the pragmatically correct immediate action, e.g., asking a clarifying question):- A
Trajectoryis initialized withy _ { i }. - A
while loopsimulates the subsequent turns of the conversation until theANSWERaction is taken by the policy model (i.e., until the model attempts to answer the original user request):- The
User Simulatorgenerates aClarification Answer(user response) based on the currentpromptp _ { i }and the ongoingTrajectory. - This
Clarification Answeris added to theTrajectory. - The policy model generates the
next policy response, conditioned on the updatedpromptp _ { i }and the extendedTrajectory. - This is added to the
Trajectory.
- The
- Trajectory Outcome Evaluation: After the
ANSWERaction is reached, theTask Heuristicevaluates theTrajectory outcome(the final answer provided by the model after all clarifying turns) against theGround Truth Outcomeg _ { i }.- If the
Trajectory outcomeis acceptable (i.e., for someHeuristic Tolerance), the entire simulatedTrajectoryis designated as thewinning responsey _ { w i }. - Otherwise (if the outcome is unacceptable), the
Trajectoryis designated as thelosing responsey _ { l i }. This penalizes the model for leading the conversation to a poor final answer, even if initial actions were correct.
- If the
- A
-
Policy Update: After determining the updated
y _ { w i }andy _ { l i }, the model's parameters are updated using theDPOobjective (Equation 1, explained next). This update step is repeated until convergence or for a fixed number of batches.The
trajectory-level content evaluationis further clarified in Figure A5.
该图像是示意图,展示了评估模型与用户模拟器之间的互动过程。评估模型在特定提示下,根据目标信息进行交互,模型首先提出澄清问题,然后尝试通过直接回答解决用户请求,最后评分使用了DROP F1指标。
Figure A5 illustrates how the trajectory-level content evaluation works. The evaluated LLM interacts with a User Simulator () until it provides a direct answer. The candidate trajectory resolution (final answer) is then scored against the user's ground truth information-seeking goal using task-specific metrics like DROP F1 for PACIFIC.
4.2.3. Contrastive RL Tuning for Alignment
After constructing the up-to-date winning (y _ { w }) and losing (y _ { l }) pairing at turn through the simulation process described above, the policy model is updated using the DPO training objective (Rafailov et al., 2024).
The DPO loss function is given as:
$
\mathcal { L } _ { \mathrm { D P O } } ( \pi _ { \theta } ; \pi _ { r e f } ) = - \mathbb { E } _ { ( p , y _ { w } , y _ { l } ) \sim \mathcal { D } } \left[ \log \sigma \left( \beta \log \frac { \pi _ { \theta } ( y _ { w } \mid p ) } { \pi _ { r e f } ( y _ { w } \mid p ) } - \beta \log \frac { \pi _ { \theta } ( y _ { l } \mid p ) } { \pi _ { r e f } ( y _ { l } \mid p ) } \right) \right]
$
Explanation of Symbols:
-
: The
Direct Preference Optimizationloss. -
: The
policy modelbeing trained, parameterized by . -
: The
reference policy model, typically theSFTmodel or the initial policy beforeDPO. It serves as a regularization term, preventing the policy from drifting too far from the initial model. -
: Expectation over
tripletssampled from the dataset . -
: The
prompt(context) for generation. This is a concatenation of task information and conversation history: , wherex _ { i }are user utterances andy _ { i }are system utterances. -
y _ { w }: Thewinning responseortrajectory. This is the response/trajectory that is preferred. -
y _ { l }: Thelosing responseortrajectory. This is the response/trajectory that is dispreferred. -
: The
sigmoid function, . It squashes the input value into a range between 0 and 1, effectively representing a probability or preference score. -
: A
hyperparameterthat controls the strength of the regularization. It regularizes the ratio between the current policy and thereference policy. A larger means stronger regularization. -
: This term measures the
log probability ratioof a response under the current policy relative to thereference policy. It essentially quantifies how much the current policy prefers a response compared to the reference. InDPO, this ratio is proportional to the implicit reward of the response.The goal of
DPOis to minimize this loss, which effectively maximizes thelog-likelihoodof thewinning responsesand minimizes thelog-likelihoodof thelosing responses, weighted by their relative preferences.
The gradient of this objective with respect to the policy parameters is given as: $ \begin{array} { r l } { \boldsymbol { \mathrm { \widehat { \tau } } } _ { \theta } \mathcal { L } _ { \mathrm { D P O } } ( \pi _ { \theta } ; \pi _ { r e f } ) = } & { } \ & { \phantom { \frac { \widehat { \tau } _ { \theta } } { \pi _ { \theta } } } - \beta \mathbb { E } _ { ( p , y _ { w } , y _ { t } ) \sim \mathcal { D } } \bigg [ \sigma ( \widehat { R } _ { \theta } ( p , y _ { l } ) - \widehat { R } _ { \theta } ( p , y _ { w } ) ) \bigg [ \nabla _ { \theta } \log \pi ( y _ { w } \mid p ) - \nabla _ { \theta } \log \pi ( y _ { l } \mid p ) \bigg ] \bigg ] } \end{array} $ where . Explanation of Symbols:
-
: The gradient of the
DPOloss with respect to the policy parameters . This is what is used to update the model weights during training. -
: An implicitly defined
rewardfor a response givenpromptunder the current policy relative to thereference policy. Thisrewardis derived directly from the policy log probabilities and the hyperparameter. -
: This term represents the
sigmoidof the difference in implicit rewards between thelosingandwinningresponses. It acts as a weighting factor. If the model already correctly ranksy _ { w }higher thany _ { l }(i.e., ), then will be negative, and will be small, resulting in a small gradient update. If the model incorrectly ranksy _ { l }higher or close toy _ { w }, then this term will be larger, leading to a stronger gradient and a larger update. -
: This is the difference between the gradients of the
log probabilitiesof thewinningandlosingresponses. TheDPOobjective pushes the model to increase thelog probabilityof thewinning responseand decrease thelog probabilityof thelosing response.The intuition is that the gradient updates the policy to increase the likelihood of
winning responses(y _ { w }) and decrease the likelihood oflosing responses(y _ { l }). The magnitude of this update is weighted by how "incorrectly" the current policy ranks the paired responses, ensuring that more egregious errors in preference are corrected more strongly.
5. Experimental Setup
ACT is evaluated as a tuning approach for three complex conversational information-seeking tasks to assess its ability to learn optimized implicit selection for agent-side clarification question asking in a sample-efficient manner. The base model for tuning experiments is , a version of Mistral 7B instruction-tuned on UltraChat and aligned to human preferences on UltraFeedback.
5.1. Datasets
The experiments investigate three mixed-initiative conversation tasks where a user interacts with an assistant to retrieve information. In each task setup, the user's query may be underspecified, and the assistant's response can either be a clarifying question or a direct answer. For each task, initial rejected responses for the preference dataset construction (Algorithm 1) are synthesized using Gemini Ultra as the conditional generation model (M).
-
PACIFIC: Conversational QA for Tabular Data
- Description:
PACIFICis a task forproactive conversational question answeringgrounded on a mixture of tabular and textual financial data (Deng et al., 2022). This involves generating answers that could be correct words from a given span, from multiple spans, or providing a correct arithmetic expression. - Domain: Financial data.
- Characteristics: Requires reasoning over structured (tabular) and unstructured (textual) information, often involving numerical calculations. Ambiguity can arise from underspecified financial terms or data points.
- Evaluation: The official evaluation uses
DROP F1, a numeracy-focused token overlap metric.
- Description:
-
Abg-CoQA: Conversational QA for Machine Reading Comprehension
- Description:
Abg-CoQAis aconversational question answeringdataset focused on disambiguation inmachine reading comprehension(Guo et al., 2021). - Domain: General text understanding.
- Characteristics: The task involves answering questions based on provided text passages, where questions might be ambiguous and require clarification from the agent.
- Evaluation: Since there are no arithmetic expressions,
embedding-based semantic distancewithSentenceBERTis used as an evaluation metric to measureQAperformance more flexibly.
- Description:
-
AmbigSQL: Ambiguous Conversational Text-to-SQL Generation
-
Description:
AmbigSQLis a novel task proposed in this paper forSQL-grounded conversational disambiguation. It was constructed by systematically perturbing unambiguous queries fromSpider(Yu et al., 2018), a popular single-turn text-to-SQLbenchmark. This perturbation results in paired training examples suitable forcontrastive RL tuning. -
Motivation: The motivation for
AmbigSQLis the idea that disambiguation can significantly improve task performance in complex code generation tasks. The authors found a performance gap of up to inSQLgeneration with and without clarifications, highlighting the necessity of clarifying questions. -
Construction:
AmbigSQLintroduces three types of ambiguous information requests by prompting anLLM(Gemini Ultra 1.0) to synthesize perturbed versions ofSpiderqueries along with appropriate clarifying questions:- Ambiguous Requested Information: (e.g., "Show details about singers ordered by age from the oldest to the youngest")
- Ambiguous Requested Population: (e.g., "Which ones who live in the state of Indiana?")
- Ambiguous Presentation of Results: (e.g., "Show name, country, age for all singers ordered by age")
-
For ambiguous queries, conversations are associated with "ground truth" states at two timesteps: (ambiguous user request, synthesized clarification question , action
CLARIFY) and (disambiguated user request, correct ground truthSQLquery , actionANSWER). -
Evaluation: Each trajectory is evaluated by whether the final proposed
SQLquery matches theground truth query's execution result. -
The following are the results from Table A10 of the original paper:
Num. Unambiguous Requests Train 7,000 Dev 1,034 Test 1,034 Num. Ambiguous Requests 7,000 1,034 1,034 Num. Unique Schemas 1,056 145 145 Types of Ambiguity 3 3 3
-
The table above provides an overview of the AmbigSQL dataset, showing the number of unambiguous and ambiguous requests, unique database schemas, and types of ambiguity across training, development, and test sets.
* The following are the results from Table A12 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<td colspan="4">[Database Schema Omitted] The target SQL query is the following:</td>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>SELECT professional_id , last_name , cell_number FROM Professionals</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="4">WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,</td>
</tr>
<tr>
<td colspan="4">T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON</td>
</tr>
<tr>
<td colspan="4">T1.professional_id = T2.professional_id</td>
</tr>
<tr>
<td colspan="4">GROUP BY T1.professional_id HAVING count(*) > 2</td>
</tr>
<tr>
<td colspan="4">Here is a clear request that would correspond to this SQL query:</td>
</tr>
<tr>
<td colspan="4">Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List</td>
</tr>
<tr>
<td colspan="4">his or her id, last name and cell phone."</td>
</tr>
<tr>
<td colspan="4">Here is the same request converted into an ambiguous format by underspecifying the target columns:</td>
</tr>
<tr>
<td colspan="4">"Which ones live in the state of Indiana or have done treatment on more than 2 treatments?"</td>
</tr>
<tr>
<td colspan="4">Here is an appropriate clarifying question to recover the clear request from the ambiguous request:</td>
</tr>
<tr>
<td colspan="4">"Are you asking about the Professionals?"</td>
</tr>
</tbody>
</table></div>
Table A12 shows an in-context example for creating ambiguous information requests where the target population is underspecified. It demonstrates a clear request, its ambiguous version, and an appropriate clarifying question.
* The following are the results from Table A13 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<td colspan="4">[Database Schema Omitted] The target SQL query is the following:</td>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>SELECT professional_id , last_name , cell_number FROM Professionals</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="4">WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,</td>
</tr>
<tr>
<td colspan="4">T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON</td>
</tr>
<tr>
<td colspan="4">T1.professional_id = T2.professional_id</td>
</tr>
<tr>
<td colspan="4">GROUP BY T1.professional_id HAVING count(*) > 2</td>
</tr>
<tr>
<td colspan="4">Here is a clear request that would correspond to this SQL query:</td>
</tr>
<tr>
<td colspan="4">Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List</td>
</tr>
<tr>
<td colspan="4">his or her id, last name and cell phone."</td>
</tr>
<tr>
<td colspan="4">Here is the same request converted into an ambiguous format by underspecifying the target columns:</td>
</tr>
<tr>
<td colspan="4">Which professionals live in the state of Indiana or have done treatment on more than 2 treatments?"</td>
</tr>
<tr>
<td colspan="4">Here is an appropriate clarifying question to recover the clear request from the ambiguous request:</td>
</tr>
<tr>
<td colspan="4">"Which information of the professionals do you want to know?"</td>
</tr>
</tbody>
</table></div>
Table A13 presents another in-context example, this time focusing on ambiguity in the requested columns, illustrating a clear request, its ambiguous counterpart, and the corresponding clarifying question.
* The following are the results from Table A14 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<td>Model</td>
<td>Ambiguous Request Execution Match</td>
<td>Execution Match with Clarification Turns</td>
<td></td>
</tr>
</thead>
<tbody>
<tr>
<td>Gemini Pro</td>
<td>28.5</td>
<td>68.7</td>
<td></td>
</tr>
<tr>
<td>Gemini Ultra</td>
<td>31.2</td>
<td>77.0</td>
<td></td>
</tr>
</tbody>
</table></div>
Table A14 shows the impact of ambiguity on SQL generation in AmbigSQL. LLMs (Gemini Pro and Gemini Ultra) struggle to produce correct SQL from ambiguous requests, but their performance significantly improves when disambiguation turns are included in the prompt.
* The following are the results from Table A15 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<td colspan="3">Examples from the AmbigSQL test set using the "concert_singer" database. Black tex: context provided to an LLM as a prompt. Blue text: ground-truth single-turn response. Purple text: target trajectory which should be achieved through simulation.</td>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt:</td>
<td colspan="2">Resulting Ground Truth Trajectory:</td>
</tr>
<tr>
<td>"Database schema: (...)</td>
<td colspan="2">Assistant: SELECT Name , Song_Name FROM singer ORDER BY Age DESC</td>
</tr>
<tr>
<td>User: show the names and song names of singers sorted by age from the oldest to the youngest"</td>
<td colspan="2"></td>
</tr>
<tr>
<td>"Database schema: (...)</td>
<td>Assistant: Are you asking about the songs or the singers?</td>
<td>Assistant: Are you asking about the songs or the singers?</td>
</tr>
<tr>
<td>User: show details about singers ordered by age from the oldest to the youngest"</td>
<td>User: The singers.</td>
<td>User: The singers.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Assistant: SELECT Name , Song_Name FROM singer ORDER BY Age DESC</td>
</tr>
</tbody>
</table></div>
Table A15 provides examples from the AmbigSQL test set, demonstrating how an unambiguous request directly leads to an SQL query, while an ambiguous request triggers a clarification question and then a subsequent SQL query.
* The following are the results from Table A16 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<td colspan="3">Examples from the AmbigSQL test set using the "farm" database. Black text: context provided to an LLM as a prompt. Blue text: ground-truth single-turn response. Purple text: target trajectory which should be achieved through simulation.</td>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt:</td>
<td colspan="2">Resulting Ground Truth Trajectory:</td>
</tr>
<tr>
<td>"Database schema: (...)</td>
<td colspan="2">Assistant: SELECT * FROM farm WHERE location_id IN ( SELECT id FROM location WHERE region = 'South')</td>
</tr>
<tr>
<td>User: show all the information of the farm in the south"</td>
<td colspan="2"></td>
</tr>
<tr>
<td>"Database schema: (...)</td>
<td>Assistant: Are you asking about the location or the items?</td>
<td>Assistant: Are you asking about the location or the items?</td>
</tr>
<tr>
<td>User: show details about the farm in the south"</td>
<td>User: The farm location.</td>
<td>User: The farm location.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Assistant: SELECT * FROM farm WHERE location_id IN ( SELECT id FROM location WHERE region = 'South')</td>
</tr>
</tbody>
</table></div>
Table A16 shows another set of AmbigSQL examples, reinforcing the pattern of direct SQL generation for clear requests and clarification followed by SQL for ambiguous ones.
5.2. Evaluation Setup
The evaluation focuses on ACT's ability to reason about ambiguity to achieve conversational goals, assessed along two primary dimensions: Agent task performance and Implicit ambiguity recognition.
5.2.1. Agent Task Performance
This dimension evaluates how well ACT improves the multi-turn task completion capabilities of the agent.
- Turn-level evaluation:
- The model's immediate response is compared to the
ground truth utterancegiven in response to the user's query. - Task-specific heuristics (defined in Section 4.1) are used.
- This mirrors the static single-turn evaluations used in
PACIFICandAbg-CoQA's original proposals.
- The model's immediate response is compared to the
- Multi-turn evaluation (Trajectory outcomes):
- To assess
multi-turn capabilities, amulti-turn evaluation schemeis proposed fortrajectory outcomesacross all three tasks. - If the
sampled responsefrom anLLMis aclarifying question, auser responseis simulated (using ), and another response is sampled from theLLMuntil it attempts toanswer the original query. - The final outcome of this simulated trajectory is evaluated against the
user's ground truth information-seeking goal. - (Action Classifier) and (User Simulator) are used for simulation as described in Section 3.2.2.
- Post-Clarify F1 / Execution Match: For
PACIFICandAmbigSQL,task performanceis also computed specifically on simulated responses where the model has previously askedclarifying questions. This provides a more fine-grained measure of the model's ability to reason about its own clarifications.
- To assess
5.2.2. Implicit Ambiguity Recognition
This dimension examines whether LLMs can implicitly recognize and reason about ambiguity.
- Metric:
Dialogue act accuracy, primarilyMacro F1. - Assumption: Given access to
ground-truth ambiguity labels, if a user's request is truly ambiguous, the model should generate aclarifying question; otherwise, it should directly provide the requested information. This directly measures the model's ability to correctly classify thedialogue actrequired by the context.Macro F1is chosen due to potentiallyimbalanced classes(more unambiguous than ambiguous cases) inPACIFICandAbg-CoQA.
5.2.3. Detailed Evaluation Metrics (from Appendix D)
Action-level evaluation metrics:
- Accuracy:
- Conceptual Definition: Measures the percentage of correctly classified implicit actions (e.g., whether the model correctly identified if it should
CLARIFYorANSWER). - Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of correct action predictions}}{\text{Total number of predictions}} $
- Symbol Explanation:
Number of correct action predictions: The count of instances where the model's predicteddialogue actmatches theground truth dialogue act.Total number of predictions: The total number ofdialogue actsevaluated.
- Conceptual Definition: Measures the percentage of correctly classified implicit actions (e.g., whether the model correctly identified if it should
- Macro F1:
- Conceptual Definition: An unweighted average of the
F1 scorefor each individual action class (CLARIFY,ANSWER). This is particularly useful when class distributions areimbalanced, as it treats all classes equally, preventing dominant classes from skewing the overall metric. - Mathematical Formula: $ \text{Macro F1} = \frac{1}{|C|} \sum_{c \in C} \text{F1}_c $ where $ \text{Precision}_c = \frac{\text{True Positives}_c}{\text{True Positives}_c + \text{False Positives}_c} $ $ \text{Recall}_c = \frac{\text{True Positives}_c}{\text{True Positives}_c + \text{False Negatives}_c} $
- Symbol Explanation:
- : The total number of unique action classes (e.g., 2 for [
CLARIFY,ANSWER]). - : An individual action class.
- : The
F1 scorecalculated for class . - : The proportion of correctly predicted positive instances for class out of all instances predicted as positive for class .
- : The proportion of correctly predicted positive instances for class out of all actual positive instances for class .
- : Number of instances of class correctly predicted as class .
- : Number of instances not of class incorrectly predicted as class .
- : Number of instances of class incorrectly predicted as not class .
- : The total number of unique action classes (e.g., 2 for [
- Conceptual Definition: An unweighted average of the
Content-level evaluation metrics:
- Turn-level DROP F1:
- Conceptual Definition: Measures the
F1 overlapbetween the model's immediate response and theground truth utterancefor a single turn, specifically adapted forDiscrete Reasoning Over Paragraphs (DROP)tasks involving numeric answers and span extraction. It's designed to handle complex answers that might involve multiple spans or arithmetic expressions. - Mathematical Formula: (The paper refers to Deng et al. (2022) and Dua et al. (2019) for
DROP F1, but doesn't provide the explicit formula. TheDROP F1metric is a token-level F1 score calculated after normalizing numbers and handling multiple correct answers. It's often defined as the maximumF1over all possible correct answers.) $ \text{DROP F1} = \max_{\text{GT} \in \text{GroundTruthAnswers}} \left( \frac{2 \cdot \text{Precision}(\text{Pred}, \text{GT}) \cdot \text{Recall}(\text{Pred}, \text{GT})}{\text{Precision}(\text{Pred}, \text{GT}) + \text{Recall}(\text{Pred}, \text{GT})} \right) $ WherePrecisionandRecallare calculated at the token level between the predicted answer and the ground truth answer after normalization steps (e.g., number normalization, ignoring punctuation). - Symbol Explanation:
Pred: The predicted answer string by the model.GT: A singleground truth answerstring from the set of possible correct answers.GroundTruthAnswers: The set of all acceptableground truth answersfor a given question.Precision(Pred, GT): Token-level precision betweenPredandGT.Recall(Pred, GT): Token-level recall betweenPredandGT.
- Conceptual Definition: Measures the
- Trajectory-level DROP F1:
- Conceptual Definition: Measures the
DROP F1of the final answer obtained at the end of a simulatedmulti-turn conversation trajectory(i.e., after all clarifying questions and subsequent user responses) compared to theground truth information goal. This assesses the model's ability to achieve the ultimate task goal through conversational interaction.
- Conceptual Definition: Measures the
- Post-Clarification DROP F1:
- Conceptual Definition: A subset of
Trajectory-level DROP F1specifically for cases where the model did askclarifying questionsduring the simulated trajectory. It evaluates the quality of the final answer after the model has engaged in disambiguation, measuring the effectiveness of its clarification strategy.
- Conceptual Definition: A subset of
- Turn-level Similarity (for Abg-CoQA):
- Conceptual Definition: Measures the
semantic similaritybetween the model's immediate response and theground truth responsefor a single turn, typically usingembedding-based similarity(e.g., cosine similarity ofSentenceBERTembeddings). This is useful when exact token matches are not expected, but semantic equivalence is desired.
- Conceptual Definition: Measures the
- Trajectory-level Similarity (for Abg-CoQA):
- Conceptual Definition: Measures the
semantic similaritybetween the final answer of a simulatedmulti-turn conversation trajectoryand theground truth information goal, usingembedding-based similarity. Analogous toTrajectory-level DROP F1but for tasks where semantic meaning is more important than exact factual extraction.
- Conceptual Definition: Measures the
- Trajectory-level Execution Match (for AmbigSQL):
- Conceptual Definition: The percentage of
multi-turn conversation trajectorieswhere the final generatedSQLquery, when executed against the database, produces the correct result as defined by theground truth SQL query. This is a strict, objective measure of correctness fortext-to-SQLtasks.
- Conceptual Definition: The percentage of
- Post-Clarification Execution Match (for AmbigSQL):
- Conceptual Definition: The percentage of
multi-turn conversation trajectoriesthat includedclarification turnsand resulted in aSQLquery that produced the correct execution result. This highlights the impact of clarification on achieving the correctSQLoutput.
- Conceptual Definition: The percentage of
5.3. Baselines
The paper compares ACT against various prompting and tuning baselines.
5.3.1. Prompting Baselines
These baselines use frontier LLMs with different in-context learning (ICL) strategies. For all prompting baselines, 10 conversations are used as in-context examples.
- Models:
Gemini 1.5 Pro,Gemini 1.5 Flash,Claude 3.5 Sonnet,Claude 3.0 Haiku. - Prompting Frameworks:
- Standard Prompting: Uses the same instruction formatting as used for
SFT, providing structured examples.-
The following are the results from Table A17 of the original paper:
You are an Assistant answering questions from a User. You should either attempt to answer the question or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted] User: How much does the company expect to make contributions to the defined plans during the next fiscal year? Assistant: ['What kind of defined plans are you asking about?'] User: The defined benefit plans and the defined contribution plan respectively. Assistant: ['`5.1 million', '`0.6 million'] User: What is the value of the expected defined contribution plan as a percentage of the defined benefit plan during the next fiscal year? Assistant: 0.6/5.1 = 11.76 User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount rate? Assistant: ['What kind of change are you asking about?'] User: The pension. Assistant: ['49,361'] User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease? Assistant: 49,361 - (39,145) = 88506 User: How much would the pension change if there is a 1% increase and a 1% decrease respectively? Assistant: ['Which rate of increase and decrease are you talking about?'] User: The future salary growth. Assistant: ['`7,572', '`(6,919)'] User: What is the difference between them? Assistant: 7,572 - (6,919) = 14491
-
- Standard Prompting: Uses the same instruction formatting as used for
Table A17 shows an example of Standard in-context prompting for PACIFIC. The purple text (first line) is the system instruction, and the blue text shows the model's expected generation at inference time. The example provides a sequence of user questions and assistant responses, demonstrating both direct answers and clarifying questions.
2. **Chain-of-Thought (CoT) Prompting (Wei et al., 2022):** Integrates `CoT` reasoning into the standard prompt format, asking the `LLM` to first produce a reasoning chain (`e.g.`, stating whether the context is ambiguous) before generating the response.
* The following are the results from Table A18 of the original paper:
You are an Assistant answering questions from a User. You should either attempt to answer the question or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted]
User: What is the value of the expected defined contribution plan as a percentage of the defined benefit plan during the next fiscal year?
Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question is not ambiguous. Assistant: 0.6/5.1 = 11.76
User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount rate?
Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question was ambiguous. Assistant: ['What kind of change are you asking about?']
User: The pension.
Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question is not ambiguous. Assistant: ['49,361']
User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?
Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question is not ambiguous. Assistant: 49,361 - (39,145) = 88506
User: How much would the pension change if there is a 1% increase and a 1% decrease respectively?
Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question was ambiguous.
Assistant: ['Which rate of increase and decrease are you talking about?']
Table A18 provides an example of Chain-of-Thought prompting for PACIFIC. The model is instructed to first provide Reasoning about ambiguity before generating an Assistant response, illustrating how CoT is integrated into the dialogue flow.
3. **Proactive MIPrompt (Deng et al., 2023b; Chen et al., 2023b):** A combination of `mixed-initiative prompting` and `proactive prompting`. The `LLM` conditions on a possible set of actions and uses interweaved natural language instructions describing which actions correspond to dialogue turns.
* The following are the results from Table A19 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<td colspan="2">You are an Assistant answering questions from a User. You should either attempt to answer the question or ask a clarifying question if there is any ambiguity.</td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="2">[Table and Passage Omitted] User: How much does the company expect to make contributions to the defined plans during the next fiscal year?</td>
</tr>
<tr>
<td colspan="2">The user's last question was ambiguous. The Assistant asks a clarifying question.</td>
</tr>
<tr>
<td colspan="2">Assistant: ['What kind of defined plans are you asking about?']</td>
</tr>
<tr>
<td colspan="2">User: The defined benefit plans and the defined contribution plan respectively.</td>
</tr>
<tr>
<td colspan="2">The user's last question was unambiguous. The Assistant directly answers the question.</td>
</tr>
<tr>
<td colspan="2">Assistant: ['5.1million' ,'0.6 million'] User: What is the value of the expected defined contribution plan as a percentage of the defined benefit</td>
</tr>
<tr>
<td colspan="2">plan during the next fiscal year?</td>
</tr>
<tr>
<td colspan="2">The user's last question was unambiguous. The Assistant directly answers the question. Assistant: 0.6/5.1 = 11.76</td>
</tr>
<tr>
<td colspan="2">User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount</td>
</tr>
<tr>
<td colspan="2">rate? The user's last question was ambiguous. The Assistant asks a clarifying question.</td>
</tr>
<tr>
<td colspan="2">Assistant: ['What kind of change are you asking about?']</td>
</tr>
<tr>
<td colspan="2">User: The pension. The user's last question was unambiguous. The Assistant directly answers the question.</td>
</tr>
<tr>
<td colspan="2">Assistant: ['(39, 145)','49,361']</td>
</tr>
<tr>
<td colspan="2">User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?</td>
</tr>
<tr>
<td colspan="2">The user's last question was unambiguous. The Assistant directly answers the question.</td>
</tr>
<tr>
<td colspan="2">Assistant: 49,361 - (39,145) = 88506</td>
</tr>
<tr>
<td colspan="2">User: How much would the pension change if there is a 1% increase and a 1% decrease respectively?</td>
</tr>
<tr>
<td colspan="2">Actions: ["Directly Answer", "Ask a Clarification Question"]</td>
</tr>
<tr>
<td colspan="2">Prompt: Given the task background and the conversation history, please use appropriate actions to generate the response.</td>
</tr>
<tr>
<td colspan="2">Response: The user's last question was ambiguous. The Assistant asks a clarifying question.</td>
</tr>
<tr>
<td colspan="2">Assistant: ['Which rate of increase and decrease are you talking about?']</td>
</tr>
</tbody>
</table></div>
Table A19 shows an example of Proactive Mixed-Initiative prompting for PACIFIC. This style explicitly provides the model with available actions and describes the action taken in each turn, guiding the LLM to generate responses with specific pragmatic intents.
5.3.2. Tuning Baselines
These approaches involve fine-tuning the model.
- Supervised Fine-Tuning (SFT): The model is fine-tuned on the
ground truth responsesfrom each dataset's training split. This is a standard method for adaptingLLMsto specific tasks. - Iterative Reasoning Preference Optimization (IRPO): An
on-policy DPOvariant,IRPOhas shown effectiveness in reasoning tasks. It's evaluated onPACIFICandAmbigSQL(quantitative reasoning tasks). - DPO-Dist (DPO-Distillation): An
off-policy DPOapproach wherewinningresponses () are sampled from a higher-capacity model, andlosingresponses () are also generated, often from a lower-capacity model or by prompting a model to generate a "bad" response. This method distills preferences from powerfulLLMsinto smaller models. The paper presentsDPO-Distresults in Appendix B, specifically usingGemini Provs.FlashandClaude Sonnetvs.Haikufor generating preferences.
6. Results & Analysis
The experimental results demonstrate ACT's effectiveness in data-efficient tuning scenarios across various conversational tasks, especially when dealing with ambiguity. The evaluation covers both action-level performance (how well the model recognizes ambiguity and takes the correct conversational action) and content-level performance (how well it achieves the ultimate task goal).
6.1. Core Results Analysis
6.1.1. Conversational QA with Tabular Grounding (PACIFIC)
The following are the results from Table 1 of the original paper:
| Adaption Setting | |Action-level | Content-level | ||||
| Base Model | Approach | Conversations | Macro F1 ↑ | Turn F1 ↑ | Traj. F1 ↑ | Post-Clarify F1 ↑ |
| Gemini Pro | Standard ICL | 10 | 81.4 | 59.7 | 58.7 | 49.7 |
| Claude Sonnet | Standard ICL | 10 | 71.9 | 43.7 | 42.0 | 28.5 |
| Gemini Pro | SFT | 50 | 71.2 | 51.8 | 45.7 | 9.9 |
| Gemini Pro | SFT | 100 | 75.2 | 64.3 | 54.6 | 8.5 |
| Gemini Pro | SFT | 250 | 88.0 | 67.4 | 59.3 | 10.2 |
| Zephyr 7B-β | SFT | 50 | 69.0 | 57.8 | 61.3 | 43.5 |
| Zephyr 7B-β | IRPO | 50 | 67.7 | 59.1 | 56.7 | 34.4 |
| Zephyr 7B-β | ACT (ours) | 50 | 82.2 | 62.8 | 61.9 | 57.2 |
| Zephyr 7B-β | SFT | 100 | 82.3 | 58.6 | 60.3 | 49.9 |
| Zephyr 7B-β | IRPO | 100 | 84.5 | 60.4 | 55.2 | 38.2 |
| Zephyr 7B-β | ACT (ours) | 100 | 86.0 | 65.0 | 62.0 | 57.4 |
| Zephyr 7B-β | SFT | 250 | 86.9 | 65.1 | 63.3 | 56.7 |
| Zephyr 7B-β | IRPO | 250 | 85.4 | 64.9 | 58.4 | 40.3 |
| Zephyr 7B-β | ACT (ours) | 250 | 89.6 | 68.1 | 65.7 | 62.0 |
Table 1 shows experimental results on PACIFIC's public evaluation set. ACT consistently achieves the strongest performance across all data-efficient conditions (50, 100, 250 conversations) and all metrics (Macro F1, Turn F1, Traj. F1, Post-Clarify F1), outperforming SFT and IRPO.
- Key Findings:
ACTsignificantly outperformsSFTandIRPOin alldata-efficientsettings.- For
50 conversations(most data-scarce):ACTachieves an82.2Macro F1(action-level), a19.1% relative improvementoverSFT's69.0. Inmulti-turn task performance(Traj. F1),ACTreaches61.9, showing a35.7% relative improvementoverGemini Pro SFT's45.6. ACT's performance in these limited data settings often matches or outperformsfrontier LLMsused within-context learning (ICL)(e.g.,Gemini ProandClaude Sonnet), despiteACTmodels having zeroin-context examplesduring inference.- The
Post-Clarify F1metric, which measures performance after clarification, is notably higher forACT(e.g.,57.2vs.SFT's43.5at 50 conversations), indicatingACT's superior ability to reason about and utilize its own clarification questions.
- For
- Conclusion: The results highlight the importance of
on-policy learningandmulti-turn trajectory simulationfor improvingmulti-turn goal completion.
6.1.2. Abg-CoQA: Conversational QA for Machine Reading Comprehension
The following are the results from Table 2 of the original paper:
| Adaptation Setting | | Action-level | | Content-level | |||
| Base Model | Approach | Conversations | Macro F1 ↑ | Turn Similarity ↑ | Traj. Similarity ↑ |
| Gemini Pro | Standard ICL | 10 | 55.5 | 67.0 | 72.2 |
| Claude Sonnet | Standard ICL | 10 | 66.0 | 50.1 | 54.3 |
| Zephyr 7B-β | SFT | 50 | 44.6 | 53.3 | 64.2 |
| Zephyr 7B-β | ACT (ours) | 50 | 52.3 | 66.2 | 68.8 |
| Zephyr 7B-β | SFT | 100 | 52.6 | 63.1 | 69.4 |
| Zephyr 7B-β | ACT (ours) | 100 | 51.1 | 69.5 | 71.4 |
| Zephyr 7B-β | SFT | 250 | 53.5 | 64.0 | 66.2 |
| Zephyr 7B-β | ACT (ours) | 250 | 53.3 | 72.5 | 75.1 |
Table 2 presents the Abg-CoQA test set evaluation results.
- Key Findings:
ACTachieves the strongest performance in terms of task-specific metrics (Turn SimilarityandTraj. Similarity) across all three data settings.- For
50 conversations,ACTsignificantly improvesTurn Similarity(66.2vs.SFT's53.3) andTraj. Similarity(68.8vs.SFT's64.2). - Even when
SFTslightly outperformsACTinMacro F1(implicit action recognition) at100and250 conversations,ACTconsistently leads inmulti-turn goal completion(Turn SimilarityandTraj. Similarity).
- For
- Conclusion:
ACT's approach leads to improvedmulti-turn reasoningand task performance, indicating that itson-policy samplingandmulti-turn simulationare effective even for tasks where action recognition might be competitive with simpler methods. The paper notes that low inter-annotator agreement inAbg-CoQAmight affectaction-levelmetrics.
6.1.3. AmbigSQL: Ambiguous Conversational Text-to-SQL Generation
The following are the results from Table 3 of the original paper:
| Adaptation Setting | Action-level | Content-level | ||||
| Base Model | Approach | Conversations | Accuracy ↑ | Macro F1 ↑ | Execution Match ↑ | PC Execution Match ↑ |
| Gemini Pro | Standard ICL | 10 | 72.1 | 70.9 | 63.5 | 75.2 |
| Claude Sonnet | Standard ICL | 10 | 68.5 | 63.8 | 66.5 | 72.4 |
| Zephyr 7B-β | SFT | 50 | 77.4 | 77.4 | 21.9 | 13.9 |
| Zephyr 7B-β | IRPO | 50 | 91.0 | 91.0 | 27.8 | 30.8 |
| Zephyr 7B-β | ACT (ours) | 50 | 80.8 | 80.7 | 43.6 | 38.1 |
| Zephyr 7B-β | SFT | 100 | 97.2 | 97.2 | 43.3 | 34.3 |
| Zephyr 7B-β | IRPO | 100 | 96.2 | 96.1 | 45.0 | 37.0 |
| Zephyr 7B-β | ACT (ours) | 100 | 99.2 | 99.3 | 48.0 | 49.6 |
| Zephyr 7B-β | SFT | 250 | 99.8 | 99.7 | 51.0 | 50.7 |
| Zephyr 7B-β | IRPO | 250 | 97.0 | 97.1 | 49.7 | 45.6 |
| Zephyr 7B-β | ACT (ours) | 250 | 99.9 | 99.8 | 52.3 | 53.0 |
| Zephyr 7B-β | SFT | 14,000 (All) | 99.8 | 99.8 | 63.1 | 60.4 |
Table 3 presents the AmbigSQL test set evaluation results.
- Key Findings:
ACTachieves the strongest task performance (Execution MatchandPC Execution Match) within each data setting, especially with scarce data.- For
50 conversations,ACTachieves43.6Execution Matchand38.1PC Execution Match, substantially outperformingSFT(21.9and13.9) andIRPO(27.8and30.8). - Even though
frontier LLMs(Gemini Pro, Claude Sonnet) achieve strongdownstream SQL performancewithICL,Zephyrtuned withACTshows impressive gains for a smaller model. - The
PC Execution Match(performance after clarification) shows particularly large improvements forACTin low-data regimes, indicatingACT's ability to leverage clarifications effectively for correctSQLgeneration.
- Conclusion:
ACTleads to significant relative improvements inmulti-turn task performancefortext-to-SQL. The paper notes thatSQLgeneration heavily benefits from model scale, suggesting that applyingACTto larger models could yield even greater performance.
6.1.4. ACT In-The-Wild: Learning Without Dialogue Action Supervision
The following are the results from Table 4 of the original paper:
| Task Adaptation Environment | Action-level | Content-level | |||||
| Base Model | Framework | Action Supervision | Tuning Ex. | Macro F1 ↑ | Turn F1 ↑ | Traj. F1 ↑ | Post-Clarify F1 ↑ |
| Zephyr 7B-β | SFT | NA | 50 | 69.0 | 57.8 | 61.3 | 43.5 |
| Zephyr 7B-β | ACT | Crowdsourced | 50 | 82.2 | 62.8 | 61.9 | 57.2 |
| Zephyr 7B-β | ACT | Pseudo-labeled | 50 | 80.1 | 62.4 | 61.1 | 54.7 |
| Zephyr 7B-β | SFT | NA | 100 | 82.3 | 58.6 | 60.3 | 49.9 |
| Zephyr 7B-β | ACT | Crowdsourced | 100 | 86.0 | 65.0 | 62.0 | 57.4 |
| Zephyr 7B-β | ACT | Pseudo-labeled | 100 | 84.8 | 63.5 | 61.5 | 56.1 |
| Zephyr 7B-β | SFT | NA | 250 | 86.9 | 65.1 | 63.3 | 56.7 |
| Zephyr 7B-β | ACT | Crowdsourced | 250 | 89.6 | 68.1 | 65.7 | 62.0 |
| Zephyr 7B-β | ACT | Pseudo-labeled | 250 | 89.0 | 68.1 | 64.9 | 61.0 |
Table 4 examines ACT's performance on PACIFIC when action labels are not available, using O-shot Gemini Pro for pseudo-labeling.
- Key Findings:
- When using
pseudo-labeledsupervision foraction labels, there is "astonishingly high agreement" () withground truth action labelsfromGemini 1.5 Pro. - The empirical difference in performance between
crowdsourcedandpseudo-labeledACTis negligible across all metrics and data settings. For example, at250 conversations,crowdsourced ACTachieves89.6Macro F1and62.0Post-Clarify F1, whilepseudo-labeled ACTachieves89.0and61.0, respectively.
- When using
- Conclusion: This demonstrates that
ACTcan be highly effective for adaptation in "in-the-wild" settings where explicitaction label supervisionis unavailable, by leveragingLLMsforpseudo-labeling.
6.2. Ablation Studies
The following are the results from Table 5 of the original paper:
| Macro F1 ↑ | Turn F1 ↑ | Traj. F1 ↑ | Post-Clarify F1 ↑ | ||||
| Action Importance | |||||||
| ACT | |||||||
| w/ Random Actions | 63.2 | 55.3 | 58.7 | 32.8 | |||
| Ablation of ACT subcomponents | |||||||
| ACT w/o on-policy sampling | 74.8 | 61.5 | 59.1 | 40.5 | |||
| ACT | |||||||
| w/ sampling but w/o simulation | 81.4 | 60.8 | 60.2 | 50.1 | |||
| ACT (full) | 82.2 | 62.8 | 61.9 | 57.2 | |||
| ACT with unaligned foundation models | |||||||
| Gemma 2B SFT | 57.7 | 38.0 | 40.5 | 17.0 | |||
| Gemma 2B ACT | 62.7 | 42.6 | 44.0 | 24.8 | |||
| Mistral 7B SFT | 57.7 | 53.8 | 51.4 | 27.7 | |||
| Mistral 7B ACT | 75.7 | 58.1 | 57.6 | 31.9 | |||
Table 5 presents ablation studies conducted using PACIFIC's 50-conversation setting, examining the importance of ACT's various components. ACT (full) refers to the standard ACT implementation from Table 1 with 50 conversations (Macro F1: 82.2, Traj. F1: 61.9, Post-Clarify F1: 57.2).
6.2.1. Action Importance
- ACT w/ Random Actions: This variant constructs
contrastive pairsby randomly sampling both thewinningandlosing actions, instead of explicitly contrasting theground truth actionwith its opposite.- Results:
ACT w/ Random ActionsachievesMacro F1of63.2,Traj. F1of58.7, andPost-Clarify F1of32.8. This is significantly lower thanACT (full). - Analysis: This demonstrates that explicit
action-based preferences(i.e., contrasting the correct pragmatic action with an incorrect one) are crucial. Randomly chosen actions do not effectively guide the model towards learning optimal conversational policies.
- Results:
6.2.2. Ablation of ACT Subcomponents
- ACT w/o on-policy sampling: This variant evaluates
DPOon the initially constructedD_pref(from Algorithm 1) without dynamically updating thewinning/losing pairsthroughon-policy sampling. It's essentially anoff-policy DPOapproach applied to the initial action-based preference dataset.- Results:
ACT w/o on-policy samplingachievesMacro F1of74.8,Traj. F1of59.1, andPost-Clarify F1of40.5. While it shows some improvement overSFT(Macro F1:69.0), it performs worse thanACT (full). - Analysis:
On-policy samplingis vital. Theoff-policynegative responses are not guaranteed to be within thelanguage manifoldof the current policy model, making it harder to overcomedistribution shiftand effectively steer the model. Dynamic updates withon-policygenerated responses ensure more relevant and effective contrastive examples.
- Results:
- ACT w/ sampling but w/o simulation: This variant includes
on-policy samplingbut removes themulti-turn trajectory simulationcomponent. The model's immediate response is still sampled, but its impact on the long-term conversational goal is not evaluated or used to update thewinning/losing pairs.- Results:
ACT w/ sampling but w/o simulationachievesMacro F1of81.4,Traj. F1of60.2, andPost-Clarify F1of50.1. This is better thanACT w/o on-policy samplingbut still falls short ofACT (full). - Analysis:
Trajectory-level simulationis critical for improvingmulti-turn performance, especially for the policy model's ability to reason about its own clarification questions. Without it, the model might learn to take correct immediate actions but fail to lead the conversation to a successful overall outcome. ThePost-Clarify F1difference (50.1vs.57.2) particularly highlights this, as the simulation explicitly penalizes poor outcomes after clarification.
- Results:
6.2.3. ACT with Unaligned Foundation Models
This section examines if ACT's benefits are dependent on the base model being already aligned with human feedback (like ).
- Models:
Gemma 2B(unaligned) andMistral 7B(unaligned, the base forZephyr). - Results:
Gemma 2B SFT:Macro F157.7,Traj. F140.5.Gemma 2B ACT:Macro F162.7,Traj. F144.0. (Improvements of5.0and3.5respectively).Mistral 7B SFT:Macro F157.7,Traj. F151.4.Mistral 7B ACT:Macro F175.7,Traj. F157.6. (Improvements of18.0and6.2respectively).
- Analysis: While
Zephyr(an alignedMistralvariant) benefits more fromACT(larger absolute performance values),ACTstill provides notable improvements forunaligned foundation modelslikeGemma 2Band rawMistral 7Bover their respectiveSFTbaselines. - Conclusion:
ACTismodel agnosticand can improve performance regardless of prior alignment, although a betterinitialization(e.g., an already aligned model) can lead to higher overall performance ceilings.
6.3. Additional Experimental Results (from Appendix B)
The following are the results from Table A7 of the original paper:
| Adaption Setting | | Action-level | | Content-level | ||||
| Base Model | Approach | Conversations | Macro F1 ↑ | Turn F1 ↑ | Traj. F1 ↑ | Post-Clarify F1 ↑ |
| Gemini Pro | Standard Prompt | 10 | 81.4 | 59.7 | 58.7 | 49.7 |
| Gemini Pro | Chain-of-Thought | 10 | 86.3 | 66.3 | 17.1 | 19.2 |
| Gemini Pro | Proactive MIPrompt | 10 | 78.9 | 63.4 | 61.1 | 18.9 |
| Gemini Flash | Standard Prompt | 10 | 67.4 | 58.8 | 58.7 | 17.9 |
| Gemini Flash | Chain-of-Thought | 10 | 77.1 | 62.0 | 16.9 | 20.0 |
| Gemini Flash | Proactive MIPrompt | 10 | 76.8 | 64.0 | 62.0 | 24.4 |
| Claude Sonnet | Standard Prompt | 10 | 71.9 | 43.7 | 42.0 | 28.5 |
| Claude Sonnet | Chain-of-Thought | 10 | 80.0 | 37.2 | 13.0 | 6.8 |
| Claude Sonnet | Proactive MIPrompt | 10 | 74.9 | 47.2 | 45.9 | 7.6 |
| Claude Haiku | Standard Prompt | 10 | 46.9 | 26.4 | 26.2 | — |
| Claude Haiku | Chain-of-Thought | 10 | 48.6 | 23.7 | 12.0 | 2.9 |
| Claude Haiku | Proactive MIPrompt | 10 | 48.3 | 18.6 | 18.2 | 7.3 |
| Gemini Pro | SFT | 50 | 71.2 | 51.8 | 45.7 | 9.9 |
| Gemini Pro | SFT | 100 | 75.2 | 64.3 | 54.6 | 8.5 |
| Gemini Pro | SFT | 250 | 88.0 | 67.4 | 59.3 | 10.2 |
| Zephyr 7B-β | SFT | 50 | 69.0 | 57.8 | 61.3 | 43.5 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 50 | 75.5 | 61.7 | 55.7 | 30.8 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 50 | 74.8 | 62.0 | 56.3 | 31.9 |
| Zephyr 7B-β | IRPO | 50 | 67.7 | 59.1 | 56.7 | 34.4 |
| Zephyr 7B-β | ACT (ours) | 50 | 82.2 | 62.8 | 61.9 | 57.2 |
| Zephyr 7B-β | SFT | 100 | 82.3 | 58.6 | 60.3 | 49.9 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 100 | 68.8 | 53.3 | 53.3 | 31.7 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 100 | 83.0 | 59.0 | 53.7 | 29.3 |
| Zephyr 7B-β | IRPO | 100 | 84.5 | 60.4 | 55.2 | 38.2 |
| Zephyr 7B-β | ACT (ours) | 100 | 86.0 | 65.0 | 62.0 | 57.4 |
| Zephyr 7B-β | SFT | 250 | 86.9 | 65.1 | 63.3 | 56.7 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 250 | 65.6 | 53.6 | 54.1 | 30.9 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 250 | 82.8 | 43.3 | 38.6 | 19.6 |
| Zephyr 7B-β | IRPO | 250 | 85.4 | 64.9 | 58.4 | 40.3 |
| Zephyr 7B-β | ACT (ours) | 250 | 89.6 | 68.1 | 65.7 | 62.0 |
Table A7 expands on PACIFIC results, including Gemini Flash and Claude Haiku for prompting baselines and DPO-Dist as a tuning baseline. ACT still consistently shows the best performance. Notably, Chain-of-Thought prompting for frontier LLMs can sometimes lead to very low Trajectory F1, indicating that while reasoning chains might be explicit, they don't always translate to better multi-turn goal completion. DPO-Dist generally performs better than SFT but falls short of ACT.
The following are the results from Table A8 of the original paper:
| Adaptation Setting | | Action-level| | Content-level | |||
| Base Model | Approach | Conversations | Macro F1 ↑ | Turn Similarity ↑ | Traj. Similarity ↑ |
| Gemini Pro | Standard Prompt | 10 | 55.5 | 67.0 | 72.2 |
| Gemini Pro | Chain-of-Thought | 10 | 61.2 | 63.4 | 39.1 |
| Gemini Pro | Proactive MIPrompt | 10 | 55.5 | 63.3 | 33.3 |
| Gemini Flash | Standard Prompt | 10 | 52.6 | 62.5 | 67.4 |
| Gemini Flash | Chain-of-Thought | 10 | 61.2 | 56.5 | 36.6 |
| Gemini Flash | Proactive MIPrompt | 10 | 58.1 | 61.7 | 36.1 |
| Claude Sonnet | Standard Prompt | 10 | 66.0 | 50.1 | 54.3 |
| Claude Sonnet | Chain-of-Thought | 10 | 63.7 | 46.2 | 36.8 |
| Claude Sonnet | Proactive MIPrompt | 10 | 57.2 | 60.8 | 32.9 |
| Claude Haiku | Standard Prompt | 10 | 49.3 | 40.9 | 41.7 |
| Claude Haiku | Chain-of-Thought | 10 | 46.2 | 30.7 | 28.0 |
| Claude Haiku | Proactive MIPrompt | 10 | 45.2 | 34.5 | 31.4 |
| Zephyr 7B-β | SFT | 50 | 44.6 | 53.3 | 64.2 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 50 | 46.9 | 57.2 | 61.2 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 50 | 44.7 | 57.9 | 61.5 |
| Zephyr 7B-β | ACT (ours) | 50 | 52.3 | 66.2 | 68.8 |
| Zephyr 7B-β | SFT | 100 | 52.6 | 63.1 | 69.4 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 100 | 47.8 | 61.9 | 67.1 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 100 | 44.8 | 62.0 | 66.4 |
| Zephyr 7B-β | ACT (ours) | 100 | 51.1 | 69.5 | 71.4 |
| Zephyr 7B-β | SFT | 250 | 53.5 | 64.0 | 66.2 |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 250 | 46.0 | 61.9 | 66.3 |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 250 | 46.3 | 62.6 | 67.0 |
| Zephyr 7B-β | ACT (ours) | 250 | 53.3 | 72.5 | 75.1 |
Table A8 shows extended Abg-CoQA results. Again, ACT demonstrates superior multi-turn goal completion (Traj. Similarity) compared to SFT and DPO-Dist, even when ICL with Gemini Pro achieves high Traj. Similarity at 10 examples. The DPO-Dist variants, while outperforming SFT in some cases, do not consistently match ACT's performance.
The following are the results from Table A9 of the original paper:
| Adaptation Setting | | Action-level | | Content-level | ||||
| Base Model | Approach | Conversations | Accuracy ↑ | Execution Match ↑ | PC Execution Match ↑ | |
| Gemini Pro | Standard Prompt | 10 | 72.1 | 63.5 | 75.2 | |
| Gemini Flash | Standard Prompt | 10 | 75.6 | 64.2 | 66.2 | |
| Claude Sonnet | Standard Prompt | 10 | 68.5 | 66.5 | 72.4 | |
| Claude Haiku | Standard Prompt | 10 | 73.8 | 57.3 | 65.3 | |
| Zephyr 7B-β | SFT | 50 | 77.4 | 21.9 | 13.9 | |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 50 | 77.7 | 42.6 | 31.5 | |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 50 | 78.0 | 40.9 | 41.2 | |
| Zephyr 7B-β | IRPO | 50 | 91.0 | 27.8 | 30.8 | |
| Zephyr 7B-β | ACT (ours) | 50 | 80.8 | 43.6 | 38.1 | |
| Zephyr 7B-β | SFT | 100 | 97.2 | 43.3 | 34.3 | |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 100 | 98.7 | 45.1 | 45.3 | |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 100 | 99.8 | 47.8 | 44.8 | |
| Zephyr 7B-β | IRPO | 100 | 96.2 | 45.0 | 37.0 | |
| Zephyr 7B-β | ACT (ours) | 100 | 99.2 | 48.0 | 49.6 | |
| Zephyr 7B-β | SFT | 250 | 99.8 | 51.0 | 50.7 | |
| Zephyr 7B-β | DPO-Dist (Pro v. Flash) | 250 | 97.3 | 49.7 | 44.2 | |
| Zephyr 7B-β | DPO-Dist (Sonnet v. Haiku) | 250 | 99.7 | 50.7 | 50.3 | |
| Zephyr 7B-β | IRPO | 250 | 97.0 | 49.7 | 45.6 | |
| Zephyr 7B-β | ACT (ours) | 250 | 99.9 | 52.3 | 53.0 | |
| Zephyr 7B-β | SFT | 14,000 (All) | 99.8 | 63.1 | 60.4 | |
Table A9 details extended AmbigSQL results. ACT again consistently achieves the strongest Execution Match and PC Execution Match among the Zephyr models, especially in data-scarce settings. DPO-Dist also shows competitive performance against SFT and IRPO, demonstrating the value of preference-based alignment. However, the largest frontier LLMs still achieve higher Execution Match, underscoring the benefits of model scale for complex tasks like SQL generation.
6.4. Qualitative Examples (from Appendix I)
The qualitative examples provide intuitive insights into ACT's improvements.
-
Table A28 (PACIFIC):
Relevant Table Data SFT ACT Year: 2019 || 2018 User: What were the total liabilities of IMFT? Assistant: \$909 × User: What were the total liabilities of IMFT? Assistant: Which year are you asking about? Total current liabilities : 851 | | 167 Other noncurrent liabilities : 5 | |74 User: 2018 Total Liabilities: 1,305 Assistant: \$1,305
Table A28 shows a comparison of trajectories on PACIFIC. The SFT model makes an "Incorrect Assumption" and provides a wrong answer (909), whereas the ACTmodel correctly asks aclarifying question ("Which year are you asking about?"), and then provides the correct answer (1,305) after the user clarifies. This exemplifies ACT's learned ability to disambiguate.
-
Table A29 (PACIFIC - Erroneous Trajectory):
SFT ACT Relevant Data Net profit/(loss) after tax was \$(9.8) million User: What was the net profit/(loss) after tax? User: What was the net profit/(loss) after tax? Assistant: Which year are you asking the amount of? Net profit/(loss) after tax : (9,819) | |6,639 || (248%) Assistant: \$(9,819) × User: FY19 Assistant: \$(9,819) ×
Table A29 illustrates an erroneous trajectory where both SFT and ACT models give incorrect answers for "net profit/(loss) after tax". The ACT model still attempts a clarification ("Which year are you asking the amount of?") and then provides an answer, but it's incorrect. This shows that while ACT improves clarification, it doesn't guarantee perfect accuracy on the underlying task, especially with limited data. The underlying issue is that both models selected a value from the table, but the ground truth was from the passage.
-
Table A30 (PACIFIC - Erroneous Trajectory):
SFT ACT Relevant Table Data User: In which year was the amount of Investments higher? ? User: In which year was the amount of Investments higher? Year: 2019 | | 2018 Investments: 1,216.0 || 1,212.9 Assistant: [2018] × Assistant: [2018, 2019] ×
Table A30 shows another example where both SFT and ACT fail to produce the correct answer, and neither asked a clarification question, even though the data is present in the table. This suggests room for improvement in task performance even with ACT.
-
Table A31 (Abg-CoQA):
SFT ACT Relevant Passage User: What did Meghan ask? User: What did Meghan ask? She yawned, stretching out like a lazy cat. Assistant: Do you mean where did Lizzie see her father? Assistant: Do you mean that morning or the night before? Her sister was also awake. "Lizzie", Meghan whispered, "are you awake?" User: No A User: The night before.
Table A31 compares SFT and ACT trajectories on Abg-CoQA. The SFT model asks an off-topic clarification question, failing to lead the conversation towards the correct information. In contrast, the ACT model asks an on-track clarification question ("Do you mean that morning or the night before?"), demonstrating its improved multi-turn reasoning and ability to eventually reach the ground truth outcome ("Meghan asked Lizzie, 'Are you awake?'").
6.5. Additional Data Analysis (from Appendix A)
The following are the results from Table A6 of the original paper:
| Adaptation Setting | | Action-level | | Content-level | |||
| Base Model | Approach | Conversations | Macro F1 ↑ | Turn Similarity ↑ | Traj. Similarity ↑ |
| Gemini Pro | ICL | 50 | 56.4 | 64.5 | 68.9 |
| Zephyr 7B-β | ACT (ours) | 50 | 52.3 | 66.2 | 68.8 |
| Gemini Pro | ICL | 100 | 59.2 | 67.0 | 72.0 |
| Zephyr 7B-β | ACT (ours) | 100 | 51.1 | 69.5 | 71.4 |
| Gemini Pro | ICL | 250 | 58.8 | 66.0 | 71.1 |
| Zephyr 7B-β | ACT (ours) | 250 | 53.3 | 72.5 | 75.1 |
Table A6 investigates the impact of additional in-context learning data for Gemini Pro on Abg-CoQA.
- Key Findings: While
Gemini ProwithICLachieves competitiveMacro F1foraction-levelperformance, itsTrajectory Similaritydoes not uniformly improve with moreICL examples(e.g., it's72.2at 10,68.9at 50,72.0at 100,71.1at 250). In contrast, tuned withACTconsistently improves itsTrajectory Similarityas the number of tuning conversations increases (68.8at 50,71.4at 100,75.1at 250). - Conclusion: This suggests that for certain corpora with potential label noise (like
Abg-CoQA's low inter-annotator agreement),multi-turn task completionis a more robust measure thanaction-levelperformance alone.ACT-tuned smaller models can outperformfrontier LLMsusingmany-shot ICLinmulti-turn task completion, highlightingACT's superior data efficiency and alignment with conversational goals.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Action-Based Contrastive Self-Training (ACT), a novel, model-agnostic, and data-efficient quasi-online contrastive tuning approach specifically designed for multi-turn conversational task adaptation. ACT leverages DPO by incorporating action-based preferences and multi-turn trajectory simulation to teach LLMs how to effectively manage ambiguity and clarify user intent. The research provides compelling evidence that ACT is highly effective in data-limited scenarios, outperforming standard SFT and other DPO variants across diverse tasks, including tabular QA, machine reading comprehension, and text-to-SQL generation. A significant finding is ACT's ability to achieve strong performance even without explicit action labels, by using pseudo-labeling from larger LLMs. The paper also contributes AmbigSQL, a new dataset and task for ambiguous text-to-SQL generation, and proposes a comprehensive evaluation workflow for conversational agents' ambiguity recognition and reasoning abilities.
7.2. Limitations & Future Work
The authors acknowledge several limitations:
-
Assumption of Timed Clarifications: The approach assumes that
clarification questionsare appropriately timed. Noisy crowdsourced data can lead to suboptimal learned policies (e.g., unnecessary clarifications or disfluent language), suggesting a need for an additional preprocessing stage to infer the usefulness of actions. -
Label Noise in Evaluation:
Label noisecan affectimplicit action recognition evaluation, especially in datasets with low inter-annotator agreement (e.g.,Abg-CoQA). -
Task-Specific Heuristics:
ACTrelies ontask-specific heuristicsfortrajectory outcome evaluation. While this allows flexibility, it demands more customization and engineering effort for new domains. -
Reliance on External LLMs: The implementation of
ACTheavily depends on otherLLMs(e.g.,GeminiforAction ClassificationandUser Simulation). Theseprompting approachesare not perfect, may require substantial customization, and might not be accessible to all researchers due to financial or privacy concerns. -
Limited Data Regime Focus: The study primarily focuses on
limited data regimes. It's unclear howACT's findings would generalize to scenarios with abundantin-distribution training data, where simpler objectives likeSFTmight eventually catch up. -
Quasi-Online Nature:
ACTisquasi-onlinerather than fully online. It samples one response from the policy model during tuning, which differs from fully onlineDPOwhere both winning and losing responses are sampled. The extent of online exploration is also limited by the nature of the tasks (e.g., specific arithmetic expressions have limited unique trajectories).Future work could explore:
-
Combining
ACTwith existing sophisticated tuning approaches for complex tasks liketext-to-SQLgeneration. -
Generalizing
ACTtolarge-scale dataandmulti-task environments. -
Further studying
ACTin combination with approaches for improvedretrieval-augmented generationto addresshallucinationsinQAcontexts.
7.3. Personal Insights & Critique
This paper presents a highly valuable contribution to the field of conversational AI, particularly in addressing a critical practical challenge: LLMs' struggle with ambiguity in multi-turn interactions under data constraints.
Strengths and Innovations:
- Targeted Problem: The focus on
disambiguationinmulti-turn conversationsis highly relevant. Many currentLLMapplications fall short when user intent isn't perfectly clear, and this paper offers a concrete solution. - Effective Integration of DPO: The
quasi-online DPOframework, tailored withaction-based contrastive learningandmulti-turn trajectory simulation, is an elegant and effective extension. It movesDPObeyond simple pairwise preferences to a more sophisticated understanding of conversational dynamics. - Data Efficiency: The demonstration of
ACT's performance inlow-data regimesis a major plus. This is a common real-world scenario, especially for specialized domains where extensive human-annotated dialogue data is prohibitively expensive or impossible to collect. The success withpseudo-labelingfurther enhances its practicality. - Novel Dataset (AmbigSQL): The creation of
AmbigSQLis a significant contribution. It not only serves as a strong benchmark for the proposed method but also provides a valuable resource for future research on ambiguoustext-to-SQLgeneration, highlighting the real-world performance gaps that clarification can address. - Comprehensive Evaluation: The dual focus on
action-levelandcontent-levelmetrics, particularly thetrajectory-levelevaluations (e.g.,Traj. F1,PC Execution Match), offers a holistic view of conversational agent performance that goes beyond single-turn accuracy.
Potential Issues and Areas for Improvement:
- Reliance on External LLMs: While practical for research, the dependence on
Gemini Ultraforconditional generation (M),action classification (A), anduser simulation (U)could be a barrier for researchers without access to suchfrontier LLMsor for those with strict privacy requirements. Developing robust open-source alternatives for these components would broadenACT's applicability. - Complexity of Heuristics: The reliance on
task-specific heuristics (H)fortrajectory outcome evaluationmeans that deployingACTto a new domain might require significant domain expertise and engineering effort to define these heuristics accurately. While flexible, it's not entirely "plug-and-play" across diverse tasks. - Defining "Optimal" Actions: The paper assumes an "optimal"
ground truth action. In highly nuanced conversations, what constitutes the bestclarifying questionordirect answermight be subjective. Thelabel noisediscussion forAbg-CoQAtouches on this. Further research into handling more complex, subjective dialogue acts or even sequences of clarifying questions could be fruitful. - Scaling to "All" Data: The paper focuses on the
limited data regime. While this is a common problem, understanding howACTperforms or evolves when faced with truly massive, in-distribution datasets (where simpler methods might eventually catch up) would be an interesting extension.
Broader Applicability:
The principles behind ACT—learning pragmatic actions, optimizing for multi-turn outcomes, and leveraging on-policy experience in a DPO-like framework—could be highly applicable to other complex AI agent behaviors beyond conversation. This could include agents that need to plan sequences of actions in interactive environments, where feedback is multi-step and reward signals are sparse or delayed. For instance, in tool-use or planning agents, distinguishing between "asking for more information/clarification" versus "executing a tool" could be framed as action-based preferences, and the success of a multi-step plan could form the trajectory outcome. Overall, ACT offers a robust framework for teaching LLMs more sophisticated, context-aware, and goal-oriented behaviors in interactive settings.
Similar papers
Recommended via semantic vector search.