Paper status: completed

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

Published:06/01/2024

Multi-Turn Conversation Modeling (1)Action-Based Contrastive Self-Training (1)Dialogue Policy Optimization for Large Language Models (1)Ambiguity Recognition in Human-Computer Interaction (1)Unlabeled Dialogue Training (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Action-Based Contrastive Self-Training (ACT) to enhance LLMs' ability in handling ambiguity. The quasi-online optimization algorithm effectively learns dialogue policies in data-sparse scenarios, demonstrating superior performance over traditional fine-tunin

Abstract

Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users' true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT's efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.

Mind Map

In-depth Reading

English Analysis~46 min read · 64,269 chars

1. Bibliographic Information

1.1. Title

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

1.2. Authors

Maximillian Chen, Ruoxi Sun, Tomas Pfister, Sercan Ö. Arik. Maximillian Chen is affiliated with Google and Columbia University. The other authors (Ruoxi Sun, Tomas Pfister, Sercan Ö. Arik) are affiliated with Google. Their research backgrounds generally lie in artificial intelligence, machine learning, and natural language processing, particularly in the domain of conversational AI and large language models.

1.3. Journal/Conference

Published at arXiv, a preprint server. arXiv is a widely respected open-access archive for preprints of scientific papers in fields such as mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers on arXiv have not necessarily undergone peer review, but it is a primary platform for disseminating new research rapidly in the AI/ML community.

1.4. Publication Year

2024 (Published at UTC: 2024-05-31T22:44:48.000Z)

1.5. Abstract

The paper addresses the limitation of Large Language Models (LLMs) in handling ambiguity during multi-turn conversations; LLMs often guess user intents or overhedge rather than asking clarifying questions. This issue is exacerbated by the scarcity of high-quality, task-specific conversation samples for training. The authors propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm built upon Direct Preference Optimization (DPO). ACT is designed for data-efficient dialogue policy learning in multi-turn conversation modeling. The paper demonstrates ACT's effectiveness in data-efficient tuning scenarios, even without explicit action labels, across three real-world conversational tasks: tabular-grounded question-answering (PACIFIC), machine reading comprehension (Abg-CoQA), and a novel task called AmbigSQL for disambiguating information-seeking requests for complex SQL generation. Additionally, the paper introduces a method for evaluating LLMs' ability to implicitly recognize and reason about ambiguity. ACT significantly outperforms standard tuning methods like Supervised Fine-Tuning (SFT) and DPO.

1.6. Original Source Link

https://arxiv.org/abs/2406.00222 (Preprint) PDF Link: https://arxiv.org/pdf/2406.00222v2.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the deficiency of Large Language Models (LLMs) in handling ambiguity within multi-turn conversations. While LLMs excel in many benchmarks, they often fail to exhibit crucial conversational skills like disambiguation. When confronted with ambiguous user requests, they tend to either provide vague or overly cautious responses (overhedging) or make implicit guesses about the user's true intent, instead of actively seeking clarification through questions. This behavior leads to suboptimal user experiences and reduces the effectiveness of LLM-based conversational agents.

This problem is particularly important in domains where high-quality conversation samples are limited. Collecting and annotating such data is costly and often faces privacy concerns, creating a bottleneck for LLMs to learn effective dialogue action policies. Existing LLM training paradigms, such as supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), don't always directly optimize for pragmatic conversational skills (e.g., when to clarify), especially in complex, multi-turn interactions.

The paper's entry point or innovative idea is to address this data scarcity and skill gap by proposing a data-efficient adaptation approach. It focuses on enabling LLMs to implicitly select conversational strategies in ambiguous contexts by contrasting pragmatic conversational actions within multi-turn trajectories, rather than just single-turn responses.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Action-Based Contrastive Self-Training (ACT): Introduction of ACT, a novel, data-efficient, quasi-online preference optimization algorithm based on DPO. ACT is designed specifically for multi-turn conversation modeling and focuses on contrasting pragmatic conversational actions (e.g., CLARIFY vs. ANSWER).
Multi-Turn Trajectory Optimization: ACT explicitly incorporates multi-turn trajectory simulation into its training objective, allowing it to optimize for long-horizon conversational success, which is a significant improvement over methods focused solely on single-turn responses.
Data-Efficient Policy Learning: Demonstrated ACT's efficacy in data-efficient tuning scenarios, showing substantial improvements even with limited conversation samples (e.g., 50, 100, 250 conversations) and even when explicit action labels are unavailable (using pseudo-labeling).
Novel Task and Evaluation Workflow: Introduction of AmbigSQL, a new task specifically designed for disambiguating information-seeking requests to generate complex SQL queries, highlighting the necessity of clarification in data analysis agents. The paper also proposes a workflow for evaluating LLMs' ability to implicitly recognize and reason about ambiguity in conversation.
Empirical Superiority: ACT demonstrated substantial conversation modeling improvements over standard tuning approaches, including Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) variants like Iterative Reasoning Preference Optimization (IRPO), across diverse tasks like tabular QA (PACIFIC), machine reading comprehension (Abg-CoQA), and text-to-SQL generation (AmbigSQL).
Model Agnosticism: ACT is shown to be model-agnostic, capable of improving performance regardless of the base model's pre-existing alignment with human feedback.

These findings solve the problem of LLMs' struggle with conversational ambiguity by providing a method that teaches them to CLARIFY effectively, even with limited task-specific data, leading to more robust and helpful conversational agents.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following fundamental concepts:

Large Language Models (LLMs): LLMs are deep learning models, typically based on the Transformer architecture, that are pre-trained on vast amounts of text data to learn complex patterns of language. They can generate human-like text, translate languages, answer questions, and perform many other Natural Language Processing (NLP) tasks. The core idea is that through pre-training, they develop a broad understanding of language, which can then be adapted for specific tasks through fine-tuning.
Supervised Fine-Tuning (SFT): SFT is a common technique used to adapt a pre-trained LLM to a specific downstream task. It involves training the LLM on a dataset of input-output pairs (e.g., instruction-response pairs) for that task. The model's weights are adjusted to minimize a loss function (e.g., cross-entropy) that measures the difference between its generated output and the ground-truth output. SFT teaches the model how to follow instructions or generate responses in a desired format.
Reinforcement Learning from Human Feedback (RLHF): RLHF is a post-training technique designed to align LLMs with human preferences and values. It typically involves three steps:
1. Supervised Fine-Tuning (SFT): An LLM is initially fine-tuned on instruction-following data.
2. Reward Model Training: Human annotators rank or rate different LLM-generated responses to a given prompt. A separate reward model is then trained on this human preference data to predict human preferences.
3. Reinforcement Learning (RL): The SFT model is further fine-tuned using an RL algorithm (e.g., Proximal Policy Optimization (PPO)) where the reward model provides a scalar reward signal. The LLM learns to generate responses that maximize this reward, thereby aligning its behavior with human preferences. RLHF is crucial for making LLMs helpful, harmless, and honest.
Direct Preference Optimization (DPO): DPO is an alternative to RLHF that simplifies the alignment process by directly optimizing the LLM policy against human preferences, without needing to explicitly train a separate reward model. Instead of learning a reward function and then using RL, DPO reformulates the RLHF objective into a simple cross-entropy loss. It directly optimizes the LLM to increase the probability of preferred responses and decrease the probability of dispreferred responses, based on a dataset of human-ranked pairs (preferred, rejected). This makes DPO more stable and easier to implement than PPO-based RLHF.
Multi-turn Conversations/Dialogue Systems: These refer to interactive systems where an agent and a user exchange multiple utterances to achieve a specific goal or engage in natural conversation. Unlike single-turn interactions, multi-turn dialogues require context understanding, memory, and the ability to manage the flow of conversation. Key challenges include maintaining coherence, tracking dialogue state, and handling ambiguity that arises over several turns.
Dialogue Acts/Pragmatic Intent: A dialogue act is a linguistic unit that describes the function or communicative intention of an utterance in a conversation (e.g., question, answer, clarification, greeting, apology). Recognizing and generating appropriate dialogue acts is crucial for effective conversational agents, as it allows them to understand and influence the pragmatic flow of a conversation. The pragmatic intent refers to the underlying purpose or goal a speaker has when making an utterance.

3.2. Previous Works

The paper builds upon a rich history of research in conversational AI and LLM alignment.

Mixed-Initiative Conversational Agents: Early neural approaches (Chen et al., 2017, 2022b; Qian et al., 2022; Yu, 2017) often decoupled generation (forming utterances) from planning (deciding what action to take). While LLMs improved generation, planning, especially for multi-turn interactions, remained challenging. Planning modules often involved complex Markov Decision Processes (MDPs) (Wang et al., 2020b; Yang et al., 2021; Yu et al., 2023) or neural models combined with search algorithms (Cheng et al., 2022; Väth et al., 2023; Yu et al., 2023) and simulation (Deng et al., 2023c; Wang et al., 2020a; Yu et al., 2023). However, these modular approaches could incur high computational overhead and error propagation.
LLM Alignment (RLHF and DPO): The current paradigm for LLM training involves pre-training, SFT, and alignment. RLHF (Ouyang et al., 2022; Christiano et al., 2017) became prominent for alignment, often using PPO. Recent offline RL algorithms like DPO (Rafailov et al., 2024), SLiC (Zhao et al., 2023), and IPO (Azar et al., 2024) have gained traction for bypassing explicit reward modeling, making them easier to optimize while achieving similar empirical results on fixed preference datasets.
On-Policy/Iterative DPO Variants: Many contemporary works explore "online" or "iterative" variants of DPO to overcome the limitations of fully offline methods. Examples include Iterative DPO (Yuan et al., 2024), variants where ground-truth responses are preferred over sampled ones (Chen et al., 2024), or DPO applied to optimize reasoning chains (Pang et al., 2024). Some works also apply DPO to conversations generally, but often focus on single-turn response optimization (Sun et al., 2024).
User Simulators: Works like Deng et al. (2023c) and Yu et al. (2023) directly prompt LLMs for goal-oriented tasks conditioned on dialogue context, building on earlier research in user simulation for dialogue systems (Chandramohan et al., 2011; Kreyssig et al., 2018; Shah et al., 2018).

3.3. Technological Evolution

The field of conversational AI has evolved significantly, particularly with the advent of LLMs. Initially, dialogue systems were often rule-based or heavily reliant on statistical models for specific components like Natural Language Understanding (NLU) and Natural Language Generation (NLG). The shift towards neural approaches brought improvements, leading to modular systems where planning (what to say) and generation (how to say it) were often separate.

The rise of LLMs brought about a paradigm shift. Their impressive zero-shot and few-shot capabilities allowed for highly fluent and contextually aware generation. However, simply using LLMs for generation didn't automatically imbue them with robust conversational pragmatics. The need for alignment with human values and specific task behaviors led to SFT and then RLHF. RLHF effectively enhanced LLMs' ability to follow instructions and generate preferred responses, but its complexity (training a reward model, using PPO) spurred the development of simpler, yet effective, alternatives like DPO.

This paper's work (ACT) fits into this timeline by taking the DPO framework and extending it to specifically address the nuanced, multi-turn challenges of conversational AI. It moves beyond general LLM alignment to focus on learning optimal dialogue action policies in scenarios where data is limited, an increasingly common and critical challenge in deploying specialized conversational agents.

3.4. Differentiation Analysis

Compared to the main methods in related work, ACT offers several core differences and innovations:

Action-Based Preference Optimization: While DPO and its variants optimize LLMs based on general human preferences for responses, ACT explicitly introduces action-based preferences. It focuses on contrasting responses not just by their content, but by the implicit conversational action (e.g., CLARIFY vs. ANSWER) they represent. This directly targets the pragmatic skills needed for effective disambiguation.
Multi-Turn Trajectory Simulation: Unlike most DPO applications that focus on single-turn response optimization, ACT incorporates multi-turn trajectory simulation into its self-training loop. It evaluates not just the immediate response but the outcome of a simulated conversation trajectory stemming from that response. This allows for long-horizon optimization, crucial for complex conversational tasks where a single "good" turn might lead to a poor overall outcome.
Quasi-Online Learning: ACT is described as a quasi-online DPO variant. It maintains the ease of use of offline methods (by constructing an initial preference dataset) but incorporates the flexible exploration found during online learning by continuously sampling on-policy responses and updating contrastive pairings based on their simulated conversational success. This strikes a balance between computational efficiency and adaptability.
Data Efficiency for Conversational Policies: ACT is specifically designed for data-efficient tuning scenarios, making it highly relevant for task-specific conversational AI where high-quality multi-turn dialogue data is scarce. It shows significant improvements even with very limited training data (e.g., 50-250 conversations), outperforming approaches that typically require more data or are less focused on conversational pragmatics.
Implicit Action Learning: ACT implicitly learns action planning as a subtask of response generation. This avoids the computational overhead and error propagation issues associated with separate planning modules in traditional mixed-initiative agents. It also demonstrates the ability to learn effectively even when explicit action labels are unavailable, leveraging pseudo-labeling.

In essence, ACT innovates by tailoring DPO to the unique demands of multi-turn conversational AI, emphasizing action-based learning and trajectory-level outcomes to teach LLMs how to effectively manage ambiguity and clarify user intent.

4. Methodology

The core idea of ACT is to adapt generic LLMs for dialogue generation, implicitly modeling action planning. It's a quasi-online extension of DPO, focusing on contrasting differences between an agent's possible pragmatic conversational actions within multi-turn contexts.

4.1. Principles

The theoretical basis behind ACT is rooted in three key intuitions:

Contrastive Preferences for Pragmatic Actions: The pragmatic differences between "winning" (desired) and "losing" (undesired) dialogue responses can be intuitively demonstrated and learned through contrastive preference pairs. This means explicitly showing the model what constitutes a good conversational action (like asking a clarifying question when appropriate) versus a bad one (like guessing the user's intent).
Multi-turn Optimization for Conversational Improvements: Real conversational improvements require optimizing over sequences of turns (multi-turn optimization), not just individual responses. A single turn's quality might not reflect its contribution to the overall conversational goal. ACT therefore needs to evaluate and learn from entire conversation trajectories.
On-Policy Response Sampling and DPO Gradient: The gradient of DPO-like algorithms is weighted by the log probabilities of winning and losing responses. By integrating on-policy response sampling (i.e., sampling responses from the current version of the model being trained), ACT ensures that the "losing" responses used in training are often high-probability sequences from the current policy, making the contrastive learning more effective for steering the model's behavior away from its current undesirable tendencies.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Setup

The paper considers the task of tuning an LLM to act as a mixed-initiative conversational agent. This agent's role is to assist users over multiple dialogue interactions, ultimately providing a correct response to their request. Crucially, a mixed-initiative agent must understand when to redirect the flow of interaction by executing specific conversational actions, such as asking clarifying questions.

Notation:

Let $\pi _ { \theta _ { i } }$ be an LLM's policy, parameterized by $\theta$ , at timestep $i \geq 0$ .
$\pi _ { r e f }$ is the reference policy model, which is typically the initial policy model (i.e., $\pi _ { r e f } = \pi _ { \theta _ { 0 } }$ ).
$D$ is a dataset of conversations.
Each conversation $c \in D$ consists of $n$ dialogue turns.
A turn state at timestep $i$ is represented by t _ { i }. This includes observed utterances and actions from both user and agent.
Each t _ { i } is part of a trajectory that ends when the user's initial question (from an earlier timestep $j \le i$ ) is answered.
t _ { i } has two primary components:
- p _ { i }: A prompt at timestep $i$ . This comprises any task-specific information (e.g., SQL database schema, tabular data, retrieved passages) combined with the existing dialogue context.
- r _ { i }: The ground truth system-side response at timestep $i$ .
g _ { i }: The goal response that resolves the implicit trajectory of t _ { i }. In single-turn trajectories, g _ { i } = r _ { i }.
Each r _ { i } implicitly expresses an action a _ { i }.
$S$ : The latent Action Space of a particular task. For the experiments in this paper, $\boldsymbol { S } =$ [CLARIFY, ANSWER].
$G$ : An Action Annotation Agent that can infer a _ { i } from ( p _ { i } , r _ { i } ). In practice, this can be a human annotator or a pseudo-labeling classifier.
$M$ : A Controllable generation model used for preference data creation. This is typically a high-capacity LLM.
$A$ : An Action Classifier model used during tuning and evaluation to identify the implicit action of a generated response.
$U$ : A User Simulator model, also used during tuning and evaluation to simulate user responses and conversational trajectories.

The notation is illustrated in Figure A4.

该图像是示意图，展示了一个关于物质主义的多轮对话示例。在对话中，助手被请求澄清物质主义与其他哲学类别的关系。所选的响应是对用户问题的澄清，而被拒绝的响应则是直接回答，突显了对话的意图与信息目标。

Figure A4 shows an example of a contrastive pairing constructed for RL tuning with Abg-CoQA. The example demonstrates how a turn in a conversation is decomposed into its prompt, ground truth response, goal response, and the inferred action. It also shows a winning response ( $y_w$ ) and a losing response ( $y_l$ ), highlighting that the losing response is generated by forcing an alternative action.

User Simulators: The User Simulator ( $U$ ) is implemented by first prompting an LLM to summarize the user's information-seeking goal. Then, another prompt is formed using this summary and the current dialogue context to simulate a user response. This approach allows more flexibility than directly providing the ground truth objective.

Action Classifiers: For the tasks considered, the possible actions are CLARIFY or ANSWER. The Action Classifier ( $A$ ) is implemented using few-shot in-context learning with an LLM.

4.2.2. ACT: Action-Based Contrastive Self-training

ACT consists of two main phases: action-based contrast dataset construction (Algorithm 1) and contrastive self-training (Algorithm 2). The overall process is summarized in Figure 3.

该图像是一个示意图，展示了基于行动的对比自我训练（ACT）方法在多轮对话中进行策略模拟与评估的流程。图中描述了不同场景下的示例响应和动作检测，以及策略更新的过程。

Figure 3 provides an overview of the tuning phase of ACT. It shows that for each initial contrastive pairing, an on-policy response is sampled from the model being tuned. This sampled response's trajectory is evaluated, and the contrastive pairing is updated (either the winning or losing response is replaced) before the model policy is updated using the DPO objective.

4.2.2.1. Construction of Preference Data

The preference dataset (D _ { p r e f }) is primarily built from contrastive winning-losing action pairs. This process is detailed in Algorithm 1.

The following are the steps for building contrastive action pairs (Algorithm 1): $Algorithm 1 Building Contrastive Action Pairs input Dataset D, Conditional generation model M, Action Space S, Action Annotation Agent G 1: Initialize empty dataset D_pref. 2: for conversation turn t_i in D do 3: Let a_i = G(p_i, r_i) ▷ Infer Contextual Action 4: Let a_i' = S \ a_i ▷ Determine Rejected Action 5: Let y_wi = r_i. 6: Sample y_li ~ P_M(· | p_i, a_i'). 7: Let t_i' = (p_i, r_i, g_i, a_i, a_i', y_wi, y_li). 8: Add t_i' to D_pref.$ Explanation:

An empty dataset D _ { p r e f } is initialized to store the contrastive pairs.
For each conversation turn t _ { i } in the input dataset $D$ $D$ :
- Infer Contextual Action: The Action Annotation Agent $G$ infers the implicit action a _ { i } of the ground truth response r _ { i } given the prompt p _ { i }. This can be human annotation or a pseudo-labeling classifier.
- Determine Rejected Action: A rejected action $a _ { i } ^ { \prime }$ is chosen from the Action Space $S$ such that $a _ { i } ^ { \prime }$ is not the inferred action a _ { i }. For $\boldsymbol { S } =$ [CLARIFY, ANSWER], if a _ { i } is CLARIFY, then $a _ { i } ^ { \prime }$ is ANSWER, and vice-versa.
- Winning Response: The ground truth response r _ { i } is designated as the winning response y _ { w i }.
- Losing Response Sampling: A losing response y _ { l i } is sampled from the Conditional generation model $M$ . This sampling is conditioned on the prompt p _ { i } and explicitly instructed to generate a response corresponding to the rejected action $a _ { i } ^ { \prime }$ . This means $M$ generates a response that performs the opposite pragmatic action of the ground truth.
- Tuple Creation: An augmented tuple $t _ { i } ^ { \prime }$ is created containing the prompt p _ { i }, ground truth response r _ { i }, goal response g _ { i }, inferred action a _ { i }, rejected action $a _ { i } ^ { \prime }$ , winning response y _ { w i }, and losing response y _ { l i }.
- This tuple $t _ { i } ^ { \prime }$ is added to D _ { p r e f }.
  
  The paper notes that a high-capacity LLM (Gemini Ultra) is used for $M$ to construct losing responses, which avoids the need for crowdworkers or tuning a smaller model for this specific task.

Action optimization for unlabeled conversations "in-the-wild": Even without gold-standard ambiguity annotations, pseudo-label supervision can be obtained by using a classifier (e.g., a zero-shot LLM) as the Action Annotation Agent $G$ .

4.2.2.2. Self-Training Using On-policy Conversation Trajectory Simulation

This phase involves continuously sampling batches from the preference dataset D _ { p r e f } and dynamically updating the winning/losing pairs based on on-policy sampling and multi-turn trajectory simulation. This process is detailed in Algorithm 2.

The following are the steps for ACT: Action-Based Contrastive Self-Training (Algorithm 2): $Algorithm 2 ACT: Action-Based Contrastive Self-Training input Initial Policy Model π_θ_0, Action Contrast Dataset D_pref, Number of Batches B, Action Classifier A, User Simulator U, Task Heuristic H, Heuristic Tolerance ε 1: for conversation turn t_i' in batch b sampled from D_pref where 0 \leq j \leq B do 2: Sample y_i ~ π_θ_j(\cdot | p_i) ▷ Sample a response from the current model policy 3: if Action A(y_i) \neq Action a_i then 4: ▷ Implicit pragmatic action does not match ground truth 5: Set y_li = y_i 6: else 7: Initialize Trajectory 8: Add y_i to Trajectory 9: while A(y_i) \neq ANSWER do ▷ Simulate User Clarification 10: Clarification Answer = U(p_i, Trajectory) ▷ Simulate user response to clarification 11: Add Clarification Answer to Trajectory 12: y_i+1 = π_θ_j(\cdot | p_i, Trajectory) ▷ Simulate next policy response 13: Add y_i+1 to Trajectory 14: if H(Trajectory outcome, Ground Truth Outcome g_i) > ε then 15: Let y_wi = Trajectory ▷ Reward acceptable trajectory outcome 16: else 17: Let y_li = Trajectory ▷ Penalize bad trajectory outcome 18: θ \leftarrow Update(θ) until convergence (eq2) 19: output π_θ_B$ Explanation:

The algorithm iterates for a specified number of batches $B$ , sampling conversation turns $t _ { i } ^ { \prime }$ from D _ { p r e f }.
On-policy Sampling: For each turn $t _ { i } ^ { \prime }$ , a response y _ { i } is sampled from the current policy model $\pi _ { \theta _ { j } }$ (the model being tuned), conditioned on the prompt p _ { i }. This on-policy sample represents how the current model would actually respond.
Action Mismatch Check: The Action Classifier $A$ determines the implicit action of y _ { i }. If this implicit action does not match the ground truth action a _ { i } (which was inferred for r _ { i } during D_pref construction):
- The sampled response y _ { i } is immediately designated as a losing response y _ { l i }, replacing the initial y _ { l i } from D _ { p r e f }. This penalizes the model for taking the wrong immediate action.
Trajectory Simulation (if action matches): If the implicit action of y _ { i } does match a _ { i } (meaning the model took the pragmatically correct immediate action, e.g., asking a clarifying question):
- A Trajectory is initialized with y _ { i }.
- A while loop simulates the subsequent turns of the conversation until the ANSWER action is taken by the policy model (i.e., until the model attempts to answer the original user request):
  - The User Simulator $U$ generates a Clarification Answer (user response) based on the current prompt p _ { i } and the ongoing Trajectory.
  - This Clarification Answer is added to the Trajectory.
  - The policy model $\pi _ { \theta _ { j } }$ generates the next policy response $y _ { i + 1 }$ , conditioned on the updated prompt p _ { i } and the extended Trajectory.
  - This $y _ { i + 1 }$ is added to the Trajectory.
- Trajectory Outcome Evaluation: After the ANSWER action is reached, the Task Heuristic $H$ $H$ evaluates the Trajectory outcome (the final answer provided by the model after all clarifying turns) against the Ground Truth Outcome g _ { i }.
  - If the Trajectory outcome is acceptable (i.e., $H(\text{Trajectory outcome}, g _ { i } ) > \epsilon$ for some Heuristic Tolerance $\epsilon$ ), the entire simulated Trajectory is designated as the winning response y _ { w i }.
  - Otherwise (if the outcome is unacceptable), the Trajectory is designated as the losing response y _ { l i }. This penalizes the model for leading the conversation to a poor final answer, even if initial actions were correct.
Policy Update: After determining the updated y _ { w i } and y _ { l i }, the model's parameters $\theta$ are updated using the DPO objective (Equation 1, explained next). This update step is repeated until convergence or for a fixed number of batches.

The trajectory-level content evaluation is further clarified in Figure A5.

该图像是示意图，展示了评估模型与用户模拟器之间的互动过程。评估模型在特定提示下，根据目标信息进行交互，模型首先提出澄清问题，然后尝试通过直接回答解决用户请求，最后评分使用了DROP F1指标。

Figure A5 illustrates how the trajectory-level content evaluation works. The evaluated LLM interacts with a User Simulator ( $U$ ) until it provides a direct answer. The candidate trajectory resolution (final answer) is then scored against the user's ground truth information-seeking goal using task-specific metrics like DROP F1 for PACIFIC.

4.2.3. Contrastive RL Tuning for Alignment

After constructing the up-to-date winning (y _ { w }) and losing (y _ { l }) pairing at turn $i$ through the simulation process described above, the policy model $\left( \pi _ { \boldsymbol { \theta } } \right)$ is updated using the DPO training objective (Rafailov et al., 2024).

The DPO loss function is given as: $ \mathcal { L } _ { \mathrm { D P O } } ( \pi _ { \theta } ; \pi _ { r e f } ) = - \mathbb { E } _ { ( p , y _ { w } , y _ { l } ) \sim \mathcal { D } } \left[ \log \sigma \left( \beta \log \frac { \pi _ { \theta } ( y _ { w } \mid p ) } { \pi _ { r e f } ( y _ { w } \mid p ) } - \beta \log \frac { \pi _ { \theta } ( y _ { l } \mid p ) } { \pi _ { r e f } ( y _ { l } \mid p ) } \right) \right] $ Explanation of Symbols:

$\mathcal { L } _ { \mathrm { D P O } }$ : The Direct Preference Optimization loss.
$\pi _ { \theta }$ : The policy model being trained, parameterized by $\theta$ .
$\pi _ { r e f }$ : The reference policy model, typically the SFT model or the initial policy before DPO. It serves as a regularization term, preventing the policy from drifting too far from the initial model.
$\mathbb { E } _ { ( p , y _ { w } , y _ { l } ) \sim \mathcal { D } }$ : Expectation over triplets $(p, y_w, y_l)$ sampled from the dataset $\mathcal{D}$ .
$p$ : The prompt (context) for generation. This is a concatenation of task information and conversation history: $\{ x _ { 1 } , y _ { 1 } , . . . , x _ { i - 1 } , y _ { i - 1 } , x _ { i } \}$ , where x _ { i } are user utterances and y _ { i } are system utterances.
y _ { w }: The winning response or trajectory. This is the response/trajectory that is preferred.
y _ { l }: The losing response or trajectory. This is the response/trajectory that is dispreferred.
$\sigma$ : The sigmoid function, $\sigma(x) = \frac{1}{1 + e^{-x}}$ . It squashes the input value into a range between 0 and 1, effectively representing a probability or preference score.
$\beta$ : A hyperparameter that controls the strength of the regularization. It regularizes the ratio between the current policy $\pi _ { \theta }$ and the reference policy $\pi _ { r e f }$ . A larger $\beta$ means stronger regularization.
$\log \frac { \pi _ { \theta } ( y \mid p ) } { \pi _ { r e f } ( y \mid p ) }$ : This term measures the log probability ratio of a response $y$ under the current policy $\pi _ { \theta }$ relative to the reference policy $\pi _ { r e f }$ . It essentially quantifies how much the current policy prefers a response compared to the reference. In DPO, this ratio is proportional to the implicit reward of the response.

The goal of DPO is to minimize this loss, which effectively maximizes the log-likelihood of the winning responses and minimizes the log-likelihood of the losing responses, weighted by their relative preferences.

The gradient of this objective with respect to the policy parameters $\theta$ is given as: $ \begin{array} { r l } { \boldsymbol { \mathrm { \widehat { \tau } } } _ { \theta } \mathcal { L } _ { \mathrm { D P O } } ( \pi _ { \theta } ; \pi _ { r e f } ) = } & { } \ & { \phantom { \frac { \widehat { \tau } _ { \theta } } { \pi _ { \theta } } } - \beta \mathbb { E } _ { ( p , y _ { w } , y _ { t } ) \sim \mathcal { D } } \bigg [ \sigma ( \widehat { R } _ { \theta } ( p , y _ { l } ) - \widehat { R } _ { \theta } ( p , y _ { w } ) ) \bigg [ \nabla _ { \theta } \log \pi ( y _ { w } \mid p ) - \nabla _ { \theta } \log \pi ( y _ { l } \mid p ) \bigg ] \bigg ] } \end{array} $ where $\widehat { R } _ { \theta } ( p , y ) = \beta \log \frac { \pi _ { \theta } ( y \mid p ) } { \pi _ { r e f } ( y \mid p ) }$ . Explanation of Symbols:

$\nabla _ { \theta } \mathcal { L } _ { \mathrm { D P O } } ( \pi _ { \theta } ; \pi _ { r e f } )$ : The gradient of the DPO loss with respect to the policy parameters $\theta$ . This is what is used to update the model weights during training.
$\widehat { R } _ { \theta } ( p , y )$ : An implicitly defined reward for a response $y$ given prompt $p$ under the current policy $\pi _ { \theta }$ relative to the reference policy $\pi _ { r e f }$ . This reward is derived directly from the policy log probabilities and the $\beta$ hyperparameter.
$\sigma ( \widehat { R } _ { \theta } ( p , y _ { l } ) - \widehat { R } _ { \theta } ( p , y _ { w } ) )$ : This term represents the sigmoid of the difference in implicit rewards between the losing and winning responses. It acts as a weighting factor. If the model already correctly ranks y _ { w } higher than y _ { l } (i.e., $\widehat { R } _ { \theta } ( p , y _ { w } ) > \widehat { R } _ { \theta } ( p , y _ { l } )$ ), then $\widehat { R } _ { \theta } ( p , y _ { l } ) - \widehat { R } _ { \theta } ( p , y _ { w } )$ will be negative, and $\sigma$ will be small, resulting in a small gradient update. If the model incorrectly ranks y _ { l } higher or close to y _ { w }, then this term will be larger, leading to a stronger gradient and a larger update.
$\nabla _ { \theta } \log \pi ( y _ { w } \mid p ) - \nabla _ { \theta } \log \pi ( y _ { l } \mid p )$ : This is the difference between the gradients of the log probabilities of the winning and losing responses. The DPO objective pushes the model to increase the log probability of the winning response and decrease the log probability of the losing response.

The intuition is that the gradient updates the policy to increase the likelihood of winning responses (y _ { w }) and decrease the likelihood of losing responses (y _ { l }). The magnitude of this update is weighted by how "incorrectly" the current policy ranks the paired responses, ensuring that more egregious errors in preference are corrected more strongly.

5. Experimental Setup

ACT is evaluated as a tuning approach for three complex conversational information-seeking tasks to assess its ability to learn optimized implicit selection for agent-side clarification question asking in a sample-efficient manner. The base model for tuning experiments is $Zephyr 7B-β$ , a version of Mistral 7B instruction-tuned on UltraChat and aligned to human preferences on UltraFeedback.

5.1. Datasets

The experiments investigate three mixed-initiative conversation tasks where a user interacts with an assistant to retrieve information. In each task setup, the user's query may be underspecified, and the assistant's response can either be a clarifying question or a direct answer. For each task, initial rejected responses for the preference dataset construction (Algorithm 1) are synthesized using Gemini Ultra as the conditional generation model (M).

PACIFIC: Conversational QA for Tabular Data
- Description: PACIFIC is a task for proactive conversational question answering grounded on a mixture of tabular and textual financial data (Deng et al., 2022). This involves generating answers that could be correct words from a given span, from multiple spans, or providing a correct arithmetic expression.
- Domain: Financial data.
- Characteristics: Requires reasoning over structured (tabular) and unstructured (textual) information, often involving numerical calculations. Ambiguity can arise from underspecified financial terms or data points.
- Evaluation: The official evaluation uses DROP F1, a numeracy-focused token overlap metric.
Abg-CoQA: Conversational QA for Machine Reading Comprehension
- Description: Abg-CoQA is a conversational question answering dataset focused on disambiguation in machine reading comprehension (Guo et al., 2021).
- Domain: General text understanding.
- Characteristics: The task involves answering questions based on provided text passages, where questions might be ambiguous and require clarification from the agent.
- Evaluation: Since there are no arithmetic expressions, embedding-based semantic distance with SentenceBERT is used as an evaluation metric to measure QA performance more flexibly.
AmbigSQL: Ambiguous Conversational Text-to-SQL Generation
- Description: AmbigSQL is a novel task proposed in this paper for SQL-grounded conversational disambiguation. It was constructed by systematically perturbing unambiguous queries from Spider (Yu et al., 2018), a popular single-turn text-to-SQL benchmark. This perturbation results in paired training examples suitable for contrastive RL tuning.
- Motivation: The motivation for AmbigSQL is the idea that disambiguation can significantly improve task performance in complex code generation tasks. The authors found a performance gap of up to $45.8\%$ in SQL generation with and without clarifications, highlighting the necessity of clarifying questions.
- Construction: AmbigSQL introduces three types of ambiguous information requests by prompting an LLM (Gemini Ultra 1.0) to synthesize perturbed versions of Spider queries along with appropriate clarifying questions:
  1. Ambiguous Requested Information: (e.g., "Show details about singers ordered by age from the oldest to the youngest")
  2. Ambiguous Requested Population: (e.g., "Which ones who live in the state of Indiana?")
  3. Ambiguous Presentation of Results: (e.g., "Show name, country, age for all singers ordered by age")
- For ambiguous queries, conversations are associated with "ground truth" states at two timesteps: $t_1$ (ambiguous user request, synthesized clarification question $r_1$ , action CLARIFY) and $t_2$ (disambiguated user request, correct ground truth SQL query $r_2$ , action ANSWER).
- Evaluation: Each trajectory is evaluated by whether the final proposed SQL query matches the ground truth query's execution result.
- The following are the results from Table A10 of the original paper:
  
  Num. Unambiguous Requests Train 7,000 Dev 1,034 Test 1,034
  
  Num. Ambiguous Requests 7,000 1,034 1,034
  
  Num. Unique Schemas 1,056 145 145
  
  Types of Ambiguity 3 3 3

Num. Unambiguous Requests	Train 7,000	Dev 1,034	Test 1,034
Num. Ambiguous Requests	7,000	1,034	1,034
Num. Unique Schemas	1,056	145	145
Types of Ambiguity	3	3	3

The table above provides an overview of the AmbigSQL dataset, showing the number of unambiguous and ambiguous requests, unique database schemas, and types of ambiguity across training, development, and test sets.

*   The following are the results from Table A12 of the original paper:

    <div class="table-wrapper"><table>
    <thead>
    <tr>
    <td colspan="4">[Database Schema Omitted] The target SQL query is the following:</td>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td></td>
    <td>SELECT professional_id , last_name , cell_number FROM Professionals</td>
    <td></td>
    <td></td>
    </tr>
    <tr>
    <td colspan="4">WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,</td>
    </tr>
    <tr>
    <td colspan="4">T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON</td>
    </tr>
    <tr>
    <td colspan="4">T1.professional_id = T2.professional_id</td>
    </tr>
    <tr>
    <td colspan="4">GROUP BY T1.professional_id HAVING count(*) > 2</td>
    </tr>
    <tr>
    <td colspan="4">Here is a clear request that would correspond to this SQL query:</td>
    </tr>
    <tr>
    <td colspan="4">Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List</td>
    </tr>
    <tr>
    <td colspan="4">his or her id, last name and cell phone."</td>
    </tr>
    <tr>
    <td colspan="4">Here is the same request converted into an ambiguous format by underspecifying the target columns:</td>
    </tr>
    <tr>
    <td colspan="4">"Which ones live in the state of Indiana or have done treatment on more than 2 treatments?"</td>
    </tr>
    <tr>
    <td colspan="4">Here is an appropriate clarifying question to recover the clear request from the ambiguous request:</td>
    </tr>
    <tr>
    <td colspan="4">"Are you asking about the Professionals?"</td>
    </tr>
    </tbody>
    </table></div>

Table A12 shows an in-context example for creating ambiguous information requests where the target population is underspecified. It demonstrates a clear request, its ambiguous version, and an appropriate clarifying question.

*   The following are the results from Table A13 of the original paper:

    <div class="table-wrapper"><table>
    <thead>
    <tr>
    <td colspan="4">[Database Schema Omitted] The target SQL query is the following:</td>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td></td>
    <td>SELECT professional_id , last_name , cell_number FROM Professionals</td>
    <td></td>
    <td></td>
    </tr>
    <tr>
    <td colspan="4">WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name ,</td>
    </tr>
    <tr>
    <td colspan="4">T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON</td>
    </tr>
    <tr>
    <td colspan="4">T1.professional_id = T2.professional_id</td>
    </tr>
    <tr>
    <td colspan="4">GROUP BY T1.professional_id HAVING count(*) > 2</td>
    </tr>
    <tr>
    <td colspan="4">Here is a clear request that would correspond to this SQL query:</td>
    </tr>
    <tr>
    <td colspan="4">Which professionals live in the state of Indiana or have done treatment on more than 2 treatments? List</td>
    </tr>
    <tr>
    <td colspan="4">his or her id, last name and cell phone."</td>
    </tr>
    <tr>
    <td colspan="4">Here is the same request converted into an ambiguous format by underspecifying the target columns:</td>
    </tr>
    <tr>
    <td colspan="4">Which professionals live in the state of Indiana or have done treatment on more than 2 treatments?"</td>
    </tr>
    <tr>
    <td colspan="4">Here is an appropriate clarifying question to recover the clear request from the ambiguous request:</td>
    </tr>
    <tr>
    <td colspan="4">"Which information of the professionals do you want to know?"</td>
    </tr>
    </tbody>
    </table></div>

Table A13 presents another in-context example, this time focusing on ambiguity in the requested columns, illustrating a clear request, its ambiguous counterpart, and the corresponding clarifying question.

*   The following are the results from Table A14 of the original paper:

    <div class="table-wrapper"><table>
    <thead>
    <tr>
    <td>Model</td>
    <td>Ambiguous Request Execution Match</td>
    <td>Execution Match with Clarification Turns</td>
    <td></td>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td>Gemini Pro</td>
    <td>28.5</td>
    <td>68.7</td>
    <td></td>
    </tr>
    <tr>
    <td>Gemini Ultra</td>
    <td>31.2</td>
    <td>77.0</td>
    <td></td>
    </tr>
    </tbody>
    </table></div>

Table A14 shows the impact of ambiguity on SQL generation in AmbigSQL. LLMs (Gemini Pro and Gemini Ultra) struggle to produce correct SQL from ambiguous requests, but their performance significantly improves when disambiguation turns are included in the prompt.

*   The following are the results from Table A15 of the original paper:

    <div class="table-wrapper"><table>
    <thead>
    <tr>
    <td colspan="3">Examples from the AmbigSQL test set using the "concert_singer" database. Black tex: context provided to an LLM as a prompt. Blue text: ground-truth single-turn response. Purple text: target trajectory which should be achieved through simulation.</td>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td>Prompt:</td>
    <td colspan="2">Resulting Ground Truth Trajectory:</td>
    </tr>
    <tr>
    <td>"Database schema: (...)</td>
    <td colspan="2">Assistant: SELECT Name , Song_Name FROM singer ORDER BY Age DESC</td>
    </tr>
    <tr>
    <td>User: show the names and song names of singers sorted by age from the oldest to the youngest"</td>
    <td colspan="2"></td>
    </tr>
    <tr>
    <td>"Database schema: (...)</td>
    <td>Assistant: Are you asking about the songs or the singers?</td>
    <td>Assistant: Are you asking about the songs or the singers?</td>
    </tr>
    <tr>
    <td>User: show details about singers ordered by age from the oldest to the youngest"</td>
    <td>User: The singers.</td>
    <td>User: The singers.</td>
    </tr>
    <tr>
    <td></td>
    <td></td>
    <td>Assistant: SELECT Name , Song_Name FROM singer ORDER BY Age DESC</td>
    </tr>
    </tbody>
    </table></div>

Table A15 provides examples from the AmbigSQL test set, demonstrating how an unambiguous request directly leads to an SQL query, while an ambiguous request triggers a clarification question and then a subsequent SQL query.

*   The following are the results from Table A16 of the original paper:

    <div class="table-wrapper"><table>
    <thead>
    <tr>
    <td colspan="3">Examples from the AmbigSQL test set using the "farm" database. Black text: context provided to an LLM as a prompt. Blue text: ground-truth single-turn response. Purple text: target trajectory which should be achieved through simulation.</td>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td>Prompt:</td>
    <td colspan="2">Resulting Ground Truth Trajectory:</td>
    </tr>
    <tr>
    <td>"Database schema: (...)</td>
    <td colspan="2">Assistant: SELECT * FROM farm WHERE location_id IN ( SELECT id FROM location WHERE region = 'South')</td>
    </tr>
    <tr>
    <td>User: show all the information of the farm in the south"</td>
    <td colspan="2"></td>
    </tr>
    <tr>
    <td>"Database schema: (...)</td>
    <td>Assistant: Are you asking about the location or the items?</td>
    <td>Assistant: Are you asking about the location or the items?</td>
    </tr>
    <tr>
    <td>User: show details about the farm in the south"</td>
    <td>User: The farm location.</td>
    <td>User: The farm location.</td>
    </tr>
    <tr>
    <td></td>
    <td></td>
    <td>Assistant: SELECT * FROM farm WHERE location_id IN ( SELECT id FROM location WHERE region = 'South')</td>
    </tr>
    </tbody>
    </table></div>

Table A16 shows another set of AmbigSQL examples, reinforcing the pattern of direct SQL generation for clear requests and clarification followed by SQL for ambiguous ones.

5.2. Evaluation Setup

The evaluation focuses on ACT's ability to reason about ambiguity to achieve conversational goals, assessed along two primary dimensions: Agent task performance and Implicit ambiguity recognition.

5.2.1. Agent Task Performance

This dimension evaluates how well ACT improves the multi-turn task completion capabilities of the agent.

Turn-level evaluation:
- The model's immediate response is compared to the ground truth utterance given in response to the user's query.
- Task-specific heuristics (defined in Section 4.1) are used.
- This mirrors the static single-turn evaluations used in PACIFIC and Abg-CoQA's original proposals.
Multi-turn evaluation (Trajectory outcomes):
- To assess multi-turn capabilities, a multi-turn evaluation scheme is proposed for trajectory outcomes across all three tasks.
- If the sampled response from an LLM is a clarifying question, a user response is simulated (using $U$ ), and another response is sampled from the LLM until it attempts to answer the original query.
- The final outcome of this simulated trajectory is evaluated against the user's ground truth information-seeking goal.
- $A$ (Action Classifier) and $U$ (User Simulator) are used for simulation as described in Section 3.2.2.
- Post-Clarify F1 / Execution Match: For PACIFIC and AmbigSQL, task performance is also computed specifically on simulated responses where the model has previously asked clarifying questions. This provides a more fine-grained measure of the model's ability to reason about its own clarifications.

5.2.2. Implicit Ambiguity Recognition

This dimension examines whether LLMs can implicitly recognize and reason about ambiguity.

Metric: Dialogue act accuracy, primarily Macro F1.
Assumption: Given access to ground-truth ambiguity labels, if a user's request is truly ambiguous, the model should generate a clarifying question; otherwise, it should directly provide the requested information. This directly measures the model's ability to correctly classify the dialogue act required by the context. Macro F1 is chosen due to potentially imbalanced classes (more unambiguous than ambiguous cases) in PACIFIC and Abg-CoQA.

5.2.3. Detailed Evaluation Metrics (from Appendix D)

Action-level evaluation metrics:

Accuracy:
- Conceptual Definition: Measures the percentage of correctly classified implicit actions (e.g., whether the model correctly identified if it should CLARIFY or ANSWER).
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of correct action predictions}}{\text{Total number of predictions}} $
- Symbol Explanation:
  - Number of correct action predictions: The count of instances where the model's predicted dialogue act matches the ground truth dialogue act.
  - Total number of predictions: The total number of dialogue acts evaluated.
Macro F1:
- Conceptual Definition: An unweighted average of the F1 score for each individual action class (CLARIFY, ANSWER). This is particularly useful when class distributions are imbalanced, as it treats all classes equally, preventing dominant classes from skewing the overall metric.
- Mathematical Formula: $ \text{Macro F1} = \frac{1}{|C|} \sum_{c \in C} \text{F1}_c $ where $\text{F1}_c = 2 \cdot \frac{\text{Precision}_c \cdot \text{Recall}_c}{\text{Precision}_c + \text{Recall}_c}$ $ \text{Precision}_c = \frac{\text{True Positives}_c}{\text{True Positives}_c + \text{False Positives}_c} $ $ \text{Recall}_c = \frac{\text{True Positives}_c}{\text{True Positives}_c + \text{False Negatives}_c} $
- Symbol Explanation:
  - $|C|$ : The total number of unique action classes (e.g., 2 for [CLARIFY, ANSWER]).
  - $c$ : An individual action class.
  - $\text{F1}_c$ : The F1 score calculated for class $c$ .
  - $\text{Precision}_c$ : The proportion of correctly predicted positive instances for class $c$ out of all instances predicted as positive for class $c$ .
  - $\text{Recall}_c$ : The proportion of correctly predicted positive instances for class $c$ out of all actual positive instances for class $c$ .
  - $\text{True Positives}_c$ : Number of instances of class $c$ correctly predicted as class $c$ .
  - $\text{False Positives}_c$ : Number of instances not of class $c$ incorrectly predicted as class $c$ .
  - $\text{False Negatives}_c$ : Number of instances of class $c$ incorrectly predicted as not class $c$ .

Content-level evaluation metrics:

Turn-level DROP F1:
- Conceptual Definition: Measures the F1 overlap between the model's immediate response and the ground truth utterance for a single turn, specifically adapted for Discrete Reasoning Over Paragraphs (DROP) tasks involving numeric answers and span extraction. It's designed to handle complex answers that might involve multiple spans or arithmetic expressions.
- Mathematical Formula: (The paper refers to Deng et al. (2022) and Dua et al. (2019) for DROP F1, but doesn't provide the explicit formula. The DROP F1 metric is a token-level F1 score calculated after normalizing numbers and handling multiple correct answers. It's often defined as the maximum F1 over all possible correct answers.) $ \text{DROP F1} = \max_{\text{GT} \in \text{GroundTruthAnswers}} \left( \frac{2 \cdot \text{Precision}(\text{Pred}, \text{GT}) \cdot \text{Recall}(\text{Pred}, \text{GT})}{\text{Precision}(\text{Pred}, \text{GT}) + \text{Recall}(\text{Pred}, \text{GT})} \right) $ Where Precision and Recall are calculated at the token level between the predicted answer and the ground truth answer after normalization steps (e.g., number normalization, ignoring punctuation).
- Symbol Explanation:
  - Pred: The predicted answer string by the model.
  - GT: A single ground truth answer string from the set of possible correct answers.
  - GroundTruthAnswers: The set of all acceptable ground truth answers for a given question.
  - Precision(Pred, GT): Token-level precision between Pred and GT.
  - Recall(Pred, GT): Token-level recall between Pred and GT.
Trajectory-level DROP F1:
- Conceptual Definition: Measures the DROP F1 of the final answer obtained at the end of a simulated multi-turn conversation trajectory (i.e., after all clarifying questions and subsequent user responses) compared to the ground truth information goal. This assesses the model's ability to achieve the ultimate task goal through conversational interaction.
Post-Clarification DROP F1:
- Conceptual Definition: A subset of Trajectory-level DROP F1 specifically for cases where the model did ask clarifying questions during the simulated trajectory. It evaluates the quality of the final answer after the model has engaged in disambiguation, measuring the effectiveness of its clarification strategy.
Turn-level Similarity (for Abg-CoQA):
- Conceptual Definition: Measures the semantic similarity between the model's immediate response and the ground truth response for a single turn, typically using embedding-based similarity (e.g., cosine similarity of SentenceBERT embeddings). This is useful when exact token matches are not expected, but semantic equivalence is desired.
Trajectory-level Similarity (for Abg-CoQA):
- Conceptual Definition: Measures the semantic similarity between the final answer of a simulated multi-turn conversation trajectory and the ground truth information goal, using embedding-based similarity. Analogous to Trajectory-level DROP F1 but for tasks where semantic meaning is more important than exact factual extraction.
Trajectory-level Execution Match (for AmbigSQL):
- Conceptual Definition: The percentage of multi-turn conversation trajectories where the final generated SQL query, when executed against the database, produces the correct result as defined by the ground truth SQL query. This is a strict, objective measure of correctness for text-to-SQL tasks.
Post-Clarification Execution Match (for AmbigSQL):
- Conceptual Definition: The percentage of multi-turn conversation trajectories that included clarification turns and resulted in a SQL query that produced the correct execution result. This highlights the impact of clarification on achieving the correct SQL output.

5.3. Baselines

The paper compares ACT against various prompting and tuning baselines.

5.3.1. Prompting Baselines

These baselines use frontier LLMs with different in-context learning (ICL) strategies. For all prompting baselines, 10 conversations are used as in-context examples.

Models: Gemini 1.5 Pro, Gemini 1.5 Flash, Claude 3.5 Sonnet, Claude 3.0 Haiku.

Prompting Frameworks:

Standard Prompting: Uses the same instruction formatting as used for SFT, providing structured examples.

The following are the results from Table A17 of the original paper:

You are an Assistant answering questions from a User. You should either attempt to answer the question or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted]
User: How much does the company expect to make contributions to the defined plans during the next fiscal year?
Assistant: ['What kind of defined plans are you asking about?']
User: The defined benefit plans and the defined contribution plan respectively.
Assistant: ['`5.1 million', '`0.6 million'] User: What is the value of the expected defined contribution plan as a percentage of the defined benefit
plan during the next fiscal year?
Assistant: 0.6/5.1 = 11.76 User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount
rate? Assistant: ['What kind of change are you asking about?']
User: The pension. Assistant: [' $(39,145)', '$ 49,361']
User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?
Assistant: 49,361 - (39,145) = 88506 User: How much would the pension change if there is a 1% increase and a 1% decrease respectively?
Assistant: ['Which rate of increase and decrease are you talking about?']
User: The future salary growth.
Assistant: ['`7,572', '`(6,919)']
User: What is the difference between them?
Assistant: 7,572 - (6,919) = 14491

Table A17 shows an example of Standard in-context prompting for PACIFIC. The purple text (first line) is the system instruction, and the blue text shows the model's expected generation at inference time. The example provides a sequence of user questions and assistant responses, demonstrating both direct answers and clarifying questions.

2.  **Chain-of-Thought (CoT) Prompting (Wei et al., 2022):** Integrates `CoT` reasoning into the standard prompt format, asking the `LLM` to first produce a reasoning chain (`e.g.`, stating whether the context is ambiguous) before generating the response.
    *   The following are the results from Table A18 of the original paper:

        
        
        
        You are an Assistant answering questions from a User. You should either attempt to answer the question or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted]
        
        
        
        
        User: What is the value of the expected defined contribution plan as a percentage of the defined benefit plan during the next fiscal year?
        
        
        Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
        
        
        Reasoning: The user's question is not ambiguous. Assistant: 0.6/5.1 = 11.76
        
        
        User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount rate?
        
        
        Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
        
        
        Reasoning: The user's question was ambiguous. Assistant: ['What kind of change are you asking about?']
        
        
        User: The pension.
        
        
        Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
        
        
        Reasoning: The user's question is not ambiguous. Assistant: [' $(39,145)', '$ 49,361']
        
        
        User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?
        
        
        Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
        
        
        Reasoning: The user's question is not ambiguous. Assistant: 49,361 - (39,145) = 88506
        
        
        User: How much would the pension change if there is a 1% increase and a 1% decrease respectively?
        
        
        Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
        
        
        Reasoning: The user's question was ambiguous.
        
        
        Assistant: ['Which rate of increase and decrease are you talking about?']

You are an Assistant answering questions from a User. You should either attempt to answer the question or ask a clarifying question if there is any ambiguity. [Table and Passage Omitted]
User: What is the value of the expected defined contribution plan as a percentage of the defined benefit plan during the next fiscal year?
Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question is not ambiguous. Assistant: 0.6/5.1 = 11.76
User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount rate?
Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question was ambiguous. Assistant: ['What kind of change are you asking about?']
User: The pension.
Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question is not ambiguous. Assistant: [' $(39,145)', '$ 49,361']
User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?
Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question is not ambiguous. Assistant: 49,361 - (39,145) = 88506
User: How much would the pension change if there is a 1% increase and a 1% decrease respectively?
Instruction: If the user's question is ambiguous, ask an appropriate clarifying question. Otherwise, directly answer the user's question using the information from the passage context and the table. Let's think step by step.
Reasoning: The user's question was ambiguous.
Assistant: ['Which rate of increase and decrease are you talking about?']

Table A18 provides an example of Chain-of-Thought prompting for PACIFIC. The model is instructed to first provide Reasoning about ambiguity before generating an Assistant response, illustrating how CoT is integrated into the dialogue flow.

3.  **Proactive MIPrompt (Deng et al., 2023b; Chen et al., 2023b):** A combination of `mixed-initiative prompting` and `proactive prompting`. The `LLM` conditions on a possible set of actions and uses interweaved natural language instructions describing which actions correspond to dialogue turns.
    *   The following are the results from Table A19 of the original paper:

        <div class="table-wrapper"><table>
        <thead>
        <tr>
        <td colspan="2">You are an Assistant answering questions from a User. You should either attempt to answer the question or ask a clarifying question if there is any ambiguity.</td>
        </tr>
        </thead>
        <tbody>
        <tr>
        <td colspan="2">[Table and Passage Omitted] User: How much does the company expect to make contributions to the defined plans during the next fiscal year?</td>
        </tr>
        <tr>
        <td colspan="2">The user's last question was ambiguous. The Assistant asks a clarifying question.</td>
        </tr>
        <tr>
        <td colspan="2">Assistant: ['What kind of defined plans are you asking about?']</td>
        </tr>
        <tr>
        <td colspan="2">User: The defined benefit plans and the defined contribution plan respectively.</td>
        </tr>
        <tr>
        <td colspan="2">The user's last question was unambiguous. The Assistant directly answers the question.</td>
        </tr>
        <tr>
        <td colspan="2">Assistant: ['5.1million' ,'0.6 million'] User: What is the value of the expected defined contribution plan as a percentage of the defined benefit</td>
        </tr>
        <tr>
        <td colspan="2">plan during the next fiscal year?</td>
        </tr>
        <tr>
        <td colspan="2">The user's last question was unambiguous. The Assistant directly answers the question. Assistant: 0.6/5.1 = 11.76</td>
        </tr>
        <tr>
        <td colspan="2">User: How much would change if there is a 1% increase and a 1% decrease respectively in the discount</td>
        </tr>
        <tr>
        <td colspan="2">rate? The user's last question was ambiguous. The Assistant asks a clarifying question.</td>
        </tr>
        <tr>
        <td colspan="2">Assistant: ['What kind of change are you asking about?']</td>
        </tr>
        <tr>
        <td colspan="2">User: The pension. The user's last question was unambiguous. The Assistant directly answers the question.</td>
        </tr>
        <tr>
        <td colspan="2">Assistant: ['(39, 145)','49,361']</td>
        </tr>
        <tr>
        <td colspan="2">User: What is the difference in that amount if the discount rate had a 1% increase and a 1% decrease?</td>
        </tr>
        <tr>
        <td colspan="2">The user's last question was unambiguous. The Assistant directly answers the question.</td>
        </tr>
        <tr>
        <td colspan="2">Assistant: 49,361 - (39,145) = 88506</td>
        </tr>
        <tr>
        <td colspan="2">User: How much would the pension change if there is a 1% increase and a 1% decrease respectively?</td>
        </tr>
        <tr>
        <td colspan="2">Actions: ["Directly Answer", "Ask a Clarification Question"]</td>
        </tr>
        <tr>
        <td colspan="2">Prompt: Given the task background and the conversation history, please use appropriate actions to generate the response.</td>
        </tr>
        <tr>
        <td colspan="2">Response: The user's last question was ambiguous. The Assistant asks a clarifying question.</td>
        </tr>
        <tr>
        <td colspan="2">Assistant: ['Which rate of increase and decrease are you talking about?']</td>
        </tr>
        </tbody>
        </table></div>

Table A19 shows an example of Proactive Mixed-Initiative prompting for PACIFIC. This style explicitly provides the model with available actions and describes the action taken in each turn, guiding the LLM to generate responses with specific pragmatic intents.

5.3.2. Tuning Baselines

These approaches involve fine-tuning the $Zephyr 7B-β$ model.

Supervised Fine-Tuning (SFT): The model is fine-tuned on the ground truth responses from each dataset's training split. This is a standard method for adapting LLMs to specific tasks.
Iterative Reasoning Preference Optimization (IRPO): An on-policy DPO variant, IRPO has shown effectiveness in reasoning tasks. It's evaluated on PACIFIC and AmbigSQL (quantitative reasoning tasks).
DPO-Dist (DPO-Distillation): An off-policy DPO approach where winning responses ( $Y_w$ ) are sampled from a higher-capacity model, and losing responses ( $Y_l$ ) are also generated, often from a lower-capacity model or by prompting a model to generate a "bad" response. This method distills preferences from powerful LLMs into smaller models. The paper presents DPO-Dist results in Appendix B, specifically using Gemini Pro vs. Flash and Claude Sonnet vs. Haiku for generating preferences.

6. Results & Analysis

The experimental results demonstrate ACT's effectiveness in data-efficient tuning scenarios across various conversational tasks, especially when dealing with ambiguity. The evaluation covers both action-level performance (how well the model recognizes ambiguity and takes the correct conversational action) and content-level performance (how well it achieves the ultimate task goal).

6.1. Core Results Analysis

6.1.1. Conversational QA with Tabular Grounding (PACIFIC)

The following are the results from Table 1 of the original paper:

Adaption Setting			\|Action-level	Content-level
Base Model	Approach	Conversations	\|Action-level	Macro F1 ↑	Turn F1 ↑	Traj. F1 ↑	Post-Clarify F1 ↑
Gemini Pro	Standard ICL	10	81.4	59.7	58.7	49.7
Claude Sonnet	Standard ICL	10	71.9	43.7	42.0	28.5
Gemini Pro	SFT	50	71.2	51.8	45.7	9.9
Gemini Pro	SFT	100	75.2	64.3	54.6	8.5
Gemini Pro	SFT	250	88.0	67.4	59.3	10.2
Zephyr 7B-β	SFT	50	69.0	57.8	61.3	43.5
Zephyr 7B-β	IRPO	50	67.7	59.1	56.7	34.4
Zephyr 7B-β	ACT (ours)	50	82.2	62.8	61.9	57.2
Zephyr 7B-β	SFT	100	82.3	58.6	60.3	49.9
Zephyr 7B-β	IRPO	100	84.5	60.4	55.2	38.2
Zephyr 7B-β	ACT (ours)	100	86.0	65.0	62.0	57.4
Zephyr 7B-β	SFT	250	86.9	65.1	63.3	56.7
Zephyr 7B-β	IRPO	250	85.4	64.9	58.4	40.3
Zephyr 7B-β	ACT (ours)	250	89.6	68.1	65.7	62.0

Table 1 shows experimental results on PACIFIC's public evaluation set. ACT consistently achieves the strongest performance across all data-efficient conditions (50, 100, 250 conversations) and all metrics (Macro F1, Turn F1, Traj. F1, Post-Clarify F1), outperforming SFT and IRPO.

Key Findings: ACT significantly outperforms SFT and IRPO in all data-efficient settings.
- For 50 conversations (most data-scarce): ACT achieves an 82.2 Macro F1 (action-level), a 19.1% relative improvement over SFT's 69.0. In multi-turn task performance (Traj. F1), ACT reaches 61.9, showing a 35.7% relative improvement over Gemini Pro SFT's 45.6.
- ACT's performance in these limited data settings often matches or outperforms frontier LLMs used with in-context learning (ICL) (e.g., Gemini Pro and Claude Sonnet), despite ACT models having zero in-context examples during inference.
- The Post-Clarify F1 metric, which measures performance after clarification, is notably higher for ACT (e.g., 57.2 vs. SFT's 43.5 at 50 conversations), indicating ACT's superior ability to reason about and utilize its own clarification questions.
Conclusion: The results highlight the importance of on-policy learning and multi-turn trajectory simulation for improving multi-turn goal completion.

6.1.2. Abg-CoQA: Conversational QA for Machine Reading Comprehension

The following are the results from Table 2 of the original paper:

Adaptation Setting			\| Action-level \|	Content-level
Base Model	Approach	Conversations	\| Action-level \|	Macro F1 ↑	Turn Similarity ↑	Traj. Similarity ↑
Gemini Pro	Standard ICL	10	55.5	67.0	72.2
Claude Sonnet	Standard ICL	10	66.0	50.1	54.3
Zephyr 7B-β	SFT	50	44.6	53.3	64.2
Zephyr 7B-β	ACT (ours)	50	52.3	66.2	68.8
Zephyr 7B-β	SFT	100	52.6	63.1	69.4
Zephyr 7B-β	ACT (ours)	100	51.1	69.5	71.4
Zephyr 7B-β	SFT	250	53.5	64.0	66.2
Zephyr 7B-β	ACT (ours)	250	53.3	72.5	75.1

Table 2 presents the Abg-CoQA test set evaluation results.

Key Findings: ACT achieves the strongest performance in terms of task-specific metrics (Turn Similarity and Traj. Similarity) across all three data settings.
- For 50 conversations, ACT significantly improves Turn Similarity (66.2 vs. SFT's 53.3) and Traj. Similarity (68.8 vs. SFT's 64.2).
- Even when SFT slightly outperforms ACT in Macro F1 (implicit action recognition) at 100 and 250 conversations, ACT consistently leads in multi-turn goal completion (Turn Similarity and Traj. Similarity).
Conclusion: ACT's approach leads to improved multi-turn reasoning and task performance, indicating that its on-policy sampling and multi-turn simulation are effective even for tasks where action recognition might be competitive with simpler methods. The paper notes that low inter-annotator agreement in Abg-CoQA might affect action-level metrics.

6.1.3. AmbigSQL: Ambiguous Conversational Text-to-SQL Generation

The following are the results from Table 3 of the original paper:

	Adaptation Setting		Action-level		Content-level
Base Model	Approach	Conversations	Accuracy ↑	Macro F1 ↑	Execution Match ↑	PC Execution Match ↑
Gemini Pro	Standard ICL	10	72.1	70.9	63.5	75.2
Claude Sonnet	Standard ICL	10	68.5	63.8	66.5	72.4
Zephyr 7B-β	SFT	50	77.4	77.4	21.9	13.9
Zephyr 7B-β	IRPO	50	91.0	91.0	27.8	30.8
Zephyr 7B-β	ACT (ours)	50	80.8	80.7	43.6	38.1
Zephyr 7B-β	SFT	100	97.2	97.2	43.3	34.3
Zephyr 7B-β	IRPO	100	96.2	96.1	45.0	37.0
Zephyr 7B-β	ACT (ours)	100	99.2	99.3	48.0	49.6
Zephyr 7B-β	SFT	250	99.8	99.7	51.0	50.7
Zephyr 7B-β	IRPO	250	97.0	97.1	49.7	45.6
Zephyr 7B-β	ACT (ours)	250	99.9	99.8	52.3	53.0
Zephyr 7B-β	SFT	14,000 (All)	99.8	99.8	63.1	60.4

Table 3 presents the AmbigSQL test set evaluation results.

Key Findings:
- ACT achieves the strongest task performance (Execution Match and PC Execution Match) within each data setting, especially with scarce data.
- For 50 conversations, ACT achieves 43.6 Execution Match and 38.1 PC Execution Match, substantially outperforming SFT (21.9 and 13.9) and IRPO (27.8 and 30.8).
- Even though frontier LLMs (Gemini Pro, Claude Sonnet) achieve strong downstream SQL performance with ICL, Zephyr tuned with ACT shows impressive gains for a smaller model.
- The PC Execution Match (performance after clarification) shows particularly large improvements for ACT in low-data regimes, indicating ACT's ability to leverage clarifications effectively for correct SQL generation.
Conclusion: ACT leads to significant relative improvements in multi-turn task performance for text-to-SQL. The paper notes that SQL generation heavily benefits from model scale, suggesting that applying ACT to larger models could yield even greater performance.

6.1.4. ACT In-The-Wild: Learning Without Dialogue Action Supervision

The following are the results from Table 4 of the original paper:

Task Adaptation Environment				Action-level	Content-level
Base Model	Framework	Action Supervision	Tuning Ex.	Macro F1 ↑	Turn F1 ↑	Traj. F1 ↑	Post-Clarify F1 ↑
Zephyr 7B-β	SFT	NA	50	69.0	57.8	61.3	43.5
Zephyr 7B-β	ACT	Crowdsourced	50	82.2	62.8	61.9	57.2
Zephyr 7B-β	ACT	Pseudo-labeled	50	80.1	62.4	61.1	54.7
Zephyr 7B-β	SFT	NA	100	82.3	58.6	60.3	49.9
Zephyr 7B-β	ACT	Crowdsourced	100	86.0	65.0	62.0	57.4
Zephyr 7B-β	ACT	Pseudo-labeled	100	84.8	63.5	61.5	56.1
Zephyr 7B-β	SFT	NA	250	86.9	65.1	63.3	56.7
Zephyr 7B-β	ACT	Crowdsourced	250	89.6	68.1	65.7	62.0
Zephyr 7B-β	ACT	Pseudo-labeled	250	89.0	68.1	64.9	61.0

Table 4 examines ACT's performance on PACIFIC when action labels are not available, using O-shot Gemini Pro for pseudo-labeling.

Key Findings:
- When using pseudo-labeled supervision for action labels, there is "astonishingly high agreement" ( $98.5\%$ ) with ground truth action labels from Gemini 1.5 Pro.
- The empirical difference in performance between crowdsourced and pseudo-labeled ACT is negligible across all metrics and data settings. For example, at 250 conversations, crowdsourced ACT achieves 89.6 Macro F1 and 62.0 Post-Clarify F1, while pseudo-labeled ACT achieves 89.0 and 61.0, respectively.
Conclusion: This demonstrates that ACT can be highly effective for adaptation in "in-the-wild" settings where explicit action label supervision is unavailable, by leveraging LLMs for pseudo-labeling.

6.2. Ablation Studies

The following are the results from Table 5 of the original paper:

	Macro F1 ↑	Turn F1 ↑	Traj. F1 ↑	Post-Clarify F1 ↑
Action Importance
ACT
w/ Random Actions	63.2	55.3	58.7	32.8
Ablation of ACT subcomponents
ACT w/o on-policy sampling	74.8	61.5	59.1	40.5
ACT
w/ sampling but w/o simulation	81.4	60.8	60.2	50.1
ACT (full)	82.2	62.8	61.9	57.2
ACT with unaligned foundation models
Gemma 2B SFT	57.7	38.0	40.5	17.0
Gemma 2B ACT	62.7	42.6	44.0	24.8
Mistral 7B SFT	57.7	53.8	51.4	27.7
Mistral 7B ACT	75.7	58.1	57.6	31.9

Table 5 presents ablation studies conducted using PACIFIC's 50-conversation setting, examining the importance of ACT's various components. ACT (full) refers to the standard ACT implementation from Table 1 with 50 conversations (Macro F1: 82.2, Traj. F1: 61.9, Post-Clarify F1: 57.2).

6.2.1. Action Importance

ACT w/ Random Actions: This variant constructs contrastive pairs by randomly sampling both the winning and losing actions, instead of explicitly contrasting the ground truth action with its opposite.
- Results: ACT w/ Random Actions achieves Macro F1 of 63.2, Traj. F1 of 58.7, and Post-Clarify F1 of 32.8. This is significantly lower than ACT (full).
- Analysis: This demonstrates that explicit action-based preferences (i.e., contrasting the correct pragmatic action with an incorrect one) are crucial. Randomly chosen actions do not effectively guide the model towards learning optimal conversational policies.

6.2.2. Ablation of ACT Subcomponents

ACT w/o on-policy sampling: This variant evaluates DPO on the initially constructed D_pref (from Algorithm 1) without dynamically updating the winning/losing pairs through on-policy sampling. It's essentially an off-policy DPO approach applied to the initial action-based preference dataset.
- Results: ACT w/o on-policy sampling achieves Macro F1 of 74.8, Traj. F1 of 59.1, and Post-Clarify F1 of 40.5. While it shows some improvement over SFT (Macro F1: 69.0), it performs worse than ACT (full).
- Analysis: On-policy sampling is vital. The off-policy negative responses are not guaranteed to be within the language manifold of the current policy model, making it harder to overcome distribution shift and effectively steer the model. Dynamic updates with on-policy generated responses ensure more relevant and effective contrastive examples.
ACT w/ sampling but w/o simulation: This variant includes on-policy sampling but removes the multi-turn trajectory simulation component. The model's immediate response is still sampled, but its impact on the long-term conversational goal is not evaluated or used to update the winning/losing pairs.
- Results: ACT w/ sampling but w/o simulation achieves Macro F1 of 81.4, Traj. F1 of 60.2, and Post-Clarify F1 of 50.1. This is better than ACT w/o on-policy sampling but still falls short of ACT (full).
- Analysis: Trajectory-level simulation is critical for improving multi-turn performance, especially for the policy model's ability to reason about its own clarification questions. Without it, the model might learn to take correct immediate actions but fail to lead the conversation to a successful overall outcome. The Post-Clarify F1 difference (50.1 vs. 57.2) particularly highlights this, as the simulation explicitly penalizes poor outcomes after clarification.

6.2.3. ACT with Unaligned Foundation Models

This section examines if ACT's benefits are dependent on the base model being already aligned with human feedback (like $Zephyr 7B-β$ ).

Models: Gemma 2B (unaligned) and Mistral 7B (unaligned, the base for Zephyr).
Results:
- Gemma 2B SFT: Macro F1 57.7, Traj. F1 40.5.
- Gemma 2B ACT: Macro F1 62.7, Traj. F1 44.0. (Improvements of 5.0 and 3.5 respectively).
- Mistral 7B SFT: Macro F1 57.7, Traj. F1 51.4.
- Mistral 7B ACT: Macro F1 75.7, Traj. F1 57.6. (Improvements of 18.0 and 6.2 respectively).
Analysis: While Zephyr (an aligned Mistral variant) benefits more from ACT (larger absolute performance values), ACT still provides notable improvements for unaligned foundation models like Gemma 2B and raw Mistral 7B over their respective SFT baselines.
Conclusion: ACT is model agnostic and can improve performance regardless of prior alignment, although a better initialization (e.g., an already aligned model) can lead to higher overall performance ceilings.

6.3. Additional Experimental Results (from Appendix B)

The following are the results from Table A7 of the original paper:

Adaption Setting			\| Action-level \|	Content-level
Base Model	Approach	Conversations	\| Action-level \|	Macro F1 ↑	Turn F1 ↑	Traj. F1 ↑	Post-Clarify F1 ↑
Gemini Pro	Standard Prompt	10	81.4	59.7	58.7	49.7
Gemini Pro	Chain-of-Thought	10	86.3	66.3	17.1	19.2
Gemini Pro	Proactive MIPrompt	10	78.9	63.4	61.1	18.9
Gemini Flash	Standard Prompt	10	67.4	58.8	58.7	17.9
Gemini Flash	Chain-of-Thought	10	77.1	62.0	16.9	20.0
Gemini Flash	Proactive MIPrompt	10	76.8	64.0	62.0	24.4
Claude Sonnet	Standard Prompt	10	71.9	43.7	42.0	28.5
Claude Sonnet	Chain-of-Thought	10	80.0	37.2	13.0	6.8
Claude Sonnet	Proactive MIPrompt	10	74.9	47.2	45.9	7.6
Claude Haiku	Standard Prompt	10	46.9	26.4	26.2	—
Claude Haiku	Chain-of-Thought	10	48.6	23.7	12.0	2.9
Claude Haiku	Proactive MIPrompt	10	48.3	18.6	18.2	7.3
Gemini Pro	SFT	50	71.2	51.8	45.7	9.9
Gemini Pro	SFT	100	75.2	64.3	54.6	8.5
Gemini Pro	SFT	250	88.0	67.4	59.3	10.2
Zephyr 7B-β	SFT	50	69.0	57.8	61.3	43.5
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	50	75.5	61.7	55.7	30.8
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	50	74.8	62.0	56.3	31.9
Zephyr 7B-β	IRPO	50	67.7	59.1	56.7	34.4
Zephyr 7B-β	ACT (ours)	50	82.2	62.8	61.9	57.2
Zephyr 7B-β	SFT	100	82.3	58.6	60.3	49.9
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	100	68.8	53.3	53.3	31.7
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	100	83.0	59.0	53.7	29.3
Zephyr 7B-β	IRPO	100	84.5	60.4	55.2	38.2
Zephyr 7B-β	ACT (ours)	100	86.0	65.0	62.0	57.4
Zephyr 7B-β	SFT	250	86.9	65.1	63.3	56.7
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	250	65.6	53.6	54.1	30.9
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	250	82.8	43.3	38.6	19.6
Zephyr 7B-β	IRPO	250	85.4	64.9	58.4	40.3
Zephyr 7B-β	ACT (ours)	250	89.6	68.1	65.7	62.0

Table A7 expands on PACIFIC results, including Gemini Flash and Claude Haiku for prompting baselines and DPO-Dist as a tuning baseline. ACT still consistently shows the best performance. Notably, Chain-of-Thought prompting for frontier LLMs can sometimes lead to very low Trajectory F1, indicating that while reasoning chains might be explicit, they don't always translate to better multi-turn goal completion. DPO-Dist generally performs better than SFT but falls short of ACT.

The following are the results from Table A8 of the original paper:

Adaptation Setting			\| Action-level\|	Content-level
Base Model	Approach	Conversations	\| Action-level\|	Macro F1 ↑	Turn Similarity ↑	Traj. Similarity ↑
Gemini Pro	Standard Prompt	10	55.5	67.0	72.2
Gemini Pro	Chain-of-Thought	10	61.2	63.4	39.1
Gemini Pro	Proactive MIPrompt	10	55.5	63.3	33.3
Gemini Flash	Standard Prompt	10	52.6	62.5	67.4
Gemini Flash	Chain-of-Thought	10	61.2	56.5	36.6
Gemini Flash	Proactive MIPrompt	10	58.1	61.7	36.1
Claude Sonnet	Standard Prompt	10	66.0	50.1	54.3
Claude Sonnet	Chain-of-Thought	10	63.7	46.2	36.8
Claude Sonnet	Proactive MIPrompt	10	57.2	60.8	32.9
Claude Haiku	Standard Prompt	10	49.3	40.9	41.7
Claude Haiku	Chain-of-Thought	10	46.2	30.7	28.0
Claude Haiku	Proactive MIPrompt	10	45.2	34.5	31.4
Zephyr 7B-β	SFT	50	44.6	53.3	64.2
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	50	46.9	57.2	61.2
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	50	44.7	57.9	61.5
Zephyr 7B-β	ACT (ours)	50	52.3	66.2	68.8
Zephyr 7B-β	SFT	100	52.6	63.1	69.4
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	100	47.8	61.9	67.1
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	100	44.8	62.0	66.4
Zephyr 7B-β	ACT (ours)	100	51.1	69.5	71.4
Zephyr 7B-β	SFT	250	53.5	64.0	66.2
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	250	46.0	61.9	66.3
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	250	46.3	62.6	67.0
Zephyr 7B-β	ACT (ours)	250	53.3	72.5	75.1

Table A8 shows extended Abg-CoQA results. Again, ACT demonstrates superior multi-turn goal completion (Traj. Similarity) compared to SFT and DPO-Dist, even when ICL with Gemini Pro achieves high Traj. Similarity at 10 examples. The DPO-Dist variants, while outperforming SFT in some cases, do not consistently match ACT's performance.

The following are the results from Table A9 of the original paper:

Adaptation Setting			\| Action-level \|		Content-level
Base Model	Approach	Conversations	Accuracy ↑	Execution Match ↑	PC Execution Match ↑
Gemini Pro	Standard Prompt	10	72.1	63.5	75.2
Gemini Flash	Standard Prompt	10	75.6	64.2	66.2
Claude Sonnet	Standard Prompt	10	68.5	66.5	72.4
Claude Haiku	Standard Prompt	10	73.8	57.3	65.3
Zephyr 7B-β	SFT	50	77.4	21.9	13.9
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	50	77.7	42.6	31.5
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	50	78.0	40.9	41.2
Zephyr 7B-β	IRPO	50	91.0	27.8	30.8
Zephyr 7B-β	ACT (ours)	50	80.8	43.6	38.1
Zephyr 7B-β	SFT	100	97.2	43.3	34.3
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	100	98.7	45.1	45.3
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	100	99.8	47.8	44.8
Zephyr 7B-β	IRPO	100	96.2	45.0	37.0
Zephyr 7B-β	ACT (ours)	100	99.2	48.0	49.6
Zephyr 7B-β	SFT	250	99.8	51.0	50.7
Zephyr 7B-β	DPO-Dist (Pro v. Flash)	250	97.3	49.7	44.2
Zephyr 7B-β	DPO-Dist (Sonnet v. Haiku)	250	99.7	50.7	50.3
Zephyr 7B-β	IRPO	250	97.0	49.7	45.6
Zephyr 7B-β	ACT (ours)	250	99.9	52.3	53.0
Zephyr 7B-β	SFT	14,000 (All)	99.8	63.1	60.4

Table A9 details extended AmbigSQL results. ACT again consistently achieves the strongest Execution Match and PC Execution Match among the Zephyr models, especially in data-scarce settings. DPO-Dist also shows competitive performance against SFT and IRPO, demonstrating the value of preference-based alignment. However, the largest frontier LLMs still achieve higher Execution Match, underscoring the benefits of model scale for complex tasks like SQL generation.

6.4. Qualitative Examples (from Appendix I)

The qualitative examples provide intuitive insights into ACT's improvements.

Table A28 (PACIFIC):

Relevant Table Data	SFT	ACT
Year: 2019 \|\| 2018	User: What were the total liabilities of IMFT? Assistant: \$909 ×	User: What were the total liabilities of IMFT? Assistant: Which year are you asking about?
Total current liabilities : 851 \| \| 167 Other noncurrent liabilities : 5 \| \|74		User: 2018
Total Liabilities: $909 \| \|$ 1,305		Assistant: \$1,305

Table A28 shows a comparison of trajectories on PACIFIC. The SFT model makes an "Incorrect Assumption" and provides a wrong answer (909), whereas the ACTmodel correctly asks aclarifying question ("Which year are you asking about?"), and then provides the correct answer (1,305) after the user clarifies. This exemplifies ACT's learned ability to disambiguate.

Table A29 (PACIFIC - Erroneous Trajectory):

	SFT	ACT
Relevant Data Net profit/(loss) after tax was \$(9.8) million	User: What was the net profit/(loss) after tax?	User: What was the net profit/(loss) after tax? Assistant: Which year are you asking the amount of?
Net profit/(loss) after tax : (9,819) \| \|6,639 \|\| (248%)	Assistant: \$(9,819) ×	User: FY19 Assistant: \$(9,819) ×

Table A29 illustrates an erroneous trajectory where both SFT and ACT models give incorrect answers for "net profit/(loss) after tax". The ACT model still attempts a clarification ("Which year are you asking the amount of?") and then provides an answer, but it's incorrect. This shows that while ACT improves clarification, it doesn't guarantee perfect accuracy on the underlying task, especially with limited data. The underlying issue is that both models selected a value from the table, but the ground truth was from the passage.

Table A30 (PACIFIC - Erroneous Trajectory):

SFT		ACT
Relevant Table Data	User: In which year was the amount of Investments higher?	? User: In which year was the amount of Investments higher?
Year: 2019 \| \| 2018 Investments: 1,216.0 \|\| 1,212.9	Assistant: [2018] ×	Assistant: [2018, 2019] ×

Table A30 shows another example where both SFT and ACT fail to produce the correct answer, and neither asked a clarification question, even though the data is present in the table. This suggests room for improvement in task performance even with ACT.

Table A31 (Abg-CoQA):

SFT		ACT
Relevant Passage	User: What did Meghan ask?	User: What did Meghan ask?
She yawned, stretching out like a lazy cat.	Assistant: Do you mean where did Lizzie see her father?	Assistant: Do you mean that morning or the night before?
Her sister was also awake. "Lizzie", Meghan whispered, "are you awake?"	User: No A	User: The night before.

Table A31 compares SFT and ACT trajectories on Abg-CoQA. The SFT model asks an off-topic clarification question, failing to lead the conversation towards the correct information. In contrast, the ACT model asks an on-track clarification question ("Do you mean that morning or the night before?"), demonstrating its improved multi-turn reasoning and ability to eventually reach the ground truth outcome ("Meghan asked Lizzie, 'Are you awake?'").

6.5. Additional Data Analysis (from Appendix A)

The following are the results from Table A6 of the original paper:

Adaptation Setting			\| Action-level \|	Content-level
Base Model	Approach	Conversations	Macro F1 ↑	Turn Similarity ↑	Traj. Similarity ↑
Gemini Pro	ICL	50	56.4	64.5	68.9
Zephyr 7B-β	ACT (ours)	50	52.3	66.2	68.8
Gemini Pro	ICL	100	59.2	67.0	72.0
Zephyr 7B-β	ACT (ours)	100	51.1	69.5	71.4
Gemini Pro	ICL	250	58.8	66.0	71.1
Zephyr 7B-β	ACT (ours)	250	53.3	72.5	75.1

Table A6 investigates the impact of additional in-context learning data for Gemini Pro on Abg-CoQA.

Key Findings: While Gemini Pro with ICL achieves competitive Macro F1 for action-level performance, its Trajectory Similarity does not uniformly improve with more ICL examples (e.g., it's 72.2 at 10, 68.9 at 50, 72.0 at 100, 71.1 at 250). In contrast, $Zephyr 7B-β$ tuned with ACT consistently improves its Trajectory Similarity as the number of tuning conversations increases (68.8 at 50, 71.4 at 100, 75.1 at 250).
Conclusion: This suggests that for certain corpora with potential label noise (like Abg-CoQA's low inter-annotator agreement), multi-turn task completion is a more robust measure than action-level performance alone. ACT-tuned smaller models can outperform frontier LLMs using many-shot ICL in multi-turn task completion, highlighting ACT's superior data efficiency and alignment with conversational goals.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Action-Based Contrastive Self-Training (ACT), a novel, model-agnostic, and data-efficient quasi-online contrastive tuning approach specifically designed for multi-turn conversational task adaptation. ACT leverages DPO by incorporating action-based preferences and multi-turn trajectory simulation to teach LLMs how to effectively manage ambiguity and clarify user intent. The research provides compelling evidence that ACT is highly effective in data-limited scenarios, outperforming standard SFT and other DPO variants across diverse tasks, including tabular QA, machine reading comprehension, and text-to-SQL generation. A significant finding is ACT's ability to achieve strong performance even without explicit action labels, by using pseudo-labeling from larger LLMs. The paper also contributes AmbigSQL, a new dataset and task for ambiguous text-to-SQL generation, and proposes a comprehensive evaluation workflow for conversational agents' ambiguity recognition and reasoning abilities.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

Assumption of Timed Clarifications: The approach assumes that clarification questions are appropriately timed. Noisy crowdsourced data can lead to suboptimal learned policies (e.g., unnecessary clarifications or disfluent language), suggesting a need for an additional preprocessing stage to infer the usefulness of actions.
Label Noise in Evaluation: Label noise can affect implicit action recognition evaluation, especially in datasets with low inter-annotator agreement (e.g., Abg-CoQA).
Task-Specific Heuristics: ACT relies on task-specific heuristics for trajectory outcome evaluation. While this allows flexibility, it demands more customization and engineering effort for new domains.
Reliance on External LLMs: The implementation of ACT heavily depends on other LLMs (e.g., Gemini for Action Classification and User Simulation). These prompting approaches are not perfect, may require substantial customization, and might not be accessible to all researchers due to financial or privacy concerns.
Limited Data Regime Focus: The study primarily focuses on limited data regimes. It's unclear how ACT's findings would generalize to scenarios with abundant in-distribution training data, where simpler objectives like SFT might eventually catch up.
Quasi-Online Nature: ACT is quasi-online rather than fully online. It samples one response from the policy model during tuning, which differs from fully online DPO where both winning and losing responses are sampled. The extent of online exploration is also limited by the nature of the tasks (e.g., specific arithmetic expressions have limited unique trajectories).

Future work could explore:
Combining ACT with existing sophisticated tuning approaches for complex tasks like text-to-SQL generation.
Generalizing ACT to large-scale data and multi-task environments.
Further studying ACT in combination with approaches for improved retrieval-augmented generation to address hallucinations in QA contexts.

7.3. Personal Insights & Critique

This paper presents a highly valuable contribution to the field of conversational AI, particularly in addressing a critical practical challenge: LLMs' struggle with ambiguity in multi-turn interactions under data constraints.

Strengths and Innovations:

Targeted Problem: The focus on disambiguation in multi-turn conversations is highly relevant. Many current LLM applications fall short when user intent isn't perfectly clear, and this paper offers a concrete solution.
Effective Integration of DPO: The quasi-online DPO framework, tailored with action-based contrastive learning and multi-turn trajectory simulation, is an elegant and effective extension. It moves DPO beyond simple pairwise preferences to a more sophisticated understanding of conversational dynamics.
Data Efficiency: The demonstration of ACT's performance in low-data regimes is a major plus. This is a common real-world scenario, especially for specialized domains where extensive human-annotated dialogue data is prohibitively expensive or impossible to collect. The success with pseudo-labeling further enhances its practicality.
Novel Dataset (AmbigSQL): The creation of AmbigSQL is a significant contribution. It not only serves as a strong benchmark for the proposed method but also provides a valuable resource for future research on ambiguous text-to-SQL generation, highlighting the real-world performance gaps that clarification can address.
Comprehensive Evaluation: The dual focus on action-level and content-level metrics, particularly the trajectory-level evaluations (e.g., Traj. F1, PC Execution Match), offers a holistic view of conversational agent performance that goes beyond single-turn accuracy.

Potential Issues and Areas for Improvement:

Reliance on External LLMs: While practical for research, the dependence on Gemini Ultra for conditional generation (M), action classification (A), and user simulation (U) could be a barrier for researchers without access to such frontier LLMs or for those with strict privacy requirements. Developing robust open-source alternatives for these components would broaden ACT's applicability.
Complexity of Heuristics: The reliance on task-specific heuristics (H) for trajectory outcome evaluation means that deploying ACT to a new domain might require significant domain expertise and engineering effort to define these heuristics accurately. While flexible, it's not entirely "plug-and-play" across diverse tasks.
Defining "Optimal" Actions: The paper assumes an "optimal" ground truth action. In highly nuanced conversations, what constitutes the best clarifying question or direct answer might be subjective. The label noise discussion for Abg-CoQA touches on this. Further research into handling more complex, subjective dialogue acts or even sequences of clarifying questions could be fruitful.
Scaling to "All" Data: The paper focuses on the limited data regime. While this is a common problem, understanding how ACT performs or evolves when faced with truly massive, in-distribution datasets (where simpler methods might eventually catch up) would be an interesting extension.

Broader Applicability: The principles behind ACT—learning pragmatic actions, optimizing for multi-turn outcomes, and leveraging on-policy experience in a DPO-like framework—could be highly applicable to other complex AI agent behaviors beyond conversation. This could include agents that need to plan sequences of actions in interactive environments, where feedback is multi-step and reward signals are sparse or delayed. For instance, in tool-use or planning agents, distinguishing between "asking for more information/clarification" versus "executing a tool" could be framed as action-based preferences, and the success of a multi-step plan could form the trajectory outcome. Overall, ACT offers a robust framework for teaching LLMs more sophisticated, context-aware, and goal-oriented behaviors in interactive settings.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~46 min read · 64,269 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Setup

4.2.2. ACT: Action-Based Contrastive Self-training

4.2.2.1. Construction of Preference Data

4.2.2.2. Self-Training Using On-policy Conversation Trajectory Simulation

4.2.3. Contrastive RL Tuning for Alignment

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Setup

5.2.1. Agent Task Performance

5.2.2. Implicit Ambiguity Recognition

5.2.3. Detailed Evaluation Metrics (from Appendix D)

5.3. Baselines

5.3.1. Prompting Baselines

5.3.2. Tuning Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Conversational QA with Tabular Grounding (PACIFIC)

6.1.2. Abg-CoQA: Conversational QA for Machine Reading Comprehension

6.1.3. AmbigSQL: Ambiguous Conversational Text-to-SQL Generation

6.1.4. ACT In-The-Wild: Learning Without Dialogue Action Supervision

6.2. Ablation Studies

6.2.1. Action Importance

6.2.2. Ablation of ACT Subcomponents

6.2.3. ACT with Unaligned Foundation Models

6.3. Additional Experimental Results (from Appendix B)

6.4. Qualitative Examples (from Appendix I)

6.5. Additional Data Analysis (from Appendix A)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers