Paper status: completed

Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter

Published:03/07/2025

Large Language Model Fine-Tuning (50)Sequence Policy Optimization (40)Emotional Support Conversations (1)Preference Bias Mitigation (1)MCTS-Based Strategy Dataset Construction (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study introduces Chain-of-Strategy Optimization, using MCTS to build ESC-Pro for fine-grained strategy tuning, improving LLMs' strategy accuracy, bias mitigation, and empathetic response in emotional support conversations.

Abstract

The growing emotional stress in modern society has increased the demand for Emotional Support Conversations (ESC). While Large Language Models (LLMs) show promise for ESC, they face two key challenges: (1) low strategy selection accuracy, and (2) preference bias, limiting their adaptability to emotional needs of users. Existing supervised fine-tuning (SFT) struggles to address these issues, as it rigidly trains models on single gold-standard responses without modeling nuanced strategy trade-offs. To overcome these limitations, we propose Chain-of-Strategy Optimization (CSO), a novel approach that optimizes strategy selection preferences at each dialogue turn. We first leverage Monte Carlo Tree Search to construct ESC-Pro, a high-quality preference dataset with turn-level strategy-response pairs. Training on ESC-Pro with CSO improves both strategy accuracy and bias mitigation, enabling LLMs to generate more empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT, highlighting the efficacy of fine-grained, turn-level preference modeling in ESC.

Mind Map

In-depth Reading

English Analysis~26 min read · 36,276 chars

1. Bibliographic Information

1.1. Title

Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter

1.2. Authors

Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, Ting Liu. The authors are affiliated with:

Harbin Institute of Technology
Singapore Management University
Central South University
iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd

1.3. Journal/Conference

The paper is published at arXiv, a preprint server. As a preprint, it has not yet undergone formal peer review for publication in a journal or conference. However, arXiv is a highly influential platform for rapid dissemination of research in fields like AI and machine learning.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses the growing demand for Emotional Support Conversations (ESC) and the limitations of Large Language Models (LLMs) in this domain, specifically their low strategy selection accuracy and preference bias. Existing supervised fine-tuning (SFT) methods are insufficient as they train on single gold-standard responses without modeling nuanced strategy trade-offs. To overcome these, the authors propose Chain-of-Strategy Optimization (CSO), a novel two-stage approach that optimizes strategy selection preferences at each dialogue turn. First, they use Monte Carlo Tree Search (MCTS) to create ESC-Pro, a high-quality preference dataset containing turn-level strategy-response pairs. Then, training LLMs on ESC-Pro with CSO improves both strategy accuracy and bias mitigation, leading to more empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT, emphasizing the effectiveness of fine-grained, turn-level preference modeling in ESC.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2503.05362
PDF Link: https://arxiv.org/pdf/2503.05362v3.pdf
Publication Status: Preprint (published on arXiv).

2. Executive Summary

2.1. Background & Motivation

The paper addresses the critical need for effective Emotional Support Conversations (ESC) in modern society, where emotional stress is prevalent. Large Language Models (LLMs) show significant potential for ESC due to their advanced conversational abilities. However, existing LLMs face two major limitations when applied to ESC:

Low Strategy Selection Accuracy: LLMs often fail to choose the most contextually appropriate support strategies, leading to less effective or even unhelpful responses.
Preference Bias: LLMs tend to rigidly favor certain strategies, lacking the adaptability to dynamically respond to a user's evolving emotional needs throughout a conversation.

These issues are exacerbated by Supervised Fine-Tuning (SFT), the mainstream training approach for LLMs, which trains models on single gold-standard responses. This rigidity prevents LLMs from understanding why certain strategies might be inappropriate in specific contexts or how to navigate nuanced strategy trade-offs. The core problem the paper aims to solve is improving LLMs' ability to provide high-quality emotional support by enhancing their strategy selection accuracy and mitigating preference bias in a fine-grained, turn-level manner.

2.2. Main Contributions / Findings

The paper introduces Chain-of-Strategy Optimization (CSO) and makes the following primary contributions:

ESC-Pro Dataset: The authors present ESC-Pro, a novel, high-quality, turn-level preference dataset specifically designed for Emotional Support Conversations. This dataset is constructed using Monte Carlo Tree Search (MCTS) to expand existing gold-standard dialogues into a comprehensive conversation tree, enabling the identification of both preferred (high-quality) and non-preferred (low-quality) strategy-response pairs at each dialogue turn. This addresses the lack of data that models nuanced strategy trade-offs.
Chain-of-Strategy Optimization (CSO) Framework: They propose CSO, a two-stage framework that refines LLM behavior through preference learning. By explicitly targeting strategy-level improvements across turns, CSO teaches LLMs to select better strategies and avoid rigid patterns, fostering local adaptability and long-range strategic coherence.
Demonstrated Efficacy: Extensive experiments on three different LLM backbones (LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B) consistently show that CSO significantly outperforms standard SFT and other decoding-based methods. CSO leads to higher strategy selection accuracy and substantially reduces strategy preference bias. This highlights the effectiveness of fine-grained, turn-level preference modeling for improving ESC quality and adaptability. The findings also suggest that CSO-equipped open-weight models can even surpass several leading closed-source models in certain metrics.

3.1. Foundational Concepts

Emotional Support Conversation (ESC): This refers to dialogues specifically aimed at providing psychological relief and guidance to individuals experiencing emotional distress. The goal is to alleviate emotional intensity, offer comfort, and provide constructive advice. In ESC, a seeker (user) expresses their emotional problem, and a supporter (AI or human) uses specific communication strategies to guide the conversation.
Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and process human language. They can perform various Natural Language Processing (NLP) tasks, including translation, summarization, and conversational AI. Examples include GPT-4, LLaMA, Gemma, and Qwen.
Supervised Fine-Tuning (SFT): A common technique used to adapt a pre-trained LLM to a specific task. In SFT, the model is trained on a dataset of input-output pairs (e.g., dialogue history and a desired response) where the "correct" output is provided as a gold standard. The model learns to predict these gold-standard outputs.
Preference Optimization Algorithms: These are a family of algorithms designed to align LLM outputs with human preferences. Instead of relying on a single gold-standard response, they learn from pairs of preferred and non-preferred responses, teaching the model to generate outputs that are more likely to be favored by humans.
- Direct Preference Optimization (DPO): A preference optimization algorithm that simplifies the Reinforcement Learning from Human Feedback (RLHF) process. Instead of explicitly training a separate reward model and then using reinforcement learning (like Proximal Policy Optimization - PPO), DPO directly optimizes the LLM's policy to maximize the probability of generating preferred responses over rejected ones, given a dataset of (prompt, preferred_response, rejected_response) triples. It frames the preference learning objective as a simple classification problem, making it computationally less expensive.
- Proximal Policy Optimization (PPO): A Reinforcement Learning (RL) algorithm widely used for LLM fine-tuning after a reward model has been trained. PPO aims to update the policy (the LLM) in smaller steps to ensure stable learning, preventing large, disruptive updates that could degrade performance.
- Monte Carlo Tree Search (MCTS): A search algorithm used for decision-making in complex domains, often seen in games like Go. It simulates future game states to evaluate the potential of different moves. In this paper, MCTS is adapted to explore and evaluate different strategy-response paths in a conversation, building a tree of possible dialogues and identifying high-quality sequences.
  - Selection: In MCTS, this phase involves traversing the tree from the root node to a leaf node using a policy that balances exploration (trying new paths) and exploitation (following promising paths). The Polymer Upper Confidence Bound (PUCB) formula is typically used for this.
  - Expansion: Once a leaf node is reached, it is expanded by adding one or more new child nodes corresponding to possible next actions (e.g., different strategies in ESC).
  - Simulation: A playout or rollout is performed from the newly expanded node to a terminal state (or a fixed depth). This involves randomly (or semi-randomly) choosing actions until the end of the game/dialogue.
  - Backpropagation: The reward obtained from the simulation is then backpropagated up the tree, updating the Q-values (expected rewards) and visit counts of all nodes along the simulation path. This information guides future selections.
ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence): A metric for evaluating the quality of text summarization and machine translation. It measures the overlap of n-grams (sequences of $n$ words) or longest common subsequences (LCS) between a generated text and a reference text. ROUGE-L specifically focuses on the LCS to capture sentence-level structural similarity. Higher ROUGE-L scores indicate more similar content.

3.2. Previous Works

The paper contextualizes its contributions within existing research on Emotional Support Conversation (ESC) and Preference Optimization.

ESC Systems:
- Early ESC systems focused on techniques like global-to-local hierarchical graph networks (Peng et al., 2022), incorporating commonsense knowledge (Tu et al., 2022), and modeling emotions and semantics (Zhao et al., 2023a,c). These approaches often relied on traditional NLP and machine learning techniques.
- With the advent of LLMs, recent work has focused on leveraging their inherent conversational capabilities. Liu et al. (2023) applied SFT to LLaMA-7B for ESC, leading to ChatCounselor, a specialized model outperforming general-purpose LLMs. Chen et al. (2023) and Qiu et al. (2023) expanded single-turn empathic responses to multi-turn dialogues and fine-tuned models on these high-quality datasets.
- Limitation of SFT in ESC: The paper highlights that SFT, while useful, fundamentally struggles with ESC because it lacks the ability to differentiate between optimal and suboptimal strategies. By only training on gold-standard responses, SFT models learn what to say but not why certain alternatives are less suitable, leading to strategy preference bias and low accuracy in dynamic contexts.
Preference Optimization Algorithms:
- Evolution from RL: Traditional approaches to preference optimization involved policy-based reinforcement learning (RL), typically training a separate reward model (Ouyang et al., 2022) followed by online RL methods like PPO (Schulman et al., 2017) for LLM fine-tuning.
- Direct Preference Optimization (DPO): Rafailov et al. (2023) streamlined this by unifying reward modeling and RL into a single step with DPO, significantly reducing computational costs.
- Subsequent Optimizations: Further advancements built on DPO to improve efficiency, robustness, or data applicability:
  - SimPO (Meng et al., 2024) eliminates DPO's reference model.
  - IPO (Azar et al., 2024) enhances robustness through regularization.
  - ORPO (Hong et al., 2024) integrates instruction tuning and preference alignment.
  - KTO (Ethayarajh et al., 2024) operates without paired data by leveraging prospect theory.
- Gap Addressed by This Work: The paper positions its work as bridging the gap between ESC and preference optimization. While preference optimization has advanced, its application to the specific challenges of ESC (strategy accuracy and bias) is a key novelty.

3.3. Technological Evolution

The evolution of AI in conversational agents has moved from rule-based systems to statistical models, then to neural networks, and most recently to Large Language Models (LLMs). In ESC, this has translated from early systems relying on explicit knowledge and hand-crafted rules to LLMs fine-tuned on conversational data. However, the limitation of SFT (training on single gold-standard responses) became apparent for nuanced tasks like ESC, which require dynamic strategic adaptation. Reinforcement Learning from Human Feedback (RLHF) emerged to address this by learning from human preferences. DPO and its variants simplified RLHF. This paper further refines this by applying preference optimization specifically to the turn-level strategy selection in ESC, generating a unique preference dataset via MCTS to capture these nuances. This work fits within the timeline by pushing the boundaries of LLM adaptability and strategic intelligence in emotionally sensitive domains.

3.4. Differentiation Analysis

Compared to existing SFT-based ESC models, this paper's approach, CSO, offers several key differentiators and innovations:

Fine-Grained Preference Modeling: SFT trains on single gold-standard responses, implicitly treating them as the only correct option. CSO, in contrast, explicitly models preferences between preferred and non-preferred strategy-response pairs at a turn-level. This allows the LLM to learn not just what is good, but why other options are suboptimal, addressing the core limitation of SFT in ESC.
MCTS-Generated Preference Dataset (ESC-Pro): Unlike manually annotated datasets or simpler augmentation techniques, ESC-Pro is systematically constructed using Monte Carlo Tree Search. MCTS allows for comprehensive exploration of diverse conversational paths and strategic choices, guided by LLM-based reward functions. This creates a richer, more diverse dataset that captures nuanced strategy trade-offs and explicitly identifies suboptimal strategies as non-preferred samples, which SFT lacks.
Bias Mitigation: SFT tends to reinforce existing strategy biases because it only sees gold-standard examples. CSO, through preference optimization on ESC-Pro, actively mitigates bias by teaching the model to avoid rigid adherence to certain strategies and adapt to users' dynamic emotional needs.
Chain-like Optimization: CSO aims for chain-like optimization across turns, improving both local adaptability (selecting the best strategy for the current turn) and long-range strategic coherence (maintaining a consistent, effective supportive narrative). SFT struggles with this multi-turn strategic consistency.
Robustness and Versatility: The experiments demonstrate CSO's effectiveness across various LLM backbones and even with LoRA fine-tuning, suggesting it is a versatile and practical approach for enhancing ESC quality.

4. Methodology

The proposed Chain-of-Strategy Optimization (CSO) framework is a novel two-stage approach designed to refine LLMs' strategy selection in Emotional Support Conversations (ESC) at each dialogue turn. It first constructs a high-quality preference dataset (ESC-Pro) and then applies a specialized preference optimization technique. The overall framework is illustrated in Figure 2.

4.1. Principles

The core idea behind CSO is to move beyond the limitations of supervised fine-tuning (SFT), which trains models on single gold-standard responses. Instead, CSO aims to teach LLMs the nuances of strategy trade-offs by explicitly modeling preferences between good and bad strategy-response pairs at each conversational turn. This is achieved through two main principles:

Structured Dialogue Exploration and Evaluation: Use an intelligent search mechanism (Monte Carlo Tree Search) to systematically explore different conversational paths and evaluate the quality of various strategy-response pairs, thus generating a comprehensive dataset of preferred and non-preferred interactions.
Turn-Level Preference Optimization: Apply preference learning (like DPO) on this finely-grained dataset to directly optimize the LLM's policy to select better strategies and avoid suboptimal ones, thereby improving strategy accuracy and mitigating preference bias.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Stage 1: Dialogue Expansion with MCTS

The first stage involves enriching existing ESC datasets with turn-level preference annotations by employing Monte Carlo Tree Search (MCTS). MCTS systematically constructs a conversation tree, where each node represents a dialogue state. The search process refines dialogue strategies through four stages: selection, expansion, simulation, and backpropagation. This allows the model to explore various strategy paths and identify optimal responses guided by a predefined value function.

Specialized LLMs for MCTS: To guide the MCTS process, four specialized LLMs are employed, each with a distinct role:

Seeker LLM: Generates responses from the seeker's perspective, based on the conversation history. It can also output an end-of-dialogue token. The Seeker LLM used is GPT-4o-mini, with the prompt shown in Appendix A (not shown in the image but described in text). The Seeker's action ( $a_{\mathrm{seeker}}$ ) is defined as: $ a _ { \mathrm { s e e k e r } } : = f _ { \mathrm { s e e k e r } } ( H _ { \mathrm { s u p p o r t e r } } ) $ where $H_{\mathrm{supporter}}$ represents the conversation history up to the last Supporter turn, denoted as $\{ \bar { R _ { \mathrm { s e e k e r } } ^ { ( 0 ) } } , \bar { R _ { \mathrm { s u p p o r t e r } } ^ { ( 0 ) } } , \bar { R _ { \mathrm { s e e k e r } } ^ { ( 1 ) } } , \bar { R _ { \mathrm { s u p p o r t e r } } ^ { ( 1 ) } } , \dots , R _ { \mathrm { s e e k e r } } ^ { ( i ) } , R _ { \mathrm { s u p p o r t e r } } ^ { ( i ) } \}$ . The function $f_{\mathrm{seeker}}$ generates the Seeker's response or an end-of-conversation token.
Supporter LLM: Generates replies from the supporter's perspective, based on the seeker's last statement and a chosen strategy. The Supporter LLM used is GPT-4o-mini, with its prompt shown in Figure 6 (provided in the appendix as an image). The Supporter's action ( $a_{\mathrm{supporter}}$ ) is defined as: $ a _ { \mathrm { s u p p o r t e r } } : = f _ { \mathrm { s u p p o r t e r } } ( H _ { \mathrm { s e e k e r } } , \mathrm { S t r a t e g y } ) $ where $H_{\mathrm{seeker}}$ represents the conversation history up to the last Seeker turn, denoted as $\{ \bar { R _ { \mathrm { s e e k e r } } ^ { ( 0 ) } } , \bar { R _ { \mathrm { s u p p o r t e r } } ^ { ( 0 ) } } , R _ { \mathrm { s e e k e r } } ^ { ( 1 ) } , R _ { \mathrm { s u p p o r t e r } } ^ { ( 1 ) } , \dots , R _ { \mathrm { s u p p o r t e r } } ^ { ( i - 1 ) } , R _ { \mathrm { s e e k e r } } ^ { ( i ) } \}$ , and $\mathrm{Strategy}$ refers to the selected response strategy. The function $f_{\mathrm{supporter}}$ generates the Supporter's reply.
Strategy LLM: Evaluates and ranks available support strategies for each conversation turn, assigning scores from 1 to 10. The Strategy LLM used is GPT-4o-mini, with its prompt shown in Figure 7 (provided in the appendix as an image). The Strategy's action ( $a_{\mathrm{strategy}}$ ) is defined as: $ a _ { \mathrm { s t r a t e g y } } : = f _ { \mathrm { s t r a t e g y } } ( H _ { \mathrm { s e e k e r } } ) $ where $H_{\mathrm{seeker}}$ is the conversation history. The output $a_{\mathrm{strategy}}$ is a score for each available strategy. These scores are then normalized using the softmax function to form a probability distribution for use in the PUCB formula.
Reward LLM: Assesses the quality of each strategy-response pair using four metrics: Empathy (E), Information (I), Humanoid (H), and Strategy (S). This serves as the value function for MCTS. The Reward LLM used is GPT-4o-mini, with its prompt shown in Figure 8 (provided in the appendix as an image) and detailed evaluation criteria in Table 13. The Reward LLM's action ( $a_{\mathrm{reward}}$ ) is defined as: $ a _ { \mathrm { r e w a r d } } : = f _ { \mathrm { r e w a r d } } ( H _ { \mathrm { s u p p o r t e r } } ) $ where $H_{\mathrm{supporter}}$ is the conversation history. The function $f_{\mathrm{reward}}$ returns scores for the four metrics: $(E^k, I^k, H^k, S^k)$ for the $k$ -th turn.

Node Types and Representation: Each node in the conversation tree represents a state and is classified into four types:

Root Node: Represents the initial dialogue turn, containing the Seeker's first response ( $R_{\mathrm{seeker_0}}$ ), Q-value ( $Q$ ), and visit count ( $N$ ). $ S _ { \mathrm { r o o t } } = ( R _ { \mathrm { s e e k e r _ { 0 } } } , Q , N ) $
Unexpanded Node: Represents an unexplored strategy, with the strategy ( $a$ ), score ( $r$ ), Q-value (initialized to 0), and visit count (initialized to 0). $ S _ { \mathrm { u n e x p } } = ( a , r , Q = 0 , N = 0 ) $
Expanded Node: Has been explored, including the strategy ( $a$ ), score ( $r$ ), Seeker and Supporter responses ( $R_{\mathrm{seeker}}, R_{\mathrm{supporter}}$ ), Q-value ( $Q$ ), and visit count ( $N$ ). $ S _ { \mathrm { e x p } } = ( a , r , R _ { \mathrm { s e e k e r } } , R _ { \mathrm { s u p p o r t e r } } , Q , N ) $
End Node: Marks the end of the dialogue, including the strategy ( $a$ ), score ( $r$ ), Supporter's response ( $R_{\mathrm{supporter}}$ ), Q-value ( $Q$ ), and visit count ( $N$ ). $ S _ { \mathrm { e n d } } = ( a , r , R _ { \mathrm { s u p p o r t e r } } , Q , N ) $

MCTS Stages:

Selection: The selection phase identifies the next node to explore using the Polymer Upper Confidence Bound (PUCB) formula. This formula balances exploration (trying new strategies) and exploitation (focusing on promising ones) based on a node's Q-value (estimated reward), visit count, and its parent's visit count. The formula is: $ \mathbf { P U C B } ( S ) = Q ( S ) + c \cdot P ( S ) \cdot { \frac { \sqrt { N ( \mathbf { P a r e n t } ( S ) ) } } { N ( S ) + 1 } } $
- $S$ : The current node being evaluated.
- Q(S): The Q-value (average reward) of node $S$ .
- $c$ : A hyperparameter that controls the balance between exploration and exploitation. A higher $c$ encourages more exploration.
- P(S): The prior probability of selecting strategy $a$ associated with node $S$ , typically derived from the Strategy LLM's scores (normalized via softmax).
- N(S): The visit count of node $S$ , i.e., how many times this node has been visited during the MCTS process.
- $N(\mathbf{Parent}(S))$ : The visit count of the parent node of $S$ . The selection process proceeds layer by layer, starting from the root node and choosing the node with the highest PUCB value at each step.
Expansion: In the expansion phase, the selected node is evaluated.
- If the node is an expanded node, the Strategy LLM evaluates all possible strategies for the Supporter at this node. It assigns scores to these strategies and generates unexpanded child nodes for each strategy. These child nodes are initialized with $Q$ and $N$ values set to 0: $ S _ { \mathrm { c h i l d } } = ( a , r , Q = 0 , N = 0 ) $
- If the selected node is an unexpanded node, the conversation history is generated using the Seeker and Supporter LLMs:
  1. The Supporter LLM generates a response based on the selected strategy and the conversation history.
  2. The Seeker LLM then generates a reply based on the updated conversation history and the Supporter's response. Once the node is expanded, the conversation history is stored. If the Seeker generates an end-of-dialogue token, the reward is computed, and the process moves directly to backpropagation.
Simulation: The simulation phase focuses on the newly expanded child nodes. To reduce computational costs, only the child node with the highest strategy score is selected for simulation.
- Node Expansion: The chosen child node transitions from an unexpanded node to an expanded node. The Seeker and Supporter LLMs generate the conversation content as needed.
- Greedy Simulation: A greedy simulation is performed for $n$ $n$ steps (a predefined number of turns). At each step:
  1. The Seeker LLM generates a response based on the highest-scoring strategy.
  2. The Supporter LLM responds accordingly. The simulation continues for $n$ steps or until an end-of-dialogue token is generated by the Seeker.
- Reward Calculation: During the simulation, the Reward LLM evaluates the conversation quality using four metrics: Empathy (E), Information (I), Humanoid (H), and Strategy (S). The reward is computed as: $ R = { \frac { E + I + H + \alpha \cdot S } { 1 0 } } + b $
  - E, I, H, S: Scores for Empathy, Information, Humanoid quality, and Strategy effectiveness, respectively, calculated by the Reward LLM for the current conversation turn. These scores range from 0 to 4 (as seen in Table 13).
  - $\alpha$ : A scaling hyperparameter for the Strategy score. It adjusts the relative importance of Strategy effectiveness in the overall reward.
  - $b$ : A bias introduced to adjust the reward. It allows the system to treat rewards lower than -b as negative, guiding the search towards higher-scoring nodes. In the experiments, $b = -3$ . The reward for the simulation ( $R_{\mathrm{sim}}(S)$ ) is averaged over all Supporter turns during the simulation and is used to update the node's Q-value in the backpropagation phase: $ R _ { \mathrm { { s i m } } } ( S ) = { \frac { 1 } { r _ { \mathrm { { s i m } _ { \mathrm { - } } e n d } } - r ( S ) } } \sum _ { i = r ( S ) } ^ { r _ { \mathrm { { s i m } _ { \mathrm { - } } e n d } } } R _ { \mathrm { { s i m } } } ^ { ( i ) } $
  - r(S): The starting turn index of the simulation from node $S$ .
  - $r_{\mathrm{sim\_end}}$ : The ending turn index of the simulation.
  - $R_{\mathrm{sim}}^{(i)}$ : The reward calculated at turn $i$ during the simulation. This formula calculates the average reward over the simulated turns.
Backpropagation: In the backpropagation phase, the reward $R_{\mathrm{sim}}(S)$ from the simulation is propagated backward up the tree. The Q-value and visit count for each node along the simulation path are updated as follows: $ Q _ { k } = \frac { N _ { k } \cdot Q _ { k } + R _ { \mathrm { { s i m } } } } { N _ { k } + 1 } , \quad N _ { k } = N _ { k } + 1 $
- $Q_k$ : The Q-value of node $k$ .
- $N_k$ : The visit count of node $k$ .
- $R_{\mathrm{sim}}$ : The simulation reward calculated in the previous step. This process updates the MCTS tree, refining the estimated values of nodes and improving future strategy selections.

4.2.2. ESC-Pro Preference Data Construction

After the MCTS process is complete, the ESC-Pro dataset is constructed from the generated conversation tree. This involves identifying high-quality strategy-response paths and pairing them with lower-scoring alternatives to create fine-grained preference data.

Conversation Decomposition and Expansion: A gold-standard ESC conversation is first decomposed into a conversation tree. MCTS iteratively expands this tree, exploring potential strategy paths. The process stops when a predefined number of iterations ( $n_{\mathrm{iter}}$ ) is reached or a sufficient number of termination nodes ( $n_{\mathrm{finish}}$ ) are identified.
Preference Data Extraction: Valid conversation paths are extracted from the MCTS tree. A path $P = \{S_1, S_2, \ldots, S_L\}$ $P = {S_{1}, S_{2}, \dots, S_{L}}$ is considered valid if:
1. is_end_node(S_L): The last node $S_L$ is a termination node (i.e., the conversation has a natural end).
2. $Q(S_i) > \theta$ for all nodes $S_i \in P$ : All nodes in the path meet a predefined quality threshold $\theta$ for their Q-value. Once valid paths are identified, preference pairs are extracted. For each high-scoring node $S_w$ (where $Q(S_w) > \theta$ ) found in a valid path, low-scoring sibling nodes $S_l$ (where $Q(S_l) < \theta$ ) are identified. These pairs $(S_w, S_l)$ represent relative strategy quality. The resulting dataset, $\mathcal{D}$ , is constructed as follows: $ \mathcal { D } = \bigcup _ { P \in \mathcal { P } } \left{ ^ { { ( S _ { w } , S _ { l } ) \mid S _ { w } \in P , S _ { l } \in \mathrm { S i b l i n g s } ( S _ { w } ) , } _ Q ( S _ { w } ) > \theta , Q ( S _ { l } ) < \theta } \right. $
- $\mathcal{P}$ : The set of all valid paths extracted from the MCTS tree.
- $P$ : A single valid path from $\mathcal{P}$ .
- $S_w$ : A preferred (winning) node within path $P$ . Its Q-value must be greater than the threshold $\theta$ .
- $S_l$ : A non-preferred (losing) node that is a sibling of $S_w$ . Its Q-value must be less than the threshold $\theta$ .
- $\mathrm{Siblings}(S_w)$ : The set of all sibling nodes (nodes at the same dialogue turn with different strategies) of node $S_w$ . This dataset $\mathcal{D}$ contains all preference pairs $(S_w, S_l)$ , providing a rich training signal for LLMs to learn nuanced strategy trade-offs and adaptive decision-making.

4.2.3. Stage 2: Chain-of-Strategy Optimization (CSO)

The second stage involves performing turn-level preference optimization using DPO (Rafailov et al., 2023) on the constructed ESC-Pro dataset $\mathcal{D}$ . For the $i$ -th conversation round (dialogue turn), the training objective is formulated as: $ \mathcal { L } _ { i } ( \pi _ { \theta } ; \pi _ { \mathrm { { r e f } } } ) = - \log \sigma \left( \beta \log r _ { w } - \beta \log r _ { l } \right) $ where:

$\mathcal{L}_i$ : The loss function for the $i$ -th conversation round.
$\pi_\theta$ : The policy of the LLM being trained (the model parameters $\theta$ ).
$\pi_{\mathrm{ref}}$ : The reference policy, typically the SFT-trained version of the LLM or an earlier version, used to stabilize training and prevent significant deviations from the original language generation capabilities.
$\sigma(\cdot)$ : The sigmoid function, which squashes its input to a range between 0 and 1.
$\beta$ : A hyperparameter that controls the strength of the preference alignment. A larger $\beta$ means stronger alignment to preferences.
$r_w$ $r_{w}$ : The ratio of the probability of generating the preferred response under the current policy to the reference policy. $ r _ { w } = \frac { \pi _ { \theta } ( S _ { w } ^ { i } \mid x , H ^ { i - 1 } ) } { \pi _ { \mathrm { r e f } } ( S _ { w } ^ { i } \mid x , H ^ { i - 1 } ) } $
- $S_w^i$ : The preferred strategy-response pair at the $i$ -th turn.
- $x$ : The initial prompt or context of the conversation.
- $H^{i-1}$ : The conversation history up to the (i-1)-th turn, which is $\{ S _ { w } ^ { 0 } , S _ { w } ^ { 1 } , \ldots , S _ { w } ^ { i - 1 } \}$ . This indicates that CSO is optimized over a chain of preferred strategies.
$r_l$ $r_{l}$ : The ratio of the probability of generating the non-preferred response under the current policy to the reference policy. $ r _ { l } = \frac { \pi _ { \theta } ( S _ { l } ^ { i } \mid x , H ^ { i - 1 } ) } { \pi _ { \mathrm { r e f } } ( S _ { l } ^ { i } \mid x , H ^ { i - 1 } ) } $
- $S_l^i$ : The non-preferred strategy-response pair at the $i$ -th turn. The objective of this loss function is to increase the likelihood of generating preferred responses ( $S_w^i$ ) while simultaneously decreasing the likelihood of generating non-preferred responses ( $S_l^i$ ), in a turn-level and context-aware manner (due to $H^{i-1}$ ).

The overall training objective across the entire ESC-Pro dataset is the expected value of this turn-level loss: $ \mathcal { L } _ { \mathrm { C S O } } ( \pi _ { \theta } ; \pi _ { \mathrm { r e f } } ) = - \mathbb { E } _ { ( x , S _ { w } ^ { i } , S _ { l } ^ { i } , H ^ { i - 1 } ) \sim D } \left[ \mathcal { L } _ { i } ( \pi _ { \theta } ; \pi _ { \mathrm { r e f } } ) \right] $

$\mathcal{L}_{\mathrm{CSO}}$ : The overall Chain-of-Strategy Optimization loss.
$\mathbb{E}_{(\cdot) \sim \mathcal{D}}[\cdot]$ : The expected value (average) over all samples $(x, S_w^i, S_l^i, H^{i-1})$ drawn from the ESC-Pro dataset $\mathcal{D}$ . This optimization process enables LLMs to learn nuanced strategy trade-offs, selecting better strategies over weaker ones at each dialogue turn, exploring diverse conversation paths, and avoiding rigid preference patterns, thus leading to local adaptability and long-range strategic coherence.

5. Experimental Setup

5.1. Datasets

ExTES (Gold-Standard Source): The ExTES dataset (Zheng et al., 2024) serves as the initial gold-standard conversational benchmark for expansion.
- Characteristics: Contains emotional support conversations with annotated support strategies. It originally contains 16 distinct support strategies.
- Usage: The first 100 samples from the processed ExTES training set are selected as a high-quality starting point (seed dialogues) for MCTS-based expansion.
ESC-Pro (Generated Preference Dataset): This is the dataset generated by applying MCTS to expand the 100 seed dialogues from ExTES.
- Characteristics: ESC-Pro expands the 100 seed dialogues into 423 dialogues. The total number of utterances grows from 1,613 (in seed) to 14,383. Crucially, over half (8,157 utterances) are classified as non-preference data, demonstrating the generation of suboptimal strategies alongside preferred ones.
  - Average dialogue length remains consistent (14.72 utterances for ESC-Pro vs. 16.13 for original).
  - Average length of preference utterances (29.42) matches seed data (29.03), while non-preference utterances (23.22) are shorter, indicating meaningful distinction.
  - The original 16 ExTES strategies are merged into 8 categories to reduce search space and improve computational efficiency, as detailed in Table 8.
- Usage: ESC-Pro is the primary dataset used for training with Chain-of-Strategy Optimization (CSO).
ESC-Pro-SFT (Fine-tuning Subset): A subset extracted from ESC-Pro specifically for Supervised Fine-Tuning (SFT).
- Characteristics: It would presumably contain only the preferred strategy-response pairs from ESC-Pro, mimicking the gold-standard nature required by SFT.
- Usage: Used to train SFT baselines for comparison with CSO.

Example of Data Sample (from Table 16 - excerpt of a dialogue):

Conversation Background

Scene: Breakups or Divorce
Description: "After five years of marriage, my husband and I have decided to separate due to irreconcilable differences. I am feeling overwhelmed and uncertain as I navigate through this difficult time."

Conversation History (partial)

Seeker: I'll try my best. Right now, it feels impossible to imagine moving on. Do you have any advice on how to start the healing process?
- Preferred Supporter Response: (Affirmation) While it may seem overwhelming now, know that healing and growth are possible. Focus on taking care of yourself both physically and emotionally. Engage in activities that bring you happiness and surround yourself with loved ones who uplift you. Consider seeking support from a therapist or counselor who can help guide you through this process. You are stronger than you realize, and I believe in your ability to navigate through this difficult time.
- Non-Preferred Supporter Response: (Collaborative Planning) Absolutely! Let's break it down together. What's one small step you think you could take this week to begin healing?
  
  This example illustrates a seeker's turn, followed by a preferred supporter response (using Affirmation strategy) and several non-preferred responses (using Collaborative Planning, Avoid Judgment and Criticism, Reframe Negative Thoughts, Share Information strategies). This structure, with explicit preferred and non-preferred options, is what ESC-Pro provides for training.

5.2. Evaluation Metrics

The evaluation includes four metrics: strategy selection accuracy (macro F1 $\mathcal{Q}$ , weighted F1 $\mathcal{Q}_{\mathcal{W}}$ ), strategy preference bias ( $\mathcal{B}$ ), and ROUGE-L (R-L) for assessing response semantics.

Strategy Selection Accuracy (macro F1 $\mathcal{Q}$ and weighted F1 $\mathcal{Q}_{\mathcal{W}}$ ):
- Conceptual Definition: F1-score is a measure of a model's accuracy on a dataset. It considers both the precision and recall to compute the score. In the context of ESC, it evaluates how accurately the LLM selects the correct support strategy for a given dialogue turn.
  - Macro F1 ( $\mathcal{Q}$ ) calculates the F1-score independently for each class (strategy) and then averages them, treating all classes equally. It is useful when detecting performance across all categories, especially when class distribution is uneven.
  - Weighted F1 ( $\mathcal{Q}_{\mathcal{W}}$ ) calculates the F1-score for each class and then averages them, weighted by the number of true instances for each class. It is more representative when there is class imbalance.
- Mathematical Formula (General F1-score): $ \mathrm{F1} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ where: $ \mathrm{Precision} = \frac{\mathrm{True; Positives}}{\mathrm{True; Positives} + \mathrm{False; Positives}} $ $ \mathrm{Recall} = \frac{\mathrm{True; Positives}}{\mathrm{True; Positives} + \mathrm{False; Negatives}} $
- Symbol Explanation:
  - $\mathrm{True\; Positives}$ : Instances correctly identified as belonging to a specific class.
  - $\mathrm{False\; Positives}$ : Instances incorrectly identified as belonging to a specific class (Type I error).
  - $\mathrm{False\; Negatives}$ : Instances incorrectly identified as not belonging to a specific class (Type II error).
  - For Macro F1, this calculation is performed for each strategy class, and the unweighted average is taken.
  - For Weighted F1, this calculation is performed for each strategy class, and the average is weighted by the number of actual instances of each class.
Strategy Preference Bias ( $\mathcal{B}$ ):
- Conceptual Definition: This metric quantifies how much an LLM deviates from an ideal or desired strategy distribution. A high bias means the model rigidly favors certain strategies, limiting its adaptability to diverse user emotional needs. The goal is to minimize this bias.
- Mathematical Formula: While the paper mentions that $B$ measures deviation from ideal strategy distributions, it does not explicitly provide a formula for its calculation in the main text or appendix. Given the context of Kang et al. (2024) being cited for this metric, it likely involves comparing the observed strategy distribution of the model's outputs against a reference "ideal" distribution (e.g., from human experts or a balanced dataset) using a statistical distance measure. A common approach for this could be Jensen-Shannon Divergence or Kullback-Leibler Divergence, or a simpler aggregate measure of standard deviation from a uniform or target distribution. Without the specific formula from the cited work or the paper itself, a precise mathematical formula cannot be provided. However, the conceptual goal is to measure the non-uniformity or skewedness in strategy usage.
ROUGE-L (R-L):
- Conceptual Definition: ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence) is a metric used to evaluate the quality of text generation by comparing a generated response to a reference (gold-standard) response. It specifically measures the longest common subsequence (LCS) between the two texts. A longer LCS indicates more shared content and better semantic overlap. Higher ROUGE-L scores are better.
- Mathematical Formula: $ \mathrm{ROUGE-L} = \frac{(1 + \beta^2) R_{\mathrm{LCS}} P_{\mathrm{LCS}}}{\beta^2 R_{\mathrm{LCS}} + P_{\mathrm{LCS}}} $ where: $ R_{\mathrm{LCS}} = \frac{\mathrm{Length}(\mathrm{LCS}(X, Y))}{\mathrm{Length}(X)} $ $ P_{\mathrm{LCS}} = \frac{\mathrm{Length}(\mathrm{LCS}(X, Y))}{\mathrm{Length}(Y)} $
- Symbol Explanation:
  - $X$ : The reference (gold-standard) response.
  - $Y$ : The generated response from the LLM.
  - $\mathrm{LCS}(X, Y)$ : The longest common subsequence between texts $X$ and $Y$ . A subsequence is a sequence that can be derived from another sequence by deleting some or no elements without changing the order of the remaining elements.
  - $\mathrm{Length}(\cdot)$ : The number of tokens (words) in a sequence.
  - $R_{\mathrm{LCS}}$ : The LCS-based recall, which measures how much of the reference text is covered by the generated text's LCS.
  - $P_{\mathrm{LCS}}$ : The LCS-based precision, which measures how much of the generated text is part of the LCS with the reference text.
  - $\beta$ : A parameter that weighs the importance of recall vs. precision. Typically, $\beta$ is set to a value where recall is more important (e.g., $\beta = 1$ for equal weight, or $\beta \gg 1$ for higher recall).

5.3. Baselines

The CSO method is compared against several baselines to demonstrate its efficacy:

Supervised Fine-Tuning (SFT):
- Description: The standard method of fine-tuning LLMs on gold-standard data. The paper evaluates both LoRA-based SFT (using Low-Rank Adaptation for parameter-efficient fine-tuning) and Full-parameter SFT (updating all model parameters).
- Representativeness: This is the most common and direct approach for adapting LLMs to specific tasks, including ESC. It serves as a strong benchmark for comparing the impact of preference optimization versus direct supervision.
Decoding Strategies:
- Direct-Refine: A simple self-optimization method where the model directly refines its output based on its initial response, without external input or complex feedback loops.
- Self-Refine: Based on Madaan et al. (2024), this two-step process involves the model first reflecting on its own output and generating feedback, then using this feedback to produce an optimized response. It promotes self-correction.
- w/ Example (In-context Example Prompting): In this method, a randomly selected gold-standard example from the ExTES training set is included in the prompt provided to the model. This serves as an in-context learning signal to guide the model's generation.
- Representativeness: These methods represent common approaches to improve LLM output quality without explicit fine-tuning on preference data, relying instead on prompting techniques or internal refinement.
Proprietary Models (Closed-Source):
- Description: The paper benchmarks against leading commercial LLMs, including GPT-4o, Gemini-2.0-Pro, and Claude-3.5-Sonnet/Claude-3.7-Sonnet.
- Representativeness: These models represent the current state-of-the-art in LLM performance and provide a strong real-world upper bound for comparison, showing whether the proposed CSO on smaller open-source models can achieve competitive results.

5.4. Implementation Details

Backbones: The experiments are conducted on three LLM backbones:
- LLaMA-3.1-8B-Instruct (Dubey et al., 2024)
- Qwen-2.5-7B-Instruct (Yang et al., 2024)
- Gemma-2-9B-it (Team et al., 2024)
Software/Hardware: All experiments are implemented using PyTorch (Paszke et al., 2019) on 8 NVIDIA Tesla A100 GPUs with DeepSpeed (Rasley et al., 2020) repository (using ZeRo-2 optimization).
Training Parameters:
- Maximum target length for generated responses: 512 tokens across all backbones.
- LoRA Fine-tuning:
  - alpha: 8
  - dropout rate: 0
  - Targeted modules: All modules.
- Official chat templates of the respective backbones are used.
- MCTS Hyperparameters (for ESC-Pro construction):
  - $c$ (PUCB hyperparameter): 1 (balances exploration/exploitation).
  - $\alpha$ (Reward scaling hyperparameter for Strategy): 7 (emphasizes strategy effectiveness).
  - $b$ (Reward bias): -3 (ensures rewards lower than -3 are negative, guiding search to higher-scoring nodes).
  - $\theta$ (Score threshold for valid paths in data extraction): 0.5 (ensures Q-values of preferred data are at least > 3.5, maintaining quality).
- Strategy Mapping: The original 16 ExTES strategies are mapped to 8 broader categories (Table 8) to simplify the search space.
Preference Optimization Algorithms: In addition to DPO, SimPO, IPO, KTO, and ORPO are also evaluated as alternatives, with algorithm-specific hyperparameters tuned (e.g., $\beta$ values, gemma for SimPO, chosen weight and rejected weight for KTO). These are detailed in Table 14 and Table 15.

5.5. Human Evaluation Details

Evaluators: Three anonymous crowd workers with prior annotation experience.
Training: Evaluators received targeted training to understand metric criteria.
Compensation: Paid based on annotated samples.
Inter-rater Agreement: Measured using Cohen's Kappa ( $\kappa$ ) scores (reported in Tables 1 and 3).
Evaluation Tasks:
- Pairwise Comparison (Table 1): 100 preference pairs from ESC-Pro are compared, evaluating whether the preference response is superior, inferior, or equal to the non-preference response based on Empathy, Information, Humanoid, and Strategy metrics (from the Reward LLM).
- Quality Assessment (Table 6): 100 responses from ExTES seed, 50 preference responses, and 50 non-preference responses from ESC-Pro are evaluated across five metrics: Acceptance (Acc), Effectiveness (Eff), Sensitivity (Sen), Satisfaction (Sat), and Alignment (Align).
- CSO vs. SFT Comparison (Table 3): Human evaluators compare CSO-generated responses against SFT-generated responses on Acceptance, Effectiveness, Sensitivity, and Satisfaction, indicating which is preferred (win), inferior (lose), or equal (tie).

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Automatic Evaluation Results

The following are the results from Table 2 of the original paper:

		Q ↑	B↓	Qw ↑	R-L↑
GPT-4o-mini		35.68	2.65	42.08	24.91
GPT-40		27.98	2.65	39.55	24.26
Gemini-2.0-Pro		27.00	1.90	46.59	22.71
Claude-3.5-Sonnet		20.97	1.55	41.00	19.82
Claude-3.7-Sonnet		31.50	1.18	48.13	21.01
LLaMA-3.1-8B-Instruct		29.79	1.18	38.78	23.48
Direct-Refine Self-Refine		16.08	1.60	18.56	19.53
w/ Example		17.85 8.85	1.35 1.27	24.72	19.48
				15.34	18.42
Full	SFT	30.28	2.65	37.33	23.77
	CSO	33.11	1.11	39.21	24.24
LoRA	SFT	31.25	2.65	39.27	23.30
	CSO	34.51	1.11	41.11	23.89
Qwen-2.5-7B-Instruct		19.84	2.47	28.12	23.52
Direct-Refine Self-Refine		12.70	1.20	24.89	22.91
w/ Example		11.77 17.33	1.75 1.37	19.59 28.21	20.53 22.51

Full	SFT	21.73	2.34	31.24	23.54
	CSO	28.78	1.92	34.39	26.16
LoRA	SFT	21.54	2.45	29.11	23.72
	CSO	23.16	2.09	32.26	24.17
Gemma-2-9b-it		31.31	1.33	44.06	25.64
Direct-Refine		7.79	2.55	12.86	21.67
Self-Refine w/ Example		15.95	2.47	22.93	20.63
		20.12	2.65	13.41	19.64
Full	SFT	32.52	1.29	46.45	25.25
	CSO	35.61	1.54	47.95	26.63
LoRA	SFT	31.40	1.55	43.90	25.68
	CSO	35.77	1.23	52.34	26.61

CSO Improves Strategy Accuracy while Reducing Bias: Across all three backbones (LLaMA-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, Gemma-2-9B-it), CSO consistently demonstrates superior performance compared to SFT. For example, with LLaMA-3.1-8B, CSO (LoRA) achieves a $Q$ of 34.51 and $B$ of 1.11, significantly better than SFT (LoRA)'s $Q$ of 31.25 and $B$ of 2.65. This clearly shows CSO's ability to not only select more accurate strategies but also to reduce preference bias, making LLMs more adaptable. In contrast, SFT often slightly improves strategy selection accuracy ( $Q$ ) but simultaneously increases strategy bias ( $B$ ), limiting the model's flexibility. This highlights the crucial role of preference optimization in ESC.
Limitations of Decoding-based Approaches: The Direct-Refine, Self-Refine, and w/ Example decoding strategies generally underperform, often resulting in lower strategy accuracy and increased bias. For instance, LLaMA-3.1-8B with Self-Refine and w/ Example yields very low $Q$ scores (17.85 and 8.85 respectively) and higher $B$ scores (1.35 and 1.27), indicating their inadequacy for the nuanced demands of ESC.
CSO-equipped Open-weight Models Outperform Closed-source Models: Notably, CSO applied to open-weight models (e.g., LLaMA-3.1-8B with CSO (LoRA) has $Q$ =34.51, $B$ =1.11) frequently outperforms several proprietary models like GPT-4o ( $Q$ =27.98, $B$ =2.65), Gemini-2.0-Pro ( $Q$ =27.00, $B$ =1.90), and Claude-3.5-Sonnet ( $Q$ =20.97, $B$ =1.55) in terms of strategy accuracy and bias. This showcases CSO's strong generalization and effectiveness.
Consistency across LoRA and Full Fine-tuning: CSO maintains its performance advantage over SFT in both LoRA-based and full-parameter fine-tuning settings. The effectiveness of CSO even with LoRA (fewer trainable parameters) makes it a practical choice for resource-constrained environments.
Scalability to Larger Models: The paper notes consistent improvements when CSO is applied to Qwen2.5-32B (discussed in Appendix G.2), confirming its scalability.

6.1.2. Human Evaluation Results

The following are the results from Table 3 of the original paper:

CSO vs. SFT	win	lose	tie	κ
Acceptance	68.00	20.33	11.67	0.65
Effectiveness	58.33	16.00	25.67	0.55
Sensitivity	60.67	21.67	17.66	0.61
Satisfaction	62.34	19.33	18.33	0.64

Human evaluators consistently prefer CSO-generated responses over SFT-generated responses across all four metrics: Acceptance, Effectiveness, Sensitivity, and Satisfaction.
CSO achieves high win rates, notably 68.00% for Acceptance and 62.34% for Satisfaction, indicating that its responses are perceived as more appropriate, impactful, and generally better for the user.
The Kappa ( $\kappa$ ) scores, ranging from 0.55 to 0.65, indicate a moderate to high level of inter-rater agreement, lending credibility to the human evaluation findings.
These results strongly validate that preference optimization (via CSO) significantly enhances the qualitative aspects of ESC, making responses more empathetic and better aligned with user needs from a human perspective.

6.1.3. Dataset Quality Evaluation

The following are the results from Table 1 of the original paper:

ESC-Pro(+) vs. ESC-Pro(-)	Win	Lose	Tie	κ
Empathy	46.33	32.67	21.00	0.61
Information	42.34	27.33	30.33	0.55
Humanoid	41.67	21.33	37.00	0.49
Strategy	60.67	15.00	24.33	0.67

The following are the results from Table 6 of the original paper:

	ExTES	ESC-Pro(+)	ESC-Pro(-)	ESC-Pro	κ
Acc	3.78	3.91	3.13	3.52	0.41
Eff	3.31	3.45	3.19	3.27	0.51
Sen	3.86	3.98	3.51	3.75	0.44
Sat	3.65	3.78	3.28	3.52	0.45
Align	3.53	3.60	3.47	3.54	0.52

ESC-Pro Quality: Table 1 shows a pairwise comparison between ESC-Pro(+) (preference samples) and ESC-Pro(-) (non-preference samples). ESC-Pro(+) consistently wins across all metrics, especially Strategy (60.67% win vs. 15.00% lose), confirming that the MCTS-generated preference data successfully distinguishes between high-quality and low-quality strategy-response pairs. Kappa scores (0.49-0.67) indicate fair to good agreement.
Human Evaluation of Data Quality (Table 6): This table further validates the quality of the ESC-Pro dataset. ESC-Pro(+) (preferred data) scores higher than ESC-Pro(-) (non-preferred data) across all human evaluation metrics (Acceptance, Effectiveness, Sensitivity, Satisfaction, Alignment). Importantly, ESC-Pro(+) also slightly exceeds the scores of the original ExTES seed dataset in most categories (e.g., Acc: 3.91 vs 3.78), indicating that the MCTS expansion not only creates useful non-preference data but also enhances the overall quality of preferred data. The lower Alignment score for ESC-Pro(-) confirms that these samples indeed contain less effective or misaligned strategies.

6.1.4. Strategy Analysis

Dynamic Strategy Distribution: Figure 3 (Strategy distribution across conversation stages in ESC-Pro) illustrates that ESC-Pro captures a dynamic and contextually appropriate strategy distribution. For example, Emotional Validation is more prominent in early stages to acknowledge feelings, while Affirmation and Avoid Judgment and Criticism increase towards the end for encouragement.
Diversity and Balance: A comparison with the seed dataset (Figure 9 in Appendix C.3) reveals that ESC-Pro employs a more diverse and balanced strategy distribution. This indicates that the MCTS method effectively models varied dialogue strategies, enriching the dataset beyond the original ExTES content.

6.1.5. Toxicity Analysis

The following are the results from Table 7 of the original paper:

	TOXICITY	SEVERE_TOXICITY	IDENTITY_ATTACK	INSULT	PROFANITY	THREAT
ExTES(seeker)	0.0281	0.0012	0.0041	0.0108	0.0171	0.0088
ExTES(supporter)	0.0173	0.0008	0.0027	0.0093	0.0124	0.0070
ExTES	0.0227	0.0010	0.0034	0.0100	0.0148	0.0079
ESC - Pro(seeker)	0.0290	0.0013	0.0042	0.0110	0.0178	0.0097
ESC - Pro(+)	0.0192	0.0008	0.0027	0.0097	0.0130	0.0074
ESC - Pro(-)	0.0223	0.0009	0.0027	0.0107	0.0134	0.0074
ESC - Pro	0.0231	0.0010	0.0030	0.0105	0.0143	0.0079

The toxicity analysis using Perspective API shows that ESC-Pro maintains a similar toxicity profile to ExTES-seed.
Preferred responses (ESC-Pro(+)) exhibit slightly lower toxicity than non-preference responses (ESC-Pro(-)), suggesting that high-quality strategy selection naturally leads to less harmful outputs.
Seeker utterances generally show higher toxicity, which is expected in emotional support dialogues as users express distress.
Supporter responses in ESC-Pro remain within a reasonable range, confirming that the preference-based expansion does not introduce significant toxicity risks. This ensures the dataset's suitability for safe and supportive ESC generation.

6.2. Ablation Studies / Parameter Analysis

The following are the results from Table 4 of the original paper:

	Q↑	B↓	Qw ↑	R-L↑
LLaMA-3.1-8B-Instruct	29.79	1.18	38.77	23.48
SFT	31.25	2.65	39.27	23.30
CSO	34.51	1.11	41.11	23.89
CSO - Random	31.79	2.65	39.24	23.65
Qwen-2.5-7B-Instruct	19.83	2.47	28.12	23.52
SFT	21.54	2.45	29.11	23.72
CSO	23.16	2.09	32.26	24.17
CSO - Random	22.89	2.19	32.97	23.90
Gemma-2-9b-it	31.31	1.33	44.06	25.64
SFT	31.40	1.55	43.90	25.68
CSO	35.77	1.23	52.34	26.61
CSO - Random	29.86	1.22	44.58	25.75

The ablation study introduces ESC-Pro-Random, a variant where non-preferred strategies are randomly selected instead of being identified as low-scoring nodes by MCTS. This assesses the importance of structured generation of non-preference data.
As shown in Table 4, CSO-Random performs only slightly better than SFT but is consistently inferior to the full CSO across all backbones and metrics (e.g., LLaMA-3.1-8B-Instruct: CSO-Random Q=31.79, $B$ =2.65 vs. CSO Q=34.51, $B$ =1.11).
This confirms that MCTS's ability to search for and verify low-scoring nodes as truly non-preferred data is crucial. The quality of the non-preferred samples directly impacts the effectiveness of preference optimization, reinforcing the importance of structured preference learning in ESC.

6.3. Data Volume Analysis

Figure Impact of ESC-ro datavolume on model performanceacross three backbones. Data sizevaries with 0 (baseline), 100, 200, 300, 400, and the full set (423 dialogues).
该图像是一个图表，展示了ESC-Pro数据量对三个不同语言模型表现的影响，横轴表示数据规模从0到完整423对话，纵轴分别为F1/Rouge-L分数及偏好度指标，体现了数据量增加对策略选择准确率和偏好度的提升作用。

Figure 4: Impact of ESC-Pro data volume on model performance across three backbones. Data size varies with 0 (baseline), 100, 200, 300, 400, and the full set (423 dialogues).

Figure 4 illustrates the impact of varying the number of ESC-Pro dialogues used for fine-tuning (0, 100, 200, 300, 400, and full 423 dialogues). SFT performance is indicated by dashed lines.
Upward Trend: Performance generally improves as data volume increases, demonstrating the value of more preference data.
Efficiency of Preference Data: Notably, CSO starts to outperform standard SFT with as few as 200-300 dialogues. For example, on LLaMA-3.1-8B-Instruct, CSO with 200 dialogues already surpasses SFT (dashed line) in $Q$ and Qw scores, and significantly reduces $B$ . This highlights the efficiency and scalability of preference data for ESC.
The results suggest that even a moderately sized ESC-Pro dataset can yield substantial improvements over SFT, making CSO a promising approach even with computational constraints on data generation.

6.4. Analysis of Different Preference Optimization Algorithms

The following are the results from Table 9 of the original paper:

		Q ↑	B↓	Qw ↑	R-L↑
LLaMA-3.1-8B-Instruct		29.79	1.18	38.78	23.48
FuLL	SFT	30.28	2.65	37.33	23.77
	CSO-DPO	33.11	1.11	39.21	24.24
	CSO-SimPO	29.12	1.53	36.27	23.59
	CSO-IPO	35.48	1.04	37.74	24.19
	CSO-KTO	32.60	0.88	36.63	25.84
	CSO-ORPO	30.46	1.14	33.24	26.06
	SFT	31.25			23.30
LoRA			2.65	39.27
	CSO-DPO	34.51	1.11	41.11	23.89
	CSO-SimPO	33.43	1.04	40.55	23.41
	CSO-IPO	33.00	1.13	39.40	23.55
	CSO-KTO	32.80	1.04	38.15	23.70
	CSO-ORPO	31.50	1.17	39.04	23.71

The following are the results from Table 10 of the original paper:

		Q ↑	B↓	Qw ↑	R-L↑
Qwen-2.5-7B-Instruct		19.84	2.47	28.12	23.52
FuLL	SFT	21.73	2.34	31.24	23.54
	CSO-DPO	28.78	1.92	34.39	26.16
	CSO-SimPO	23.51	2.01	34.84	21.54
	CSO-IPO	25.10	2.15	36.54	25.16
	CSO-KTO	25.56	1.56	38.44	26.65
	CSO-ORPO	22.20	1.24	35.91	24.58
LoRA	SFT	21.54	2.45	29.11	23.72
	CSO-DPO	23.16	2.09	32.26	24.17
	CSO-SimPO	25.91	2.02	30.45	23.32
	CSO-IPO	22.71	2.21	28.75	23.49
	CSO-KTO	22.91	2.17	30.82	23.61
	CSO-ORPO	22.49	2.12	28.42	23.35

The following are the results from Table 11 of the original paper:

		Q ↑	B↓	Qw ↑	R-L↑
Gemma-2-9b-it		31.31	1.33	44.06	25.64
Full	SFT	32.52	1.29	46.45	25.25
	CSO-DPO	35.61	1.54	47.95	26.63
	CSO-SimPO	26.67	2.03	48.03	25.60
	CSO-IPO	32.02	1.70	45.29	25.81
	CSO-KTO	39.73	0.81	48.87	27.84
	CSO-ORPO	34.80	1.24	48.28	27.52
LoRA	SFT	31.40	1.55	43.90	25.68
	CSO-DPO	35.77	1.23	52.34	26.61
	CSO-SimPO	34.95	1.19	51.62	26.22
	CSO-IPO	34.16	1.29	51.62	26.20
	CSO-KTO	35.89	0.99	48.53	26.45
	CSO-ORPO	32.35	1.30	48.14	25.85

The evaluation of alternative preference optimization algorithms (SimPO, IPO, KTO, ORPO) on ESC-Pro consistently shows that all variants outperform standard SFT. This reinforces the fundamental advantage of preference-driven learning in ESC.
While DPO is the default CSO implementation, other algorithms like CSO-IPO and CSO-KTO can achieve even higher performance on certain metrics or backbones. For instance, CSO-IPO (Full) on LLaMA-3.1-8B-Instruct achieves a $Q$ of 35.48 and $B$ of 1.04, surpassing CSO-DPO's $Q$ of 33.11. Similarly, CSO-KTO (Full) on Gemma-2-9B-it achieves an impressive $Q$ of 39.73 and a very low $B$ of 0.81.
This analysis validates the versatility of the ESC-Pro dataset itself as a high-quality resource for various preference optimization techniques, suggesting that CSO's benefits are rooted in the turn-level preference data rather than being specific to DPO alone.

6.5. Results of Large-Scale Backbone

The following are the results from Table 12 of the original paper:

	Q↑	B↓	Qw ↑	R-L↑
Qwen2.5-32B-Instruct	37.37	1.40	41.97	24.37
Direct-Refine	34.87	1.68	40.69	23.14
Self-Refine	14.63	0.98	26.97	21.74
w/ Example	20.66	2.65	24.40	21.93
SFT-LoRA	37.69	1.60	42.09	24.40
CSO-LoRA	38.53	1.29	43.95	24.65

Experiments on a larger model, Qwen2.5-32B-Instruct, using LoRA-based preference optimization, further confirm CSO's effectiveness.
CSO-LoRA consistently outperforms all baselines across all four metrics and ROUGE-L (long-term user satisfaction). For example, CSO-LoRA achieves a $Q$ of 38.53 and $B$ of 1.29, compared to SFT-LoRA's $Q$ of 37.69 and $B$ of 1.60.
This demonstrates that CSO remains effective even at larger scales, showcasing its strong scalability and robustness in enhancing both adaptability and emotional intelligence in LLM-based emotional support.

6.6. Case Study

The following are the results from Table 16 of the original paper:


Conversation Background
Scene	Breakups or Divorce
Description	After five years of marriage, my husband and I have decided to separate due to irreconcilable differences. I am feeling overwhelmed and uncertain as I navigate through this difficult time.
Conversation History
Seeker	Hey, I really need someone to talk to right now. Can you lend me an ear?
Supporter	(Emotional Validation) Of course! I'm here for you. What's been going on?
Seeker	My husband and I have decided to separate. It's been really hard to accept this reality.
Supporter	(Emotional Validation) I'm so sorry to hear that. Separating from a long-term partner can be incredibly challenging. It's normal to feel overwhelmed and unsure about what lies ahead. Can you share more about your feelings?
Seeker	I feel like everything fell apart so quickly. We had dreams about raising a family together and now it's all shattered.
Supporter	(Emotional Validation) It's really tough when our dreams and plans change unexpectedly. You're experi- encing a whirlwind of emotions right now. It's completely understandable that you're feeling devastated and lost. Know that you're not alone in going through this difficult process.
Seeker	Thank you. I can't help but blame myself for the breakdown of our marriage. I keep wondering if I did something wrong.
Supporter	(Emotional Validation) It's natural to feel a sense of self-blame during such a challenging time. However, I want tophasize that relatonships involv thecontributions o both partners, andt's rarely a recin of just one person's actions. You are worthy of love and happiness, regardless of the outcome of this marriage.
Seeker	I appreciate your kind words. It's just hard to let go of all the memories and years we spent together. I miss him so much.
Supporter	(Empathetic Statements) It's completely understandable that you're missing your husband. Remember that healing takes time and it's okay to grieve the loss of your marriage. It's important to be patient with yourself and allow yourself to process the many emotions that may arise. Lean on your support system and engage in self-care activities that bring you comfort and joy.
Seeker	I'll try my best. Right now, it feels impossible to imagine moving on. Do you have any advice on how to start the healing process?
Responses
Preferred	(Affrmation) While it may seem overwhelming now, know that healing and growth are possible. Focus on taking care of yourself both physically and emotionally. Engage in activities that bring you happiness and surround yourself with loved ones who uplift you. Consider seeking support from a therapist or counselor who can help guide you through this process. You are stronger than you realize, and I believe in your ability to navigate through this difficult time.
Non-Preferred	(Collaborative Planning) Absolutely! Let's break it down together. What's one small step you think you could take this week to begin healing?
Non-Preferred	(Avoid Judgment and Criticism) Absolutely! Let's break it down together. What's one small step you think you could take this week to begin healing?
Non-Preferred	(Reframe Negative Thoughts) I get that moving on feels impossible right now. But think about it this way: healing is a journey, not a race. Each small step counts.
Non-Preferred	(Share Information) Starting the healing process can be tough, but small steps can help. Journaling your feelings, talking to friends, or exploring new hobbies can be beneficial.

The following are the results from Table 17 of the original paper:


Scene	Depression and Low Mood
Description	I've been feeling emotionally exhausted and overwhelmed by everything happening in my life. It's hard to stay positive when it feels like things keep piling up and nothing changes.
Role	Emotion / Strategy	Utterance
Seeker	Sadness	I've been feeling really overwhelmed lately. I don't know how to handle everything that's going on.
Supporter	Emotional Validation	It sounds like you're carrying a lot right now, and it's understand- able that you're feeling overwhelmed. It's okay to feel that way.
Seeker	Frustration	Yeah, it's just so much, and I feel like I can't get a grip on it.
Supporter	Empathetic Statements	It's tough when everything seems like it's piling up. I can hear how much you're struggling with it.
Seeker	Hopelessness	I'm trying to stay positive, but it's hard.
Supporter	Affirmation	It's really impressive that you're trying to stay positive despite everything. That shows real strength.
Seeker	Overwhelm	I just don't know where to start.
Supporter	Collaborative Planning	Let's take it step by step. What's one small thing you could focus on right now?
Seeker	Hope	Maybe I can tackle some smaller tasks to get the ball rolling.
Supporter	Provide Different Perspectives	That's a great perspective. Starting small can help you feel more in control and build momentum.
Seeker	Defeat	I feel like I've been stuck in this cycle, and nothing changes.
Supporter	Reframe Negative Thoughts	It's easy to feel stuck when it seems like things aren't improving.
Seeker	Acceptance	But every small step forward counts, even if it doesn't feel like it. You're right. Maybe I need to be kinder to myself through this.
Supporter	Avoid Judgment and Criticism	Absolutely. Being kind to yourself is key. You're doing your best,
Seeker	Hopeful	and that's enough. I think I can manage it now, one step at a time.
Supporter	Emotional Validation	That's a wonderful mindset. Taking it one step at a time will help you regain your balance.

Preference Dialogue Example (Table 16): This table provides a concrete example from ESC-Pro, showcasing a seeker's statement and multiple supporter responses classified as preferred or non-preferred. When the seeker says, "I'll try my best. Right now, it feels impossible to imagine moving on. Do you have any advice on how to start the healing process?", the preferred response uses an Affirmation strategy, focusing on possibilities ("healing and growth are possible") and practical self-care advice. In contrast, non-preferred responses include Collaborative Planning (which might be too direct at this stage), Avoid Judgment and Criticism (too generic), Reframe Negative Thoughts (potentially premature), or Share Information (less personalized). This clearly illustrates how ESC-Pro distinguishes between contextually appropriate and less effective strategies.
Emotional Continuity Case Study (Table 17): This case study demonstrates CSO's ability to implicitly preserve emotional continuity over multiple turns by selecting context-appropriate strategies. The supporter (model trained with CSO) adapts its strategies to the seeker's changing emotional state. It starts with Emotional Validation and Empathetic Statements when the seeker expresses sadness and frustration. As the conversation progresses and the seeker moves from hopelessness to hope, the supporter shifts to Affirmation and Collaborative Planning. When the seeker shows acceptance and hope, the supporter responds with Avoid Judgment and Criticism and Emotional Validation. This smooth progression highlights that CSO doesn't just react to the current emotion but supports the overall emotional flow and trajectory of the conversation.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper effectively addresses the critical challenges of low strategy selection accuracy and preference bias in Emotional Support Conversations (ESC) generated by Large Language Models (LLMs). The authors introduce a novel two-stage framework, Chain-of-Strategy Optimization (CSO). The first stage leverages Monte Carlo Tree Search (MCTS) to systematically construct ESC-Pro, a high-quality, turn-level preference dataset that explicitly captures both preferred and non-preferred strategy-response pairs. The second stage applies preference optimization (specifically DPO) on ESC-Pro, enabling LLMs to learn nuanced strategy trade-offs at each dialogue turn. Extensive experiments on LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B consistently demonstrate that CSO significantly outperforms standard Supervised Fine-Tuning (SFT) and decoding-based methods in terms of strategy accuracy and bias mitigation. Human evaluations further confirm CSO's superiority in generating more acceptable, effective, and sensitive responses, validating the efficacy of fine-grained, turn-level preference modeling for enhancing ESC quality and adaptability.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

Computational Resources: The experiments were conducted on mid-scale LLMs (7B-32B parameters). Larger models (e.g., 70B+) could offer deeper insights into CSO's scalability and potentially yield even better performance.
ESC-Pro Dataset Size: The ESC-Pro dataset was generated from a seed set of 100 dialogues. While effective, a larger seed dataset or alternative expansion strategies (e.g., human-in-the-loop validation) could further enhance the diversity and quality of the preference data.

Future work suggestions include:
Personalization: Greater emphasis on personalization (Liu et al., 2025; Qiu et al., 2025b,a) to tailor ESC responses to user-specific needs.
Safety: More focus on safety (Zhao et al., 2025) to ensure CSO-driven dialogue systems are robust against potential risks and ethical concerns, especially in emotionally sensitive applications. This includes addressing potential biases inherent in LLMs.

7.3. Personal Insights & Critique

This paper presents a robust and well-thought-out approach to a critical problem in AI-powered emotional support. The use of MCTS to generate a preference dataset is particularly innovative. Instead of relying solely on scarce human annotations for preferred vs. non-preferred pairs, which is common in DPO-like setups, MCTS with LLM-based reward functions provides a scalable and systematic way to create such data. This is a significant contribution to the broader field of RLHF and preference learning, not just ESC.

The explicit focus on turn-level strategy optimization and bias mitigation is also highly relevant. Emotional support is inherently dynamic, requiring an LLM to adapt its approach as the seeker's emotional state and needs evolve. CSO's ability to model these nuances across turns moves LLMs beyond simple response generation to more sophisticated conversational strategy.

One area for further exploration, beyond the authors' mention, could be the robustness of the Reward LLM and Strategy LLM. The quality of the ESC-Pro dataset hinges entirely on these LLMs' ability to accurately score empathy, information, humanoid quality, and strategy effectiveness. While GPT-4o-mini is powerful, its inherent biases or limitations could propagate into the generated dataset. A detailed analysis or ablation study on the sensitivity of CSO's performance to the choice and fine-tuning of these LLMs would be valuable. For instance, what if a less capable Reward LLM was used? How would the dataset quality and subsequent CSO performance be affected?

Furthermore, while the paper demonstrates improved strategy accuracy and bias mitigation, the subjective nature of emotional support means that "optimal" strategies can vary greatly across individuals and cultures. Future work could explore incorporating user-specific preference models or cultural context into the MCTS reward function or the DPO objective to achieve even more personalized and universally acceptable emotional support.

The finding that CSO allows smaller open-weight models to outperform larger closed-source models in certain ESC metrics is very inspiring. It suggests that domain-specific preference learning, even with relatively smaller models, can be more impactful than simply scaling up LLM size, especially for specialized and sensitive applications like ESC. This opens up avenues for more accessible and efficient deployment of AI-powered emotional support tools.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~26 min read · 36,276 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Stage 1: Dialogue Expansion with MCTS

4.2.2. ESC-Pro Preference Data Construction

4.2.3. Stage 2: Chain-of-Strategy Optimization (CSO)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

5.5. Human Evaluation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Automatic Evaluation Results

6.1.2. Human Evaluation Results

6.1.3. Dataset Quality Evaluation

6.1.4. Strategy Analysis

6.1.5. Toxicity Analysis

6.2. Ablation Studies / Parameter Analysis

6.3. Data Volume Analysis

6.4. Analysis of Different Preference Optimization Algorithms

6.5. Results of Large-Scale Backbone

6.6. Case Study

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers