Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter
TL;DR Summary
The study introduces Chain-of-Strategy Optimization, using MCTS to build ESC-Pro for fine-grained strategy tuning, improving LLMs' strategy accuracy, bias mitigation, and empathetic response in emotional support conversations.
Abstract
The growing emotional stress in modern society has increased the demand for Emotional Support Conversations (ESC). While Large Language Models (LLMs) show promise for ESC, they face two key challenges: (1) low strategy selection accuracy, and (2) preference bias, limiting their adaptability to emotional needs of users. Existing supervised fine-tuning (SFT) struggles to address these issues, as it rigidly trains models on single gold-standard responses without modeling nuanced strategy trade-offs. To overcome these limitations, we propose Chain-of-Strategy Optimization (CSO), a novel approach that optimizes strategy selection preferences at each dialogue turn. We first leverage Monte Carlo Tree Search to construct ESC-Pro, a high-quality preference dataset with turn-level strategy-response pairs. Training on ESC-Pro with CSO improves both strategy accuracy and bias mitigation, enabling LLMs to generate more empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT, highlighting the efficacy of fine-grained, turn-level preference modeling in ESC.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter
1.2. Authors
Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, Ting Liu. The authors are affiliated with:
- Harbin Institute of Technology
- Singapore Management University
- Central South University
- iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd
1.3. Journal/Conference
The paper is published at arXiv, a preprint server. As a preprint, it has not yet undergone formal peer review for publication in a journal or conference. However, arXiv is a highly influential platform for rapid dissemination of research in fields like AI and machine learning.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses the growing demand for Emotional Support Conversations (ESC) and the limitations of Large Language Models (LLMs) in this domain, specifically their low strategy selection accuracy and preference bias. Existing supervised fine-tuning (SFT) methods are insufficient as they train on single gold-standard responses without modeling nuanced strategy trade-offs. To overcome these, the authors propose Chain-of-Strategy Optimization (CSO), a novel two-stage approach that optimizes strategy selection preferences at each dialogue turn. First, they use Monte Carlo Tree Search (MCTS) to create ESC-Pro, a high-quality preference dataset containing turn-level strategy-response pairs. Then, training LLMs on ESC-Pro with CSO improves both strategy accuracy and bias mitigation, leading to more empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT, emphasizing the effectiveness of fine-grained, turn-level preference modeling in ESC.
1.6. Original Source Link
- Official Source: https://arxiv.org/abs/2503.05362
- PDF Link: https://arxiv.org/pdf/2503.05362v3.pdf
- Publication Status: Preprint (published on arXiv).
2. Executive Summary
2.1. Background & Motivation
The paper addresses the critical need for effective Emotional Support Conversations (ESC) in modern society, where emotional stress is prevalent. Large Language Models (LLMs) show significant potential for ESC due to their advanced conversational abilities. However, existing LLMs face two major limitations when applied to ESC:
-
Low Strategy Selection Accuracy:
LLMsoften fail to choose the most contextually appropriate support strategies, leading to less effective or even unhelpful responses. -
Preference Bias:
LLMstend to rigidly favor certain strategies, lacking the adaptability to dynamically respond to a user's evolving emotional needs throughout a conversation.These issues are exacerbated by
Supervised Fine-Tuning (SFT), the mainstream training approach forLLMs, which trains models on single gold-standard responses. ThisrigiditypreventsLLMsfrom understanding why certain strategies might be inappropriate in specific contexts or how to navigatenuanced strategy trade-offs. The core problem the paper aims to solve is improvingLLMs'ability to provide high-qualityemotional supportby enhancing theirstrategy selection accuracyandmitigating preference biasin afine-grained, turn-levelmanner.
2.2. Main Contributions / Findings
The paper introduces Chain-of-Strategy Optimization (CSO) and makes the following primary contributions:
- ESC-Pro Dataset: The authors present
ESC-Pro, a novel, high-quality,turn-level preference datasetspecifically designed forEmotional Support Conversations. This dataset is constructed usingMonte Carlo Tree Search (MCTS)to expand existing gold-standard dialogues into a comprehensive conversation tree, enabling the identification of bothpreferred(high-quality) andnon-preferred(low-quality)strategy-response pairsat each dialogue turn. This addresses the lack of data that modelsnuanced strategy trade-offs. - Chain-of-Strategy Optimization (CSO) Framework: They propose
CSO, a two-stage framework that refinesLLMbehavior throughpreference learning. By explicitly targetingstrategy-level improvementsacross turns,CSOteachesLLMsto select better strategies and avoid rigid patterns, fosteringlocal adaptabilityandlong-range strategic coherence. - Demonstrated Efficacy: Extensive experiments on three different
LLMbackbones (LLaMA-3.1-8B,Gemma-2-9B, andQwen2.5-7B) consistently show thatCSOsignificantly outperforms standardSFTand otherdecoding-based methods.CSOleads to higherstrategy selection accuracyand substantiallyreduces strategy preference bias. This highlights the effectiveness offine-grained, turn-level preference modelingfor improvingESCquality andadaptability. The findings also suggest thatCSO-equipped open-weight models can even surpass several leading closed-source models in certain metrics.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Emotional Support Conversation (ESC): This refers to dialogues specifically aimed at providing psychological relief and guidance to individuals experiencing emotional distress. The goal is to alleviate emotional intensity, offer comfort, and provide constructive advice. In
ESC, aseeker(user) expresses their emotional problem, and asupporter(AI or human) uses specific communicationstrategiesto guide the conversation. - Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and process human language. They can perform various
Natural Language Processing (NLP)tasks, including translation, summarization, and conversational AI. Examples includeGPT-4,LLaMA,Gemma, andQwen. - Supervised Fine-Tuning (SFT): A common technique used to adapt a pre-trained
LLMto a specific task. InSFT, the model is trained on a dataset of input-output pairs (e.g., dialogue history and a desired response) where the "correct" output is provided as agold standard. The model learns to predict thesegold-standardoutputs. - Preference Optimization Algorithms: These are a family of algorithms designed to align
LLMoutputs with human preferences. Instead of relying on a singlegold-standardresponse, they learn from pairs ofpreferredandnon-preferredresponses, teaching the model to generate outputs that are more likely to be favored by humans.- Direct Preference Optimization (DPO): A
preference optimizationalgorithm that simplifies theReinforcement Learning from Human Feedback (RLHF)process. Instead of explicitly training a separatereward modeland then usingreinforcement learning(likeProximal Policy Optimization - PPO),DPOdirectly optimizes theLLM's policy to maximize the probability of generatingpreferredresponses overrejectedones, given a dataset of(prompt, preferred_response, rejected_response)triples. It frames thepreference learningobjective as a simple classification problem, making it computationally less expensive. - Proximal Policy Optimization (PPO): A
Reinforcement Learning (RL)algorithm widely used forLLM fine-tuningafter areward modelhas been trained.PPOaims to update thepolicy(theLLM) in smaller steps to ensure stable learning, preventing large, disruptive updates that could degrade performance. - Monte Carlo Tree Search (MCTS): A search algorithm used for decision-making in complex domains, often seen in games like Go. It simulates future game states to evaluate the potential of different moves. In this paper,
MCTSis adapted to explore and evaluate differentstrategy-response pathsin a conversation, building a tree of possible dialogues and identifying high-quality sequences.- Selection: In
MCTS, this phase involves traversing the tree from theroot nodeto aleaf nodeusing a policy that balancesexploration(trying new paths) andexploitation(following promising paths). ThePolymer Upper Confidence Bound (PUCB)formula is typically used for this. - Expansion: Once a
leaf nodeis reached, it isexpandedby adding one or more new child nodes corresponding to possible next actions (e.g., differentstrategiesinESC). - Simulation: A
playoutorrolloutis performed from the newlyexpanded nodeto a terminal state (or a fixed depth). This involves randomly (or semi-randomly) choosing actions until the end of the game/dialogue. - Backpropagation: The
rewardobtained from thesimulationis thenbackpropagatedup the tree, updating theQ-values(expected rewards) andvisit countsof all nodes along thesimulation path. This information guides future selections.
- Selection: In
- Direct Preference Optimization (DPO): A
- ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence): A metric for evaluating the quality of text summarization and machine translation. It measures the overlap of
n-grams(sequences of words) orlongest common subsequences (LCS)between a generated text and a reference text.ROUGE-Lspecifically focuses on theLCSto capture sentence-level structural similarity. HigherROUGE-Lscores indicate more similar content.
3.2. Previous Works
The paper contextualizes its contributions within existing research on Emotional Support Conversation (ESC) and Preference Optimization.
-
ESC Systems:
- Early
ESCsystems focused on techniques likeglobal-to-local hierarchical graph networks(Peng et al., 2022),incorporating commonsense knowledge(Tu et al., 2022), andmodeling emotions and semantics(Zhao et al., 2023a,c). These approaches often relied on traditionalNLPandmachine learningtechniques. - With the advent of
LLMs, recent work has focused on leveraging their inherent conversational capabilities. Liu et al. (2023) appliedSFTtoLLaMA-7BforESC, leading toChatCounselor, a specialized model outperforming general-purposeLLMs. Chen et al. (2023) and Qiu et al. (2023) expanded single-turn empathic responses to multi-turn dialogues andfine-tunedmodels on these high-quality datasets. - Limitation of SFT in ESC: The paper highlights that
SFT, while useful, fundamentally struggles withESCbecause it lacks the ability to differentiate betweenoptimalandsuboptimalstrategies. By only training ongold-standardresponses,SFTmodels learn what to say but not why certain alternatives are less suitable, leading tostrategy preference biasandlow accuracyin dynamic contexts.
- Early
-
Preference Optimization Algorithms:
- Evolution from RL:
Traditional approachestopreference optimizationinvolvedpolicy-based reinforcement learning (RL), typically training a separatereward model(Ouyang et al., 2022) followed byonline RLmethods likePPO(Schulman et al., 2017) forLLM fine-tuning. - Direct Preference Optimization (DPO): Rafailov et al. (2023) streamlined this by unifying
reward modelingandRLinto a single step withDPO, significantly reducing computational costs. - Subsequent Optimizations: Further advancements built on
DPOto improve efficiency, robustness, or data applicability:SimPO(Meng et al., 2024) eliminatesDPO's reference model.IPO(Azar et al., 2024) enhances robustness through regularization.ORPO(Hong et al., 2024) integratesinstruction tuningandpreference alignment.KTO(Ethayarajh et al., 2024) operates without paired data by leveragingprospect theory.
- Gap Addressed by This Work: The paper positions its work as bridging the gap between
ESCandpreference optimization. Whilepreference optimizationhas advanced, its application to the specific challenges ofESC(strategy accuracy and bias) is a key novelty.
- Evolution from RL:
3.3. Technological Evolution
The evolution of AI in conversational agents has moved from rule-based systems to statistical models, then to neural networks, and most recently to Large Language Models (LLMs). In ESC, this has translated from early systems relying on explicit knowledge and hand-crafted rules to LLMs fine-tuned on conversational data. However, the limitation of SFT (training on single gold-standard responses) became apparent for nuanced tasks like ESC, which require dynamic strategic adaptation. Reinforcement Learning from Human Feedback (RLHF) emerged to address this by learning from human preferences. DPO and its variants simplified RLHF. This paper further refines this by applying preference optimization specifically to the turn-level strategy selection in ESC, generating a unique preference dataset via MCTS to capture these nuances. This work fits within the timeline by pushing the boundaries of LLM adaptability and strategic intelligence in emotionally sensitive domains.
3.4. Differentiation Analysis
Compared to existing SFT-based ESC models, this paper's approach, CSO, offers several key differentiators and innovations:
- Fine-Grained Preference Modeling:
SFTtrains on singlegold-standardresponses, implicitly treating them as the only correct option.CSO, in contrast, explicitly models preferences betweenpreferredandnon-preferred strategy-response pairsat aturn-level. This allows theLLMto learn not just what is good, but why other options are suboptimal, addressing the core limitation ofSFTinESC. - MCTS-Generated Preference Dataset (ESC-Pro): Unlike manually annotated datasets or simpler augmentation techniques,
ESC-Prois systematically constructed usingMonte Carlo Tree Search.MCTSallows for comprehensive exploration of diverse conversational paths and strategic choices, guided byLLM-basedreward functions. This creates a richer, more diverse dataset that capturesnuanced strategy trade-offsand explicitly identifiessuboptimal strategiesasnon-preferred samples, whichSFTlacks. - Bias Mitigation:
SFTtends to reinforce existingstrategy biasesbecause it only seesgold-standardexamples.CSO, throughpreference optimizationonESC-Pro, activelymitigates biasby teaching the model to avoid rigid adherence to certain strategies and adapt tousers' dynamic emotional needs. - Chain-like Optimization:
CSOaims forchain-like optimization across turns, improving bothlocal adaptability(selecting the best strategy for the current turn) andlong-range strategic coherence(maintaining a consistent, effective supportive narrative).SFTstruggles with this multi-turn strategic consistency. - Robustness and Versatility: The experiments demonstrate
CSO's effectiveness across variousLLMbackbones and even withLoRAfine-tuning, suggesting it is a versatile and practical approach for enhancingESCquality.
4. Methodology
The proposed Chain-of-Strategy Optimization (CSO) framework is a novel two-stage approach designed to refine LLMs' strategy selection in Emotional Support Conversations (ESC) at each dialogue turn. It first constructs a high-quality preference dataset (ESC-Pro) and then applies a specialized preference optimization technique. The overall framework is illustrated in Figure 2.
4.1. Principles
The core idea behind CSO is to move beyond the limitations of supervised fine-tuning (SFT), which trains models on single gold-standard responses. Instead, CSO aims to teach LLMs the nuances of strategy trade-offs by explicitly modeling preferences between good and bad strategy-response pairs at each conversational turn. This is achieved through two main principles:
- Structured Dialogue Exploration and Evaluation: Use an intelligent search mechanism (
Monte Carlo Tree Search) to systematically explore different conversational paths and evaluate the quality of variousstrategy-response pairs, thus generating a comprehensive dataset ofpreferredandnon-preferredinteractions. - Turn-Level Preference Optimization: Apply
preference learning(likeDPO) on this finely-grained dataset to directly optimize theLLM's policy to select better strategies and avoid suboptimal ones, thereby improvingstrategy accuracyandmitigating preference bias.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Stage 1: Dialogue Expansion with MCTS
The first stage involves enriching existing ESC datasets with turn-level preference annotations by employing Monte Carlo Tree Search (MCTS). MCTS systematically constructs a conversation tree, where each node represents a dialogue state. The search process refines dialogue strategies through four stages: selection, expansion, simulation, and backpropagation. This allows the model to explore various strategy paths and identify optimal responses guided by a predefined value function.
Specialized LLMs for MCTS:
To guide the MCTS process, four specialized LLMs are employed, each with a distinct role:
- Seeker LLM: Generates responses from the
seeker'sperspective, based on the conversation history. It can also output anend-of-dialogue token. TheSeeker LLMused isGPT-4o-mini, with the prompt shown in Appendix A (not shown in the image but described in text). TheSeeker'saction () is defined as: $ a _ { \mathrm { s e e k e r } } : = f _ { \mathrm { s e e k e r } } ( H _ { \mathrm { s u p p o r t e r } } ) $ where represents the conversation history up to the lastSupporterturn, denoted as . The function generates theSeeker'sresponse or anend-of-conversation token. - Supporter LLM: Generates replies from the
supporter'sperspective, based on theseeker'slast statement and a chosenstrategy. TheSupporter LLMused isGPT-4o-mini, with its prompt shown in Figure 6 (provided in the appendix as an image). TheSupporter'saction () is defined as: $ a _ { \mathrm { s u p p o r t e r } } : = f _ { \mathrm { s u p p o r t e r } } ( H _ { \mathrm { s e e k e r } } , \mathrm { S t r a t e g y } ) $ where represents the conversation history up to the lastSeekerturn, denoted as , and refers to the selectedresponse strategy. The function generates theSupporter'sreply. - Strategy LLM: Evaluates and ranks available
support strategiesfor each conversation turn, assigning scores from 1 to 10. TheStrategy LLMused isGPT-4o-mini, with its prompt shown in Figure 7 (provided in the appendix as an image). TheStrategy'saction () is defined as: $ a _ { \mathrm { s t r a t e g y } } : = f _ { \mathrm { s t r a t e g y } } ( H _ { \mathrm { s e e k e r } } ) $ where is the conversation history. The output is a score for each available strategy. These scores are thennormalizedusing thesoftmax functionto form aprobability distributionfor use in thePUCBformula. - Reward LLM: Assesses the quality of each
strategy-response pairusing four metrics:Empathy (E),Information (I),Humanoid (H), andStrategy (S). This serves as thevalue functionforMCTS. TheReward LLMused isGPT-4o-mini, with its prompt shown in Figure 8 (provided in the appendix as an image) and detailed evaluation criteria in Table 13. TheReward LLM'saction () is defined as: $ a _ { \mathrm { r e w a r d } } : = f _ { \mathrm { r e w a r d } } ( H _ { \mathrm { s u p p o r t e r } } ) $ where is the conversation history. The function returns scores for the four metrics: for the -th turn.
Node Types and Representation:
Each node in the conversation tree represents a state and is classified into four types:
- Root Node: Represents the initial dialogue turn, containing the
Seeker'sfirst response (),Q-value(), andvisit count(). $ S _ { \mathrm { r o o t } } = ( R _ { \mathrm { s e e k e r _ { 0 } } } , Q , N ) $ - Unexpanded Node: Represents an unexplored
strategy, with thestrategy(), score (),Q-value(initialized to 0), andvisit count(initialized to 0). $ S _ { \mathrm { u n e x p } } = ( a , r , Q = 0 , N = 0 ) $ - Expanded Node: Has been explored, including the
strategy(), score (),SeekerandSupporterresponses (),Q-value(), andvisit count(). $ S _ { \mathrm { e x p } } = ( a , r , R _ { \mathrm { s e e k e r } } , R _ { \mathrm { s u p p o r t e r } } , Q , N ) $ - End Node: Marks the end of the dialogue, including the
strategy(), score (),Supporter'sresponse (),Q-value(), andvisit count(). $ S _ { \mathrm { e n d } } = ( a , r , R _ { \mathrm { s u p p o r t e r } } , Q , N ) $
MCTS Stages:
-
Selection: The
selection phaseidentifies the next node to explore using thePolymer Upper Confidence Bound (PUCB)formula. This formula balancesexploration(trying new strategies) andexploitation(focusing on promising ones) based on a node'sQ-value(estimated reward),visit count, and its parent'svisit count. The formula is: $ \mathbf { P U C B } ( S ) = Q ( S ) + c \cdot P ( S ) \cdot { \frac { \sqrt { N ( \mathbf { P a r e n t } ( S ) ) } } { N ( S ) + 1 } } $- : The current node being evaluated.
Q(S): TheQ-value(average reward) of node .- : A hyperparameter that controls the balance between
explorationandexploitation. A higher encourages more exploration. P(S): Theprior probabilityof selecting strategy associated with node , typically derived from theStrategy LLM'sscores (normalized viasoftmax).N(S): Thevisit countof node , i.e., how many times this node has been visited during theMCTSprocess.- : The
visit countof the parent node of . Theselection processproceeds layer by layer, starting from theroot nodeand choosing the node with the highestPUCBvalue at each step.
-
Expansion: In the
expansion phase, the selected node is evaluated.- If the node is an
expanded node, theStrategy LLMevaluates all possiblestrategiesfor theSupporterat this node. It assigns scores to thesestrategiesand generatesunexpanded child nodesfor each strategy. These child nodes are initialized with and values set to 0: $ S _ { \mathrm { c h i l d } } = ( a , r , Q = 0 , N = 0 ) $ - If the selected node is an
unexpanded node, theconversation historyis generated using theSeekerandSupporter LLMs:- The
Supporter LLMgenerates a response based on theselected strategyand theconversation history. - The
Seeker LLMthen generates a reply based on the updatedconversation historyand theSupporter'sresponse. Once the node isexpanded, theconversation historyis stored. If theSeekergenerates anend-of-dialogue token, therewardis computed, and the process moves directly tobackpropagation.
- The
- If the node is an
-
Simulation: The
simulation phasefocuses on the newlyexpanded child nodes. To reduce computational costs, only thechild nodewith the higheststrategy scoreis selected forsimulation.- Node Expansion: The chosen
child nodetransitions from anunexpanded nodeto anexpanded node. TheSeekerandSupporter LLMsgenerate the conversation content as needed. - Greedy Simulation: A
greedy simulationis performed for steps (a predefined number of turns). At each step:- The
Seeker LLMgenerates a response based on the highest-scoringstrategy. - The
Supporter LLMresponds accordingly. Thesimulationcontinues for steps or until anend-of-dialogue tokenis generated by theSeeker.
- The
- Reward Calculation: During the
simulation, theReward LLMevaluates theconversation qualityusing four metrics:Empathy (E),Information (I),Humanoid (H), andStrategy (S). Therewardis computed as: $ R = { \frac { E + I + H + \alpha \cdot S } { 1 0 } } + b $E, I, H, S: Scores forEmpathy,Information,Humanoidquality, andStrategyeffectiveness, respectively, calculated by theReward LLMfor the current conversation turn. These scores range from 0 to 4 (as seen in Table 13).- : A
scaling hyperparameterfor theStrategyscore. It adjusts the relative importance ofStrategyeffectiveness in the overall reward. - : A
biasintroduced to adjust the reward. It allows the system to treat rewards lower than-bas negative, guiding the search towards higher-scoring nodes. In the experiments, . Thereward for the simulation() is averaged over allSupporterturns during thesimulationand is used to update the node'sQ-valuein thebackpropagation phase: $ R _ { \mathrm { { s i m } } } ( S ) = { \frac { 1 } { r _ { \mathrm { { s i m } _ { \mathrm { - } } e n d } } - r ( S ) } } \sum _ { i = r ( S ) } ^ { r _ { \mathrm { { s i m } _ { \mathrm { - } } e n d } } } R _ { \mathrm { { s i m } } } ^ { ( i ) } $ r(S): The starting turn index of the simulation from node .- : The ending turn index of the simulation.
- : The reward calculated at turn during the simulation. This formula calculates the average reward over the simulated turns.
- Node Expansion: The chosen
-
Backpropagation: In the
backpropagation phase, therewardfrom thesimulationis propagated backward up the tree. TheQ-valueandvisit countfor each node along thesimulation pathare updated as follows: $ Q _ { k } = \frac { N _ { k } \cdot Q _ { k } + R _ { \mathrm { { s i m } } } } { N _ { k } + 1 } , \quad N _ { k } = N _ { k } + 1 $- : The
Q-valueof node . - : The
visit countof node . - : The
simulation rewardcalculated in the previous step. This process updates theMCTS tree, refining the estimated values of nodes and improving futurestrategy selections.
- : The
4.2.2. ESC-Pro Preference Data Construction
After the MCTS process is complete, the ESC-Pro dataset is constructed from the generated conversation tree. This involves identifying high-quality strategy-response paths and pairing them with lower-scoring alternatives to create fine-grained preference data.
- Conversation Decomposition and Expansion: A
gold-standard ESC conversationis first decomposed into aconversation tree.MCTSiteratively expands this tree, exploring potentialstrategy paths. The process stops when a predefined number of iterations () is reached or a sufficient number oftermination nodes() are identified. - Preference Data Extraction: Valid conversation paths are extracted from the
MCTS tree. A path is considered valid if:is_end_node(S_L): The last node is atermination node(i.e., the conversation has a natural end).- for all nodes : All nodes in the path meet a predefined
quality thresholdfor theirQ-value. Oncevalid pathsare identified,preference pairsare extracted. For eachhigh-scoring node(where ) found in a valid path,low-scoring sibling nodes(where ) are identified. These pairs represent relativestrategy quality. The resulting dataset, , is constructed as follows: $ \mathcal { D } = \bigcup _ { P \in \mathcal { P } } \left{ ^ { { ( S _ { w } , S _ { l } ) \mid S _ { w } \in P , S _ { l } \in \mathrm { S i b l i n g s } ( S _ { w } ) , } _ Q ( S _ { w } ) > \theta , Q ( S _ { l } ) < \theta } \right. $
- : The set of all
valid pathsextracted from theMCTS tree. - : A single valid path from .
- : A
preferred(winning) node within path . ItsQ-valuemust be greater than the threshold . - : A
non-preferred(losing) node that is asiblingof . ItsQ-valuemust be less than the threshold . - : The set of all
sibling nodes(nodes at the same dialogue turn with different strategies) of node . This dataset contains allpreference pairs, providing a rich training signal forLLMsto learnnuanced strategy trade-offsandadaptive decision-making.
4.2.3. Stage 2: Chain-of-Strategy Optimization (CSO)
The second stage involves performing turn-level preference optimization using DPO (Rafailov et al., 2023) on the constructed ESC-Pro dataset .
For the -th conversation round (dialogue turn), the training objective is formulated as:
$
\mathcal { L } _ { i } ( \pi _ { \theta } ; \pi _ { \mathrm { { r e f } } } ) = - \log \sigma \left( \beta \log r _ { w } - \beta \log r _ { l } \right)
$
where:
- : The
loss functionfor the -th conversation round. - : The
policyof theLLMbeing trained (the model parameters ). - : The
reference policy, typically theSFT-trainedversion of theLLMor an earlier version, used to stabilize training and prevent significant deviations from the original language generation capabilities. - : The
sigmoid function, which squashes its input to a range between 0 and 1. - : A
hyperparameterthat controls the strength of thepreference alignment. A larger means stronger alignment to preferences. - : The
ratioof the probability of generating thepreferred responseunder the currentpolicyto thereference policy. $ r _ { w } = \frac { \pi _ { \theta } ( S _ { w } ^ { i } \mid x , H ^ { i - 1 } ) } { \pi _ { \mathrm { r e f } } ( S _ { w } ^ { i } \mid x , H ^ { i - 1 } ) } $- : The
preferred strategy-response pairat the -th turn. - : The initial prompt or context of the conversation.
- : The
conversation historyup to the(i-1)-th turn, which is . This indicates thatCSOis optimized over a chain of preferred strategies.
- : The
- : The
ratioof the probability of generating thenon-preferred responseunder the currentpolicyto thereference policy. $ r _ { l } = \frac { \pi _ { \theta } ( S _ { l } ^ { i } \mid x , H ^ { i - 1 } ) } { \pi _ { \mathrm { r e f } } ( S _ { l } ^ { i } \mid x , H ^ { i - 1 } ) } $- : The
non-preferred strategy-response pairat the -th turn. The objective of this loss function is to increase the likelihood of generatingpreferred responses() while simultaneously decreasing the likelihood of generatingnon-preferred responses(), in aturn-levelandcontext-awaremanner (due to ).
- : The
The overall training objective across the entire ESC-Pro dataset is the expected value of this turn-level loss:
$
\mathcal { L } _ { \mathrm { C S O } } ( \pi _ { \theta } ; \pi _ { \mathrm { r e f } } ) = - \mathbb { E } _ { ( x , S _ { w } ^ { i } , S _ { l } ^ { i } , H ^ { i - 1 } ) \sim D } \left[ \mathcal { L } _ { i } ( \pi _ { \theta } ; \pi _ { \mathrm { r e f } } ) \right]
$
- : The overall
Chain-of-Strategy Optimizationloss. - : The
expected value(average) over all samples drawn from theESC-Pro dataset. This optimization process enablesLLMsto learnnuanced strategy trade-offs, selecting better strategies over weaker ones at each dialogue turn, exploring diverse conversation paths, and avoiding rigid preference patterns, thus leading tolocal adaptabilityandlong-range strategic coherence.
5. Experimental Setup
5.1. Datasets
- ExTES (Gold-Standard Source): The
ExTESdataset (Zheng et al., 2024) serves as the initialgold-standard conversational benchmarkfor expansion.- Characteristics: Contains emotional support conversations with annotated
support strategies. It originally contains 16 distinct support strategies. - Usage: The first 100 samples from the processed
ExTES training setare selected as a high-quality starting point (seed dialogues) forMCTS-based expansion.
- Characteristics: Contains emotional support conversations with annotated
- ESC-Pro (Generated Preference Dataset): This is the dataset generated by applying
MCTSto expand the 100 seed dialogues fromExTES.- Characteristics:
ESC-Proexpands the 100 seed dialogues into 423 dialogues. The total number of utterances grows from 1,613 (in seed) to 14,383. Crucially, over half (8,157 utterances) are classified asnon-preference data, demonstrating the generation ofsuboptimal strategiesalongsidepreferredones.- Average dialogue length remains consistent (14.72 utterances for
ESC-Provs. 16.13 for original). - Average length of preference utterances (29.42) matches seed data (29.03), while non-preference utterances (23.22) are shorter, indicating meaningful distinction.
- The original 16
ExTES strategiesare merged into 8 categories to reduce search space and improve computational efficiency, as detailed in Table 8.
- Average dialogue length remains consistent (14.72 utterances for
- Usage:
ESC-Prois the primary dataset used for training withChain-of-Strategy Optimization (CSO).
- Characteristics:
- ESC-Pro-SFT (Fine-tuning Subset): A subset extracted from
ESC-Prospecifically forSupervised Fine-Tuning (SFT).- Characteristics: It would presumably contain only the
preferred strategy-response pairsfromESC-Pro, mimicking thegold-standardnature required bySFT. - Usage: Used to train
SFTbaselines for comparison withCSO.
- Characteristics: It would presumably contain only the
Example of Data Sample (from Table 16 - excerpt of a dialogue):
Conversation Background
- Scene: Breakups or Divorce
- Description: "After five years of marriage, my husband and I have decided to separate due to irreconcilable differences. I am feeling overwhelmed and uncertain as I navigate through this difficult time."
Conversation History (partial)
- Seeker: I'll try my best. Right now, it feels impossible to imagine moving on. Do you have any advice on how to start the healing process?
-
Preferred Supporter Response: (Affirmation) While it may seem overwhelming now, know that healing and growth are possible. Focus on taking care of yourself both physically and emotionally. Engage in activities that bring you happiness and surround yourself with loved ones who uplift you. Consider seeking support from a therapist or counselor who can help guide you through this process. You are stronger than you realize, and I believe in your ability to navigate through this difficult time.
-
Non-Preferred Supporter Response: (Collaborative Planning) Absolutely! Let's break it down together. What's one small step you think you could take this week to begin healing?
This example illustrates a
seeker'sturn, followed by apreferredsupporterresponse (usingAffirmationstrategy) and severalnon-preferredresponses (usingCollaborative Planning,Avoid Judgment and Criticism,Reframe Negative Thoughts,Share Informationstrategies). This structure, with explicit preferred and non-preferred options, is whatESC-Proprovides for training.
-
5.2. Evaluation Metrics
The evaluation includes four metrics: strategy selection accuracy (macro F1 , weighted F1 ), strategy preference bias (), and ROUGE-L (R-L) for assessing response semantics.
-
Strategy Selection Accuracy (macro F1 and weighted F1 ):
- Conceptual Definition: F1-score is a measure of a model's accuracy on a dataset. It considers both the precision and recall to compute the score. In the context of
ESC, it evaluates how accurately theLLMselects the correct supportstrategyfor a given dialogue turn.Macro F1() calculates the F1-score independently for each class (strategy) and then averages them, treating all classes equally. It is useful when detecting performance across all categories, especially when class distribution is uneven.Weighted F1() calculates the F1-score for each class and then averages them, weighted by the number of true instances for each class. It is more representative when there is class imbalance.
- Mathematical Formula (General F1-score): $ \mathrm{F1} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ where: $ \mathrm{Precision} = \frac{\mathrm{True; Positives}}{\mathrm{True; Positives} + \mathrm{False; Positives}} $ $ \mathrm{Recall} = \frac{\mathrm{True; Positives}}{\mathrm{True; Positives} + \mathrm{False; Negatives}} $
- Symbol Explanation:
- : Instances correctly identified as belonging to a specific class.
- : Instances incorrectly identified as belonging to a specific class (Type I error).
- : Instances incorrectly identified as not belonging to a specific class (Type II error).
- For
Macro F1, this calculation is performed for each strategy class, and the unweighted average is taken. - For
Weighted F1, this calculation is performed for each strategy class, and the average is weighted by the number of actual instances of each class.
- Conceptual Definition: F1-score is a measure of a model's accuracy on a dataset. It considers both the precision and recall to compute the score. In the context of
-
Strategy Preference Bias ():
- Conceptual Definition: This metric quantifies how much an
LLMdeviates from an ideal or desiredstrategy distribution. A high bias means the model rigidly favors certain strategies, limiting its adaptability to diverse user emotional needs. The goal is to minimize this bias. - Mathematical Formula: While the paper mentions that measures deviation from ideal strategy distributions, it does not explicitly provide a formula for its calculation in the main text or appendix. Given the context of
Kang et al. (2024)being cited for this metric, it likely involves comparing the observed strategy distribution of the model's outputs against a reference "ideal" distribution (e.g., from human experts or a balanced dataset) using a statistical distance measure. A common approach for this could be Jensen-Shannon Divergence or Kullback-Leibler Divergence, or a simpler aggregate measure of standard deviation from a uniform or target distribution. Without the specific formula from the cited work or the paper itself, a precise mathematical formula cannot be provided. However, the conceptual goal is to measure the non-uniformity or skewedness in strategy usage.
- Conceptual Definition: This metric quantifies how much an
-
ROUGE-L (R-L):
- Conceptual Definition:
ROUGE-L(Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence) is a metric used to evaluate the quality of text generation by comparing a generated response to a reference (gold-standard) response. It specifically measures thelongest common subsequence (LCS)between the two texts. A longerLCSindicates more shared content and better semantic overlap. HigherROUGE-Lscores are better. - Mathematical Formula: $ \mathrm{ROUGE-L} = \frac{(1 + \beta^2) R_{\mathrm{LCS}} P_{\mathrm{LCS}}}{\beta^2 R_{\mathrm{LCS}} + P_{\mathrm{LCS}}} $ where: $ R_{\mathrm{LCS}} = \frac{\mathrm{Length}(\mathrm{LCS}(X, Y))}{\mathrm{Length}(X)} $ $ P_{\mathrm{LCS}} = \frac{\mathrm{Length}(\mathrm{LCS}(X, Y))}{\mathrm{Length}(Y)} $
- Symbol Explanation:
- : The reference (gold-standard) response.
- : The generated response from the
LLM. - : The
longest common subsequencebetween texts and . A subsequence is a sequence that can be derived from another sequence by deleting some or no elements without changing the order of the remaining elements. - : The number of tokens (words) in a sequence.
- : The
LCS-based recall, which measures how much of the reference text is covered by the generated text'sLCS. - : The
LCS-based precision, which measures how much of the generated text is part of theLCSwith the reference text. - : A parameter that weighs the importance of
recallvs.precision. Typically, is set to a value whererecallis more important (e.g., for equal weight, or for higher recall).
- Conceptual Definition:
5.3. Baselines
The CSO method is compared against several baselines to demonstrate its efficacy:
- Supervised Fine-Tuning (SFT):
- Description: The standard method of
fine-tuningLLMsongold-standarddata. The paper evaluates bothLoRA-based SFT(usingLow-Rank Adaptationfor parameter-efficientfine-tuning) andFull-parameter SFT(updating all model parameters). - Representativeness: This is the most common and direct approach for adapting
LLMsto specific tasks, includingESC. It serves as a strong benchmark for comparing the impact ofpreference optimizationversus direct supervision.
- Description: The standard method of
- Decoding Strategies:
- Direct-Refine: A simple self-optimization method where the model directly refines its output based on its initial response, without external input or complex feedback loops.
- Self-Refine: Based on Madaan et al. (2024), this two-step process involves the model first reflecting on its own output and generating feedback, then using this feedback to produce an optimized response. It promotes self-correction.
- w/ Example (In-context Example Prompting): In this method, a randomly selected
gold-standardexample from theExTES training setis included in the prompt provided to the model. This serves as anin-context learningsignal to guide the model's generation. - Representativeness: These methods represent common approaches to improve
LLMoutput quality without explicitfine-tuningon preference data, relying instead on prompting techniques or internal refinement.
- Proprietary Models (Closed-Source):
- Description: The paper benchmarks against leading commercial
LLMs, includingGPT-4o,Gemini-2.0-Pro, andClaude-3.5-Sonnet/Claude-3.7-Sonnet. - Representativeness: These models represent the current state-of-the-art in
LLMperformance and provide a strong real-world upper bound for comparison, showing whether the proposedCSOon smaller open-source models can achieve competitive results.
- Description: The paper benchmarks against leading commercial
5.4. Implementation Details
- Backbones: The experiments are conducted on three
LLMbackbones:LLaMA-3.1-8B-Instruct(Dubey et al., 2024)Qwen-2.5-7B-Instruct(Yang et al., 2024)Gemma-2-9B-it(Team et al., 2024)
- Software/Hardware: All experiments are implemented using
PyTorch(Paszke et al., 2019) on 8NVIDIA Tesla A100 GPUswithDeepSpeed(Rasley et al., 2020) repository (usingZeRo-2 optimization). - Training Parameters:
- Maximum target length for generated responses: 512 tokens across all backbones.
- LoRA Fine-tuning:
alpha: 8dropout rate: 0- Targeted modules: All modules.
- Official chat templates of the respective backbones are used.
- MCTS Hyperparameters (for
ESC-Proconstruction):- (PUCB hyperparameter): 1 (balances exploration/exploitation).
- (Reward scaling hyperparameter for Strategy): 7 (emphasizes strategy effectiveness).
- (Reward bias): -3 (ensures rewards lower than -3 are negative, guiding search to higher-scoring nodes).
- (Score threshold for valid paths in data extraction): 0.5 (ensures
Q-valuesof preferred data are at least > 3.5, maintaining quality).
- Strategy Mapping: The original 16
ExTESstrategies are mapped to 8 broader categories (Table 8) to simplify the search space.
- Preference Optimization Algorithms: In addition to
DPO,SimPO,IPO,KTO, andORPOare also evaluated as alternatives, with algorithm-specific hyperparameters tuned (e.g., values,gemmaforSimPO,chosen weightandrejected weightforKTO). These are detailed in Table 14 and Table 15.
5.5. Human Evaluation Details
- Evaluators: Three anonymous crowd workers with prior annotation experience.
- Training: Evaluators received targeted training to understand metric criteria.
- Compensation: Paid based on annotated samples.
- Inter-rater Agreement: Measured using
Cohen's Kappa() scores (reported in Tables 1 and 3). - Evaluation Tasks:
- Pairwise Comparison (Table 1): 100
preference pairsfromESC-Proare compared, evaluating whether thepreference responseis superior, inferior, or equal to thenon-preference responsebased onEmpathy,Information,Humanoid, andStrategymetrics (from theReward LLM). - Quality Assessment (Table 6): 100 responses from
ExTES seed, 50preference responses, and 50non-preference responsesfromESC-Proare evaluated across five metrics:Acceptance (Acc),Effectiveness (Eff),Sensitivity (Sen),Satisfaction (Sat), andAlignment (Align). - CSO vs. SFT Comparison (Table 3): Human evaluators compare
CSO-generated responses againstSFT-generated responses onAcceptance,Effectiveness,Sensitivity, andSatisfaction, indicating which ispreferred(win),inferior(lose), orequal(tie).
- Pairwise Comparison (Table 1): 100
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Automatic Evaluation Results
The following are the results from Table 2 of the original paper:
| Q ↑ | B↓ | Qw ↑ | R-L↑ | ||
| GPT-4o-mini | 35.68 | 2.65 | 42.08 | 24.91 | |
| GPT-40 | 27.98 | 2.65 | 39.55 | 24.26 | |
| Gemini-2.0-Pro | 27.00 | 1.90 | 46.59 | 22.71 | |
| Claude-3.5-Sonnet | 20.97 | 1.55 | 41.00 | 19.82 | |
| Claude-3.7-Sonnet | 31.50 | 1.18 | 48.13 | 21.01 | |
| LLaMA-3.1-8B-Instruct | 29.79 | 1.18 | 38.78 | 23.48 | |
| Direct-Refine Self-Refine | 16.08 | 1.60 | 18.56 | 19.53 | |
| w/ Example | 17.85 8.85 | 1.35 1.27 | 24.72 | 19.48 | |
| 15.34 | 18.42 | ||||
| Full | SFT | 30.28 | 2.65 | 37.33 | 23.77 |
| CSO | 33.11 | 1.11 | 39.21 | 24.24 | |
| LoRA | SFT | 31.25 | 2.65 | 39.27 | 23.30 |
| CSO | 34.51 | 1.11 | 41.11 | 23.89 | |
| Qwen-2.5-7B-Instruct | 19.84 | 2.47 | 28.12 | 23.52 | |
| Direct-Refine Self-Refine | 12.70 | 1.20 | 24.89 | 22.91 | |
| w/ Example | 11.77 17.33 | 1.75 1.37 | 19.59 28.21 | 20.53 22.51 | |
| Full | SFT | 21.73 | 2.34 | 31.24 | 23.54 |
| CSO | 28.78 | 1.92 | 34.39 | 26.16 | |
| LoRA | SFT | 21.54 | 2.45 | 29.11 | 23.72 |
| CSO | 23.16 | 2.09 | 32.26 | 24.17 | |
| Gemma-2-9b-it | 31.31 | 1.33 | 44.06 | 25.64 | |
| Direct-Refine | 7.79 | 2.55 | 12.86 | 21.67 | |
| Self-Refine w/ Example | 15.95 | 2.47 | 22.93 | 20.63 | |
| 20.12 | 2.65 | 13.41 | 19.64 | ||
| Full | SFT | 32.52 | 1.29 | 46.45 | 25.25 |
| CSO | 35.61 | 1.54 | 47.95 | 26.63 | |
| LoRA | SFT | 31.40 | 1.55 | 43.90 | 25.68 |
| CSO | 35.77 | 1.23 | 52.34 | 26.61 |
-
CSO Improves Strategy Accuracy while Reducing Bias: Across all three backbones (
LLaMA-3.1-8B-Instruct,Qwen-2.5-7B-Instruct,Gemma-2-9B-it),CSOconsistently demonstrates superior performance compared toSFT. For example, withLLaMA-3.1-8B,CSO(LoRA) achieves a of 34.51 and of 1.11, significantly better thanSFT(LoRA)'s of 31.25 and of 2.65. This clearly showsCSO's ability to not only select more accurate strategies but also to reducepreference bias, makingLLMsmore adaptable. In contrast,SFToften slightly improvesstrategy selection accuracy() but simultaneously increasesstrategy bias(), limiting the model's flexibility. This highlights the crucial role ofpreference optimizationinESC. -
Limitations of Decoding-based Approaches: The
Direct-Refine,Self-Refine, andw/ Exampledecoding strategies generally underperform, often resulting in lowerstrategy accuracyand increasedbias. For instance,LLaMA-3.1-8BwithSelf-Refineandw/ Exampleyields very low scores (17.85 and 8.85 respectively) and higher scores (1.35 and 1.27), indicating their inadequacy for the nuanced demands ofESC. -
CSO-equipped Open-weight Models Outperform Closed-source Models: Notably,
CSOapplied to open-weight models (e.g.,LLaMA-3.1-8BwithCSO(LoRA) has =34.51, =1.11) frequently outperforms several proprietary models likeGPT-4o(=27.98, =2.65),Gemini-2.0-Pro(=27.00, =1.90), andClaude-3.5-Sonnet(=20.97, =1.55) in terms ofstrategy accuracyandbias. This showcasesCSO's strong generalization and effectiveness. -
Consistency across LoRA and Full Fine-tuning:
CSOmaintains its performance advantage overSFTin bothLoRA-basedandfull-parameter fine-tuningsettings. The effectiveness ofCSOeven withLoRA(fewer trainable parameters) makes it a practical choice for resource-constrained environments. -
Scalability to Larger Models: The paper notes consistent improvements when
CSOis applied toQwen2.5-32B(discussed in Appendix G.2), confirming its scalability.
6.1.2. Human Evaluation Results
The following are the results from Table 3 of the original paper:
| CSO vs. SFT | win | lose | tie | κ |
| Acceptance | 68.00 | 20.33 | 11.67 | 0.65 |
| Effectiveness | 58.33 | 16.00 | 25.67 | 0.55 |
| Sensitivity | 60.67 | 21.67 | 17.66 | 0.61 |
| Satisfaction | 62.34 | 19.33 | 18.33 | 0.64 |
- Human evaluators consistently prefer
CSO-generated responses overSFT-generated responses across all four metrics:Acceptance,Effectiveness,Sensitivity, andSatisfaction. CSOachieves highwin rates, notably 68.00% forAcceptanceand 62.34% forSatisfaction, indicating that its responses are perceived as more appropriate, impactful, and generally better for the user.- The
Kappa() scores, ranging from 0.55 to 0.65, indicate amoderate to high level of inter-rater agreement, lending credibility to the human evaluation findings. - These results strongly validate that
preference optimization(viaCSO) significantly enhances the qualitative aspects ofESC, making responses more empathetic and better aligned with user needs from a human perspective.
6.1.3. Dataset Quality Evaluation
The following are the results from Table 1 of the original paper:
| ESC-Pro(+) vs. ESC-Pro(-) | Win | Lose | Tie | κ |
| Empathy | 46.33 | 32.67 | 21.00 | 0.61 |
| Information | 42.34 | 27.33 | 30.33 | 0.55 |
| Humanoid | 41.67 | 21.33 | 37.00 | 0.49 |
| Strategy | 60.67 | 15.00 | 24.33 | 0.67 |
The following are the results from Table 6 of the original paper:
| ExTES | ESC-Pro(+) | ESC-Pro(-) | ESC-Pro | κ | |
| Acc | 3.78 | 3.91 | 3.13 | 3.52 | 0.41 |
| Eff | 3.31 | 3.45 | 3.19 | 3.27 | 0.51 |
| Sen | 3.86 | 3.98 | 3.51 | 3.75 | 0.44 |
| Sat | 3.65 | 3.78 | 3.28 | 3.52 | 0.45 |
| Align | 3.53 | 3.60 | 3.47 | 3.54 | 0.52 |
- ESC-Pro Quality: Table 1 shows a pairwise comparison between
ESC-Pro(+)(preference samples) andESC-Pro(-)(non-preference samples).ESC-Pro(+)consistently wins across all metrics, especiallyStrategy(60.67% win vs. 15.00% lose), confirming that theMCTS-generatedpreference datasuccessfully distinguishes between high-quality and low-qualitystrategy-response pairs.Kappascores (0.49-0.67) indicate fair to good agreement. - Human Evaluation of Data Quality (Table 6): This table further validates the quality of the
ESC-Prodataset.ESC-Pro(+)(preferred data) scores higher thanESC-Pro(-)(non-preferred data) across all human evaluation metrics (Acceptance,Effectiveness,Sensitivity,Satisfaction,Alignment). Importantly,ESC-Pro(+)also slightly exceeds the scores of the originalExTES seed datasetin most categories (e.g.,Acc: 3.91 vs 3.78), indicating that theMCTSexpansion not only creates usefulnon-preference databut also enhances the overall quality ofpreferred data. The lowerAlignmentscore forESC-Pro(-)confirms that these samples indeed contain less effective or misaligned strategies.
6.1.4. Strategy Analysis
- Dynamic Strategy Distribution: Figure 3 (Strategy distribution across conversation stages in
ESC-Pro) illustrates thatESC-Procaptures adynamic and contextually appropriate strategy distribution. For example,Emotional Validationis more prominent in early stages to acknowledge feelings, whileAffirmationandAvoid Judgment and Criticismincrease towards the end for encouragement. - Diversity and Balance: A comparison with the
seed dataset(Figure 9 in Appendix C.3) reveals thatESC-Proemploys a more diverse and balanced strategy distribution. This indicates that theMCTSmethod effectively models varied dialogue strategies, enriching the dataset beyond the originalExTEScontent.
6.1.5. Toxicity Analysis
The following are the results from Table 7 of the original paper:
| TOXICITY | SEVERE_TOXICITY | IDENTITY_ATTACK | INSULT | PROFANITY | THREAT | |
| ExTES(seeker) | 0.0281 | 0.0012 | 0.0041 | 0.0108 | 0.0171 | 0.0088 |
| ExTES(supporter) | 0.0173 | 0.0008 | 0.0027 | 0.0093 | 0.0124 | 0.0070 |
| ExTES | 0.0227 | 0.0010 | 0.0034 | 0.0100 | 0.0148 | 0.0079 |
| ESC - Pro(seeker) | 0.0290 | 0.0013 | 0.0042 | 0.0110 | 0.0178 | 0.0097 |
| ESC - Pro(+) | 0.0192 | 0.0008 | 0.0027 | 0.0097 | 0.0130 | 0.0074 |
| ESC - Pro(-) | 0.0223 | 0.0009 | 0.0027 | 0.0107 | 0.0134 | 0.0074 |
| ESC - Pro | 0.0231 | 0.0010 | 0.0030 | 0.0105 | 0.0143 | 0.0079 |
- The
toxicity analysisusingPerspective APIshows thatESC-Promaintains asimilar toxicity profile to ExTES-seed. Preferred responses(ESC-Pro(+)) exhibit slightlylower toxicitythannon-preference responses(ESC-Pro(-)), suggesting thathigh-quality strategy selectionnaturally leads to less harmful outputs.Seeker utterancesgenerally show higher toxicity, which is expected inemotional support dialoguesas users express distress.Supporter responsesinESC-Proremain within a reasonable range, confirming that thepreference-based expansiondoes not introduce significanttoxicity risks. This ensures the dataset's suitability for safe and supportiveESC generation.
6.2. Ablation Studies / Parameter Analysis
The following are the results from Table 4 of the original paper:
| Q↑ | B↓ | Qw ↑ | R-L↑ | |
| LLaMA-3.1-8B-Instruct | 29.79 | 1.18 | 38.77 | 23.48 |
| SFT | 31.25 | 2.65 | 39.27 | 23.30 |
| CSO | 34.51 | 1.11 | 41.11 | 23.89 |
| CSO - Random | 31.79 | 2.65 | 39.24 | 23.65 |
| Qwen-2.5-7B-Instruct | 19.83 | 2.47 | 28.12 | 23.52 |
| SFT | 21.54 | 2.45 | 29.11 | 23.72 |
| CSO | 23.16 | 2.09 | 32.26 | 24.17 |
| CSO - Random | 22.89 | 2.19 | 32.97 | 23.90 |
| Gemma-2-9b-it | 31.31 | 1.33 | 44.06 | 25.64 |
| SFT | 31.40 | 1.55 | 43.90 | 25.68 |
| CSO | 35.77 | 1.23 | 52.34 | 26.61 |
| CSO - Random | 29.86 | 1.22 | 44.58 | 25.75 |
- The ablation study introduces
ESC-Pro-Random, a variant wherenon-preferred strategiesare randomly selected instead of being identified aslow-scoring nodesbyMCTS. This assesses the importance of structured generation ofnon-preference data. - As shown in Table 4,
CSO-Randomperforms only slightly better thanSFTbut is consistently inferior to the fullCSOacross all backbones and metrics (e.g.,LLaMA-3.1-8B-Instruct:CSO-Random Q=31.79, =2.65 vs.CSO Q=34.51, =1.11). - This confirms that
MCTS's ability to search for and verifylow-scoring nodesas trulynon-preferred datais crucial. The quality of thenon-preferred samplesdirectly impacts the effectiveness ofpreference optimization, reinforcing the importance of structuredpreference learninginESC.
6.3. Data Volume Analysis

该图像是一个图表,展示了ESC-Pro数据量对三个不同语言模型表现的影响,横轴表示数据规模从0到完整423对话,纵轴分别为F1/Rouge-L分数及偏好度指标,体现了数据量增加对策略选择准确率和偏好度的提升作用。
Figure 4: Impact of ESC-Pro data volume on model performance across three backbones. Data size varies with 0 (baseline), 100, 200, 300, 400, and the full set (423 dialogues).
- Figure 4 illustrates the impact of varying the number of
ESC-Pro dialoguesused forfine-tuning(0, 100, 200, 300, 400, and full 423 dialogues).SFTperformance is indicated by dashed lines. - Upward Trend: Performance generally improves as
data volume increases, demonstrating the value of morepreference data. - Efficiency of Preference Data: Notably,
CSOstarts to outperform standardSFTwith as few as 200-300 dialogues. For example, onLLaMA-3.1-8B-Instruct,CSOwith 200 dialogues already surpassesSFT(dashed line) in andQwscores, and significantly reduces . This highlights theefficiencyandscalabilityofpreference dataforESC. - The results suggest that even a moderately sized
ESC-Pro datasetcan yield substantial improvements overSFT, makingCSOa promising approach even with computational constraints on data generation.
6.4. Analysis of Different Preference Optimization Algorithms
The following are the results from Table 9 of the original paper:
| Q ↑ | B↓ | Qw ↑ | R-L↑ | ||
| LLaMA-3.1-8B-Instruct | 29.79 | 1.18 | 38.78 | 23.48 | |
| FuLL | SFT | 30.28 | 2.65 | 37.33 | 23.77 |
| CSO-DPO | 33.11 | 1.11 | 39.21 | 24.24 | |
| CSO-SimPO | 29.12 | 1.53 | 36.27 | 23.59 | |
| CSO-IPO | 35.48 | 1.04 | 37.74 | 24.19 | |
| CSO-KTO | 32.60 | 0.88 | 36.63 | 25.84 | |
| CSO-ORPO | 30.46 | 1.14 | 33.24 | 26.06 | |
| SFT | 31.25 | 23.30 | |||
| LoRA | 2.65 | 39.27 | |||
| CSO-DPO | 34.51 | 1.11 | 41.11 | 23.89 | |
| CSO-SimPO | 33.43 | 1.04 | 40.55 | 23.41 | |
| CSO-IPO | 33.00 | 1.13 | 39.40 | 23.55 | |
| CSO-KTO | 32.80 | 1.04 | 38.15 | 23.70 | |
| CSO-ORPO | 31.50 | 1.17 | 39.04 | 23.71 | |
The following are the results from Table 10 of the original paper:
| Q ↑ | B↓ | Qw ↑ | R-L↑ | ||
| Qwen-2.5-7B-Instruct | 19.84 | 2.47 | 28.12 | 23.52 | |
| FuLL | SFT | 21.73 | 2.34 | 31.24 | 23.54 |
| CSO-DPO | 28.78 | 1.92 | 34.39 | 26.16 | |
| CSO-SimPO | 23.51 | 2.01 | 34.84 | 21.54 | |
| CSO-IPO | 25.10 | 2.15 | 36.54 | 25.16 | |
| CSO-KTO | 25.56 | 1.56 | 38.44 | 26.65 | |
| CSO-ORPO | 22.20 | 1.24 | 35.91 | 24.58 | |
| LoRA | SFT | 21.54 | 2.45 | 29.11 | 23.72 |
| CSO-DPO | 23.16 | 2.09 | 32.26 | 24.17 | |
| CSO-SimPO | 25.91 | 2.02 | 30.45 | 23.32 | |
| CSO-IPO | 22.71 | 2.21 | 28.75 | 23.49 | |
| CSO-KTO | 22.91 | 2.17 | 30.82 | 23.61 | |
| CSO-ORPO | 22.49 | 2.12 | 28.42 | 23.35 | |
The following are the results from Table 11 of the original paper:
| Q ↑ | B↓ | Qw ↑ | R-L↑ | ||
| Gemma-2-9b-it | 31.31 | 1.33 | 44.06 | 25.64 | |
| Full | SFT | 32.52 | 1.29 | 46.45 | 25.25 |
| CSO-DPO | 35.61 | 1.54 | 47.95 | 26.63 | |
| CSO-SimPO | 26.67 | 2.03 | 48.03 | 25.60 | |
| CSO-IPO | 32.02 | 1.70 | 45.29 | 25.81 | |
| CSO-KTO | 39.73 | 0.81 | 48.87 | 27.84 | |
| CSO-ORPO | 34.80 | 1.24 | 48.28 | 27.52 | |
| LoRA | SFT | 31.40 | 1.55 | 43.90 | 25.68 |
| CSO-DPO | 35.77 | 1.23 | 52.34 | 26.61 | |
| CSO-SimPO | 34.95 | 1.19 | 51.62 | 26.22 | |
| CSO-IPO | 34.16 | 1.29 | 51.62 | 26.20 | |
| CSO-KTO | 35.89 | 0.99 | 48.53 | 26.45 | |
| CSO-ORPO | 32.35 | 1.30 | 48.14 | 25.85 | |
- The evaluation of alternative
preference optimization algorithms(SimPO,IPO,KTO,ORPO) onESC-Proconsistently shows that all variantsoutperform standard SFT. This reinforces the fundamental advantage ofpreference-driven learninginESC. - While
DPOis the defaultCSOimplementation, other algorithms likeCSO-IPOandCSO-KTOcan achieve even higher performance on certain metrics or backbones. For instance,CSO-IPO(Full) onLLaMA-3.1-8B-Instructachieves a of 35.48 and of 1.04, surpassingCSO-DPO's of 33.11. Similarly,CSO-KTO(Full) onGemma-2-9B-itachieves an impressive of 39.73 and a very low of 0.81. - This analysis validates the versatility of the
ESC-Pro datasetitself as a high-quality resource for variouspreference optimizationtechniques, suggesting thatCSO's benefits are rooted in theturn-level preference datarather than being specific toDPOalone.
6.5. Results of Large-Scale Backbone
The following are the results from Table 12 of the original paper:
| Q↑ | B↓ | Qw ↑ | R-L↑ | |
| Qwen2.5-32B-Instruct | 37.37 | 1.40 | 41.97 | 24.37 |
| Direct-Refine | 34.87 | 1.68 | 40.69 | 23.14 |
| Self-Refine | 14.63 | 0.98 | 26.97 | 21.74 |
| w/ Example | 20.66 | 2.65 | 24.40 | 21.93 |
| SFT-LoRA | 37.69 | 1.60 | 42.09 | 24.40 |
| CSO-LoRA | 38.53 | 1.29 | 43.95 | 24.65 |
- Experiments on a larger model,
Qwen2.5-32B-Instruct, usingLoRA-based preference optimization, further confirmCSO's effectiveness. CSO-LoRAconsistentlyoutperforms all baselinesacross all four metrics andROUGE-L(long-term user satisfaction). For example,CSO-LoRAachieves a of 38.53 and of 1.29, compared toSFT-LoRA's of 37.69 and of 1.60.- This demonstrates that
CSOremains effective even atlarger scales, showcasing itsstrong scalabilityandrobustnessin enhancing bothadaptabilityandemotional intelligenceinLLM-based emotional support.
6.6. Case Study
The following are the results from Table 16 of the original paper:
| Conversation Background | |
| Scene | Breakups or Divorce |
| Description | After five years of marriage, my husband and I have decided to separate due to irreconcilable differences. I am feeling overwhelmed and uncertain as I navigate through this difficult time. |
| Conversation History | |
| Seeker | Hey, I really need someone to talk to right now. Can you lend me an ear? |
| Supporter | (Emotional Validation) Of course! I'm here for you. What's been going on? |
| Seeker | My husband and I have decided to separate. It's been really hard to accept this reality. |
| Supporter | (Emotional Validation) I'm so sorry to hear that. Separating from a long-term partner can be incredibly challenging. It's normal to feel overwhelmed and unsure about what lies ahead. Can you share more about your feelings? |
| Seeker | I feel like everything fell apart so quickly. We had dreams about raising a family together and now it's all shattered. |
| Supporter | (Emotional Validation) It's really tough when our dreams and plans change unexpectedly. You're experi- encing a whirlwind of emotions right now. It's completely understandable that you're feeling devastated and lost. Know that you're not alone in going through this difficult process. |
| Seeker | Thank you. I can't help but blame myself for the breakdown of our marriage. I keep wondering if I did something wrong. |
| Supporter | (Emotional Validation) It's natural to feel a sense of self-blame during such a challenging time. However, I want tophasize that relatonships involv thecontributions o both partners, andt's rarely a recin of just one person's actions. You are worthy of love and happiness, regardless of the outcome of this marriage. |
| Seeker | I appreciate your kind words. It's just hard to let go of all the memories and years we spent together. I miss him so much. |
| Supporter | (Empathetic Statements) It's completely understandable that you're missing your husband. Remember that healing takes time and it's okay to grieve the loss of your marriage. It's important to be patient with yourself and allow yourself to process the many emotions that may arise. Lean on your support system and engage in self-care activities that bring you comfort and joy. |
| Seeker | I'll try my best. Right now, it feels impossible to imagine moving on. Do you have any advice on how to start the healing process? |
| Responses | |
| Preferred | (Affrmation) While it may seem overwhelming now, know that healing and growth are possible. Focus on taking care of yourself both physically and emotionally. Engage in activities that bring you happiness and surround yourself with loved ones who uplift you. Consider seeking support from a therapist or counselor who can help guide you through this process. You are stronger than you realize, and I believe in your ability to navigate through this difficult time. |
| Non-Preferred | (Collaborative Planning) Absolutely! Let's break it down together. What's one small step you think you could take this week to begin healing? |
| Non-Preferred | (Avoid Judgment and Criticism) Absolutely! Let's break it down together. What's one small step you think you could take this week to begin healing? |
| Non-Preferred | (Reframe Negative Thoughts) I get that moving on feels impossible right now. But think about it this way: healing is a journey, not a race. Each small step counts. |
| Non-Preferred | (Share Information) Starting the healing process can be tough, but small steps can help. Journaling your feelings, talking to friends, or exploring new hobbies can be beneficial. |
The following are the results from Table 17 of the original paper:
| Scene | Depression and Low Mood | |
| Description | I've been feeling emotionally exhausted and overwhelmed by everything happening in my life. It's hard to stay positive when it feels like things keep piling up and nothing changes. | |
| Role | Emotion / Strategy | Utterance |
| Seeker | Sadness | I've been feeling really overwhelmed lately. I don't know how to handle everything that's going on. |
| Supporter | Emotional Validation | It sounds like you're carrying a lot right now, and it's understand- able that you're feeling overwhelmed. It's okay to feel that way. |
| Seeker | Frustration | Yeah, it's just so much, and I feel like I can't get a grip on it. |
| Supporter | Empathetic Statements | It's tough when everything seems like it's piling up. I can hear how much you're struggling with it. |
| Seeker | Hopelessness | I'm trying to stay positive, but it's hard. |
| Supporter | Affirmation | It's really impressive that you're trying to stay positive despite everything. That shows real strength. |
| Seeker | Overwhelm | I just don't know where to start. |
| Supporter | Collaborative Planning | Let's take it step by step. What's one small thing you could focus on right now? |
| Seeker | Hope | Maybe I can tackle some smaller tasks to get the ball rolling. |
| Supporter | Provide Different Perspectives | That's a great perspective. Starting small can help you feel more in control and build momentum. |
| Seeker | Defeat | I feel like I've been stuck in this cycle, and nothing changes. |
| Supporter | Reframe Negative Thoughts | It's easy to feel stuck when it seems like things aren't improving. |
| Seeker | Acceptance | But every small step forward counts, even if it doesn't feel like it. You're right. Maybe I need to be kinder to myself through this. |
| Supporter | Avoid Judgment and Criticism | Absolutely. Being kind to yourself is key. You're doing your best, |
| Seeker | Hopeful | and that's enough. I think I can manage it now, one step at a time. |
| Supporter | Emotional Validation | That's a wonderful mindset. Taking it one step at a time will help you regain your balance. |
-
Preference Dialogue Example (Table 16): This table provides a concrete example from
ESC-Pro, showcasing aseeker'sstatement and multiplesupporterresponses classified aspreferredornon-preferred. When theseekersays, "I'll try my best. Right now, it feels impossible to imagine moving on. Do you have any advice on how to start the healing process?", thepreferred responseuses anAffirmationstrategy, focusing on possibilities ("healing and growth are possible") and practical self-care advice. In contrast,non-preferred responsesincludeCollaborative Planning(which might be too direct at this stage),Avoid Judgment and Criticism(too generic),Reframe Negative Thoughts(potentially premature), orShare Information(less personalized). This clearly illustrates howESC-Prodistinguishes between contextually appropriate and less effective strategies. -
Emotional Continuity Case Study (Table 17): This case study demonstrates
CSO's ability to implicitly preserveemotional continuityover multiple turns by selectingcontext-appropriate strategies. Thesupporter(model trained withCSO) adapts its strategies to theseeker'schanging emotional state. It starts withEmotional ValidationandEmpathetic Statementswhen theseekerexpressessadnessandfrustration. As the conversation progresses and theseekermoves fromhopelessnesstohope, thesupportershifts toAffirmationandCollaborative Planning. When theseekershowsacceptanceandhope, thesupporterresponds withAvoid Judgment and CriticismandEmotional Validation. This smooth progression highlights thatCSOdoesn't just react to the current emotion but supports the overall emotional flow and trajectory of the conversation.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper effectively addresses the critical challenges of low strategy selection accuracy and preference bias in Emotional Support Conversations (ESC) generated by Large Language Models (LLMs). The authors introduce a novel two-stage framework, Chain-of-Strategy Optimization (CSO). The first stage leverages Monte Carlo Tree Search (MCTS) to systematically construct ESC-Pro, a high-quality, turn-level preference dataset that explicitly captures both preferred and non-preferred strategy-response pairs. The second stage applies preference optimization (specifically DPO) on ESC-Pro, enabling LLMs to learn nuanced strategy trade-offs at each dialogue turn. Extensive experiments on LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B consistently demonstrate that CSO significantly outperforms standard Supervised Fine-Tuning (SFT) and decoding-based methods in terms of strategy accuracy and bias mitigation. Human evaluations further confirm CSO's superiority in generating more acceptable, effective, and sensitive responses, validating the efficacy of fine-grained, turn-level preference modeling for enhancing ESC quality and adaptability.
7.2. Limitations & Future Work
The authors acknowledge several limitations:
-
Computational Resources: The experiments were conducted on mid-scale
LLMs(7B-32B parameters). Larger models (e.g., 70B+) could offer deeper insights intoCSO's scalability and potentially yield even better performance. -
ESC-Pro Dataset Size: The
ESC-Pro datasetwas generated from a seed set of 100 dialogues. While effective, a larger seed dataset or alternative expansion strategies (e.g.,human-in-the-loop validation) could further enhance the diversity and quality of the preference data.Future work suggestions include:
-
Personalization: Greater emphasis on
personalization(Liu et al., 2025; Qiu et al., 2025b,a) to tailorESCresponses to user-specific needs. -
Safety: More focus on
safety(Zhao et al., 2025) to ensureCSO-driven dialogue systems are robust against potential risks and ethical concerns, especially in emotionally sensitive applications. This includes addressing potential biases inherent inLLMs.
7.3. Personal Insights & Critique
This paper presents a robust and well-thought-out approach to a critical problem in AI-powered emotional support. The use of MCTS to generate a preference dataset is particularly innovative. Instead of relying solely on scarce human annotations for preferred vs. non-preferred pairs, which is common in DPO-like setups, MCTS with LLM-based reward functions provides a scalable and systematic way to create such data. This is a significant contribution to the broader field of RLHF and preference learning, not just ESC.
The explicit focus on turn-level strategy optimization and bias mitigation is also highly relevant. Emotional support is inherently dynamic, requiring an LLM to adapt its approach as the seeker's emotional state and needs evolve. CSO's ability to model these nuances across turns moves LLMs beyond simple response generation to more sophisticated conversational strategy.
One area for further exploration, beyond the authors' mention, could be the robustness of the Reward LLM and Strategy LLM. The quality of the ESC-Pro dataset hinges entirely on these LLMs' ability to accurately score empathy, information, humanoid quality, and strategy effectiveness. While GPT-4o-mini is powerful, its inherent biases or limitations could propagate into the generated dataset. A detailed analysis or ablation study on the sensitivity of CSO's performance to the choice and fine-tuning of these LLMs would be valuable. For instance, what if a less capable Reward LLM was used? How would the dataset quality and subsequent CSO performance be affected?
Furthermore, while the paper demonstrates improved strategy accuracy and bias mitigation, the subjective nature of emotional support means that "optimal" strategies can vary greatly across individuals and cultures. Future work could explore incorporating user-specific preference models or cultural context into the MCTS reward function or the DPO objective to achieve even more personalized and universally acceptable emotional support.
The finding that CSO allows smaller open-weight models to outperform larger closed-source models in certain ESC metrics is very inspiring. It suggests that domain-specific preference learning, even with relatively smaller models, can be more impactful than simply scaling up LLM size, especially for specialized and sensitive applications like ESC. This opens up avenues for more accessible and efficient deployment of AI-powered emotional support tools.
Similar papers
Recommended via semantic vector search.