Exploring ChatGPT's Capabilities, Stability, Potential and Risks in Conducting Psychological Counseling through Simulations in School Counseling
TL;DR Summary
The study explores ChatGPT-4's capabilities in simulated school counseling, showing high warmth (97.5%), empathy (94.2%), and moderate stability (ICC 0.62), highlighting the need for human oversight. Future research should involve real users and multiple model comparisons.
Abstract
To provide an exploratory analysis of ChatGPT-4's quantitative performance indicators in simulated school-counseling settings. Conversational artificial intelligence (AI) has shown strong capabilities in providing low-cost and timely interventions for a wide range of people and increasing well-being. Therefore, this study examined ChatGPT's capabilities, including response stability in conducting psychological counseling and its potential for providing accessible psychological interventions, especially in school settings. We prompted ChatGPT-4 with 80 real-world college-student counseling questions. Replies were quantified with APA-informed NLP tools to measure warmth, empathy, and acceptance, and run-to-run stability was assessed via Fleiss' \k{appa} and ICC(2,1). ChatGPT-4 achieved high warmth (97.5%), empathy (94.2%), and positive acceptance (mean compound score = 0.93 plus/minus 0.19), with moderate stability (ICC(2,1) = 0.62; \k{appa} = 0.59). Occasional randomness in responses highlights risk areas requiring human oversight. As an offline, single-model text simulation without clinical validation, these results remain exploratory. Future work should involve live users, compare multiple LLMs, and incorporate mixed-methods validation to assess real-world efficacy and safety. The findings suggest ChatGPT-4 could augment low-intensity mental-health support in educational settings, guiding the design of human-in-the-loop workflows, policy regulations, and product roadmaps. This is among the first exploratory studies to apply quantitative stability metrics and NLP-based emotion detection to ChatGPT-4 in a school-counseling context and to integrate a practitioner's perspective to inform future research, product development, and policy.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Exploring ChatGPT's Capabilities, Stability, Potential and Risks in Conducting Psychological Counseling through Simulations in School Counseling
1.2. Authors
- Yanzhuoao Yang: Affiliation not explicitly stated in the provided abstract or full text, but contact email is yn2413@columbia.edu, suggesting a Columbia University affiliation. The
Practitioner Testimonysection indicates the author is Founder & Product Lead of an AI mental-health startup and an independent researcher. - Yixin Chu: Affiliation not explicitly stated.
1.3. Journal/Conference
The paper is published in Mental Health and Digital Technologies. The provided abstract and author-accepted manuscript (AAM) information (10.1108/MHDT-02-2025-0013) indicate it will be published in a journal. Mental Health and Digital Technologies is a specialized journal, suggesting the paper targets an audience interested in the intersection of technology and mental health, including researchers, practitioners, and policymakers in digital health and AI ethics.
1.4. Publication Year
The Published at (UTC) timestamp is 2025-11-03T17:39:57.000Z, indicating a publication year of 2025. The Author Accepted Manuscript (AAM) also states Mental Health and Digital Technologies 2025.
1.5. Abstract
This exploratory study quantitatively analyzes ChatGPT-4's performance in simulated school-counseling settings. Leveraging conversational artificial intelligence (AI) for its potential to offer low-cost, timely mental health interventions, the research assesses ChatGPT's capabilities, including response stability, using 80 real-world college-student counseling questions. Responses were evaluated for warmth, empathy, and acceptance using American Psychological Association (APA)-informed Natural Language Processing (NLP) tools. Run-to-run stability was quantified using Fleiss' Kappa and Intraclass Correlation Coefficient (ICC(2,1)). ChatGPT-4 demonstrated high warmth (97.5%), empathy (94.2%), and positive acceptance (mean compound score = 0.93 ± 0.19), with moderate stability (ICC(2,1) = 0.62; Kappa = 0.59). The occasional randomness in responses highlights risks requiring human oversight. As an offline, single-model text simulation without clinical validation, the results are exploratory. Future work should involve live users, compare multiple Large Language Models (LLMs), and incorporate mixed-methods validation for real-world efficacy and safety. The findings suggest ChatGPT-4 could augment low-intensity mental-health support in educational settings, guiding the design of human-in-the-loop workflows, policy regulations, and product roadmaps. This study is among the first to apply quantitative stability metrics and NLP-based emotion detection to ChatGPT-4 in a school-counseling context and integrate a practitioner's perspective.
1.6. Original Source Link
- Official Source:
https://arxiv.org/abs/2511.01788(ArXiv preprint) - PDF Link:
https://arxiv.org/pdf/2511.01788v1.pdf - Publication Status: The paper is available as a preprint on ArXiv and has been accepted as an Author Accepted Manuscript (AAM) for publication in
Mental Health and Digital Technologiesin 2025.
2. Executive Summary
2.1. Background & Motivation
The paper addresses a significant global challenge: the widespread lack of access to mental healthcare. This issue is particularly acute due to factors like a scarcity of trained providers, high costs, and societal stigma. College students face unique stressors, including academic pressure, social strain, and the transition to adulthood, making timely and effective mental health support crucial for this demographic.
The core problem the paper aims to solve is how to leverage emerging technologies, specifically conversational Artificial Intelligence (AI) like ChatGPT, to increase access to mental health care, particularly in underserved settings such as schools. Digital health tools have already shown efficacy and expanded access for mild-to-moderate conditions, and the advent of sophisticated conversational AIs like ChatGPT presents a new paradigm for scalable mental health support.
The paper's entry point is an exploratory analysis of ChatGPT-4's performance in a simulated school-counseling environment. It seeks to characterize the capabilities, limitations, and real-world risks of AI chatbots in providing mental health care, moving beyond general feasibility discussions to quantify specific therapeutic qualities and response stability. The tragic case of a teenager's suicide allegedly precipitated by an unhealthy dependency on a chatbot (Character AI lawsuit) underscores the urgent need for rigorous evaluation and safeguards for AI-driven relational products.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Quantitative Performance Evaluation: It provides an exploratory quantitative analysis of ChatGPT-4's performance in simulated school-counseling scenarios, specifically measuring
warmth,empathy, andacceptanceusing American Psychological Association (APA)-informed Natural Language Processing (NLP) tools. -
Stability Metrics Application: It is among the first studies to apply quantitative
stability metrics(Fleiss' Kappa and ICC(2,1)) to assess therun-to-run consistencyof ChatGPT-4's responses in a counseling context. This highlights the practical reliability of the model. -
Practitioner's Perspective Integration: It integrates a practitioner's perspective to interpret the technical findings, informing future research, product development, and policy regulations for AI in mental health.
-
Risk Identification: It identifies and quantifies risk areas, such as occasional
randomnessandnon-positive responses(e.g., 'confusion'), emphasizing the need forhuman oversightandguardrails.The key conclusions and findings include:
-
High Therapeutic Qualities: ChatGPT-4 generated responses with high levels of
warmth(97.5%),empathy(94.2%), and strongly positiveacceptance(mean compound score = 0.93 ± 0.19). This suggests the model can consistently produce emotionally supportive and reassuring content. -
Moderate Stability: The
run-to-run stabilityof responses was found to be moderate (ICC(2,1) = 0.62; Kappa = 0.59), indicating that while generally consistent, there is still some variability in responses to identical prompts. -
Identified Risks: A small but significant fraction (2.5%) of responses were labeled as 'confusion/realization', and
sentiment driftwas observed, highlighting potential forunforeseeable harmormisinformationin sensitive contexts. -
Potential for Augmentation: The findings suggest that ChatGPT-4 could effectively augment low-intensity mental health support in educational settings, particularly for psycho-education, homework review, or check-in messages.
-
Need for Human-in-the-Loop: Due to identified risks and variability, the paper strongly advocates for
human-in-the-loop workflows, where AI drafts are reviewed and approved by licensed clinicians, and for robustrisk managementandgovernance frameworks.These findings contribute to understanding the practical utility and safety considerations for deploying advanced AI in mental health, especially within school environments.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a reader should be familiar with several key concepts across artificial intelligence, natural language processing, psychology, and statistics.
-
Conversational Artificial Intelligence (AI) / Large Language Models (LLMs):
- Definition: Conversational AI refers to technologies, like chatbots or virtual assistants, that can understand and respond to human language in a natural, human-like way.
Large Language Models (LLMs)are a type of AI program that has been trained on a massive amount of text data from the internet. This training allows them to understand, generate, and process human language for various tasks, including answering questions, writing essays, or engaging in dialogue. ChatGPT-4 is a specific, highly advanced LLM developed by OpenAI. - Relevance: The entire study evaluates ChatGPT-4's performance, so understanding what LLMs are and their general capabilities is fundamental.
- Definition: Conversational AI refers to technologies, like chatbots or virtual assistants, that can understand and respond to human language in a natural, human-like way.
-
Natural Language Processing (NLP):
- Definition:
Natural Language Processing (NLP)is a branch of AI that gives computers the ability to understand, interpret, and generate human language. It involves techniques for analyzing text data to extract meaning, identify sentiment, recognize entities, and more. - Relevance: The paper uses NLP tools (
EmoRoBERTa,neural network model,VADER) to quantifywarmth,empathy, andacceptancefrom ChatGPT's text responses.
- Definition:
-
Sentiment Analysis:
- Definition:
Sentiment analysis, oropinion mining, is an NLP technique used to determine the emotional tone behind a piece of text. It categorizes text as positive, negative, or neutral, and often provides a score indicating the strength of that sentiment. - Relevance: The
VADER model(Valence Aware Dictionary and sEntiment Reasoner) is a specific rule-based sentiment analysis tool mentioned in the paper. It is used to measureacceptanceby analyzing thecompound scoreof ChatGPT's responses.
- Definition:
-
Emotion Detection:
- Definition:
Emotion detectionis an NLP task that aims to identify and categorize specific human emotions (e.g., joy, anger, sadness, caring) expressed in text or speech. - Relevance: The
EmoRoBERTa modelis apre-trained transformer-based modelused in the paper foremotion detectionto quantifywarmthin ChatGPT's responses.
- Definition:
-
Empathy Detection:
- Definition:
Empathy detectionin NLP involves identifying linguistic cues that signal an understanding or sharing of another person's feelings or experiences within a text. - Relevance: The paper uses a
neural network modelspecifically trained forempathy detectionto measure the presence of empathy in ChatGPT's replies.
- Definition:
-
Common Factors Theory (in Counseling):
- Definition: In psychotherapy,
common factors theoryposits that the effectiveness of different therapeutic approaches is largely due to elements common across all successful therapies, rather than specific techniques unique to each. Key common factors include the therapeutic relationship (alliance),empathy,warmth, andacceptance. - Relevance: This theory directly informs the paper's choice of
warmth,empathy, andacceptanceas the primary metrics for evaluating ChatGPT's counseling capabilities, as these are crucial for fostering a positive therapeutic relationship.
- Definition: In psychotherapy,
-
Parasocial Interaction (PSI):
- Definition:
Parasocial interaction (PSI)describes a one-sided psychological relationship experienced by an audience member with a media persona (e.g., a celebrity, a fictional character). The audience feels an emotional connection and intimacy with the persona, even though the relationship is not reciprocal. With conversational AI, this becomes a two-way, on-demand bond. - Relevance: The paper uses PSI theory as a
theoretical foundationto understand how users might form bonds with chatbots and howwarmth,empathy, andacceptancecontribute to perceived intimacy and responsiveness, which are principal drivers of user satisfaction. It also highlights the "darker corollary" where this intimacy can amplify the harm of misinformation from a chatbot.
- Definition:
-
Intraclass Correlation Coefficient (ICC):
- Definition: The
Intraclass Correlation Coefficient (ICC)is a statistical measure used to assess the reliability of ratings or measurements. It quantifies the degree of similarity of measurements made on the same subjects, or in this case, therun-to-run reliabilityof ChatGPT's responses. An ICC of 1 indicates perfect agreement, while 0 indicates no agreement. - Mathematical Formula (for ICC(2,1) as used in the paper):
$
\mathrm{ICC}(2,1) = \frac{\mathrm{MS_R} - \mathrm{MS_E}}{\mathrm{MS_R} + (k-1)\mathrm{MS_E} + k(\mathrm{MS_C} - \mathrm{MS_E})/n}
$
- Symbol Explanation:
- : Mean square for rows (subjects/queries in this case).
- : Mean square for error (residual variance).
- : Mean square for columns (raters/runs in this case).
- : Number of ratings per subject (number of runs, which is 3 in this study).
- : Number of subjects (number of queries, which is 80 in this study).
- Symbol Explanation:
- Relevance: Used to assess the
stabilityof ChatGPT's continuous sentiment outputs (e.g., compound score, negativity, positivity, neutral scores).
- Definition: The
-
Fleiss' Kappa ():
- Definition:
Fleiss' Kappais a statistical measure for assessing the reliability of agreement between a fixed number of raters (or measurement instances) when assigning categorical ratings to a number of items or subjects. Unlike simple percent agreement, Kappa accounts for the possibility of agreement occurring by chance. - Mathematical Formula:
$
\kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}
$
- Symbol Explanation:
- : The observed agreement probability (average agreement across all items and categories).
- : The chance agreement probability (average agreement expected by chance).
- Symbol Explanation:
- Relevance: Used to assess the
stabilityofempathy detection, which is a binary categorical outcome (empathy detected/not detected).
- Definition:
-
Chi-Square Test for Independence ():
- Definition: The
Chi-square test for independenceis a statistical test used to determine if there is a significant association between two categorical variables. In this context, it checks if the distribution of emotion categories is independent of the response run (i.e., if the emotion distribution varies significantly across the three responses for each query). - Mathematical Formula:
$
\chi^2 = \sum_{i=1}^{R} \sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
$
- Symbol Explanation:
- : Number of rows (categories of the first variable).
- : Number of columns (categories of the second variable).
- : Observed frequency in cell
(i, j). - : Expected frequency in cell
(i, j)under the assumption of independence.
- Symbol Explanation:
- Relevance: Used to describe if there were significant differences in
emotion category distributionacross the three responses.
- Definition: The
-
One-Way Analysis of Variance (ANOVA):
- Definition:
One-Way ANOVAis a statistical method used to compare the means of two or more groups (or conditions) to determine if there is a statistically significant difference between them. It is used when there is one categorical independent variable (factor) and one continuous dependent variable. - Mathematical Formula (for F-statistic):
$
F = \frac{\mathrm{MS}{\text{between}}}{\mathrm{MS}{\text{within}}}
= \frac{\mathrm{SS}{\text{between}} / (\mathrm{k}-1)}{\mathrm{SS}{\text{within}} / (\mathrm{N}-\mathrm{k})}
$
- Symbol Explanation:
- : The F-statistic.
- : Mean square between groups (variance among group means).
- : Mean square within groups (variance within each group).
- : Sum of squares between groups.
- : Sum of squares within groups.
- : Number of groups (number of response runs, which is 3).
- : Total number of observations (total number of queries * number of runs).
- Symbol Explanation:
- Relevance: Used to assess
differences in the average composite sentiment scoresamong the three responses, checking forsystematic driftin emotional tone.
- Definition:
-
Pearson's Correlation Coefficient (Pearson's r):
- Definition:
Pearson's ris a measure of the linear correlation between two continuous variables. It ranges from -1 (perfect negative linear correlation) to +1 (perfect positive linear correlation), with 0 indicating no linear correlation. - Mathematical Formula:
$
r = \frac{n \sum(xy) - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}}
$
- Symbol Explanation:
- : Number of paired observations.
- : Sum of the products of the paired scores.
- : Sum of the x scores (e.g., question word counts).
- : Sum of the y scores (e.g., response word counts).
- : Sum of the squared x scores.
- : Sum of the squared y scores.
- Symbol Explanation:
- Relevance: Used to explore whether
longer questions tended to elicit longer answersby correlating question and response word counts.
- Definition:
3.2. Previous Works
The paper frames its research within a broader context of digital mental health and conversational AI.
- Digital Health Tools Efficacy: Studies like Areán (2021), Henson et al. (2019), and Lattie et al. (2022) are cited for demonstrating the efficacy of digital health tools (computerized CBT, mental-health apps, teletherapy) comparable to face-to-face interventions for mild-to-moderate conditions, expanding access amid provider shortages. The COVID-19 pandemic accelerated this shift (Friis-Healy et al., 2021; Prescott et al., 2022).
- Rise of Conversational AI: The public debut of ChatGPT in late 2022 introduced a new paradigm (Wu et al., 2023; Roumeliotis & Tselikas, 2023), with early adopters reporting applications from homework tutoring to health triage (Biswas, 2023; Su et al., 2022). Policy bodies are exploring hybrid models combining AI check-ins with human oversight (Vilaza & McCashin, 2021; Rollwage et al., 2022).
- AI Strengths in Mental Health: Previous research highlighted AI's potential in mental health assessment, detection, diagnosis, operation support, treatment, and counseling (Rollwage et al., 2023; Danieli et al., 2022; Trappey et al., 2022). Chatbots like Tess have shown efficacy in reducing depression and anxiety symptoms (Fulmer et al., 2018).
- Nuance and Limitations of AI Empathy:
- Huang et al. (2024) found that GPT-4 matches human therapists on
surface-level warmth and reflectionbut underperforms ondepth of emotional processing, often resorting to formulaic reassurance. - Elyoseph et al. (2023) showed that while ChatGPT outperforms human norms in standard emotional-awareness tests, its responses often lack personalized probing and adaptive questioning. However, Elyoseph et al. (2023) and Heston (2023) also found ChatGPT demonstrated significantly higher emotional awareness than human norms and could provide useful intervention for low- and medium-risk conditions.
- Huang et al. (2024) found that GPT-4 matches human therapists on
- Risks and Inconsistencies:
Potential harmfrommalfunctionsandmisinterpretationsis a serious concern (Hamdoun et al., 2023; Shaik et al., 2022; Trappey et al., 2022; Kapoor & Goel, 2022; Heston, 2023; Farhat, 2024).- Trappey et al. (2022) noted that slight input changes can lead to
unpredictable variationsin AI responses. - Farhat (2024) highlighted that
altering or repeating promptscan lead toharmful suggestions. - Wang et al. (2023) demonstrated that GPT-4 can
hallucinate client detailsor omit critical information when prompts exceed about 3000 tokens, indicatinglimited working memory. - The
2024 wrongful-death lawsuit against Character AIafter a teenager's suicide, allegedly precipitated by chatbot guidance (Frenkel & Hirsh, 2024), underscores the severe risks ofPSI-amplified errors.
- Bias and Complexity: Biases in AI algorithms may perpetuate societal biases (Sharma et al., 2022; Tutun et al., 2023). Increased context complexity can lead to worse results, making ChatGPT unsuitable for complex mental health interventions (Dergaa et al., 2024).
3.3. Technological Evolution
The evolution of technology in mental health has progressed from early computerized Cognitive Behavioral Therapy (CBT) and mental-health apps to fully remote teletherapy, which gained significant traction during the COVID-19 pandemic. This shift enabled expanded access for mild-to-moderate conditions. The introduction of advanced conversational AI like ChatGPT in late 2022 marked a new phase, providing accessible, end-user-friendly tools for various applications, including preliminary mental health support. Concurrently, specialized next-generation mental-health chatbots (e.g., Hailey, MYLO, Limbic Access) have emerged, integrating advanced Natural Language Processing (NLP) pipelines for emotion detection, sentiment analysis, and multi-turn dialogue to deliver psychoeducational content and self-help exercises. This paper's work fits into the current state of exploring the capabilities and safety of these advanced LLMs for mental health, specifically in school counseling. It seeks to bridge the gap between general AI capabilities and specific therapeutic metrics, quantifying both performance and reliability, while acknowledging the critical need for safeguards in this rapidly evolving landscape.
3.4. Differentiation Analysis
Compared to main methods in related work, this paper offers several core differences and innovations:
- Quantitative Stability Focus: While previous research discussed the general feasibility of ChatGPT in mental health or its emotional awareness (e.g., Elyoseph et al., Huang et al.), this study specifically focuses on
inherent capabilitiesandresponse stabilityby quantifying three APA-defined therapeutic metrics (warmth,empathy,acceptance) and, crucially, usingFleiss' Kappaand to measurerun-to-run consistencyfor identical prompts. This provides a more rigorous understanding of reliability. - School Counseling Context: The study grounds its simulations in
school counseling scenariosusingreal-world college-student counseling questions. This offers context-specific insights not broadly covered in general mental health AI research, addressing a specific, high-need demographic. - NLP-based Emotion Detection with APA Benchmarks: It rigorously applies
APA-informed NLP tools(EmoRoBERTa, neural network empathy model, VADER) to quantify specific therapeutic qualities, linking AI performance directly to established counseling benchmarks. - Integration of Practitioner's Perspective: The inclusion of
practitioner testimonyprovides a unique, real-world lens on the implications of the findings fortechnical design,product workflows,human-AI collaboration,organizational practices, andregulatory frameworks, bridging research with practical application and policy. - Emphasis on Risks and Randomness: The paper explicitly highlights
randomnessandinstabilityas critical risk factors, aiming to reveal threats in using AI indirect clinical environmentsrather than just showcasing general performance. This is a more cautious and safety-focused approach than many prior exploratory studies.
4. Methodology
4.1. Principles
The core idea behind the method used in this paper is to conduct text-based simulations of real-world counseling interactions to explore the capabilities and response stability of Large Language Model (LLM) chatbots, specifically ChatGPT-4, in a school-counseling context. The theoretical basis is rooted in the common factors theory of counseling, which identifies warmth, empathy, and acceptance as crucial interpersonal skills for effective therapists (APA, 2013; Castonguay & Hill, 2012; Wampold et al., 2017). The intuition is that if an AI can consistently demonstrate these qualities, it holds potential for supporting mental health. Additionally, drawing on Parasocial Interaction (PSI) theory, the study aims to understand how warmth, empathy, and acceptance in AI responses can foster perceived intimacy and responsiveness, which are key drivers of user satisfaction and emotional support.
The methodology also emphasizes quantifying the stability and consistency of AI responses. This addresses a critical concern about randomness in AI outputs, especially in sensitive clinical contexts where unpredictable variations or hallucinations could lead to harm. By submitting identical prompts multiple times and using statistical measures like Fleiss' Kappa and Intraclass Correlation Coefficient (ICC), the study aims to provide an initial picture of ChatGPT's capacity to deliver emotionally supportive and stable replies. The overall approach combines established psychological benchmarks with computational social science techniques (NLP for sentiment and emotion analysis) to objectively evaluate AI's potential and risks.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Study Design and LLM Selection
The study employed text-based simulations using real-world queries.
- Data Source Identification: The first step involved identifying an
authentic data sourceof counseling questions. - LLM Selection:
ChatGPT-4(model0613, accessed15 July 2024) was chosen as theLarge Language Model (LLM). The rationale for this choice was its widespread adoption and transparent documentation at the time of data collection. - Response Collection: For each identified question, three responses were collected from ChatGPT's online application. This repeated querying was crucial for assessing
run-to-run stability. - Analysis Focus: The subsequent analysis focused on
quantitatively demonstratinghow well the responses conveyedwarmth,empathy, andacceptance. A secondary aim was tocharacterize the degree of response randomnessacross the three repetitions of identical prompts. - Tools: All analyses utilized publicly available
Natural Language Processing (NLP)tools for illustrative purposes rather thanconfirmatory testing. - Interpretation: The quantitative results were then used to discuss ChatGPT's capabilities and risks in real-world applications, with descriptive insights provided by two authors with backgrounds in mental health, technological innovations, and computer science.
4.2.2. Data Collection
The study utilized a secondary data source from the ChatCounselor research (Liu et al., 2023). This dataset consists of diverse queries related to adolescent psychological issues, originally collected in Chinese and then translated into English.
- Dataset Composition: The dataset comprises queries from
80 different students, covering topics such asacademic stress,family, andintimate relationships. The variety in topics, tones, and lengths was considered advantageous for testing ChatGPT's performance stability. - Dataset Columns: The dataset included the original query in Chinese, its translated English version, and three AI-generated responses to the translated query.
- Prompt Engineering: To ensure ChatGPT's responses closely resembled real counseling sessions, a specific prompt was used: "Imagine you are a counselor, and you need to give a response just as in a counseling session. You need to give a response in the same format as a professional counselor. According to the APA, an effective therapist has abilities including verbal fluency, warmth, acceptance, empathy, and an ability to identify how a patient is feeling." A query from the dataset was then appended to this prompt. This prompt design aimed to provide necessary context and expectations for an objective evaluation.
4.2.3. Warmth (Emotion Detection)
- Metric Definition:
Warmthwas defined as the ability of the AI to create a welcoming and supportive context, as per APA benchmarks. - Tool: The
EmoRoBERTa model(Kamath et al., 2022) was utilized foremotion detection. This is apre-trained transformer-based modeldesigned to identify 28 distinct emotions. - Emotions Detected by EmoRoBERTa: Admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, and neutrality.
- Application: The EmoRoBERTa model was applied to each of ChatGPT's responses to classify the
primary emotion. - Operationalization of Warmth: The study specifically combined the
caringandapprovalcategories detected by EmoRoBERTa to quantifyemotional warmth.
4.2.4. Empathy (Empathy Detection)
- Metric Definition:
Empathywas assessed based on the AI's ability to understand and mirror students' feelings and experiences, as per APA benchmarks. - Tool: A
neural network modelspecifically trained to detect empathy in text (Sharma et al., 2020) was adopted. This model was trained on a dataset of empathetic and non-empathetic text. - Output: The model outputs a
binary label:1for responses containing empathy and0for those that do not. - Application: This model was applied to measure the levels of
empathyin ChatGPT's responses.
4.2.5. Acceptance (Sentiment Analysis)
- Metric Definition:
Acceptanceinvestigates whether the AI can demonstrateunconditional positive regardand anonjudgmental attitudetoward students, as per APA benchmarks. - Tool: The
Valence Aware Dictionary and sEntiment Reasoner (VADER)model (Hutto & Gilbert, 2014) was used forsentiment analysis. - Output: VADER provides four scores for each text:
negative (neg),neutral (neu),positive (pos), and acomprehensive sentiment score (compound). - Operationalization of Acceptance: The
compound scorewas specifically used to evaluate the overall emotional tone, with higher positive scores indicating a higher level ofacceptance.
4.2.6. Stability and Consistency Evaluation
To produce a descriptive estimate of the stability and consistency of ChatGPT's responses, several statistical methods were employed:
- Fleiss' Kappa () for Empathy:
- Purpose: To assess the inter-rater reliability (or
run-to-run consistencyin this case) of thebinary empathy detection(0 or 1). - Mathematical Formula:
$
\kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}
$
- Symbol Explanation:
- : The observed agreement probability, calculated as the average proportion of agreements across all categories and subjects.
- : The chance agreement probability, calculated as the sum of squared proportions of assignments for each category by chance.
- Symbol Explanation:
- Purpose: To assess the inter-rater reliability (or
- Intraclass Correlation Coefficient (ICC(2,1)) for Sentiment Scores:
- Purpose: To quantify the
run-to-run reliabilityof ChatGPT's continuous sentiment outputs (e.g., compound, negativity, positivity, neutral scores). The model, a two-way random-effects, absolute-agreement, single-measurement ICC, was chosen to estimate if any subsequent call would reproduce the same absolute score pattern. - Mathematical Formula (for ICC(2,1)):
$
\mathrm{ICC}(2,1) = \frac{\mathrm{MS_R} - \mathrm{MS_E}}{\mathrm{MS_R} + (k-1)\mathrm{MS_E} + k(\mathrm{MS_C} - \mathrm{MS_E})/n}
$
- Symbol Explanation:
- : Mean square for rows (representing the 80 individual counseling queries).
- : Mean square for error (representing the random variability or residual variance).
- : Mean square for columns (representing the 3 repeated runs of ChatGPT for each query).
- : Number of ratings per subject (which is 3, for the three responses).
- : Number of subjects (which is 80, for the 80 queries).
- Symbol Explanation:
- Purpose: To quantify the
- Chi-Square Test for Emotion Category Distribution:
- Purpose: To determine if the
emotion-category distribution(as detected by EmoRoBERTa) differed significantly across the three responses. - Mathematical Formula:
$
\chi^2 = \sum_{i=1}^{R} \sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
$
- Symbol Explanation:
- : The Chi-square test statistic.
- : Number of rows in the contingency table (representing emotion categories).
- : Number of columns in the contingency table (representing the three response runs).
- : The observed frequency of an emotion category in a specific response run.
- : The expected frequency of that emotion category in that response run, assuming independence between emotion category and response run.
- Symbol Explanation:
- Purpose: To determine if the
- One-Way ANOVA for Composite Sentiment Scores:
- Purpose: To assess whether there were statistically significant differences in the
average composite sentiment scores(VADER compound scores) among the three responses. This checks forsystematic driftin emotional tone. - Mathematical Formula (for F-statistic):
$
F = \frac{\mathrm{MS}{\text{between}}}{\mathrm{MS}{\text{within}}}
= \frac{\mathrm{SS}{\text{between}} / (\mathrm{k}-1)}{\mathrm{SS}{\text{within}} / (\mathrm{N}-\mathrm{k})}
$
- Symbol Explanation:
- : The F-statistic, which is the ratio of variance between the groups to the variance within the groups.
- : Mean square between groups, representing the variability among the means of the three response runs.
- : Mean square within groups, representing the variability within each response run.
- : Sum of squares between groups.
- : Sum of squares within groups.
- : Number of groups (which is 3, for the three response runs).
- : Total number of observations (total number of queries multiplied by 3 runs).
- Symbol Explanation:
- Purpose: To assess whether there were statistically significant differences in the
4.2.7. Correlation Analysis
- Purpose: To explore if the length of the input question influenced the length of ChatGPT's responses, addressing a factor potentially influencing
randomness. - Method:
Pearson's r(Pearson correlation coefficient) was calculated between theword countof the input questions and theaverage word countof the corresponding responses. - Mathematical Formula:
$
r = \frac{n \sum(xy) - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}}
$
- Symbol Explanation:
- : Pearson's correlation coefficient.
- : The number of pairs of data points (80 queries).
- : The sum of the products of the paired question word counts () and average response word counts ().
- : The sum of all question word counts.
- : The sum of all average response word counts.
- : The sum of the squares of all question word counts.
- : The sum of the squares of all average response word counts.
- Symbol Explanation:
4.2.8. Ethical Considerations
The study adhered to ethical standards by ensuring anonymization and using publicly available data.
- No Human/Animal Subjects: The study did not involve direct interactions with human or animal subjects.
- De-identified Data: The data came from the
publicly released ChatCounselor corpus(Liu et al., 2023), hosted on GitHub/Hugging-Face, which wasfully de-identified. This means user names, dates, IP logs, and any HIPAA- or GDPR-protected identifiers were removed by the original curators. - Terms of Service: The forum's terms of service permitted non-commercial redistribution for research, and the dataset was distributed under an open license.
- IRB Exemption: Based on U.S. federal regulations (45 CFR 46.102 and 46.104(d)(4)), which exclude
publicly available, de-identified datafrom the definition ofhuman-subjects research, the study determined that it wasNot-Human-Subjects Researchand did not requireInstitutional Review Board (IRB)review. - No Re-identification/Linkage: No re-identification was attempted, no linkage with other data sources occurred, and no contact was made with the original posters.
4.2.9. Methodological Limitations
The authors explicitly acknowledge several limitations:
- Offline Text Simulations: The study was based entirely on
offline text simulationsusingeighty previously posted student queriesfed to asingle model version(GPT-4-0613, accessed 15 July 2024). This means the results are descriptive flags, not confirmatory tests, and may not generalize to other LLM versions or chatbots. - No Live User Interaction:
No live users or cliniciansinteracted with the system, so the study cannot speak toreal-time usability,safety, orconversational dynamics. - External Validity and Measurement:
Warmth,empathy, andacceptancewere inferred bymachine-learning modelsrather thanhuman raters. Crucialclinical outcomes(e.g., symptom change, client satisfaction) were not observed. Thus, findings are exploratory and hypothesis-generating, not evidence of effectiveness.Rigorous user studies,multi-model replications, andmixed-methods validationare needed. - Lack of Direct Stakeholder Input: The study
did not collect usability or acceptability feedbackfromadolescent usersorprofessional school counselors. This leaves open questions about how the automated scores translate intoreal-world experience. Future work should include interviews and focus groups with these stakeholders.
5. Experimental Setup
5.1. Datasets
- Source: The secondary data for this study were sourced from the
ChatCounselorresearch (Liu et al., 2023). This corpus is publicly available on GitHub/Hugging-Face. - Characteristics and Domain: The dataset consists of
80 real-world college-student counseling questions. These questions were originally collected in Chinese and then translated into English for the purpose of this study. The queries are diverse, spanning variousadolescent psychological issues, includingacademic stress,family relationships, andintimate relationships. - Data Sample Example (from Table 3 of the original paper):
- Question: "My mom has actually learned to treat me right, it's only occasionally that I'm able to notice that she really doesn't have the ability to care for others, and that stings me, other than that she's really made a lot of effort and I'm impressed. But my situation still hasn't gotten much better, a lot of people who have heard me talk about the situation have suggested I leave, which I don't want to do, is there no way to fix the problem without leaving?"
- Rationale for Choice: The
ChatCounselordataset was considered suitable due to its authenticity in representing real-world counseling data. Thelarge variation in topics, tones, and lengthsof the queries within this dataset made it meaningful for testing the stability and generalizability of ChatGPT's performance in a counseling context.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
-
Warmth (Emotion Detection Percentage):
- Conceptual Definition: This metric quantifies the extent to which ChatGPT's responses convey a welcoming, supportive, and positive emotional tone, which is a key component of a therapeutic alliance. It specifically measures the proportion of responses that fall into categories indicative of care and affirmation.
- Mathematical Formula: $ \text{Warmth Percentage} = \frac{\text{Number of responses coded as 'Caring' or 'Approval'}}{\text{Total number of responses}} \times 100% $
- Symbol Explanation:
Number of responses coded as 'Caring' or 'Approval': The count of individual ChatGPT responses where the EmoRoBERTa model detected the primary emotion as eitherCaringorApproval.Total number of responses: The total number of responses generated by ChatGPT across all queries (80 queries * 3 runs = 240 responses).
-
Empathy (Empathy Detection Percentage):
- Conceptual Definition: This metric assesses ChatGPT's ability to understand, reflect, and emotionally connect with the user's feelings and experiences, which is crucial for empathetic listening in counseling. It measures the proportion of responses that contain empathetic language.
- Mathematical Formula: $ \text{Empathy Percentage} = \frac{\text{Number of responses where Empathy is detected (1)}}{\text{Total number of responses}} \times 100% $
- Symbol Explanation:
Number of responses where Empathy is detected (1): The count of individual ChatGPT responses where the neural network model for empathy detection output a binary label of1(indicating empathy presence).Total number of responses: The total number of responses (240).
-
Acceptance (VADER Compound Score):
- Conceptual Definition: This metric quantifies the overall emotional positivity of ChatGPT's responses, reflecting
unconditional positive regardand anonjudgmental attitude. Thecompound scoreaggregates positive, negative, and neutral sentiments into a single, normalized score between -1 (most extreme negative) and +1 (most extreme positive). - Mathematical Formula: The VADER compound score is an output of the VADER model, which internally calculates it based on a lexicon and rule-based system. It is not presented as a simple linear formula of neg, neu, pos scores but is a proprietary aggregate. The paper reports the mean of these scores. $ \text{Mean Compound Score} = \frac{1}{N} \sum_{i=1}^{N} \text{Compound Score}_i $
- Symbol Explanation:
- : The VADER compound sentiment score for the -th response.
- : The total number of responses (240).
- Conceptual Definition: This metric quantifies the overall emotional positivity of ChatGPT's responses, reflecting
-
Stability: Fleiss' Kappa () for Empathy Detection:
- Conceptual Definition: This metric measures the reliability of agreement among the three repeated responses (as "raters") regarding the presence or absence of empathy for each query. It accounts for agreement occurring by chance, providing a more robust measure of consistency than simple percent agreement.
- Mathematical Formula: $ \kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e} $
- Symbol Explanation:
- : Fleiss' Kappa statistic.
- : The observed agreement probability, which is the average proportion of agreements across all items (queries) and categories (empathy detected/not detected).
- : The chance agreement probability, which is the average proportion of agreements expected by chance.
-
Stability: Intraclass Correlation Coefficient (ICC(2,1)) for Sentiment Scores:
- Conceptual Definition: This metric quantifies the
absolute agreementandrun-to-run reliabilityof continuous sentiment scores (negativity, neutral, positivity, compound) across the three responses for each query. It indicates how consistently the same absolute score is reproduced across repetitions. - Mathematical Formula (for ICC(2,1)): $ \mathrm{ICC}(2,1) = \frac{\mathrm{MS_R} - \mathrm{MS_E}}{\mathrm{MS_R} + (k-1)\mathrm{MS_E} + k(\mathrm{MS_C} - \mathrm{MS_E})/n} $
- Symbol Explanation:
- : The Intraclass Correlation Coefficient (two-way random-effects, absolute-agreement, single-measurement).
- : Mean square for rows (variance attributable to the 80 queries).
- : Mean square for error (unexplained variance).
- : Mean square for columns (variance attributable to the 3 response runs).
- : Number of ratings per subject (which is 3, the number of responses per query).
- : Number of subjects (which is 80, the number of distinct queries).
- Conceptual Definition: This metric quantifies the
-
Chi-Square Test Statistic () for Emotion Category Distribution:
- Conceptual Definition: This statistic is used to determine if there is a statistically significant association between the categorical variable of 'emotion category' (e.g., Caring, Approval, Confusion) and the categorical variable of 'response run' (Answer 1, Answer 2, Answer 3). A high value and low p-value would suggest a dependent relationship, meaning emotion distributions differ significantly across runs.
- Mathematical Formula: $ \chi^2 = \sum_{i=1}^{R} \sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $
- Symbol Explanation:
- : The Chi-square test statistic.
- : Number of emotion categories observed.
- : Number of response runs (3).
- : The observed frequency count for emotion category in response run .
- : The expected frequency count for emotion category in response run , assuming no association between emotion category and response run.
-
F-statistic from One-Way ANOVA for Composite Sentiment Scores:
- Conceptual Definition: The F-statistic tests the null hypothesis that the means of the VADER compound scores are equal across the three response runs. A significant F-statistic (with a low p-value) would indicate that the average emotional tone differs systematically across the runs.
- Mathematical Formula: $ F = \frac{\mathrm{MS}{\text{between}}}{\mathrm{MS}{\text{within}}} $
- Symbol Explanation:
- : The F-statistic.
- : Mean square between groups, which represents the variance of the group means (average compound scores for Answer 1, Answer 2, Answer 3) around the grand mean.
- : Mean square within groups, which represents the pooled variance within each group (the variability of compound scores within each set of 80 responses).
-
Pearson's Correlation Coefficient () for Word Counts:
- Conceptual Definition: This metric quantifies the strength and direction of a linear relationship between the word count of the input questions and the average word count of the generated responses. A positive value indicates that longer questions tend to produce longer answers.
- Mathematical Formula: $ r = \frac{n \sum(xy) - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}} $
- Symbol Explanation:
- : Pearson's correlation coefficient.
- : The number of paired observations (80 queries).
- : The sum of the product of each question's word count () and its average response word count ().
- : The sum of all question word counts.
- : The sum of all average response word counts.
- : The sum of the squares of all question word counts.
- : The sum of the squares of all average response word counts.
5.3. Baselines
The study primarily focuses on an exploratory analysis of ChatGPT-4's intrinsic capabilities and stability rather than a direct comparison against other Large Language Models (LLMs) or traditional counseling interventions. Therefore, explicit baseline models in the conventional sense (i.e., other LLMs used for head-to-head performance comparison on the same dataset) are not present within the experimental setup.
However, the paper implicitly references "human norms" and "human therapists" from related work (e.g., Elyoseph et al., Huang et al.) to contextualize ChatGPT's performance, suggesting human therapeutic qualities as an aspirational benchmark. The discussion also mentions other AI-driven platforms like Anthropic's Claude, Google's Gemini, Replika, and CharacterAI as part of the broader ecosystem of mental health support, but these are not used as experimental baselines in this specific study.
The study aims to establish a baseline understanding of ChatGPT-4 itself, particularly its run-to-run stability, which is a critical aspect for its potential deployment in real-world sensitive applications.
6. Results & Analysis
6.1. Core Results Analysis
The study's results indicate that ChatGPT-4 exhibits strong performance in generating responses with therapeutic qualities (warmth, empathy, acceptance) in simulated school-counseling settings, alongside moderate run-to-run stability.
For warmth, the model showed a high prevalence of supportive and empathetic emotions. Combining caring (75.4%) and approval (22.1%) categories, 97.5% of all replies were coded as warm. This suggests that ChatGPT-4 rarely adopted a neutral or negative tone within this specific counseling context, fostering a nurturing dialogue environment. However, the low occurrence of confusion (0.83%) and realization (1.67%) emotions, while infrequent, is notable. While indicating clarity, any mistake or non-supportive emotion in psychological counseling can be harmful, highlighting a residual randomness that requires caution.
Empathy was also highly prevalent, detected in 94.2% of responses. This suggests ChatGPT's ability to effectively mirror user feelings and experiences in a psychologically accurate manner, aligning with common factors theory where empathic listening is crucial.
Regarding acceptance, the VADER compound score averaged , which falls into VADER's "strongly positive" band. This indicates an overwhelmingly positive emotional undertone, promoting a supportive and reassuring interaction framework.
Stability of responses was moderate. Fleiss' Kappa for empathy detection was , indicating substantial agreement but also implying that around 4 in 10 empathy judgments might shift across reruns. For continuous sentiment scores, showed good stability for the compound score (0.62), fair stability for negativity (0.57) and positivity (0.49), and poor stability for neutral scores (0.39). This suggests that while the overall positive tone is relatively consistent, the nuances of negative, neutral, and positive components can vary more significantly. However, a chi-square test for emotion category distribution () indicated no significant variation in emotional distribution across the three runs, with Cramér's V of 0.09 suggesting a weak association. Similarly, a one-way ANOVA for composite sentiment scores () found no systematic drift in average emotional tone. These seemingly contrasting stability results highlight that while the overall average sentiment and categorical distribution might be stable, the absolute scores for specific sentiment components (especially neutral) can still fluctuate, indicating a degree of stochastic sampling inherent to the model.
Finally, a moderate positive correlation () was found between question word count and average answer word count. This suggests that longer student disclosures tend to receive proportionally richer feedback, which could be advantageous for engagement in counseling, but also indicates that the output length is not entirely arbitrary and is influenced by prompt length.
Overall, the results strongly validate ChatGPT-4's potential for generating warm, empathetic, and accepting responses, which are critical for therapeutic interaction. However, the identified moderate stability and occasional non-positive responses highlight its immaturity for direct clinical use without human oversight, particularly in high-risk mental health interventions.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| EmotionCategory | Answer1 | Answer2 | Answer3 | Total | Proportion | Frequency |
| Approval | 14 | 21 | 18 | 53 | 22.08% | Moderate |
| Caring | 64 | 57 | 60 | 181 | 75.42% | Very High |
| Realization | 1 | 1 | 2 | 4 | 1.67% | Very Low |
| Confusion | 1 | 1 | 0 | 2 | 0.83% | Very Low |
The following are the results from Table 2 of the original paper:
| EmpathyDetection | Answer1 | Answer2 | Answer3 | Frequency | Percentage |
| Empathy (1) | 76 | 75 | 75 | 226 | 94.17% |
| No Empathy (0) | 4 | 5 | 5 | 14 | 5.83% |
| Sentiment Type | Answerl Mean | Answer2 Mean | Answer3 Mean | Total Mean | |
| Negative (neg) | 0.06 | 0.06 | 0.06 | 0.06 | |
| Neutral (neu) | 0.73 | 0.74 | 0.73 | 0.73 | |
| Positive (pos) | 0.21 | 0.21 | 0.21 | 0.21 | |
| Compound | 0.90 | 0.94 | 0.94 | 0.93 |
The following are the results from Table 3 of the original paper, showing sample of regular and irregular responses. Note: The layout implies merged cells for the "Question" and "Response" columns, as a single question and its response can span multiple lines while the Emotion/Empathy/Sentiment columns only apply once per full question-response pair. This necessitates HTML for accurate representation.
| Question: | Response: | EmotionCategory | EmpathyCategory | SentimentCompound |
|---|---|---|---|---|
| My mom has actually learned totreat me right, it's onlyoccasionally that I'm able to noticethat she really doesn't have theability to care for others, and thatstings me, other than that she'sreally made a lot of effort and I'mimpressed. But my situation stillhasn't gotten much better, a lot ofpeople who have heard me talkabout the situation have suggestedI leave, which I don't want to do, isthere no way to fix the problemwithout leaving? | It sounds like you're in a complexsituation where you recognize andappreciate the efforts your mom has made,yet you're also feeling hurt by momentsthat reveal limitations in her ability toshow care. It's understandable that suchmoments can be painful, especially whenthey come from someone as important asyour mother. The conflict between valuingthe progress she's made and feeling stungby her limitations is a challenging spaceto navigate.Leaving or staying isn't a decision to betaken lightly, and it's clear you have astrong inclination towards finding asolution that doesn't involve leaving. It'simportant to acknowledge that change inrelationships often requires effort from allparties involved. Have you been able tocommunicate your feelings to your momabout those moments that sting?Sometimes, opening up a dialogue canpave the way for understanding andfurther improvements in your relationship.It may also be helpful to consider whatsupport systems you have in place, such asfriends, other family members, or atherapist, who can offer you perspectivesand emotional support as you navigatethese complexities. Finding a balancebetween your needs and the dynamics withyour mom may take time and patience.Remember, it's okay to seek out spaceswhere you feel understood and valued, asthis can significantly contribute to yourwell-being. | Caring | 1 (detected) | 0.99 |
| --- | ||||
| (This structure indicates a single question/response pair across multiple lines, with repeated metadata for clarity in the original table's layout.) | ||||
| After coming into contact withpsychology, I gradually realizedthat there are some negativeinfluences on the way ofcommunication, expression andgetting along formed by my familyof origin, and that the incorrectmode of getting along between myparents has indirectly caused aprofound potential impact onmyself, which has led to my ownstumbling in the choice of love andgetting along with my partner. AndI want to try my best to save myfamily of origin, but I don't haveenough strength. I often switch between extreme optimism andextreme pessimism. How can I balance myself? | It sounds like you 've been doing a lot ofreflective thinking about your familydynamics and how these have influencedyour own patterns in relationships. It'sinsightful and brave of you to recognizethese patterns and want to address them.Feeling torn between optimism andpessimism is understandable given thecomplexities of familial relationships andthe desire to change deeply ingrainedbehaviors.Balancing oneself amid such emotionalups and downs involves acknowledgingthat change is a process, often slow andnon-linear. It's important to give yourselfpermission to feel these conflictingemotions without judgment. Recognizingthat you can only control your own actionsand reactions, not the behavior of yourfamily members, is key. Practicingself-compassion and seeking support,whether through friends, support groups,or a therapist, can be incredibly beneficial.Exploring these feelings and experiencesin a safe space can help you gain insightsand develop strategies for healthiercommunication and relationship patterns.Remember, it's okay to seek help and it'sokay to set boundaries for your own well-being. Balancing optimism and pessimismcan also mean accepting the reality of thesituation while holding onto hope forpersonal growth and healthier relationshipsin the future. | Approval | 1 (detected) | 0.99 |
| --- | ||||
| (This structure indicates a single question/response pair across multiple lines, with repeated metadata for clarity in the original table's layout.) | ||||
| I am 18 years old, I feel social phobia, I blush when I talk to strange boys, and I am afraid of teachers, I like to be jealous of other people's eloquence, good social skills, good interpersonal relationships, what should I do? progress. Also, focusing on others' conversations, not to compare but to understand that everyone has moments of awkwardness, can be enlightening. Remember, eloquence and good social skills come with practice and exposure. It's perfectly fine to take small steps | It's quite understandable to feel anxious in social situations, especially when interacting with people you're not familiar with. These feelings are more common than you might think, and acknowledging them is a brave first step. It's also natural to admire qualities in others that we wish to see in ourselves. Let's consider your strengths and the situations where you feel more comfortable. Reflecting on these can help us identify strategies to gradually build your confidence in more challenging social settings. Feeling nervous or blushing is a physicalresponse to anxiety, and it's important to remember that everyone experiences moments of uncertainty when speaking to someone new or in authority roles like teachers. It's okay to feel this way. A helpful approach might be to start with small, manageable social interactions and gradually increase the complexity as you become more comfortable. Practice makes | Approval | 1 (detected) | 0.99 |
| --- | ||||
| towards bigger goals. Celebrate your | seem. Would you be open to exploring some techniques or activities that could help you build these skills at a pace that feels right for you? | (Continuation of the same response) | ||
| --- | ||||
| Irregular Responses | ||||
| I would sometimes feel that the people around me were pathetic and inferior, and feel alone with them and disdainful. I would feel that the niche was superior and the mass was vulgar. But in reality, deep down I was inferior and even angry. I would be reluctant to wear designer clothes for fear of attracting attention, and I wouldn't care to look like them. I would be angry at being too plain and unimpressive. I hated the public, but at the same time I needed them to recognize me. clothes and the fear of attracting | It sounds like you're grappling with complex feelings of superiority and inferiority, alongside a desire for recognition and a fear of it. It's not uncommon to feel conflicted in this way, where on one hand, there's a disdain for what's perceived as 'vulgar' or 'mass,' while on the other, there's a deep-seated wish to be acknowledged and valued by the very same group. This dichotomy can indeed make one feel isolated and frustrated. Your feelings towards wearing designer | Confusion | 1 (detected) | 0.94 |
| --- | ||||
| Remember, it's okay to seek recognition and want to be seen, but it's also | attention, juxtaposed with a longing to be noticed, highlight an inner conflict between wanting to stand out and fearing the scrutiny that may come with it. It's a delicate balance between seeking validation and fearing judgment. Acknowledging these feelings is a significant first step. Exploring where these feelings come from can help us understand what drives your need for recognition and your fear of it. It's also worth exploring the standards you're holding yourself to and questioning where they come from. Are they truly your own, or perhaps influenced by societal or external expectations? | (Continuation of the same response) | ||
| --- | ||||
| important to find a sense of self-worth that's not solely dependent on the approval of others. Together, we can work on strategies that help you reconcile these feelings, aiming for a place of self- acceptance and inner peace. It sounds like you're experiencing a deep connection with your partner, along with significant anxiety about the possibility of parting ways. It's understandable to feel strongly attached to someone who hasbeen a significant part of your life for a long time. Relationships, especially long- | investments and shared experiences thatcan make the thought of separation verypainful.However, it's also important to distinguishbetween healthy attachment anddependence that might hinder personalgrowth. Healthy love supports mutualgrowth, respects individuality, andincludes a sense of security within therelationship that allows both partners tobe themselves fully. Feeling unable toleave a relationship, despite recognizingissues or dissatisfaction, can sometimesreflect patterns of dependence or fearrather than just deep love.The concern expressed about low self-esteem comes from the idea thatsometimes people stay in relationships notsolely because of love but due to fear ofbeing alone or belief that they can't findlove again. This isn't to say your feelingsaren't valid, but it's beneficial to explorethe roots of these feelings. Are they basedon love and mutual respect, or are theydriven by fear and insecurity?Wanting to stay with one person for therest of your life is a beautiful sentimentwhen it's based on a healthy, nurturingrelationship. It might be helpful to reflecton why you feel unable to leave, despiterecognizing potential issues. Consultingwith a therapist could provide asupportive space to explore these feelings,understand your emotional needs better,and address concerns about self-esteemand dependency in relationships. | Confusion | 1 (detected) | 0.99 |
| --- | ||||
| (This structure indicates a single question/response pair across multiple lines, with repeated metadata for clarity in the original table's layout.) | ||||
| proposed breakup, and slightlyoverstays his or her welcome.Must such a mindset andbehavior be unhealthy and inneed of adjustment? And ishealthy love something you can'tpour into a person? Isn't it truethat the longer a five-yearrelationship goes on and themore exchanges are made, themore you'll be unable to partwith it? Why is it being told bythe other person that this is lowself-esteem behavior? I don'tthink it feels like I won't be lovedby someone else after I separate,and I have suitors now. But I justcan't let go of the people aroundme, and I want to stay with oneperson for the rest of my life. | (Continuation of the previous question, with its response continuing.) | |||
| --- | ||||
| (This structure indicates a single question/response pair across multiple lines, with repeated metadata for clarity in the original table's layout.) | ||||
| About relationship and marriage,I've been with my boyfriend for ayear now, we have a greatrelationship and I like him a lot.But it seems like as a marriagepartner he lacks some of thequalities I want in a man, like hardwork and the ability to treat peoplewell. So my question is, howexactly is a relationship related tomarriage? Do you like each otherenough to get married? Is thereany point in being in a relationshipif you don't want to marry eachother? Thank you for your answer. | It sounds like you're doing somethoughtful reflection on your relationshipand considering what you value for yourfuture. Relationships and marriage, whileinterconnected, serve different roles andcommitments in people's lives. Being in arelationship can be a way to connect,share experiences, and grow with anotherperson. Marriage, for many, symbolizes adeeper commitment and a willingness toshare life's journey together, with mutualrespect, understanding, and shared valuesat its core.It's important to acknowledge andappreciate the qualities you admire in | Realization | 0 (undetected) | 1.00 |
| --- | ||||
| Remember, it's about finding a balance that feels right for you and your partner, where both of you feel valued, respected, and fulfilled. | your boyfriend, like the strength of your relationship and how much you like each other. At the same time, it's equally important to recognize areas where you feel there might be a misalignment in values or qualities you consider crucial for a lifelong partnership. Communication is key in these situations. Discussing your feelings and expectations with your boyfriend openly can provide clarity and may help both of you understand if your paths align in terms of a long-term commitment like marriage. Deciding on marriage is deeply personal and varies significantly from one individual to another. Some may find fulfillment in a relationship without the need for marriage, while others view marriage as a fundamental goal of their partnership. Reflecting on what you truly value and desire in a partnership can guide you in making decisions that align with your personal and relationship goals. | (Continuation of the same response) | ||
| --- | ||||
The following are the results from Table 4 of the original paper:
| Emotional Type | ICC Type | ICC Value | Stability Rating | p-value |
| Negativity | ICC(2,1) | 0.57 | Fair | <.001 |
| Neutral | ICC(2,1) | 0.39 | Poor | <.01 |
| Positivity | ICC(2,1) | 0.49 | Fair | <.001 |
| Compound | ICC(2,1) | 0.62 | Good | <.001 |
The following are the results from Table 5 of the original paper:
| Metric | Response | Response 2 | Response 3 |
| Count | 80 | 80 | 80 |
| Mean (neg) | 0.06 | 0.06 | 0.06 |
| Std (neg) | 0.04 | 0.04 | 0.03 |
| Min (neg) | 0.00 | 0.00 | 0.00 |
| 25% (neg) | 0.03 | 0.03 | 0.04 |
| 50% (neg) | 0.05 | 0.05 | 0.06 |
| 75% (neg) | 0.07 | 0.08 | 0.08 |
| Max (neg) | 0.21 | 0.23 | 0.15 |
| Mean (neu) | 0.73 | 0.74 | 0.73 |
| Std (neu) | 0.05 | 0.05 | 0.04 |
| Min (neu) | 0.53 | 0.56 | 0.61 |
| 25% (neu) | 0.71 | 0.71 | 0.70 |
| 50% (neu) | 0.74 | 0.74 | 0.73 |
| 75% (neu) | 0.77 | 0.77 | 0.76 |
| Max (neu) | 0.82 | 0.83 | 0.82 |
| Mean (pos) | 0.21 | 0.21 | 0.21 |
| Std (pos) | 0.06 | 0.05 | 0.05 |
| Min (pos) | 0.11 | 0.09 | 0.11 |
| 25% (pos) | 0.17 | 0.18 | 0.18 |
| 50% (pos) | 0.21 | 0.21 | 0.21 |
| 75% (pos) | 0.23 | 0.24 | 0.23 |
| Max (pos) | 0.45 | 0.42 | 0.35 |
| Mean (compound) | 0.90 | 0.94 | 0.94 |
| Std (compound) | 0.35 | 0.23 | 0.19 |
| Min (compound) | -0.95 | -0.99 | -0.65 |
| 25% (compound) | 0.97 | 0.97 | 0.97 |
| 50% (compound) | 0.99 | 0.99 | 0.99 |
| 75% (compound) | 0.99 | 0.99 | 0.99 |
| Max (compound) | 0.99 | 0.99 | 0.99 |
6.3. Ablation Studies / Parameter Analysis
The paper does not present traditional ablation studies (where components of a model are removed to assess their individual contribution) or a detailed analysis of hyper-parameters. Instead, it includes a correlation analysis that indirectly explores a factor influencing GPT's output randomness:
- Correlation Between Question and Answer Word Count:
- The study examined the relationship between the
word countof the input questions and theaverage word countof the answers provided by GPT. - Result: A
moderate positive correlationwas found, with and a (95% CI [.44, .72]). - Analysis: This suggests that
longer questions tend to elicit longer responsesfrom ChatGPT. While not an ablation study, this finding indicates that the length of the prompt is a factor influencing the content (and implicitly, potentially the complexity and detail) of theAI's output. This relationship highlights a controllable variable that could be leveraged or standardized in future applications to manage response characteristics. It also underscores that the "randomness" of GPT's output is not entirely opaque but can be partially influenced by input characteristics, suggesting avenues forprompt engineeringto improve consistency.
- The study examined the relationship between the
7. Conclusion & Reflections
7.1. Conclusion Summary
This exploratory study provides a valuable quantitative snapshot of ChatGPT-4's capabilities and run-to-run stability in simulating psychological counseling within a school context. The findings indicate that ChatGPT-4 can consistently generate responses exhibiting high levels of warmth (97.5%), empathy (94.2%), and acceptance (mean compound score = 0.93), aligning with key APA benchmarks for effective therapeutic communication. The stability of these responses was found to be moderate (ICC(2,1) = 0.62 for compound sentiment, for empathy), suggesting a reasonable level of consistency but also highlighting inherent variability. While the model generally maintains a positive and supportive tone, the occasional occurrence of non-positive (e.g., 'confusion') responses and sentiment drift underscores critical risk areas. The research suggests that ChatGPT-4 holds significant potential for augmenting low-intensity mental health support in educational settings, particularly for psycho-education and check-in messages, provided that robust human-in-the-loop workflows and risk management policies are implemented. This study is notable for its application of quantitative stability metrics and NLP-based emotion detection in this specific domain, alongside incorporating a practitioner's real-world perspective.
7.2. Limitations & Future Work
The authors explicitly acknowledge several limitations:
-
Offline Simulation: The study was based solely on
offline text simulations, using pre-recorded student queries and a single version ofGPT-4 (0613, accessed 15 July 2024). This means the results aredescriptiverather thanconfirmatoryand may not generalize to otherLLM versionsorlive conversational dynamics. -
No Live User/Clinician Interaction: The absence of
live usersorcliniciansinteracting with the system means the study cannot assessreal-time usability,safety, or the actualconversational dynamicsthat would occur in a clinical setting. -
Inferred Metrics:
Warmth,empathy, andacceptancewere inferred bymachine-learning modelsrather than validated byhuman raters. Crucialclinical outcomeslikesymptom changeorclient satisfactionwere not measured. -
Exploratory Nature: The findings are presented as an
exploratory, hypothesis-generating snapshotrather than definitive evidence of effectiveness. -
Lack of Direct Stakeholder Feedback: The study did not collect
usabilityoracceptability feedbackfromadolescent usersorprofessional school counselors, leaving a gap in understanding real-world experience and needs.Based on these limitations, the authors suggest several future research directions:
-
Live User Involvement: Future work should involve
live usersto assess real-time interaction, usability, and safety. -
Multi-Model Comparison: Comparing
multiple LLMswould provide a broader understanding of the landscape. -
Mixed-Methods Validation: Incorporating
mixed-methods validation, includinghuman rater assessments,clinical outcomes, andqualitative feedbackfrom users and clinicians, is essential to assess real-world efficacy and safety. -
Co-design with Stakeholders: Conducting
semi-structured interviewsandfocus groupswith students and counseling staff is needed to validate automated metrics against lived perceptions and toco-design guardrail criteriathat balance relational support with user safety. -
Addressing Randomness: Continuous refinement of AI algorithms is necessary to maintain
consistencyandreliability, especially inhigh-risk mental health interventions. -
Technological Solutions: Developing
multi-agent models(validation agents, risk assessment agents) and refiningprompt engineeringto mitigate limitations. -
Applicational Solutions: Piloting AI applications in
non-clinical settingswithA/B trialsand exploring their integration withwearable devicesforcontinuous monitoringandpreventive care. -
Organizational Solutions: Schools and colleges should develop
protocols for continuous review,human oversight,product roadmaps,risk-management plans, andtrainingfor AI integration.
7.3. Personal Insights & Critique
This paper offers a rigorous quantitative approach to a crucial and sensitive application of Large Language Models (LLMs): mental health counseling. The meticulous measurement of warmth, empathy, and acceptance using APA-informed NLP tools is a significant step beyond anecdotal observations of AI's therapeutic potential. The focus on run-to-run stability with ICC and Kappa is particularly insightful, as consistency is paramount in clinical settings; an AI that gives different advice for the same problem could undermine trust and cause harm.
One personal insight is the critical distinction between NLP-derived emotional awareness and genuine therapeutic attunement. While ChatGPT-4 demonstrates impressive scores in warmth and empathy as measured by algorithms, the authors' own reference to Huang et al.'s finding that GPT-4 underperforms on "depth of emotional processing" and often resorts to "formulaic reassurance" is vital. This suggests that while the surface-level cues are present, the underlying capacity for dynamic, personalized, and adaptive therapeutic dialogue may still be limited. The high quantitative scores, therefore, might reflect sophisticated stylistic mimicry rather than true understanding or a robust internal model of human psychology. This highlights an important area for future research: how do we design AI that not only sounds empathetic but is therapeutically effective in a way that goes beyond lexical patterns?
A potential unverified assumption is that APA-informed NLP tools perfectly capture warmth, empathy, and acceptance as understood by human counselors and clients. While these tools are advanced, the nuances of human emotional expression and therapeutic intent can be complex. Human raters, qualitative feedback, and real-world clinical outcomes are indispensable for validating these automated metrics, a point the authors themselves acknowledge as a limitation and future work.
The paper's emphasis on risk management and human-in-the-loop workflows is commendable and absolutely necessary. The 2.5% 'confusion/realization' labels and the moderate stability are not mere statistical curiosities but quantifiable risk markers that could lead to significant harm in a crisis. The discussion of the Character AI lawsuit powerfully underscores this. This highlights that correctness and safety must take precedence over efficiency in mental health AI.
The proposed multi-agent model (validation agent, risk assessment agent) offers a promising technological solution to enhance safety, but it still relies on predefined benchmarks and rules, which might struggle with novel or highly complex clinical situations. A critical area for improvement lies in developing AI that can articulate its uncertainty or limitations transparently to the user and seamlessly escalate to human oversight when appropriate, rather than hallucinating or providing potentially harmful advice.
The concept of organizational readiness is also crucial. Deploying AI in schools or clinical settings requires not just technological solutions but comprehensive frameworks for training, oversight, accountability, and regulatory compliance. Without these, even the most sophisticated AI risks doing more harm than good.
Inspirationally, this paper demonstrates a rigorous method for evaluating AI in a sensitive domain, pushing for evidence-based deployment. Its findings can guide the design of AI systems that are genuinely assistive, perhaps starting with low-intensity support and prevention, which can free up human therapists for high-acuity cases. The research contributes significantly to the ongoing discourse on how to harness the benefits of AI for mental health while vigilantly mitigating its considerable risks.
Similar papers
Recommended via semantic vector search.