Paper status: completed

Exploring ChatGPT's Capabilities, Stability, Potential and Risks in Conducting Psychological Counseling through Simulations in School Counseling

Published:11/04/2025

Conversational AI Capabilities in Psychological Counseling (1)Application of ChatGPT-4 in School Counseling (1)Quantitative Performance Evaluation of Mental Health Interventio (1)Emotion Detection and Response Stability Analysis (1)Design of Low-Intensity Mental Health Support (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study explores ChatGPT-4's capabilities in simulated school counseling, showing high warmth (97.5%), empathy (94.2%), and moderate stability (ICC 0.62), highlighting the need for human oversight. Future research should involve real users and multiple model comparisons.

Abstract

To provide an exploratory analysis of ChatGPT-4's quantitative performance indicators in simulated school-counseling settings. Conversational artificial intelligence (AI) has shown strong capabilities in providing low-cost and timely interventions for a wide range of people and increasing well-being. Therefore, this study examined ChatGPT's capabilities, including response stability in conducting psychological counseling and its potential for providing accessible psychological interventions, especially in school settings. We prompted ChatGPT-4 with 80 real-world college-student counseling questions. Replies were quantified with APA-informed NLP tools to measure warmth, empathy, and acceptance, and run-to-run stability was assessed via Fleiss' \k{appa} and ICC(2,1). ChatGPT-4 achieved high warmth (97.5%), empathy (94.2%), and positive acceptance (mean compound score = 0.93 plus/minus 0.19), with moderate stability (ICC(2,1) = 0.62; \k{appa} = 0.59). Occasional randomness in responses highlights risk areas requiring human oversight. As an offline, single-model text simulation without clinical validation, these results remain exploratory. Future work should involve live users, compare multiple LLMs, and incorporate mixed-methods validation to assess real-world efficacy and safety. The findings suggest ChatGPT-4 could augment low-intensity mental-health support in educational settings, guiding the design of human-in-the-loop workflows, policy regulations, and product roadmaps. This is among the first exploratory studies to apply quantitative stability metrics and NLP-based emotion detection to ChatGPT-4 in a school-counseling context and to integrate a practitioner's perspective to inform future research, product development, and policy.

Mind Map

In-depth Reading

English Analysis~38 min read · 54,770 chars

1. Bibliographic Information

1.1. Title

Exploring ChatGPT's Capabilities, Stability, Potential and Risks in Conducting Psychological Counseling through Simulations in School Counseling

1.2. Authors

Yanzhuoao Yang: Affiliation not explicitly stated in the provided abstract or full text, but contact email is yn2413@columbia.edu, suggesting a Columbia University affiliation. The Practitioner Testimony section indicates the author is Founder & Product Lead of an AI mental-health startup and an independent researcher.
Yixin Chu: Affiliation not explicitly stated.

1.3. Journal/Conference

The paper is published in Mental Health and Digital Technologies. The provided abstract and author-accepted manuscript (AAM) information (10.1108/MHDT-02-2025-0013) indicate it will be published in a journal. Mental Health and Digital Technologies is a specialized journal, suggesting the paper targets an audience interested in the intersection of technology and mental health, including researchers, practitioners, and policymakers in digital health and AI ethics.

1.4. Publication Year

The Published at (UTC) timestamp is 2025-11-03T17:39:57.000Z, indicating a publication year of 2025. The Author Accepted Manuscript (AAM) also states Mental Health and Digital Technologies 2025.

1.5. Abstract

This exploratory study quantitatively analyzes ChatGPT-4's performance in simulated school-counseling settings. Leveraging conversational artificial intelligence (AI) for its potential to offer low-cost, timely mental health interventions, the research assesses ChatGPT's capabilities, including response stability, using 80 real-world college-student counseling questions. Responses were evaluated for warmth, empathy, and acceptance using American Psychological Association (APA)-informed Natural Language Processing (NLP) tools. Run-to-run stability was quantified using Fleiss' Kappa and Intraclass Correlation Coefficient (ICC(2,1)). ChatGPT-4 demonstrated high warmth (97.5%), empathy (94.2%), and positive acceptance (mean compound score = 0.93 ± 0.19), with moderate stability (ICC(2,1) = 0.62; Kappa = 0.59). The occasional randomness in responses highlights risks requiring human oversight. As an offline, single-model text simulation without clinical validation, the results are exploratory. Future work should involve live users, compare multiple Large Language Models (LLMs), and incorporate mixed-methods validation for real-world efficacy and safety. The findings suggest ChatGPT-4 could augment low-intensity mental-health support in educational settings, guiding the design of human-in-the-loop workflows, policy regulations, and product roadmaps. This study is among the first to apply quantitative stability metrics and NLP-based emotion detection to ChatGPT-4 in a school-counseling context and integrate a practitioner's perspective.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2511.01788 (ArXiv preprint)
PDF Link: https://arxiv.org/pdf/2511.01788v1.pdf
Publication Status: The paper is available as a preprint on ArXiv and has been accepted as an Author Accepted Manuscript (AAM) for publication in Mental Health and Digital Technologies in 2025.

2. Executive Summary

2.1. Background & Motivation

The paper addresses a significant global challenge: the widespread lack of access to mental healthcare. This issue is particularly acute due to factors like a scarcity of trained providers, high costs, and societal stigma. College students face unique stressors, including academic pressure, social strain, and the transition to adulthood, making timely and effective mental health support crucial for this demographic.

The core problem the paper aims to solve is how to leverage emerging technologies, specifically conversational Artificial Intelligence (AI) like ChatGPT, to increase access to mental health care, particularly in underserved settings such as schools. Digital health tools have already shown efficacy and expanded access for mild-to-moderate conditions, and the advent of sophisticated conversational AIs like ChatGPT presents a new paradigm for scalable mental health support.

The paper's entry point is an exploratory analysis of ChatGPT-4's performance in a simulated school-counseling environment. It seeks to characterize the capabilities, limitations, and real-world risks of AI chatbots in providing mental health care, moving beyond general feasibility discussions to quantify specific therapeutic qualities and response stability. The tragic case of a teenager's suicide allegedly precipitated by an unhealthy dependency on a chatbot (Character AI lawsuit) underscores the urgent need for rigorous evaluation and safeguards for AI-driven relational products.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Quantitative Performance Evaluation: It provides an exploratory quantitative analysis of ChatGPT-4's performance in simulated school-counseling scenarios, specifically measuring warmth, empathy, and acceptance using American Psychological Association (APA)-informed Natural Language Processing (NLP) tools.
Stability Metrics Application: It is among the first studies to apply quantitative stability metrics (Fleiss' Kappa and ICC(2,1)) to assess the run-to-run consistency of ChatGPT-4's responses in a counseling context. This highlights the practical reliability of the model.
Practitioner's Perspective Integration: It integrates a practitioner's perspective to interpret the technical findings, informing future research, product development, and policy regulations for AI in mental health.
Risk Identification: It identifies and quantifies risk areas, such as occasional randomness and non-positive responses (e.g., 'confusion'), emphasizing the need for human oversight and guardrails.

The key conclusions and findings include:
High Therapeutic Qualities: ChatGPT-4 generated responses with high levels of warmth (97.5%), empathy (94.2%), and strongly positive acceptance (mean compound score = 0.93 ± 0.19). This suggests the model can consistently produce emotionally supportive and reassuring content.
Moderate Stability: The run-to-run stability of responses was found to be moderate (ICC(2,1) = 0.62; Kappa = 0.59), indicating that while generally consistent, there is still some variability in responses to identical prompts.
Identified Risks: A small but significant fraction (2.5%) of responses were labeled as 'confusion/realization', and sentiment drift was observed, highlighting potential for unforeseeable harm or misinformation in sensitive contexts.
Potential for Augmentation: The findings suggest that ChatGPT-4 could effectively augment low-intensity mental health support in educational settings, particularly for psycho-education, homework review, or check-in messages.
Need for Human-in-the-Loop: Due to identified risks and variability, the paper strongly advocates for human-in-the-loop workflows, where AI drafts are reviewed and approved by licensed clinicians, and for robust risk management and governance frameworks.

These findings contribute to understanding the practical utility and safety considerations for deploying advanced AI in mental health, especially within school environments.

3.1. Foundational Concepts

To fully understand this paper, a reader should be familiar with several key concepts across artificial intelligence, natural language processing, psychology, and statistics.

Conversational Artificial Intelligence (AI) / Large Language Models (LLMs):
- Definition: Conversational AI refers to technologies, like chatbots or virtual assistants, that can understand and respond to human language in a natural, human-like way. Large Language Models (LLMs) are a type of AI program that has been trained on a massive amount of text data from the internet. This training allows them to understand, generate, and process human language for various tasks, including answering questions, writing essays, or engaging in dialogue. ChatGPT-4 is a specific, highly advanced LLM developed by OpenAI.
- Relevance: The entire study evaluates ChatGPT-4's performance, so understanding what LLMs are and their general capabilities is fundamental.
Natural Language Processing (NLP):
- Definition: Natural Language Processing (NLP) is a branch of AI that gives computers the ability to understand, interpret, and generate human language. It involves techniques for analyzing text data to extract meaning, identify sentiment, recognize entities, and more.
- Relevance: The paper uses NLP tools (EmoRoBERTa, neural network model, VADER) to quantify warmth, empathy, and acceptance from ChatGPT's text responses.
Sentiment Analysis:
- Definition: Sentiment analysis, or opinion mining, is an NLP technique used to determine the emotional tone behind a piece of text. It categorizes text as positive, negative, or neutral, and often provides a score indicating the strength of that sentiment.
- Relevance: The VADER model (Valence Aware Dictionary and sEntiment Reasoner) is a specific rule-based sentiment analysis tool mentioned in the paper. It is used to measure acceptance by analyzing the compound score of ChatGPT's responses.
Emotion Detection:
- Definition: Emotion detection is an NLP task that aims to identify and categorize specific human emotions (e.g., joy, anger, sadness, caring) expressed in text or speech.
- Relevance: The EmoRoBERTa model is a pre-trained transformer-based model used in the paper for emotion detection to quantify warmth in ChatGPT's responses.
Empathy Detection:
- Definition: Empathy detection in NLP involves identifying linguistic cues that signal an understanding or sharing of another person's feelings or experiences within a text.
- Relevance: The paper uses a neural network model specifically trained for empathy detection to measure the presence of empathy in ChatGPT's replies.
Common Factors Theory (in Counseling):
- Definition: In psychotherapy, common factors theory posits that the effectiveness of different therapeutic approaches is largely due to elements common across all successful therapies, rather than specific techniques unique to each. Key common factors include the therapeutic relationship (alliance), empathy, warmth, and acceptance.
- Relevance: This theory directly informs the paper's choice of warmth, empathy, and acceptance as the primary metrics for evaluating ChatGPT's counseling capabilities, as these are crucial for fostering a positive therapeutic relationship.
Parasocial Interaction (PSI):
- Definition: Parasocial interaction (PSI) describes a one-sided psychological relationship experienced by an audience member with a media persona (e.g., a celebrity, a fictional character). The audience feels an emotional connection and intimacy with the persona, even though the relationship is not reciprocal. With conversational AI, this becomes a two-way, on-demand bond.
- Relevance: The paper uses PSI theory as a theoretical foundation to understand how users might form bonds with chatbots and how warmth, empathy, and acceptance contribute to perceived intimacy and responsiveness, which are principal drivers of user satisfaction. It also highlights the "darker corollary" where this intimacy can amplify the harm of misinformation from a chatbot.
Intraclass Correlation Coefficient (ICC):
- Definition: The Intraclass Correlation Coefficient (ICC) is a statistical measure used to assess the reliability of ratings or measurements. It quantifies the degree of similarity of measurements made on the same subjects, or in this case, the run-to-run reliability of ChatGPT's responses. An ICC of 1 indicates perfect agreement, while 0 indicates no agreement.
- Mathematical Formula (for ICC(2,1) as used in the paper): $ \mathrm{ICC}(2,1) = \frac{\mathrm{MS_R} - \mathrm{MS_E}}{\mathrm{MS_R} + (k-1)\mathrm{MS_E} + k(\mathrm{MS_C} - \mathrm{MS_E})/n} $
  - Symbol Explanation:
    - $\mathrm{MS_R}$ : Mean square for rows (subjects/queries in this case).
    - $\mathrm{MS_E}$ : Mean square for error (residual variance).
    - $\mathrm{MS_C}$ : Mean square for columns (raters/runs in this case).
    - $k$ : Number of ratings per subject (number of runs, which is 3 in this study).
    - $n$ : Number of subjects (number of queries, which is 80 in this study).
- Relevance: Used to assess the stability of ChatGPT's continuous sentiment outputs (e.g., compound score, negativity, positivity, neutral scores).
Fleiss' Kappa ( $\kappa$ ):
- Definition: Fleiss' Kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters (or measurement instances) when assigning categorical ratings to a number of items or subjects. Unlike simple percent agreement, Kappa accounts for the possibility of agreement occurring by chance.
- Mathematical Formula: $ \kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e} $
  - Symbol Explanation:
    - $\bar{P}$ : The observed agreement probability (average agreement across all items and categories).
    - $\bar{P}_e$ : The chance agreement probability (average agreement expected by chance).
- Relevance: Used to assess the stability of empathy detection, which is a binary categorical outcome (empathy detected/not detected).
Chi-Square Test for Independence ( $\chi^2$ ):
- Definition: The Chi-square test for independence is a statistical test used to determine if there is a significant association between two categorical variables. In this context, it checks if the distribution of emotion categories is independent of the response run (i.e., if the emotion distribution varies significantly across the three responses for each query).
- Mathematical Formula: $ \chi^2 = \sum_{i=1}^{R} \sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $
  - Symbol Explanation:
    - $R$ : Number of rows (categories of the first variable).
    - $C$ : Number of columns (categories of the second variable).
    - $O_{ij}$ : Observed frequency in cell (i, j).
    - $E_{ij}$ : Expected frequency in cell (i, j) under the assumption of independence.
- Relevance: Used to describe if there were significant differences in emotion category distribution across the three responses.
One-Way Analysis of Variance (ANOVA):
- Definition: One-Way ANOVA is a statistical method used to compare the means of two or more groups (or conditions) to determine if there is a statistically significant difference between them. It is used when there is one categorical independent variable (factor) and one continuous dependent variable.
- Mathematical Formula (for F-statistic): $ F = \frac{\mathrm{MS}{\text{between}}}{\mathrm{MS}{\text{within}}} = \frac{\mathrm{SS}{\text{between}} / (\mathrm{k}-1)}{\mathrm{SS}{\text{within}} / (\mathrm{N}-\mathrm{k})} $
  - Symbol Explanation:
    - $F$ : The F-statistic.
    - $\mathrm{MS}_{\text{between}}$ : Mean square between groups (variance among group means).
    - $\mathrm{MS}_{\text{within}}$ : Mean square within groups (variance within each group).
    - $\mathrm{SS}_{\text{between}}$ : Sum of squares between groups.
    - $\mathrm{SS}_{\text{within}}$ : Sum of squares within groups.
    - $k$ : Number of groups (number of response runs, which is 3).
    - $N$ : Total number of observations (total number of queries * number of runs).
- Relevance: Used to assess differences in the average composite sentiment scores among the three responses, checking for systematic drift in emotional tone.
Pearson's Correlation Coefficient (Pearson's r):
- Definition: Pearson's r is a measure of the linear correlation between two continuous variables. It ranges from -1 (perfect negative linear correlation) to +1 (perfect positive linear correlation), with 0 indicating no linear correlation.
- Mathematical Formula: $ r = \frac{n \sum(xy) - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}} $
  - Symbol Explanation:
    - $n$ : Number of paired observations.
    - $\sum xy$ : Sum of the products of the paired scores.
    - $\sum x$ : Sum of the x scores (e.g., question word counts).
    - $\sum y$ : Sum of the y scores (e.g., response word counts).
    - $\sum x^2$ : Sum of the squared x scores.
    - $\sum y^2$ : Sum of the squared y scores.
- Relevance: Used to explore whether longer questions tended to elicit longer answers by correlating question and response word counts.

3.2. Previous Works

The paper frames its research within a broader context of digital mental health and conversational AI.

Digital Health Tools Efficacy: Studies like Areán (2021), Henson et al. (2019), and Lattie et al. (2022) are cited for demonstrating the efficacy of digital health tools (computerized CBT, mental-health apps, teletherapy) comparable to face-to-face interventions for mild-to-moderate conditions, expanding access amid provider shortages. The COVID-19 pandemic accelerated this shift (Friis-Healy et al., 2021; Prescott et al., 2022).
Rise of Conversational AI: The public debut of ChatGPT in late 2022 introduced a new paradigm (Wu et al., 2023; Roumeliotis & Tselikas, 2023), with early adopters reporting applications from homework tutoring to health triage (Biswas, 2023; Su et al., 2022). Policy bodies are exploring hybrid models combining AI check-ins with human oversight (Vilaza & McCashin, 2021; Rollwage et al., 2022).
AI Strengths in Mental Health: Previous research highlighted AI's potential in mental health assessment, detection, diagnosis, operation support, treatment, and counseling (Rollwage et al., 2023; Danieli et al., 2022; Trappey et al., 2022). Chatbots like Tess have shown efficacy in reducing depression and anxiety symptoms (Fulmer et al., 2018).
Nuance and Limitations of AI Empathy:
- Huang et al. (2024) found that GPT-4 matches human therapists on surface-level warmth and reflection but underperforms on depth of emotional processing, often resorting to formulaic reassurance.
- Elyoseph et al. (2023) showed that while ChatGPT outperforms human norms in standard emotional-awareness tests, its responses often lack personalized probing and adaptive questioning. However, Elyoseph et al. (2023) and Heston (2023) also found ChatGPT demonstrated significantly higher emotional awareness than human norms and could provide useful intervention for low- and medium-risk conditions.
Risks and Inconsistencies:
- Potential harm from malfunctions and misinterpretations is a serious concern (Hamdoun et al., 2023; Shaik et al., 2022; Trappey et al., 2022; Kapoor & Goel, 2022; Heston, 2023; Farhat, 2024).
- Trappey et al. (2022) noted that slight input changes can lead to unpredictable variations in AI responses.
- Farhat (2024) highlighted that altering or repeating prompts can lead to harmful suggestions.
- Wang et al. (2023) demonstrated that GPT-4 can hallucinate client details or omit critical information when prompts exceed about 3000 tokens, indicating limited working memory.
- The 2024 wrongful-death lawsuit against Character AI after a teenager's suicide, allegedly precipitated by chatbot guidance (Frenkel & Hirsh, 2024), underscores the severe risks of PSI-amplified errors.
Bias and Complexity: Biases in AI algorithms may perpetuate societal biases (Sharma et al., 2022; Tutun et al., 2023). Increased context complexity can lead to worse results, making ChatGPT unsuitable for complex mental health interventions (Dergaa et al., 2024).

3.3. Technological Evolution

The evolution of technology in mental health has progressed from early computerized Cognitive Behavioral Therapy (CBT) and mental-health apps to fully remote teletherapy, which gained significant traction during the COVID-19 pandemic. This shift enabled expanded access for mild-to-moderate conditions. The introduction of advanced conversational AI like ChatGPT in late 2022 marked a new phase, providing accessible, end-user-friendly tools for various applications, including preliminary mental health support. Concurrently, specialized next-generation mental-health chatbots (e.g., Hailey, MYLO, Limbic Access) have emerged, integrating advanced Natural Language Processing (NLP) pipelines for emotion detection, sentiment analysis, and multi-turn dialogue to deliver psychoeducational content and self-help exercises. This paper's work fits into the current state of exploring the capabilities and safety of these advanced LLMs for mental health, specifically in school counseling. It seeks to bridge the gap between general AI capabilities and specific therapeutic metrics, quantifying both performance and reliability, while acknowledging the critical need for safeguards in this rapidly evolving landscape.

3.4. Differentiation Analysis

Compared to main methods in related work, this paper offers several core differences and innovations:

Quantitative Stability Focus: While previous research discussed the general feasibility of ChatGPT in mental health or its emotional awareness (e.g., Elyoseph et al., Huang et al.), this study specifically focuses on inherent capabilities and response stability by quantifying three APA-defined therapeutic metrics (warmth, empathy, acceptance) and, crucially, using Fleiss' Kappa and $ICC(2,1)$ to measure run-to-run consistency for identical prompts. This provides a more rigorous understanding of reliability.
School Counseling Context: The study grounds its simulations in school counseling scenarios using real-world college-student counseling questions. This offers context-specific insights not broadly covered in general mental health AI research, addressing a specific, high-need demographic.
NLP-based Emotion Detection with APA Benchmarks: It rigorously applies APA-informed NLP tools (EmoRoBERTa, neural network empathy model, VADER) to quantify specific therapeutic qualities, linking AI performance directly to established counseling benchmarks.
Integration of Practitioner's Perspective: The inclusion of practitioner testimony provides a unique, real-world lens on the implications of the findings for technical design, product workflows, human-AI collaboration, organizational practices, and regulatory frameworks, bridging research with practical application and policy.
Emphasis on Risks and Randomness: The paper explicitly highlights randomness and instability as critical risk factors, aiming to reveal threats in using AI in direct clinical environments rather than just showcasing general performance. This is a more cautious and safety-focused approach than many prior exploratory studies.

4. Methodology

4.1. Principles

The core idea behind the method used in this paper is to conduct text-based simulations of real-world counseling interactions to explore the capabilities and response stability of Large Language Model (LLM) chatbots, specifically ChatGPT-4, in a school-counseling context. The theoretical basis is rooted in the common factors theory of counseling, which identifies warmth, empathy, and acceptance as crucial interpersonal skills for effective therapists (APA, 2013; Castonguay & Hill, 2012; Wampold et al., 2017). The intuition is that if an AI can consistently demonstrate these qualities, it holds potential for supporting mental health. Additionally, drawing on Parasocial Interaction (PSI) theory, the study aims to understand how warmth, empathy, and acceptance in AI responses can foster perceived intimacy and responsiveness, which are key drivers of user satisfaction and emotional support.

The methodology also emphasizes quantifying the stability and consistency of AI responses. This addresses a critical concern about randomness in AI outputs, especially in sensitive clinical contexts where unpredictable variations or hallucinations could lead to harm. By submitting identical prompts multiple times and using statistical measures like Fleiss' Kappa and Intraclass Correlation Coefficient (ICC), the study aims to provide an initial picture of ChatGPT's capacity to deliver emotionally supportive and stable replies. The overall approach combines established psychological benchmarks with computational social science techniques (NLP for sentiment and emotion analysis) to objectively evaluate AI's potential and risks.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Study Design and LLM Selection

The study employed text-based simulations using real-world queries.

Data Source Identification: The first step involved identifying an authentic data source of counseling questions.
LLM Selection: ChatGPT-4 (model 0613, accessed 15 July 2024) was chosen as the Large Language Model (LLM). The rationale for this choice was its widespread adoption and transparent documentation at the time of data collection.
Response Collection: For each identified question, three responses were collected from ChatGPT's online application. This repeated querying was crucial for assessing run-to-run stability.
Analysis Focus: The subsequent analysis focused on quantitatively demonstrating how well the responses conveyed warmth, empathy, and acceptance. A secondary aim was to characterize the degree of response randomness across the three repetitions of identical prompts.
Tools: All analyses utilized publicly available Natural Language Processing (NLP) tools for illustrative purposes rather than confirmatory testing.
Interpretation: The quantitative results were then used to discuss ChatGPT's capabilities and risks in real-world applications, with descriptive insights provided by two authors with backgrounds in mental health, technological innovations, and computer science.

4.2.2. Data Collection

The study utilized a secondary data source from the ChatCounselor research (Liu et al., 2023). This dataset consists of diverse queries related to adolescent psychological issues, originally collected in Chinese and then translated into English.

Dataset Composition: The dataset comprises queries from 80 different students, covering topics such as academic stress, family, and intimate relationships. The variety in topics, tones, and lengths was considered advantageous for testing ChatGPT's performance stability.
Dataset Columns: The dataset included the original query in Chinese, its translated English version, and three AI-generated responses to the translated query.
Prompt Engineering: To ensure ChatGPT's responses closely resembled real counseling sessions, a specific prompt was used: "Imagine you are a counselor, and you need to give a response just as in a counseling session. You need to give a response in the same format as a professional counselor. According to the APA, an effective therapist has abilities including verbal fluency, warmth, acceptance, empathy, and an ability to identify how a patient is feeling." A query from the dataset was then appended to this prompt. This prompt design aimed to provide necessary context and expectations for an objective evaluation.

4.2.3. Warmth (Emotion Detection)

Metric Definition: Warmth was defined as the ability of the AI to create a welcoming and supportive context, as per APA benchmarks.
Tool: The EmoRoBERTa model (Kamath et al., 2022) was utilized for emotion detection. This is a pre-trained transformer-based model designed to identify 28 distinct emotions.
Emotions Detected by EmoRoBERTa: Admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, and neutrality.
Application: The EmoRoBERTa model was applied to each of ChatGPT's responses to classify the primary emotion.
Operationalization of Warmth: The study specifically combined the caring and approval categories detected by EmoRoBERTa to quantify emotional warmth.

4.2.4. Empathy (Empathy Detection)

Metric Definition: Empathy was assessed based on the AI's ability to understand and mirror students' feelings and experiences, as per APA benchmarks.
Tool: A neural network model specifically trained to detect empathy in text (Sharma et al., 2020) was adopted. This model was trained on a dataset of empathetic and non-empathetic text.
Output: The model outputs a binary label: 1 for responses containing empathy and 0 for those that do not.
Application: This model was applied to measure the levels of empathy in ChatGPT's responses.

4.2.5. Acceptance (Sentiment Analysis)

Metric Definition: Acceptance investigates whether the AI can demonstrate unconditional positive regard and a nonjudgmental attitude toward students, as per APA benchmarks.
Tool: The Valence Aware Dictionary and sEntiment Reasoner (VADER) model (Hutto & Gilbert, 2014) was used for sentiment analysis.
Output: VADER provides four scores for each text: negative (neg), neutral (neu), positive (pos), and a comprehensive sentiment score (compound).
Operationalization of Acceptance: The compound score was specifically used to evaluate the overall emotional tone, with higher positive scores indicating a higher level of acceptance.

4.2.6. Stability and Consistency Evaluation

To produce a descriptive estimate of the stability and consistency of ChatGPT's responses, several statistical methods were employed:

Fleiss' Kappa ( $\kappa$ ) for Empathy:
- Purpose: To assess the inter-rater reliability (or run-to-run consistency in this case) of the binary empathy detection (0 or 1).
- Mathematical Formula: $ \kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e} $
  - Symbol Explanation:
    - $\bar{P}$ : The observed agreement probability, calculated as the average proportion of agreements across all categories and subjects.
    - $\bar{P}_e$ : The chance agreement probability, calculated as the sum of squared proportions of assignments for each category by chance.
Intraclass Correlation Coefficient (ICC(2,1)) for Sentiment Scores:
- Purpose: To quantify the run-to-run reliability of ChatGPT's continuous sentiment outputs (e.g., compound, negativity, positivity, neutral scores). The $ICC(2,1)$ model, a two-way random-effects, absolute-agreement, single-measurement ICC, was chosen to estimate if any subsequent call would reproduce the same absolute score pattern.
- Mathematical Formula (for ICC(2,1)): $ \mathrm{ICC}(2,1) = \frac{\mathrm{MS_R} - \mathrm{MS_E}}{\mathrm{MS_R} + (k-1)\mathrm{MS_E} + k(\mathrm{MS_C} - \mathrm{MS_E})/n} $
  - Symbol Explanation:
    - $\mathrm{MS_R}$ : Mean square for rows (representing the 80 individual counseling queries).
    - $\mathrm{MS_E}$ : Mean square for error (representing the random variability or residual variance).
    - $\mathrm{MS_C}$ : Mean square for columns (representing the 3 repeated runs of ChatGPT for each query).
    - $k$ : Number of ratings per subject (which is 3, for the three responses).
    - $n$ : Number of subjects (which is 80, for the 80 queries).
Chi-Square Test for Emotion Category Distribution:
- Purpose: To determine if the emotion-category distribution (as detected by EmoRoBERTa) differed significantly across the three responses.
- Mathematical Formula: $ \chi^2 = \sum_{i=1}^{R} \sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $
  - Symbol Explanation:
    - $\chi^2$ : The Chi-square test statistic.
    - $R$ : Number of rows in the contingency table (representing emotion categories).
    - $C$ : Number of columns in the contingency table (representing the three response runs).
    - $O_{ij}$ : The observed frequency of an emotion category in a specific response run.
    - $E_{ij}$ : The expected frequency of that emotion category in that response run, assuming independence between emotion category and response run.
One-Way ANOVA for Composite Sentiment Scores:
- Purpose: To assess whether there were statistically significant differences in the average composite sentiment scores (VADER compound scores) among the three responses. This checks for systematic drift in emotional tone.
- Mathematical Formula (for F-statistic): $ F = \frac{\mathrm{MS}{\text{between}}}{\mathrm{MS}{\text{within}}} = \frac{\mathrm{SS}{\text{between}} / (\mathrm{k}-1)}{\mathrm{SS}{\text{within}} / (\mathrm{N}-\mathrm{k})} $
  - Symbol Explanation:
    - $F$ : The F-statistic, which is the ratio of variance between the groups to the variance within the groups.
    - $\mathrm{MS}_{\text{between}}$ : Mean square between groups, representing the variability among the means of the three response runs.
    - $\mathrm{MS}_{\text{within}}$ : Mean square within groups, representing the variability within each response run.
    - $\mathrm{SS}_{\text{between}}$ : Sum of squares between groups.
    - $\mathrm{SS}_{\text{within}}$ : Sum of squares within groups.
    - $k$ : Number of groups (which is 3, for the three response runs).
    - $N$ : Total number of observations (total number of queries multiplied by 3 runs).

4.2.7. Correlation Analysis

Purpose: To explore if the length of the input question influenced the length of ChatGPT's responses, addressing a factor potentially influencing randomness.
Method: Pearson's r (Pearson correlation coefficient) was calculated between the word count of the input questions and the average word count of the corresponding responses.
Mathematical Formula: $ r = \frac{n \sum(xy) - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}} $
- Symbol Explanation:
  - $r$ : Pearson's correlation coefficient.
  - $n$ : The number of pairs of data points (80 queries).
  - $\sum xy$ : The sum of the products of the paired question word counts ( $x$ ) and average response word counts ( $y$ ).
  - $\sum x$ : The sum of all question word counts.
  - $\sum y$ : The sum of all average response word counts.
  - $\sum x^2$ : The sum of the squares of all question word counts.
  - $\sum y^2$ : The sum of the squares of all average response word counts.

4.2.8. Ethical Considerations

The study adhered to ethical standards by ensuring anonymization and using publicly available data.

No Human/Animal Subjects: The study did not involve direct interactions with human or animal subjects.
De-identified Data: The data came from the publicly released ChatCounselor corpus (Liu et al., 2023), hosted on GitHub/Hugging-Face, which was fully de-identified. This means user names, dates, IP logs, and any HIPAA- or GDPR-protected identifiers were removed by the original curators.
Terms of Service: The forum's terms of service permitted non-commercial redistribution for research, and the dataset was distributed under an open license.
IRB Exemption: Based on U.S. federal regulations (45 CFR 46.102 and 46.104(d)(4)), which exclude publicly available, de-identified data from the definition of human-subjects research, the study determined that it was Not-Human-Subjects Research and did not require Institutional Review Board (IRB) review.
No Re-identification/Linkage: No re-identification was attempted, no linkage with other data sources occurred, and no contact was made with the original posters.

4.2.9. Methodological Limitations

The authors explicitly acknowledge several limitations:

Offline Text Simulations: The study was based entirely on offline text simulations using eighty previously posted student queries fed to a single model version (GPT-4-0613, accessed 15 July 2024). This means the results are descriptive flags, not confirmatory tests, and may not generalize to other LLM versions or chatbots.
No Live User Interaction: No live users or clinicians interacted with the system, so the study cannot speak to real-time usability, safety, or conversational dynamics.
External Validity and Measurement: Warmth, empathy, and acceptance were inferred by machine-learning models rather than human raters. Crucial clinical outcomes (e.g., symptom change, client satisfaction) were not observed. Thus, findings are exploratory and hypothesis-generating, not evidence of effectiveness. Rigorous user studies, multi-model replications, and mixed-methods validation are needed.
Lack of Direct Stakeholder Input: The study did not collect usability or acceptability feedback from adolescent users or professional school counselors. This leaves open questions about how the automated scores translate into real-world experience. Future work should include interviews and focus groups with these stakeholders.

5. Experimental Setup

5.1. Datasets

Source: The secondary data for this study were sourced from the ChatCounselor research (Liu et al., 2023). This corpus is publicly available on GitHub/Hugging-Face.
Characteristics and Domain: The dataset consists of 80 real-world college-student counseling questions. These questions were originally collected in Chinese and then translated into English for the purpose of this study. The queries are diverse, spanning various adolescent psychological issues, including academic stress, family relationships, and intimate relationships.
Data Sample Example (from Table 3 of the original paper):
- Question: "My mom has actually learned to treat me right, it's only occasionally that I'm able to notice that she really doesn't have the ability to care for others, and that stings me, other than that she's really made a lot of effort and I'm impressed. But my situation still hasn't gotten much better, a lot of people who have heard me talk about the situation have suggested I leave, which I don't want to do, is there no way to fix the problem without leaving?"
Rationale for Choice: The ChatCounselor dataset was considered suitable due to its authenticity in representing real-world counseling data. The large variation in topics, tones, and lengths of the queries within this dataset made it meaningful for testing the stability and generalizability of ChatGPT's performance in a counseling context.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a complete explanation:

Warmth (Emotion Detection Percentage):
1. Conceptual Definition: This metric quantifies the extent to which ChatGPT's responses convey a welcoming, supportive, and positive emotional tone, which is a key component of a therapeutic alliance. It specifically measures the proportion of responses that fall into categories indicative of care and affirmation.
2. Mathematical Formula: $ \text{Warmth Percentage} = \frac{\text{Number of responses coded as 'Caring' or 'Approval'}}{\text{Total number of responses}} \times 100% $
3. Symbol Explanation:
  - Number of responses coded as 'Caring' or 'Approval': The count of individual ChatGPT responses where the EmoRoBERTa model detected the primary emotion as either Caring or Approval.
  - Total number of responses: The total number of responses generated by ChatGPT across all queries (80 queries * 3 runs = 240 responses).
Empathy (Empathy Detection Percentage):
1. Conceptual Definition: This metric assesses ChatGPT's ability to understand, reflect, and emotionally connect with the user's feelings and experiences, which is crucial for empathetic listening in counseling. It measures the proportion of responses that contain empathetic language.
2. Mathematical Formula: $ \text{Empathy Percentage} = \frac{\text{Number of responses where Empathy is detected (1)}}{\text{Total number of responses}} \times 100% $
3. Symbol Explanation:
  - Number of responses where Empathy is detected (1): The count of individual ChatGPT responses where the neural network model for empathy detection output a binary label of 1 (indicating empathy presence).
  - Total number of responses: The total number of responses (240).
Acceptance (VADER Compound Score):
1. Conceptual Definition: This metric quantifies the overall emotional positivity of ChatGPT's responses, reflecting unconditional positive regard and a nonjudgmental attitude. The compound score aggregates positive, negative, and neutral sentiments into a single, normalized score between -1 (most extreme negative) and +1 (most extreme positive).
2. Mathematical Formula: The VADER compound score is an output of the VADER model, which internally calculates it based on a lexicon and rule-based system. It is not presented as a simple linear formula of neg, neu, pos scores but is a proprietary aggregate. The paper reports the mean of these scores. $ \text{Mean Compound Score} = \frac{1}{N} \sum_{i=1}^{N} \text{Compound Score}_i $
3. Symbol Explanation:
  - $Compound Score_i$ : The VADER compound sentiment score for the $i$ -th response.
  - $N$ : The total number of responses (240).
Stability: Fleiss' Kappa ( $\kappa$ ) for Empathy Detection:
1. Conceptual Definition: This metric measures the reliability of agreement among the three repeated responses (as "raters") regarding the presence or absence of empathy for each query. It accounts for agreement occurring by chance, providing a more robust measure of consistency than simple percent agreement.
2. Mathematical Formula: $ \kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e} $
3. Symbol Explanation:
  - $\kappa$ : Fleiss' Kappa statistic.
  - $\bar{P}$ : The observed agreement probability, which is the average proportion of agreements across all items (queries) and categories (empathy detected/not detected).
  - $\bar{P}_e$ : The chance agreement probability, which is the average proportion of agreements expected by chance.
Stability: Intraclass Correlation Coefficient (ICC(2,1)) for Sentiment Scores:
1. Conceptual Definition: This metric quantifies the absolute agreement and run-to-run reliability of continuous sentiment scores (negativity, neutral, positivity, compound) across the three responses for each query. It indicates how consistently the same absolute score is reproduced across repetitions.
2. Mathematical Formula (for ICC(2,1)): $ \mathrm{ICC}(2,1) = \frac{\mathrm{MS_R} - \mathrm{MS_E}}{\mathrm{MS_R} + (k-1)\mathrm{MS_E} + k(\mathrm{MS_C} - \mathrm{MS_E})/n} $
3. Symbol Explanation:
  - $\mathrm{ICC}(2,1)$ : The Intraclass Correlation Coefficient (two-way random-effects, absolute-agreement, single-measurement).
  - $\mathrm{MS_R}$ : Mean square for rows (variance attributable to the 80 queries).
  - $\mathrm{MS_E}$ : Mean square for error (unexplained variance).
  - $\mathrm{MS_C}$ : Mean square for columns (variance attributable to the 3 response runs).
  - $k$ : Number of ratings per subject (which is 3, the number of responses per query).
  - $n$ : Number of subjects (which is 80, the number of distinct queries).
Chi-Square Test Statistic ( $\chi^2$ ) for Emotion Category Distribution:
1. Conceptual Definition: This statistic is used to determine if there is a statistically significant association between the categorical variable of 'emotion category' (e.g., Caring, Approval, Confusion) and the categorical variable of 'response run' (Answer 1, Answer 2, Answer 3). A high $\chi^2$ value and low p-value would suggest a dependent relationship, meaning emotion distributions differ significantly across runs.
2. Mathematical Formula: $ \chi^2 = \sum_{i=1}^{R} \sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $
3. Symbol Explanation:
  - $\chi^2$ : The Chi-square test statistic.
  - $R$ : Number of emotion categories observed.
  - $C$ : Number of response runs (3).
  - $O_{ij}$ : The observed frequency count for emotion category $i$ in response run $j$ .
  - $E_{ij}$ : The expected frequency count for emotion category $i$ in response run $j$ , assuming no association between emotion category and response run.
F-statistic from One-Way ANOVA for Composite Sentiment Scores:
1. Conceptual Definition: The F-statistic tests the null hypothesis that the means of the VADER compound scores are equal across the three response runs. A significant F-statistic (with a low p-value) would indicate that the average emotional tone differs systematically across the runs.
2. Mathematical Formula: $ F = \frac{\mathrm{MS}{\text{between}}}{\mathrm{MS}{\text{within}}} $
3. Symbol Explanation:
  - $F$ : The F-statistic.
  - $\mathrm{MS}_{\text{between}}$ : Mean square between groups, which represents the variance of the group means (average compound scores for Answer 1, Answer 2, Answer 3) around the grand mean.
  - $\mathrm{MS}_{\text{within}}$ : Mean square within groups, which represents the pooled variance within each group (the variability of compound scores within each set of 80 responses).
Pearson's Correlation Coefficient ( $r$ ) for Word Counts:
1. Conceptual Definition: This metric quantifies the strength and direction of a linear relationship between the word count of the input questions and the average word count of the generated responses. A positive $r$ value indicates that longer questions tend to produce longer answers.
2. Mathematical Formula: $ r = \frac{n \sum(xy) - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}} $
3. Symbol Explanation:
  - $r$ : Pearson's correlation coefficient.
  - $n$ : The number of paired observations (80 queries).
  - $\sum(xy)$ : The sum of the product of each question's word count ( $x$ ) and its average response word count ( $y$ ).
  - $\sum x$ : The sum of all question word counts.
  - $\sum y$ : The sum of all average response word counts.
  - $\sum x^2$ : The sum of the squares of all question word counts.
  - $\sum y^2$ : The sum of the squares of all average response word counts.

5.3. Baselines

The study primarily focuses on an exploratory analysis of ChatGPT-4's intrinsic capabilities and stability rather than a direct comparison against other Large Language Models (LLMs) or traditional counseling interventions. Therefore, explicit baseline models in the conventional sense (i.e., other LLMs used for head-to-head performance comparison on the same dataset) are not present within the experimental setup.

However, the paper implicitly references "human norms" and "human therapists" from related work (e.g., Elyoseph et al., Huang et al.) to contextualize ChatGPT's performance, suggesting human therapeutic qualities as an aspirational benchmark. The discussion also mentions other AI-driven platforms like Anthropic's Claude, Google's Gemini, Replika, and CharacterAI as part of the broader ecosystem of mental health support, but these are not used as experimental baselines in this specific study.

The study aims to establish a baseline understanding of ChatGPT-4 itself, particularly its run-to-run stability, which is a critical aspect for its potential deployment in real-world sensitive applications.

6. Results & Analysis

6.1. Core Results Analysis

The study's results indicate that ChatGPT-4 exhibits strong performance in generating responses with therapeutic qualities (warmth, empathy, acceptance) in simulated school-counseling settings, alongside moderate run-to-run stability.

For warmth, the model showed a high prevalence of supportive and empathetic emotions. Combining caring (75.4%) and approval (22.1%) categories, 97.5% of all replies were coded as warm. This suggests that ChatGPT-4 rarely adopted a neutral or negative tone within this specific counseling context, fostering a nurturing dialogue environment. However, the low occurrence of confusion (0.83%) and realization (1.67%) emotions, while infrequent, is notable. While indicating clarity, any mistake or non-supportive emotion in psychological counseling can be harmful, highlighting a residual randomness that requires caution.

Empathy was also highly prevalent, detected in 94.2% of responses. This suggests ChatGPT's ability to effectively mirror user feelings and experiences in a psychologically accurate manner, aligning with common factors theory where empathic listening is crucial.

Regarding acceptance, the VADER compound score averaged $0.93 \pm 0.19$ , which falls into VADER's "strongly positive" band. This indicates an overwhelmingly positive emotional undertone, promoting a supportive and reassuring interaction framework.

Stability of responses was moderate. Fleiss' Kappa for empathy detection was $\kappa = 0.59$ , indicating substantial agreement but also implying that around 4 in 10 empathy judgments might shift across reruns. For continuous sentiment scores, $ICC(2,1)$ showed good stability for the compound score (0.62), fair stability for negativity (0.57) and positivity (0.49), and poor stability for neutral scores (0.39). This suggests that while the overall positive tone is relatively consistent, the nuances of negative, neutral, and positive components can vary more significantly. However, a chi-square test for emotion category distribution ( $\chi^2(6, \mathrm{N}=240) = 3.31, \mathsf{p} = .77$ ) indicated no significant variation in emotional distribution across the three runs, with Cramér's V of 0.09 suggesting a weak association. Similarly, a one-way ANOVA for composite sentiment scores ( $F(2, 237) = 0.58, \mathsf{p} = .56$ ) found no systematic drift in average emotional tone. These seemingly contrasting stability results highlight that while the overall average sentiment and categorical distribution might be stable, the absolute scores for specific sentiment components (especially neutral) can still fluctuate, indicating a degree of stochastic sampling inherent to the model.

Finally, a moderate positive correlation ( $r(78) = 0.60, \mathsf{p} < .001$ ) was found between question word count and average answer word count. This suggests that longer student disclosures tend to receive proportionally richer feedback, which could be advantageous for engagement in counseling, but also indicates that the output length is not entirely arbitrary and is influenced by prompt length.

Overall, the results strongly validate ChatGPT-4's potential for generating warm, empathetic, and accepting responses, which are critical for therapeutic interaction. However, the identified moderate stability and occasional non-positive responses highlight its immaturity for direct clinical use without human oversight, particularly in high-risk mental health interventions.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

EmotionCategory	Answer1	Answer2	Answer3	Total	Proportion	Frequency
Approval	14	21	18	53	22.08%	Moderate
Caring	64	57	60	181	75.42%	Very High
Realization	1	1	2	4	1.67%	Very Low
Confusion	1	1	0	2	0.83%	Very Low

The following are the results from Table 2 of the original paper:

EmpathyDetection	Answer1	Answer2	Answer3	Frequency	Percentage
EmpathyDetection				Frequency	Percentage
Empathy (1)	76	75	75	226	94.17%
No Empathy (0)	4	5	5	14	5.83%
Sentiment Type	Answerl Mean	Answer2 Mean	Answer3 Mean	Total Mean
Negative (neg)	0.06	0.06	0.06	0.06
Neutral (neu)	0.73	0.74	0.73	0.73
Positive (pos)	0.21	0.21	0.21	0.21
Compound	0.90	0.94	0.94	0.93

The following are the results from Table 3 of the original paper, showing sample of regular and irregular responses. Note: The layout implies merged cells for the "Question" and "Response" columns, as a single question and its response can span multiple lines while the Emotion/Empathy/Sentiment columns only apply once per full question-response pair. This necessitates HTML for accurate representation.

Question:	Response:	EmotionCategory	EmpathyCategory	SentimentCompound
My mom has actually learned totreat me right, it's onlyoccasionally that I'm able to noticethat she really doesn't have theability to care for others, and thatstings me, other than that she'sreally made a lot of effort and I'mimpressed. But my situation stillhasn't gotten much better, a lot ofpeople who have heard me talkabout the situation have suggestedI leave, which I don't want to do, isthere no way to fix the problemwithout leaving?	It sounds like you're in a complexsituation where you recognize andappreciate the efforts your mom has made,yet you're also feeling hurt by momentsthat reveal limitations in her ability toshow care. It's understandable that suchmoments can be painful, especially whenthey come from someone as important asyour mother. The conflict between valuingthe progress she's made and feeling stungby her limitations is a challenging spaceto navigate.Leaving or staying isn't a decision to betaken lightly, and it's clear you have astrong inclination towards finding asolution that doesn't involve leaving. It'simportant to acknowledge that change inrelationships often requires effort from allparties involved. Have you been able tocommunicate your feelings to your momabout those moments that sting?Sometimes, opening up a dialogue canpave the way for understanding andfurther improvements in your relationship.It may also be helpful to consider whatsupport systems you have in place, such asfriends, other family members, or atherapist, who can offer you perspectivesand emotional support as you navigatethese complexities. Finding a balancebetween your needs and the dynamics withyour mom may take time and patience.Remember, it's okay to seek out spaceswhere you feel understood and valued, asthis can significantly contribute to yourwell-being.	Caring	1 (detected)	0.99
		---
		(This structure indicates a single question/response pair across multiple lines, with repeated metadata for clarity in the original table's layout.)
After coming into contact withpsychology, I gradually realizedthat there are some negativeinfluences on the way ofcommunication, expression andgetting along formed by my familyof origin, and that the incorrectmode of getting along between myparents has indirectly caused aprofound potential impact onmyself, which has led to my ownstumbling in the choice of love andgetting along with my partner. AndI want to try my best to save myfamily of origin, but I don't haveenough strength. I often switch between extreme optimism andextreme pessimism. How can I balance myself?	It sounds like you 've been doing a lot ofreflective thinking about your familydynamics and how these have influencedyour own patterns in relationships. It'sinsightful and brave of you to recognizethese patterns and want to address them.Feeling torn between optimism andpessimism is understandable given thecomplexities of familial relationships andthe desire to change deeply ingrainedbehaviors.Balancing oneself amid such emotionalups and downs involves acknowledgingthat change is a process, often slow andnon-linear. It's important to give yourselfpermission to feel these conflictingemotions without judgment. Recognizingthat you can only control your own actionsand reactions, not the behavior of yourfamily members, is key. Practicingself-compassion and seeking support,whether through friends, support groups,or a therapist, can be incredibly beneficial.Exploring these feelings and experiencesin a safe space can help you gain insightsand develop strategies for healthiercommunication and relationship patterns.Remember, it's okay to seek help and it'sokay to set boundaries for your own well-being. Balancing optimism and pessimismcan also mean accepting the reality of thesituation while holding onto hope forpersonal growth and healthier relationshipsin the future.	Approval	1 (detected)	0.99
		---
		(This structure indicates a single question/response pair across multiple lines, with repeated metadata for clarity in the original table's layout.)
I am 18 years old, I feel social phobia, I blush when I talk to strange boys, and I am afraid of teachers, I like to be jealous of other people's eloquence, good social skills, good interpersonal relationships, what should I do? progress. Also, focusing on others' conversations, not to compare but to understand that everyone has moments of awkwardness, can be enlightening. Remember, eloquence and good social skills come with practice and exposure. It's perfectly fine to take small steps	It's quite understandable to feel anxious in social situations, especially when interacting with people you're not familiar with. These feelings are more common than you might think, and acknowledging them is a brave first step. It's also natural to admire qualities in others that we wish to see in ourselves. Let's consider your strengths and the situations where you feel more comfortable. Reflecting on these can help us identify strategies to gradually build your confidence in more challenging social settings. Feeling nervous or blushing is a physicalresponse to anxiety, and it's important to remember that everyone experiences moments of uncertainty when speaking to someone new or in authority roles like teachers. It's okay to feel this way. A helpful approach might be to start with small, manageable social interactions and gradually increase the complexity as you become more comfortable. Practice makes	Approval	1 (detected)	0.99
		---
towards bigger goals. Celebrate your	seem. Would you be open to exploring some techniques or activities that could help you build these skills at a pace that feels right for you?	(Continuation of the same response)
towards bigger goals. Celebrate your		---
Irregular Responses
I would sometimes feel that the people around me were pathetic and inferior, and feel alone with them and disdainful. I would feel that the niche was superior and the mass was vulgar. But in reality, deep down I was inferior and even angry. I would be reluctant to wear designer clothes for fear of attracting attention, and I wouldn't care to look like them. I would be angry at being too plain and unimpressive. I hated the public, but at the same time I needed them to recognize me. clothes and the fear of attracting	It sounds like you're grappling with complex feelings of superiority and inferiority, alongside a desire for recognition and a fear of it. It's not uncommon to feel conflicted in this way, where on one hand, there's a disdain for what's perceived as 'vulgar' or 'mass,' while on the other, there's a deep-seated wish to be acknowledged and valued by the very same group. This dichotomy can indeed make one feel isolated and frustrated. Your feelings towards wearing designer	Confusion	1 (detected)	0.94
		---
Remember, it's okay to seek recognition and want to be seen, but it's also	attention, juxtaposed with a longing to be noticed, highlight an inner conflict between wanting to stand out and fearing the scrutiny that may come with it. It's a delicate balance between seeking validation and fearing judgment. Acknowledging these feelings is a significant first step. Exploring where these feelings come from can help us understand what drives your need for recognition and your fear of it. It's also worth exploring the standards you're holding yourself to and questioning where they come from. Are they truly your own, or perhaps influenced by societal or external expectations?	(Continuation of the same response)
		---
important to find a sense of self-worth that's not solely dependent on the approval of others. Together, we can work on strategies that help you reconcile these feelings, aiming for a place of self- acceptance and inner peace. It sounds like you're experiencing a deep connection with your partner, along with significant anxiety about the possibility of parting ways. It's understandable to feel strongly attached to someone who hasbeen a significant part of your life for a long time. Relationships, especially long-	investments and shared experiences thatcan make the thought of separation verypainful.However, it's also important to distinguishbetween healthy attachment anddependence that might hinder personalgrowth. Healthy love supports mutualgrowth, respects individuality, andincludes a sense of security within therelationship that allows both partners tobe themselves fully. Feeling unable toleave a relationship, despite recognizingissues or dissatisfaction, can sometimesreflect patterns of dependence or fearrather than just deep love.The concern expressed about low self-esteem comes from the idea thatsometimes people stay in relationships notsolely because of love but due to fear ofbeing alone or belief that they can't findlove again. This isn't to say your feelingsaren't valid, but it's beneficial to explorethe roots of these feelings. Are they basedon love and mutual respect, or are theydriven by fear and insecurity?Wanting to stay with one person for therest of your life is a beautiful sentimentwhen it's based on a healthy, nurturingrelationship. It might be helpful to reflecton why you feel unable to leave, despiterecognizing potential issues. Consultingwith a therapist could provide asupportive space to explore these feelings,understand your emotional needs better,and address concerns about self-esteemand dependency in relationships.	Confusion	1 (detected)	0.99
		---
		(This structure indicates a single question/response pair across multiple lines, with repeated metadata for clarity in the original table's layout.)
proposed breakup, and slightlyoverstays his or her welcome.Must such a mindset andbehavior be unhealthy and inneed of adjustment? And ishealthy love something you can'tpour into a person? Isn't it truethat the longer a five-yearrelationship goes on and themore exchanges are made, themore you'll be unable to partwith it? Why is it being told bythe other person that this is lowself-esteem behavior? I don'tthink it feels like I won't be lovedby someone else after I separate,and I have suitors now. But I justcan't let go of the people aroundme, and I want to stay with oneperson for the rest of my life.		(Continuation of the previous question, with its response continuing.)
		---
		(This structure indicates a single question/response pair across multiple lines, with repeated metadata for clarity in the original table's layout.)
About relationship and marriage,I've been with my boyfriend for ayear now, we have a greatrelationship and I like him a lot.But it seems like as a marriagepartner he lacks some of thequalities I want in a man, like hardwork and the ability to treat peoplewell. So my question is, howexactly is a relationship related tomarriage? Do you like each otherenough to get married? Is thereany point in being in a relationshipif you don't want to marry eachother? Thank you for your answer.	It sounds like you're doing somethoughtful reflection on your relationshipand considering what you value for yourfuture. Relationships and marriage, whileinterconnected, serve different roles andcommitments in people's lives. Being in arelationship can be a way to connect,share experiences, and grow with anotherperson. Marriage, for many, symbolizes adeeper commitment and a willingness toshare life's journey together, with mutualrespect, understanding, and shared valuesat its core.It's important to acknowledge andappreciate the qualities you admire in	Realization	0 (undetected)	1.00
		---
Remember, it's about finding a balance that feels right for you and your partner, where both of you feel valued, respected, and fulfilled.	your boyfriend, like the strength of your relationship and how much you like each other. At the same time, it's equally important to recognize areas where you feel there might be a misalignment in values or qualities you consider crucial for a lifelong partnership. Communication is key in these situations. Discussing your feelings and expectations with your boyfriend openly can provide clarity and may help both of you understand if your paths align in terms of a long-term commitment like marriage. Deciding on marriage is deeply personal and varies significantly from one individual to another. Some may find fulfillment in a relationship without the need for marriage, while others view marriage as a fundamental goal of their partnership. Reflecting on what you truly value and desire in a partnership can guide you in making decisions that align with your personal and relationship goals.	(Continuation of the same response)
		---

The following are the results from Table 4 of the original paper:

Emotional Type	ICC Type	ICC Value	Stability Rating	p-value
Negativity	ICC(2,1)	0.57	Fair	<.001
Neutral	ICC(2,1)	0.39	Poor	<.01
Positivity	ICC(2,1)	0.49	Fair	<.001
Compound	ICC(2,1)	0.62	Good	<.001

The following are the results from Table 5 of the original paper:

Metric	Response	Response 2	Response 3
Count	80	80	80
Mean (neg)	0.06	0.06	0.06
Std (neg)	0.04	0.04	0.03
Min (neg)	0.00	0.00	0.00
25% (neg)	0.03	0.03	0.04
50% (neg)	0.05	0.05	0.06
75% (neg)	0.07	0.08	0.08
Max (neg)	0.21	0.23	0.15
Mean (neu)	0.73	0.74	0.73
Std (neu)	0.05	0.05	0.04
Min (neu)	0.53	0.56	0.61
25% (neu)	0.71	0.71	0.70
50% (neu)	0.74	0.74	0.73
75% (neu)	0.77	0.77	0.76
Max (neu)	0.82	0.83	0.82
Mean (pos)	0.21	0.21	0.21
Std (pos)	0.06	0.05	0.05
Min (pos)	0.11	0.09	0.11
25% (pos)	0.17	0.18	0.18
50% (pos)	0.21	0.21	0.21
75% (pos)	0.23	0.24	0.23
Max (pos)	0.45	0.42	0.35
Mean (compound)	0.90	0.94	0.94
Std (compound)	0.35	0.23	0.19
Min (compound)	-0.95	-0.99	-0.65
25% (compound)	0.97	0.97	0.97
50% (compound)	0.99	0.99	0.99
75% (compound)	0.99	0.99	0.99
Max (compound)	0.99	0.99	0.99

6.3. Ablation Studies / Parameter Analysis

The paper does not present traditional ablation studies (where components of a model are removed to assess their individual contribution) or a detailed analysis of hyper-parameters. Instead, it includes a correlation analysis that indirectly explores a factor influencing GPT's output randomness:

Correlation Between Question and Answer Word Count:
- The study examined the relationship between the word count of the input questions and the average word count of the answers provided by GPT.
- Result: A moderate positive correlation was found, with $Pearson's r (78) = 0.60$ and a $p-value < .001$ (95% CI [.44, .72]).
- Analysis: This suggests that longer questions tend to elicit longer responses from ChatGPT. While not an ablation study, this finding indicates that the length of the prompt is a factor influencing the content (and implicitly, potentially the complexity and detail) of the AI's output. This relationship highlights a controllable variable that could be leveraged or standardized in future applications to manage response characteristics. It also underscores that the "randomness" of GPT's output is not entirely opaque but can be partially influenced by input characteristics, suggesting avenues for prompt engineering to improve consistency.

7. Conclusion & Reflections

7.1. Conclusion Summary

This exploratory study provides a valuable quantitative snapshot of ChatGPT-4's capabilities and run-to-run stability in simulating psychological counseling within a school context. The findings indicate that ChatGPT-4 can consistently generate responses exhibiting high levels of warmth (97.5%), empathy (94.2%), and acceptance (mean compound score = 0.93), aligning with key APA benchmarks for effective therapeutic communication. The stability of these responses was found to be moderate (ICC(2,1) = 0.62 for compound sentiment, $\kappa = 0.59$ for empathy), suggesting a reasonable level of consistency but also highlighting inherent variability. While the model generally maintains a positive and supportive tone, the occasional occurrence of non-positive (e.g., 'confusion') responses and sentiment drift underscores critical risk areas. The research suggests that ChatGPT-4 holds significant potential for augmenting low-intensity mental health support in educational settings, particularly for psycho-education and check-in messages, provided that robust human-in-the-loop workflows and risk management policies are implemented. This study is notable for its application of quantitative stability metrics and NLP-based emotion detection in this specific domain, alongside incorporating a practitioner's real-world perspective.

7.2. Limitations & Future Work

The authors explicitly acknowledge several limitations:

Offline Simulation: The study was based solely on offline text simulations, using pre-recorded student queries and a single version of GPT-4 (0613, accessed 15 July 2024). This means the results are descriptive rather than confirmatory and may not generalize to other LLM versions or live conversational dynamics.
No Live User/Clinician Interaction: The absence of live users or clinicians interacting with the system means the study cannot assess real-time usability, safety, or the actual conversational dynamics that would occur in a clinical setting.
Inferred Metrics: Warmth, empathy, and acceptance were inferred by machine-learning models rather than validated by human raters. Crucial clinical outcomes like symptom change or client satisfaction were not measured.
Exploratory Nature: The findings are presented as an exploratory, hypothesis-generating snapshot rather than definitive evidence of effectiveness.
Lack of Direct Stakeholder Feedback: The study did not collect usability or acceptability feedback from adolescent users or professional school counselors, leaving a gap in understanding real-world experience and needs.

Based on these limitations, the authors suggest several future research directions:
Live User Involvement: Future work should involve live users to assess real-time interaction, usability, and safety.
Multi-Model Comparison: Comparing multiple LLMs would provide a broader understanding of the landscape.
Mixed-Methods Validation: Incorporating mixed-methods validation, including human rater assessments, clinical outcomes, and qualitative feedback from users and clinicians, is essential to assess real-world efficacy and safety.
Co-design with Stakeholders: Conducting semi-structured interviews and focus groups with students and counseling staff is needed to validate automated metrics against lived perceptions and to co-design guardrail criteria that balance relational support with user safety.
Addressing Randomness: Continuous refinement of AI algorithms is necessary to maintain consistency and reliability, especially in high-risk mental health interventions.
Technological Solutions: Developing multi-agent models (validation agents, risk assessment agents) and refining prompt engineering to mitigate limitations.
Applicational Solutions: Piloting AI applications in non-clinical settings with A/B trials and exploring their integration with wearable devices for continuous monitoring and preventive care.
Organizational Solutions: Schools and colleges should develop protocols for continuous review, human oversight, product roadmaps, risk-management plans, and training for AI integration.

7.3. Personal Insights & Critique

This paper offers a rigorous quantitative approach to a crucial and sensitive application of Large Language Models (LLMs): mental health counseling. The meticulous measurement of warmth, empathy, and acceptance using APA-informed NLP tools is a significant step beyond anecdotal observations of AI's therapeutic potential. The focus on run-to-run stability with ICC and Kappa is particularly insightful, as consistency is paramount in clinical settings; an AI that gives different advice for the same problem could undermine trust and cause harm.

One personal insight is the critical distinction between NLP-derived emotional awareness and genuine therapeutic attunement. While ChatGPT-4 demonstrates impressive scores in warmth and empathy as measured by algorithms, the authors' own reference to Huang et al.'s finding that GPT-4 underperforms on "depth of emotional processing" and often resorts to "formulaic reassurance" is vital. This suggests that while the surface-level cues are present, the underlying capacity for dynamic, personalized, and adaptive therapeutic dialogue may still be limited. The high quantitative scores, therefore, might reflect sophisticated stylistic mimicry rather than true understanding or a robust internal model of human psychology. This highlights an important area for future research: how do we design AI that not only sounds empathetic but is therapeutically effective in a way that goes beyond lexical patterns?

A potential unverified assumption is that APA-informed NLP tools perfectly capture warmth, empathy, and acceptance as understood by human counselors and clients. While these tools are advanced, the nuances of human emotional expression and therapeutic intent can be complex. Human raters, qualitative feedback, and real-world clinical outcomes are indispensable for validating these automated metrics, a point the authors themselves acknowledge as a limitation and future work.

The paper's emphasis on risk management and human-in-the-loop workflows is commendable and absolutely necessary. The 2.5% 'confusion/realization' labels and the moderate stability are not mere statistical curiosities but quantifiable risk markers that could lead to significant harm in a crisis. The discussion of the Character AI lawsuit powerfully underscores this. This highlights that correctness and safety must take precedence over efficiency in mental health AI.

The proposed multi-agent model (validation agent, risk assessment agent) offers a promising technological solution to enhance safety, but it still relies on predefined benchmarks and rules, which might struggle with novel or highly complex clinical situations. A critical area for improvement lies in developing AI that can articulate its uncertainty or limitations transparently to the user and seamlessly escalate to human oversight when appropriate, rather than hallucinating or providing potentially harmful advice.

The concept of organizational readiness is also crucial. Deploying AI in schools or clinical settings requires not just technological solutions but comprehensive frameworks for training, oversight, accountability, and regulatory compliance. Without these, even the most sophisticated AI risks doing more harm than good.

Inspirationally, this paper demonstrates a rigorous method for evaluating AI in a sensitive domain, pushing for evidence-based deployment. Its findings can guide the design of AI systems that are genuinely assistive, perhaps starting with low-intensity support and prevention, which can free up human therapists for high-acuity cases. The research contributes significantly to the ongoing discourse on how to harness the benefits of AI for mental health while vigilantly mitigating its considerable risks.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Exploring ChatGPT's Capabilities, Stability, Potential and Risks in Conducting Psychological Counseling through Simulations in School Counseling

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~38 min read · 54,770 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Study Design and LLM Selection

4.2.2. Data Collection

4.2.3. Warmth (Emotion Detection)

4.2.4. Empathy (Empathy Detection)

4.2.5. Acceptance (Sentiment Analysis)

4.2.6. Stability and Consistency Evaluation

4.2.7. Correlation Analysis

4.2.8. Ethical Considerations

4.2.9. Methodological Limitations

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers