Paper status: completed

Evaluating large language models and agents in healthcare: key challenges in clinical applications

Published:04/02/2025

Evaluation of Large Language Models in Healthcare (1)Multitask LLM Agents in Clinical Settings (1)Medical Text and Image Processing Tasks (1)Automated Evaluation Metrics for LLMs (1)Interdisciplinary AI Safety and Ethics in Medicine (1)

Original Link

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This review systematically evaluates LLMs and agents in healthcare, summarizing data sources, analyzing diverse clinical tasks, and comparing automated metrics with expert assessments, highlighting key challenges in clinical applications.

Abstract

Intelligent Medicine 5 (2025) 151–163 Contents lists available at ScienceDirect Intelligent Medicine journal homepage: www.elsevier.com/locate/imed Review Evaluating large language models and agents in healthcare: key challenges in clinical applications Xiaolan Chen 1,# , Jiayang Xiang 2,# , Shanfu Lu 3,# , Yexin Liu 4 , Mingguang He 1,5,6, ∗ , Danli Shi 1,2, ∗ 1 School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China 2 Department of Ophthalmology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200127, China 3 Perception Vision Medical Technologies Co. Ltd., Guangzhou, Guangdong 510530, China 4 AI Thrust, The Hong Kong University of Science and Technology, Guangzhou, Guangdong 511453, China 5 Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Kowloon, Hong Kong, China 6 Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Hong Kong, China a r t i c l e i n f o Keywords: Large language model Generative pre-trained transformer Evaluation Reasoning Hallucination Medical agent a b s t r a c t Large language models (LLMs) have emerged as transform

Mind Map

In-depth Reading

English Analysis~36 min read · 54,416 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "Evaluating large language models and agents in healthcare: key challenges in clinical applications".

1.2. Authors

The authors are Xiaolan Chen, Jiayang Xiang, Shanfu Lu, Yexin Liu, Mingguang He, and Danli Shi. Their affiliations are:

Xiaolan Chen, Mingguang He, and Danli Shi are affiliated with The Hong Kong Polytechnic University, Kowloon, Hong Kong. Mingguang He is also affiliated with the Research Centre for SHARP Vision (RCSV) and the Centre for Eye and Vision Research (CEVR) in Hong Kong.
Jiayang Xiang is affiliated with The Hong Kong Polytechnic University, Kowloon, Hong Kong.
Shanfu Lu is affiliated with Perception Vision Medical Technologies Co. Lt., Guangzhou, Guangdong, China.
Yexin Liu is affiliated with The Hong Kong Polytechnic University, Kowloon, Hong Kong.

1.3. Journal/Conference

The paper is a review article. While the exact journal is not specified in the provided text, the DOI $10.1016/j.imed.2025.03.002$ suggests it will be published in a journal, likely in the field of medical informatics or digital medicine, given the nature of the content. Such venues are highly reputable for publishing research at the intersection of AI and healthcare. The publication year is listed as 2025, indicating it is a forthcoming or very recent publication.

1.4. Publication Year

The publication year is 2025.

1.5. Abstract

Large language models (LLMs) and their advanced variants, LLM agents, show immense potential in healthcare, from clinical decision support to patient education, by enabling multimodal processing and multi-task handling. However, evaluating their performance in high-risk medical contexts with complex data presents unique challenges. This paper offers a comprehensive overview of current evaluation practices. It summarizes data sources (existing medical resources and manually curated questions), analyzes key medical task scenarios (closed-ended, open-ended, image processing, and real-world multitask scenarios with LLM agents), and compares evaluation methods and dimensions (automated metrics vs. human expert assessments, traditional accuracy vs. agent-specific dimensions like tool usage and reasoning). Finally, it identifies key challenges and opportunities, stressing the need for continued interdisciplinary collaboration to ensure safe, ethical, and effective deployment of LLMs in clinical practice.

1.6. Original Source Link

The original source link is /files/papers/69086f591ccaadf40a4344bf/paper.pdf. Based on the provided DOI $10.1016/j.imed.2025.03.002$ and the year 2025, it is likely an officially published article or accepted manuscript for a journal.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the effective and rigorous evaluation of large language models (LLMs) and LLM agents in medical and clinical applications. LLMs have demonstrated transformative potential across various domains, including healthcare, where they promise advancements in tasks like clinical decision support and patient education. The emergence of LLM agents further expands this utility by enabling multimodal processing (handling different types of data, e.g., text and images) and multi-task handling (performing several related tasks sequentially or in parallel) within complex clinical workflows.

This problem is critically important in the current field due to the high-risk nature of healthcare. Errors or inaccuracies from LLMs in medical contexts can have severe consequences for patient safety and outcomes. Moreover, medical data is inherently complex, diverse, and often sensitive, making robust evaluation challenging. Existing research faces several challenges and gaps:

Data Bias: Many datasets are drawn from specific domains or populations, potentially misrepresenting real-world performance.
Lack of Depth: Evaluations often cover a broad range of applications but lack the depth needed to truly differentiate strengths and weaknesses in practical scenarios.
Limited Evaluation Dimensions: Prior methods primarily focus on accuracy (the proportion of correct predictions), overlooking other critical attributes such as hallucination (generation of plausible-sounding but factually incorrect information), logical reasoning, bias (unfair or prejudiced outcomes), stability (consistency of output), and the likelihood of generating harmful content.
Inadequacy for Agents: Traditional evaluation methods designed for single tasks and dimensions are insufficient for the multimodal, multitask capabilities of advanced LLM agents.

The paper's entry point is to provide a comprehensive overview of the current evaluation landscape for LLMs and LLM agents in medicine. It aims to synthesize existing insights, identify key challenges, and propose future directions to facilitate the responsible integration of these AI technologies into clinical practice.

2.2. Main Contributions / Findings

The paper makes three primary contributions by comprehensively reviewing and structuring the current state of LLM and LLM agent evaluation in medicine:

Summarized Data Sources: It categorizes and details the data sources used in LLM evaluations, including existing medical resources (e.g., standardized medical examinations, literature) and manually designed clinical questions (e.g., real-world interactions, expert-crafted cases). This provides a foundational understanding of what kind of data is currently leveraged to assess medical LLMs.
Analyzed Key Medical Task Scenarios: The paper breaks down medical applications into four critical scenarios: closed-ended tasks (e.g., multiple-choice questions), open-ended tasks (e.g., summarization, information extraction, medical question answering), image processing tasks (e.g., classification, report generation, visual question answering), and real-world multitask scenarios involving LLM agents. This categorization offers guidance for researchers on evaluating LLMs across diverse medical applications.
Compared Evaluation Methods and Dimensions: It differentiates between automated metrics (e.g., accuracy, F1-score, BLEU) and human expert assessments, while also addressing traditional accuracy measures alongside agent-specific dimensions such as tool usage and reasoning capabilities. This highlights the evolving requirements for evaluation as LLM technology advances.

The key conclusions and findings emphasize that despite significant advancements, current evaluation practices still face considerable challenges related to data quality, comprehensiveness of assessment standards, and the need for rigorous, real-world validation. The paper concludes by underscoring the critical need for continued interdisciplinary collaboration between healthcare professionals and computer scientists to ensure the safe, ethical, and effective deployment of LLMs in clinical practice.

3.1. Foundational Concepts

To fully understand this paper, a novice reader should be familiar with the following core concepts:

Large Language Models (LLMs): These are deep learning models trained on vast amounts of text data, enabling them to understand, generate, and process human language. They leverage complex neural network architectures, typically transformers, to learn patterns, grammar, and factual knowledge from their training data. Examples include GPT (Generative Pre-trained Transformer) series, LLaMA, etc.
LLM Agents: These are advanced AI systems that use an LLM as their "brain" (the central decision-making component) and integrate various expert AI models as "tools." Unlike basic LLMs that simply respond to prompts, agents can autonomously understand complex user instructions, break them down into sub-tasks, make decisions, select and use appropriate tools (e.g., a search engine, an image analysis model, a calculator), and iteratively perform tasks to achieve a goal in complex clinical workflows. They typically involve capabilities like planning, reasoning, and tool usage.
Clinical Decision Support (CDS): Computerized systems designed to aid healthcare professionals in making clinical decisions by providing evidence-based information, alerts, and recommendations at the point of care. LLMs can enhance CDS by processing patient data and medical literature to offer relevant insights.
Multimodal Processing: The ability of an AI system to process and integrate information from multiple modalities (types of data) simultaneously. In healthcare, this could mean combining text from electronic health records (EHRs), images from MRI scans, and audio from patient-doctor conversations to form a comprehensive understanding.
Multitask Handling: The capability of an AI model or agent to perform multiple distinct tasks, often related, within a single system or workflow, rather than being specialized for just one task.
Hallucination: A phenomenon in LLMs where the model generates plausible-sounding but factually incorrect or nonsensical information. In healthcare, this is particularly dangerous as it can lead to misinformation or incorrect diagnoses.
Retrieval-Augmented Generation (RAG): A technique used to improve the factual accuracy and reduce hallucinations in LLMs. When an LLM receives a query, a retrieval component first searches a knowledge base (e.g., medical literature) for relevant information. This retrieved information is then fed to the LLM along with the original query, allowing the LLM to generate more informed and accurate responses grounded in external data.
Natural Language Processing (NLP): A field of artificial intelligence focused on enabling computers to understand, interpret, and generate human language. Core NLP tasks in this paper include summarization, information extraction, and question answering.
Electronic Health Records (EHRs): Digital versions of a patient's paper chart. EHRs are real-time, patient-centered records that make information available instantly and securely to authorized users. LLMs can help in summarizing, extracting information, or generating reports from EHRs.
Named Entity Recognition (NER): An NLP task that identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, medical conditions, genes, or proteins.
Relation Extraction: An NLP task that identifies semantic relationships between named entities in text, such as gene-disease associations or drug-drug interactions.
Visual Question Answering (VQA): A multimodal task where an AI system is given an image and a natural language question about the image, and it must generate a natural language answer. In medical VQA, this involves interpreting medical images (e.g., X-rays) to answer clinical questions.
Evaluation Metrics (General):
- Accuracy: The proportion of correct predictions (or answers) out of the total number of predictions. $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Precision: The proportion of true positive predictions among all positive predictions. It measures how many of the identified items are actually relevant. $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $
- Recall (Sensitivity): The proportion of true positive predictions among all actual positive items. It measures how many relevant items are identified. $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
- Specificity: The proportion of true negative predictions among all actual negative items. It measures how many irrelevant items are correctly identified as irrelevant. $ \text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives} + \text{False Positives}} $
- F1-score: The harmonic mean of Precision and Recall. It provides a single score that balances both metrics, especially useful when class distribution is uneven. $ \text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
- Area Under the Receiver Operating Characteristic Curve (AUROC or ROC AUC): A measure of a classification model's performance at various threshold settings. It indicates the model's ability to distinguish between classes. A higher AUC indicates a better model.
Evaluation Metrics (Text Generation):
- BLEU (Bilingual Evaluation Understudy): A metric for evaluating the quality of text which has been machine-translated from one natural language to another. It compares the generated text to one or more reference texts.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): A set of metrics for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation with a set of reference summaries or translations.
- CIDEr (Consensus-based Image Description Evaluation): A metric used for evaluating image captioning models. It measures the consensus between the generated caption and a set of human-generated reference captions.
- METEOR (Metric for Evaluation of Translation with Explicit Ordering): A metric for the evaluation of machine translation output that correlates well with human judgment. It considers exact word matches, stemmed word matches, and synonym matches between the machine translation and reference translations.
- BERTScore: A metric that leverages contextual embeddings from BERT (Bidirectional Encoder Representations from Transformers) to compute a similarity score between generated and reference sentences. It measures semantic similarity rather than just surface-level word overlap.
- Moverscore: Another metric for text generation that uses word embeddings (numerical representations of words) to measure the semantic distance between generated and reference texts, considering the "cost" of moving words from one text to another.
Likert Scale: A psychometric scale commonly used in surveys to measure attitudes, opinions, or perceptions. Respondents indicate their level of agreement or disagreement with a statement (e.g., 1=Strongly Disagree, 5=Strongly Agree).
Objective Structured Clinical Examination (OSCE): A modern type of examination used in health sciences to assess clinical competence, often using simulated patient encounters or tasks. It emphasizes objective and standardized evaluation of practical skills.

3.2. Previous Works

The paper extensively references prior studies to contextualize its review. Key previous works mentioned or implied include:

General LLM Applications: Early works explored LLMs in diverse domains like software engineering [1,2] and higher education [3], demonstrating their broad natural language understanding and content generation capabilities.
Medical LLM Potential: Researchers like Singhal et al. [4] and Thirunavukarasu et al. [5] began exploring LLMs specifically for medical tasks, from clinical decision-making to patient education. This laid the groundwork for specialized medical LLMs.
Tailored Medical LLMs: Recognizing limitations of general LLMs in medical contexts (e.g., interpreting medical images, grasping clinical nuances) [7,8], studies developed LLMs specifically trained or fine-tuned for medical applications, such as EyeGPT [9], FFA-GPT [10], HuatuoGPT [11], and Generalist Biomedical AI (e.g., Med-PaLM) [4,12]. These demonstrated improved performance on medical tasks.
MultiMedQA [4]: A notable evaluation dataset proposed by Singhal et al. for assessing medical LLMs, combining multiple medical question-answering datasets.
MultiMedEval [16]: An open-source evaluation toolkit for comprehensively assessing LLMs on medical multiple-choice question (MCQ) tasks, covering general medical knowledge and radiological image interpretation.
CMedExam [40]: A comprehensive Chinese medical exam dataset used to benchmark LLMs, revealing that models like GPT-4, while capable, still fall short of human-level performance.
LLM Agent Development: The paper highlights the emergence of AI agent systems driven by LLMs to address multimodal, multitask demands in real-world medical workflows. This includes architectures that integrate expert AI models as tools.
Agent Benchmarks:
- AgentBench [104]: The first cross-domain benchmark designed to evaluate LLM agents across eight simulated environments, revealing significant performance gaps in open-ended decision-making.
- RadABench [105]: A pioneering work in evaluating radiology-specific agents by simulating tool-rich workflows across various anatomies, imaging modalities, tool categories, and radiology tasks.
OSCE-style Evaluation: Tu et al. [106] innovatively assessed LLM performance using a remote Objective Structured Clinical Examination (OSCE) style, involving patient actors and specialists evaluating conversations.
CRAFT-MD [107]: The CRAFT-MD evaluation framework by Johri et al. shifted benchmarking from traditional structured exams to natural dialogue with AI agents, focusing on realistic doctor-patient interactions, comprehensive history collection, and combined automated/expert evaluation.

Background on Text Generation Metrics (as they are crucial for understanding this paper's evaluation methods, even if not fully detailed in the provided text):

BLEU (Bilingual Evaluation Understudy) [109]:
- Conceptual Definition: BLEU measures the similarity between a candidate text (generated by an LLM) and a set of reference texts (human-written). It focuses on n-gram precision, counting how many short sequences of words (n-grams) in the candidate text appear in the reference text. A penalty is applied for brevity to prevent very short, high-precision outputs.
- Mathematical Formula: $ \text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ Where:
  - $\text{BP}$ is the brevity penalty, which penalizes candidate sentences that are too short compared to the reference.
  - $N$ is the maximum $n$ -gram order (typically 4).
  - $w_n$ is the weight for each $n$ -gram precision (typically $1/N$ ).
  - $p_n$ is the modified $n$ -gram precision, calculated as: $ p_n = \frac{\sum_{\text{n-gram} \in \text{candidate}} \min(\text{Count}(\text{n-gram}), \text{Max_Ref_Count}(\text{n-gram}))}{\sum_{\text{n-gram} \in \text{candidate}} \text{Count}(\text{n-gram})} $ $\text{Max\_Ref\_Count}(\text{n-gram})$ is the maximum count of the $n$ -gram in any single reference translation.
- Symbol Explanation:
  - $\text{BP}$ : Brevity Penalty, a factor to penalize overly short generated texts.
  - $N$ : The highest $n$ -gram order considered (e.g., 4 for 1-grams, 2-grams, 3-grams, 4-grams).
  - $w_n$ : Weight assigned to the precision of $n$ -grams of length $n$ .
  - $p_n$ : Modified $n$ -gram precision for $n$ -grams of length $n$ .
  - $\text{Count}(\text{n-gram})$ : The count of a specific $n$ -gram in the candidate text.
  - $\text{Max\_Ref\_Count}(\text{n-gram})$ : The maximum count of that specific $n$ -gram in any of the reference texts.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- Conceptual Definition: ROUGE measures the overlap between a candidate summary/translation and one or more reference summaries/translations. Unlike BLEU, which focuses on precision, ROUGE primarily focuses on recall, measuring how many $n$ -grams from the reference text are present in the candidate text. Different variants exist (ROUGE-N for $n$ -grams, ROUGE-L for longest common subsequence, ROUGE-S for skip-bigrams).
- Mathematical Formula (ROUGE-N, where N is the length of the n-gram): $ \text{ROUGE-N} = \frac{\sum_{\text{Reference Summaries}} \sum_{n\text{-gram} \in \text{Reference Summary}} \text{Count}{\text{match}}(n\text{-gram})}{\sum{\text{Reference Summaries}} \sum_{n\text{-gram} \in \text{Reference Summary}} \text{Count}(n\text{-gram})} $
- Symbol Explanation:
  - $\text{Count}_{\text{match}}(n\text{-gram})$ : The number of $n$ -grams co-occurring in both the candidate and reference summaries.
  - $\text{Count}(n\text{-gram})$ : The number of $n$ -grams in the reference summary.

3.3. Technological Evolution

The evolution of AI in healthcare, as implied by the paper, has progressed through several stages:

Early AI/Traditional Machine Learning: Focused on specific, narrow tasks with structured data, often requiring extensive feature engineering. Evaluations were typically single-task, accuracy-centric.
General LLMs: The advent of large transformer models (e.g., early GPT versions) revolutionized NLP, demonstrating impressive capabilities in language understanding and generation across diverse domains. Initial medical applications involved basic information retrieval or question answering.
Specialized Medical LLMs: Recognizing the unique demands and high-stakes nature of healthcare, researchers developed or fine-tuned LLMs specifically for medical tasks (e.g., Med-PaLM, EyeGPT). These models were trained on vast medical datasets to gain domain-specific knowledge, significantly improving performance on tasks like medical exam questions.
Multimodal LLMs: As healthcare data is inherently multimodal (text, images, signals), the integration of vision and language models became crucial. These models (e.g., GPT-4V) could process both medical images and textual queries, paving the way for tasks like visual question answering and automated report generation.
LLM Agents: The latest evolution involves LLM agents, which are LLMs enhanced with reasoning, planning, and tool-use capabilities. These agents act as orchestrators, using the LLM's "brain" to understand complex instructions, interact with specialized AI models (tools), and manage multi-step workflows. This development is crucial for simulating real-world clinical scenarios that involve sequential decision-making and diverse data types.

This paper's work fits into the current frontier of this timeline, specifically addressing the challenges of evaluating these sophisticated LLM agents and multimodal LLMs in complex, real-world medical applications, moving beyond single-task, accuracy-focused evaluations.

3.4. Differentiation Analysis

Compared to previous studies, the core differences and innovations of this paper's approach lie in its comprehensive, structured, and forward-looking review of evaluation practices rather than proposing a new model or algorithm.

Breadth and Structure: While many prior works focus on evaluating a specific LLM on a specific task or dataset (e.g., ChatGPT on USMLE [13], GPT-4 on ophthalmology images [7]), this paper provides a broad, systematic overview of the entire evaluation landscape. It rigorously categorizes data sources, task scenarios, and evaluation methods/dimensions, offering a holistic framework (as visualized in Figure 1 and Figure 2).
Focus on LLM Agents: The paper explicitly incorporates the evaluation of LLM agents, which is a relatively newer and more complex area compared to standalone LLMs. It identifies specific evaluation dimensions for agents, such as tool usage capability, reasoning capability, and workflow management (Figure 6), which are often overlooked in traditional LLM evaluations.
Multimodal Emphasis: It dedicates a significant section to image processing tasks, acknowledging the multimodal nature of medical data and the increasing importance of visual-language models.
Beyond Accuracy: The paper moves beyond traditional accuracy metrics to discuss the need for evaluating more nuanced aspects like hallucination, bias, interpretability, empathy, safety, and real-world applicability, which are crucial in high-stakes medical contexts.
Call for Interdisciplinary Collaboration: It strongly emphasizes the necessity of interdisciplinary collaboration between healthcare professionals and computer scientists, not just in development but critically in establishing robust evaluation frameworks and benchmarks.
Future-Oriented Critique: It not only summarizes current practices but also critically identifies existing limitations and proposes concrete future research directions (Figure 7), including the need for better datasets (especially VQA), refined assessment standards, and more realistic clinical trial designs.

Essentially, this paper acts as a meta-analysis and a roadmap for the responsible development and deployment of medical AI, distinguishing itself by its comprehensive scope and proactive identification of evolving evaluation needs.

4. Methodology

4.1. Principles

The core idea of the method used in this paper is a systematic review of existing literature on the evaluation of large language models (LLMs) and LLM agents in medical applications. The theoretical basis is to synthesize current knowledge, identify common practices, pinpoint gaps and challenges, and propose future directions for robust and safe AI integration into healthcare. This approach allows for a comprehensive understanding of the evolving landscape of medical AI evaluation.

4.2. Core Methodology In-depth (Layer by Layer)

The paper employed a systematic review methodology to gather and analyze relevant studies. The steps are as follows:

Adherence to Guidelines: The research explicitly states that it followed the systematic review guidelines set forth by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). PRISMA is an evidence-based minimum set of items for reporting in systematic reviews and meta-analyses. It helps authors report why the review was done, what they did, and what they found.
Systematic Search Strategy:
- Databases: A systematic search was carried out across three major scientific databases: PubMed, Google Scholar, and Web of Science. These databases are widely recognized for their comprehensive coverage of peer-reviewed journal articles and conference proceedings in medical and computer science fields.
- Publication Period: The initial search focused on articles published between January 1, 2023, and November 13, 2024. This specific timeframe ensures that the review captures the most recent advancements in the rapidly evolving field of LLMs.
- Keywords: Specific keywords were employed to ensure relevance: "Large Language Model," "ChatGPT," "AI Agent," "LLM Agent," "Medical," "Medicine," "Evaluation," and "Assess." The use of multiple related terms helps to cast a wide net and capture all pertinent studies.
Inclusion Criteria: Studies were included in the review if they met the following conditions:
- They had applied LLMs in the medical field.
- They performed an adequate assessment of the LLMs' performance.
Exclusion Criteria: Studies were excluded if they met any of these conditions:
- They were not relevant to medical applications.
- They demonstrated methodological limitations, particularly those lacking formal evaluation protocols, statistical validation, or sample sizes smaller than 20. This criterion ensures the quality and rigor of the included studies.
Study Selection and Refinement:
- Representative studies from different task scenarios and evaluation methods were selected and cited as examples throughout the paper.
- To minimize recency bias and capture the very latest developments, an updated search was conducted on February 25, 2025. This identified an additional 5 eligible peer-reviewed articles and 4 relevant preprints that met the inclusion criteria. This iterative search process is crucial in fast-moving fields.
- After the filtering process, a total of 256 studies were included in the literature review. The process of article identification, screening, eligibility, and inclusion is typically summarized in a PRISMA flow diagram (mentioned as Supplementary Figure 1, though the figure itself is not provided in the prompt).
Synthesis and Categorization of Findings: Once the relevant studies were identified, the authors synthesized their findings and structured the review around three main aspects, as outlined in the abstract and main sections of the paper:
- Data Sources: How are medical data sets prepared and used for evaluation? (Section 3)
- Task Scenarios: What are the common medical tasks LLMs are applied to and evaluated against? (Section 4)
- Evaluation Methods and Dimensions: What techniques and criteria are used to assess LLM performance? (Sections 5 and 6)
  
  This structured approach allows the paper to provide a comprehensive and organized overview of the complex topic of LLM evaluation in medicine.

The overall illustration of the potential LLM evaluation framework in medicine is conceptualized in Figure 1:

Figure 1.Ilustration of the potential LLM evaluation framework in medicine. LLM: large language mode 该图像是论文中的示意图，展示了医疗领域中大语言模型（LLMs）评估的框架，包括数据来源、任务场景、评估方法和维度四个主要部分，详细阐述了自动与人工评估手段及评价指标。

Figure 1. Illustration of the potential LLM evaluation framework in medicine. LLM: large language mode

This figure provides a high-level overview, showing that LLM evaluation involves Data Sources, Task Scenarios, Evaluation Methods, and Evaluation Dimensions, all feeding into the assessment of medical LLMs.

Figure 2 further details the relationships between data sources, task scenarios, and evaluation dimensions:

该图像是关于医疗领域大语言模型评估流程的桑基图，展示了数据来源、任务场景与多种评估指标之间的流向关系，体现了闭环任务、开放任务、图像处理及真实场景中评估维度的交互。

Figure 2. electronic health record; OSCE: objective structured clinical examination.

This Sankey diagram illustrates the flow from Data Sources (Existing Medical Resources, Manually Curated Questions) to Task Scenarios (Closed-ended, Open-ended, Image Processing, Real-world Multitask Scenarios) and finally to Evaluation Dimensions (Accuracy, Safety, Completeness, Readability, Bias, Hallucination, Tool Usage, Reasoning, etc.). It visually represents how different data types are used for various tasks, which in turn are assessed using a range of metrics.

5. Experimental Setup

As this paper is a systematic review, it does not describe its own experimental setup in the traditional sense (i.e., it didn't conduct experiments with LLMs itself). Instead, it reviews the experimental setups, datasets, evaluation metrics, and baselines used by other studies in the field. This section will summarize those elements as presented in the paper.

5.1. Datasets

The paper identifies two broad categories of data sources used in the evaluation of LLMs and LLM agents in medical applications:

Existing Medical Resources: These are readily available, often standardized materials that assess the competency of healthcare professionals.
- Medical Examinations:
  - Source: Crafted through generations of expertise, accompanied by standardized answers.
  - Scale & Characteristics: Provide a substantial volume of validated material. Cover diverse medical knowledge.
  - Examples:
    - General Medical Knowledge: United States Medical Licensing Examination (USMLE) [13,17], National Medical Licensing Examination in China [18], National Pharmacist Licensing Examination in China, National Nurse Licensing Examination in China [19], Chinese Master's Degree Entrance Examination [20].
    - Specific Subspecialties: Ophthalmic Knowledge Assessment Program examination [21], Basic Science and Clinical Science Self-Assessment Program [22], oral and written board examinations for the American Board of Neurological Surgery (ABNS) [23], Otolaryngology-Head and Neck Surgery Certification Examinations [24], Royal College of General Practitioners Applied Knowledge Test [25], European Board of Radiology exam [26].
  - Purpose: Assess LLMs' general medical knowledge and specialty knowledge depth and breadth.
- Medical Literature:
  - Source: Peer-reviewed journal articles and conference papers [27-29].
  - Characteristics: Offer cutting-edge medical insights and research findings.
  - Purpose: Assess whether LLMs can rapidly update medical knowledge and summarize research evidence.
Manually Curated Questions: These datasets are designed to overcome the limitations of standardized exams in reflecting real-world interactions and dynamic capabilities.
- Real-world Interactions and Discussions:
  - Source: Medical forums and social media [30-32].
  - Purpose: Evaluate LLMs' conversational and consultative skills.
- Medical Images with Expert Reports:
  - Source: Clinically derived image data often accompanied by expert medical reports.
  - Examples: X-rays, magnetic resonance imaging (MRI), computed tomography (CT) in radiology [33]; fundus photographs, fundus fluorescein angiography (FFA), and optical coherence tomography (OCT) images in ophthalmology [34].
  - Purpose: Crucial for constructing multimodal datasets to test LLMs' ability to handle complex visual and textual information for tasks like disease diagnosis and image analysis.
- Expert-Crafted Questions:
  - Source: Carefully formulated by healthcare professionals based on clinical expertise [35-37].
  - Examples: MultiMedQA [4] (a composite evaluation dataset), questions centered around symptoms, examinations, and treatments of uveitis [38], or a dataset of 314 clinical questions spanning 9 medical specialties generated by clinicians and healthcare practitioners [39].
  - Characteristics: Highly specialized, practical insights into nuanced understanding required in real-world clinical practice, ensuring integrity by being excluded from training data.
  - Purpose: To provide specialized and practical insights into LLM proficiency in handling specific content.

Example of a data sample (implied through task examples):

Closed-ended task (MCQ): A multiple-choice question from a medical exam, e.g., "Which of the following is the most common cause of community-acquired pneumonia in adults?" with options A, B, C, D.
Open-ended task (Summarization): A long medical research paper or an electronic health record note that needs to be condensed into a concise summary.
Image Processing task (VQA): An X-ray image paired with a question like "Is there evidence of cardiomegaly in this chest X-ray?"

The paper emphasizes that collecting appropriate data can be challenging due to scarce resources in certain disciplines, as well as ethical considerations and data privacy issues.

5.2. Evaluation Metrics

The paper reviews a wide array of evaluation metrics, categorized by task type and evaluation method (automated vs. human).

5.2.1. Automated Evaluation Metrics

These metrics objectively assess LLM performance using algorithms.

Classification Tasks: Used for tasks where the LLM predicts a category or label.
- Accuracy: The proportion of correct predictions out of the total number of predictions. $ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} $
  - $\text{TP}$ : True Positives (correctly predicted positive).
  - $\text{TN}$ : True Negatives (correctly predicted negative).
  - $\text{FP}$ : False Positives (incorrectly predicted positive).
  - $\text{FN}$ : False Negatives (incorrectly predicted negative).
- Specificity: The proportion of actual negatives that are correctly identified as negative. $ \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} $
  - $\text{TN}$ : True Negatives.
  - $\text{FP}$ : False Positives.
- Precision: The proportion of positive predictions that are actually correct. $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $
  - $\text{TP}$ : True Positives.
  - $\text{FP}$ : False Positives.
- Sensitivity (Recall): The proportion of actual positives that are correctly identified. $ \text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}} $
  - $\text{TP}$ : True Positives.
  - $\text{FN}$ : False Negatives.
- F1-score: The harmonic mean of Precision and Recall, balancing both metrics. $ \text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
  - $\text{Precision}$ : Defined above.
  - $\text{Recall}$ : Defined above.
- AUROC (Area Under the Receiver Operating Characteristic Curve): A measure of the overall diagnostic ability of a binary classifier system, indicating its ability to discriminate between positive and negative classes across various thresholds. (Its formula involves integration and is generally represented by the area under the ROC curve, which plots Sensitivity vs. (1-Specificity)).
Long-Text Generation Tasks (e.g., summarization, report generation):
- BLEU (Bilingual Evaluation Understudy) [29,91,109,110]: Measures n-gram overlap between generated and reference texts, primarily focusing on precision. $ \text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $
  - $\text{BP}$ : Brevity Penalty, penalizes short candidate sentences.
  - $N$ : Maximum $n$ -gram order.
  - $w_n$ : Weight for $n$ -gram precision.
  - $p_n$ : Modified $n$ -gram precision.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [29,91,109,110]: Measures overlap between generated and reference texts, primarily focusing on recall. $ \text{ROUGE-N} = \frac{\sum_{\text{Reference Summaries}} \sum_{n\text{-gram} \in \text{Reference Summary}} \text{Count}{\text{match}}(n\text{-gram})}{\sum{\text{Reference Summaries}} \sum_{n\text{-gram} \in \text{Reference Summary}} \text{Count}(n\text{-gram})} $
  - $\text{Count}_{\text{match}}(n\text{-gram})$ : Number of $n$ -grams co-occurring in candidate and reference.
  - $\text{Count}(n\text{-gram})$ : Number of $n$ -grams in the reference summary.
- CIDEr (Consensus-based Image Description Evaluation) [91]: Specifically used for image captioning. $ \text{CIDEr}n(c, S) = \frac{1}{|S|} \sum{s \in S} \prod_{k=1}^n B_k(\text{n-gram}_k(c), \text{n-gram}_k(s)) $
  - $c$ : Candidate caption.
  - $S$ : Set of reference captions.
  - $B_k$ : A function that computes the cosine similarity between TF-IDF weighted $k$ -grams of the candidate and reference captions.
  - $\text{n-gram}_k(\cdot)$ : The set of $k$ -grams in a caption.
- METEOR (Metric for Evaluation of Translation with Explicit Ordering) [109]: Considers exact word matches, stemmed word matches, and synonym matches. $ \text{METEOR} = (1 - \text{Penalty}) \times F_{\text{mean}} $ $ F_{\text{mean}} = \frac{10 \times P \times R}{P + 9 \times R} $
  - $\text{Penalty}$ : A penalty for disfluencies (gaps) in the matching.
  - $P$ : Precision (unigram precision).
  - $R$ : Recall (unigram recall).
- BERTScore [10,111]: Measures semantic similarity using contextual embeddings. $ \text{BERTScore} = \max_R \sum_{i \in \text{reference}} \max_{j \in \text{candidate}} \text{cosine}(\text{embedding}(r_i), \text{embedding}(c_j)) $ (This is a simplified representation; the actual BERTScore calculates precision, recall, and F1 by matching words based on their BERT embeddings and taking the maximum cosine similarity.)
- Moverscore [10,111]: Measures semantic distance using word embeddings, considering "cost" of moving words. (Detailed formula is complex and involves optimal transport, but conceptually it measures the minimum 'work' to transform one text's word embeddings into another's.)
- QAEval and QAFactEval [46]: Specialized tools used for evaluating factual consistency in open-ended question-answering scenarios, especially for medical text generation.
Readability Metrics: Used to assess the fluency and ease of understanding of generated text [37]. These do not typically have simple, single formulas but involve algorithms based on sentence length, word length, and syllable count.
- Flesch Reading Ease Score
- Flesch-Kincaid grade level
- Gunning Fog Index
- Coleman-Liau Index
- Simple Measure of Gobbledygook (SMOG)
Image Segmentation Metrics [97]:
- Dice coefficient (or F1-score): Measures the overlap between the predicted segmentation and the ground truth. $ \text{Dice} = \frac{2 \times |\text{Predicted} \cap \text{Ground Truth}|}{|\text{Predicted}| + |\text{Ground Truth}|} $
  - $|\text{Predicted} \cap \text{Ground Truth}|$ : The number of overlapping pixels/voxels between predicted and ground truth.
  - $|\text{Predicted}|$ : The total number of pixels/voxels in the predicted segmentation.
  - $|\text{Ground Truth}|$ : The total number of pixels/voxels in the ground truth segmentation.
- Intersection Over Union (IoU): Also known as Jaccard Index, measures the ratio of the intersection to the union of the predicted and ground truth segmentations. $ \text{IoU} = \frac{|\text{Predicted} \cap \text{Ground Truth}|}{|\text{Predicted} \cup \text{Ground Truth}|} $
  - $|\text{Predicted} \cap \text{Ground Truth}|$ : The number of overlapping pixels/voxels.
  - $|\text{Predicted} \cup \text{Ground Truth}|$ : The total number of pixels/voxels covered by either predicted or ground truth.
- 95th percentile of Hausdorff distance (95-HD): Measures the maximum distance of a point in one set to the nearest point in the other set, focusing on the boundary distances. The 95th percentile ignores a small fraction of outliers. (Formula is complex, involves calculating distances between points of two sets.)
Cross-modal Retrieval Metrics [98]:
- R@K (Recall@K): Measures the proportion of queries for which the correct item is found within the top K retrieved results. $ \text{R@K} = \frac{\text{Number of queries where correct item is in top K}}{\text{Total number of queries}} $
- MRR (Mean Reciprocal Rank): The average of the reciprocal ranks of the correct answer for a set of queries. If the correct answer is at rank $k$ $k$ , its reciprocal rank is $1/k$ $1/ k$ . $ \text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i} $
  - $|Q|$ : Total number of queries.
  - $\text{rank}_i$ : Rank of the first relevant document for the $i$ -th query.

5.2.2. Human Evaluation Metrics

These involve human experts or users to assess qualitative aspects.

Qualitative Methods: Case studies [112] allow evaluators to compare LLM output against ground truth.
Standardized Scales:
- DISCERN scale [113]: For judging the quality of written consumer health information.
- JAMA benchmark criteria [114]: For assessing the quality of medical information.
- Global Quality scale [115].
Error Analysis: Identifying and statistically analyzing factual errors, logical errors, unrelated information, incomplete responses, and faulty logic [9,109].
Custom Grading Rules / Likert-style Scales: Used to assess text quality across multiple levels.
- Example: Samaan et al. [31] used a 4-point scale for accuracy and comprehensiveness (1=comprehensive, 4=completely incorrect).
- Example: Chen et al. [91] used a 5-point Likert scale (1=strongly disagree to 5=strongly agree) for completeness and correctness.
Evaluator Diversity: Assessments are conducted by professional physicians [53,117-119], but also non-professionals such as patients and the general public [32,35] to capture user-centered perspectives.

5.3. Baselines

The baselines used in the reviewed studies vary widely depending on the task and the LLM being evaluated. Common baselines include:

Other LLM Versions/Models: For example, comparing ChatGPT-3.5 with ChatGPT-4 [20,41] or GPT-4V against traditional convolutional neural network (CNN) models [86].
Human Performance: Often, LLMs are benchmarked against the performance of medical professionals (e.g., physicians, radiologists) or human-level performance on medical exams [13,40,55].
Traditional AI Models: In image processing, vision-language models (VLMs) are compared to specialized CNN models that might have been trained for specific tasks [86].
No Intervention / Existing Practices: In clinical utility assessments, LLMs might be compared to current healthcare practices or the absence of AI assistance.
Open-source vs. Closed-source Models: Comparisons are made between closed-source models (e.g., proprietary GPT models) and open-source models (e.g., LLaMA-based models) [16].

These baselines are chosen to demonstrate the relative capabilities of LLMs, show their advantages over traditional methods, and highlight areas where they still fall short of human expertise.

6. Results & Analysis

The paper synthesizes findings from 256 studies, providing a comprehensive overview of LLM and LLM agent evaluation across various dimensions.

6.1. Core Results Analysis

6.1.1. Data Sources Used in Evaluations

The review highlights that evaluation data for medical LLMs primarily comes from two categories:

Existing Medical Resources: Standardized medical examinations (e.g., USMLE, Chinese national exams) are widely used due to their validated material and established answers. These are excellent for assessing general medical knowledge and subspecialty expertise. Medical literature also serves as a knowledge repository to test LLMs' ability to update knowledge.
Manually Curated Questions: To address the limitations of standardized exams in reflecting real-world interactions, researchers create questions based on clinical expertise or collect data from medical forums/social media. Multimodal datasets are constructed using medical images (e.g., X-rays, MRI, OCT) paired with expert reports, crucial for evaluating visual-textual understanding. Challenges in data collection include scarcity in certain disciplines, ethical considerations, and data privacy.

6.1.2. Key Medical Task Scenarios

The paper categorizes tasks into four main scenarios:

Closed-ended Tasks:
- Nature: Typically multiple-choice questions (MCQ) with definite answers, common in medical education and suitable for large-scale, quantifiable evaluation (Figure 3, left).
- Performance: LLMs (e.g., GPT-4) show promising accuracy (e.g., 61.6% on CMedExam [40], 69% on radiology board questions [42]) and can pass admission thresholds in some exams (e.g., Chinese Master's exams for ChatGPT-3.5 and ChatGPT-4 [20]). However, their performance can still be significantly lower than human level (e.g., 71.6% for humans vs. 61.6% for GPT-4 on CMedExam [40]) and sometimes fail to meet passing thresholds in specialized fields (e.g., gastroenterology self-assessment [41]).
- Limitations: LLMs exhibit biases, such as sensitivity to answer positions in bilingual MCQs [44]. They focus on procedural knowledge and may lack deep assessment of complex situations or real-world scenarios.
Open-ended Tasks:
- Nature: Require advanced natural language processing (NLP) and the ability to generate diverse, expressive responses (Figure 3, right). Evaluation dimensions are more complex.
- Summarization:
  - Capabilities: LLMs can summarize medical research evidence and electronic health records (EHRs). Studies show they can reduce time consumption and achieve high factual correctness for discharge summaries and operative reports [51]. They can also transform discharge summaries into patient-friendly language [52].
  - Limitations: Susceptible to generating incorrect information and potential harm due to misinformation [29]. May contain inappropriate omissions or insertions. Some LLMs outperform medical experts in clinical text summarization [55].
- Information Extraction:
  - Named Entity Recognition (NER): LLMs show strong performance in identifying medical entities (genes, proteins, diseases) from text, even achieving high F1-scores (e.g., 0.990 for stroke severity extraction from EHRs [58]). They can extract social determinants of health (SDoH) from clinic notes [59].
  - Relation Extraction: LLMs are evaluated for relations between entities like drug-drug interaction or gene-disease associations [56,57,60]. Fine-tuned models like ChIP-GPT demonstrate high accuracy (90-94%) in extracting specific data from biomedical databases [61].
- Medical Question Answering (QA):
  - Capabilities: LLMs provide responses across various medical disciplines (disease concepts, etiology, diagnosis, treatment). They demonstrate empathy in answering patient questions [68,64].
  - Reliability: Focus is on response reliability due to impact on healthcare decisions.
  - Limitations: Performance varies by subspecialty; for example, ChatGPT's performance on lacrimal drainage queries was average [72], and it struggled with treatment options for retinal diseases [73,74].
  - Evaluation: Diverse sources (author-designed, professional societies, social media, clinical cases). Human evaluation is often necessary due to the inability of automated metrics to fully assess correctness, medical knowledge updates, and patient implications. Beyond accuracy, completeness [31,64,65,67,68,76-80] and readability [64,76,77,81-83] are crucial. Safety (potential negative consequences) [84] and humanistic care/emotional support [68,64] are also evaluated. Concerns exist about LLM safeguards against generating health disinformation [85].
Image Processing Tasks:
- Nature: Involves jointly modeling image and text information for tasks like diagnosis and report generation (Figure 4).
- Image Classification:
  - Capabilities: LLMs perform well on various medical image classification tasks (e.g., using F1, AUROC, accuracy on MIMIC-CXR, Pad-UFES20 [16]). Vision-language models (VLMs) show promise in zero-shot and few-shot scenarios compared to traditional CNNs, especially with cross-lingual regularization (Med-UniC framework) [87].
  - Limitations: Need improvement in recognizing fine-grained visual concepts [16].
- Report Generation:
  - Capabilities: Aim to automate diagnostic report generation from images. GPT-4V can generate descriptive reports with structured prompts [88]. Models like R2GenGPT [89] and LLM-CXR [90] show promise in capturing lesion characteristics. ICGA-GPT achieved satisfactory automated metrics and high human accuracy/completeness scores for ophthalmic reports [91].
  - Limitations: Accuracy of generating specific medical terms needs improvement [88]. Many evaluations lack human assessment.
- Visual Question Answering (VQA):
  - Nature: Multimodal task requiring image comprehension, contextual understanding, and medical knowledge reasoning. High demands on cross-modal understanding and natural language expression (Figure 4, panel D).
  - Performance: GPT-4V excels in distinguishing question types [93] and identifying ophthalmic examination types (e.g., 95.6% accuracy [7]) but shows significant limitations in lesion identification (e.g., 25.6% accuracy for GPT-4V [7], 65% for ChatGPT-4 [15]).
  - Challenges: Hallucination is a critical issue; models generate plausible but incorrect answers, especially for detailed observations or complex reasoning [94,95,96]. Lack of large-scale, high-quality VQA datasets in specific medical specialties (e.g., ophthalmology) [7].
  - Evaluation: Combines automated (accuracy, F1, BLEU) and manual assessments for expertise and appropriateness.
- Other Tasks: LLMs show potential in medical image segmentation (e.g., LLMSeg [97] for target delineation in radiotherapy) and cross-modal retrieval (e.g., GPT-CMR [98] for medical video QA). Evaluation benchmarks for these emerging tasks are still inadequate.
Real-world Multitask Scenarios involving LLM Agents:
- Nature: Simulate complex clinical workflows involving interdependent subtasks (e.g., diagnosis, image segmentation, report generation, multimodal QA) (Figure 5). Traditional LLMs fall short; agents are more suitable.
- Agent Architecture Levels:
  - Level 1: Generator agent (with RAG).
  - Level 2: Tool-integrating agent (expert model toolkits).
  - Level 3: Planning agent (reasoning, multi-step workflows).
  - Level 4: High autonomy agent (tool calls, reasoning, planning) [101-103].
- Evaluation Frameworks: AgentBench [104] (cross-domain, open-ended decision-making gaps). RadABench [105] (radiology-specific, tool-rich workflows).
- Process-centered Evaluation: Shift toward simulating fluidity of real-world workflows. Remote OSCE-style evaluations (Tu et al. [106]) and CRAFT-MD framework [107] emphasize realistic doctor-patient interactions and comprehensive medical history collection. These frameworks integrate automated and expert evaluation.

6.1.3. Evaluation Methods and Dimensions

The review distinguishes between automatic and human evaluation, and traditional versus agent-specific dimensions.

Automatic Evaluation:
- Classification Metrics: Accuracy, specificity, precision, sensitivity, F1-score (for predictions) [108].
- Text Generation Metrics: BLEU, ROUGE, CIDEr, METEOR (for text overlap/literal accuracy) [29,91,109,110]. BERTScore, Moverscore (for semantic similarity) [10,111]. QAEval, QAFactEval (for factual consistency in QA) [46].
- Readability Metrics: Flesch Reading Ease Score, Flesch-Kincaid grade level, etc. (for fluency and ease of understanding) [37].
Human Evaluation:
- Qualitative Methods: Case studies for detailed comparison with ground truth [112].
- Scoring Protocols: Standardized scales like DISCERN [113], JAMA criteria [114], Global Quality scale [115].
- Error Analysis: Quantifying factual errors, logical errors, incompleteness [9,109].
- Custom Scales: Likert-style scales for multi-level assessment (e.g., 4-point for accuracy/comprehensiveness [31], 5-point for completeness/correctness [91]).
- Evaluator Diversity: Involves professional physicians [53,117-119] and non-professionals (patients, public) [32,35] to capture varied perspectives.
Traditional Evaluation Dimensions (Figure 2):
- Accuracy: Most common, measured by correct answers or semantic consistency [10,91,120].
- Completeness, Safety, Communication, User Satisfaction.
- Hallucination: A critical issue, but no standardized medical evaluation method. Quantified by accuracy below a threshold or categorized errors [9,27,94,95,96].
- Limited assessment of bias, stability, cost-effectiveness [47,121,122].
- Comprehensive frameworks (e.g., Singhal et al. [4]) involving 12 aspects (scientific consensus, harm extent/likelihood, comprehension, retrieval, reasoning, bias).
LLM Agent-Specific Evaluation Dimensions (Figure 6):
- Beyond traditional metrics, agent evaluation requires assessing intermediate processes.
- Tool Usage Capability (for Level 2+ agents): Tool correctness (was the right tool called?), tool sequencing (optimal combination?), tool efficiency (redundant calls?).
- Reasoning Capability (for Level 3+ agents): Assesses metacognition, confidence scores [123], logical chain from patient data to diagnosis, or from entities to treatment plans [124].
- Workflow Management (for Level 3+ agents): Rationale for task decomposition, completion rate of planned actions.
- Autonomous Assessment (for Level 4 agents): Agent's safety rate during autonomous operations, ability for continuous feedback optimization, boundary identification for unsolvable cases.

6.1.4. Key Challenges and Opportunities (Discussion Section)

The paper identifies significant challenges and proposes strategies:

Data Challenges: Lack of high-quality medical QA and VQA datasets. Burden on medical professionals for dataset construction due to time constraints [125].
- Opportunity: Human-LLM collaboration to build and optimize evaluation datasets to enhance scalability and reduce manual workload.
Assessment Standards Challenges: Traditional focus on accuracy is insufficient. Need for refined classification of hallucinations, assessment of empathy and interpretability [126-128].
- Opportunity: Develop hierarchical classification systems for hallucinations, integrate agent-specific dimensions (tool usage, reasoning) into targeted frameworks.
Evaluation Metrics Challenges: Over-reliance on traditional metrics. Advanced LLMs (e.g., GPT-4) for automated assessment are being explored [75,11,129,130] but require validation for stability and robustness. Human evaluation remains indispensable.
- Opportunity: Develop and validate automated evaluation systems that balance rigor with efficiency, while acknowledging the continued need for human oversight.
Practical Application Challenges: Most studies are at the preclinical validation stage. Need for real-world clinical trials (RCTs) to compare LLMs with existing practices and assess value in practical applications [131]. Appropriate endpoints (reducing morbidity, improving efficiency, satisfaction) are needed.
- Opportunity: Design trials aligned with real-world clinics, incorporate diverse evaluators (patients, students, public) [32,35], and leverage user feedback for continuous optimization, similar to generalist LLMs [132].
Frameworks: Mention of a five-element evaluation framework (models, environments, interfaces, interactive users, leaderboards) [133] as a valuable reference.

6.2. Data Presentation (Tables)

The provided paper content does not include any data tables in the main body. It refers to a "Supplementary Table 1" which is not provided. The figures included are conceptual diagrams (Figure 1, Figure 2, Figure 6), illustrative examples (Figure 3, Figure 4), and an architectural diagram (Figure 5). Therefore, there are no tables to transcribe.

6.2.1. Analysis of Figures

The paper leverages several figures to visually summarize its conceptual framework and task scenarios:

Figure 1. Illustration of the potential LLM evaluation framework in medicine. This figure serves as a high-level conceptual map, depicting the interconnected components of LLM evaluation: Data Sources, Task Scenarios, Evaluation Methods, and Evaluation Dimensions. It underscores the multidimensional nature of robust assessment in healthcare.
Figure 2. Relationships between Data Sources, Task Scenarios, and Evaluation Dimensions. This Sankey diagram provides a more detailed visualization of the flow. It shows how Existing Medical Resources and Manually Curated Questions feed into different Task Scenarios (Closed-ended, Open-ended, Image Processing, Real-world Multitask Scenarios). These tasks then lead to various Evaluation Dimensions, which include traditional metrics (Accuracy, Safety, Completeness, Readability) and more advanced ones (Bias, Hallucination, Tool Usage, Reasoning). This visual aid helps to grasp the complexity and interdependencies of the evaluation process discussed in the paper.
Figure 3. Examples of closed-ended tasks and various open-ended tasks. This figure illustrates the distinction between closed-ended tasks (like multiple-choice questions with a single correct answer) and open-ended tasks (which require generation of diverse responses, such as summarization, information extraction, and medical question answering). It supports the paper's categorization of task scenarios by providing concrete examples.
Figure 4. Examples of medical LLMs applied in image classification, report generation, visual question answering, and image segmentation. This multi-panel figure demonstrates the multimodal capabilities of LLMs in medical imaging. It shows how LLMs can be used for:
- (A, B) Image classification, where the model identifies conditions from images.
- (C) Image report generation, where it creates textual diagnostic reports from scans.
- (D) Visual question answering (VQA), where it answers questions based on an image.
- (E) Image segmentation, where it delineates structures within an image. This figure visually underpins the "Image Processing Tasks" section, showcasing the breadth of visual applications.
Figure 5. A medical LLM-based agent interacting with a patient through multi-turn dialogue and tool calls. This figure provides a detailed schematic of an LLM agent in a clinical decision support role. It shows an agent taking patient input, performing medical history collection, symptom analysis, diagnostic reasoning, image analysis (calling an image model), report generation, and providing clinical decision support through multi-turn dialogue. This visual representation is key to understanding the real-world multitask scenarios involving LLM agents and their operational flow.
Figure 6. LLM Agent specific evaluation dimensions. This chart, central to the paper's agent evaluation discussion, illustrates the hierarchical nature of LLM agent evaluation. It categorizes agents into four levels of autonomy (Generator, Tool-integrating, Planning, High Autonomy) and maps specific evaluation dimensions to each level. Traditional metrics apply to all levels, but as autonomy increases, Tool Usage Capability, Reasoning Capability, Workflow Management, and Autonomous Assessment become increasingly important. This figure succinctly captures the paper's argument for specialized agent evaluation.
Figure 7. Key challenges and future directions for evaluating large language models (LLMs) in medicine. This chart summarizes the discussion section. On the left, it outlines Key Challenges such as single-aspect test datasets, limited evaluation dimensions, and lack of patient involvement. On the right, it points to Future Directions, including the need for reasoning, safety test sets, and the development of automatic evaluation pipelines. This figure provides a clear roadmap for future research, reiterating the paper's forward-looking insights.

6.3. Ablation Studies / Parameter Analysis

This paper is a systematic review of existing literature, not a primary research paper presenting new experimental results. Therefore, it does not include ablation studies or parameter analysis of its own methods. The paper reviews the evaluation practices, datasets, and metrics used in other studies. Any ablation studies or parameter analyses would belong to the individual papers that this review synthesizes.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper provides a comprehensive and systematic overview of the current evaluation practices for large language models (LLMs) and LLM agents in healthcare. It underscores that while these AI technologies hold significant promise across various clinical applications, their evaluation presents unique and complex challenges due to the high-risk nature of medicine and the complexity of medical data. The review meticulously categorizes data sources (existing medical resources and manually curated questions), task scenarios (closed-ended, open-ended, image processing, and real-world multitask agent scenarios), and evaluation methods and dimensions (automated metrics, human expert assessments, traditional accuracy, and agent-specific considerations like tool usage and reasoning). Ultimately, the paper concludes that establishing precise, effective, and standardized evaluation frameworks is crucial for ensuring the safe, ethical, and effective deployment of LLMs in clinical practice, emphasizing the indispensable role of continued interdisciplinary collaboration.

7.2. Limitations & Future Work

The authors identify several key limitations in the current landscape of medical LLM evaluation and propose corresponding future research directions:

Data Scarcity and Quality:
- Limitation: A significant challenge is the lack of high-quality medical QA and especially VQA datasets. Crafting these datasets is resource-intensive, requiring medically compliant images, diverse questions, and expert annotations. The involvement of medical professionals, crucial for this task, is often impractical due to their limited time and energy [125].
- Future Work: Explore human-LLM capabilities to collaboratively build and continuously optimize evaluation datasets, thereby enhancing scalability and reducing manual workload. This suggests a hybrid approach where LLMs can assist in generating or pre-filtering data, which experts then refine.
Incomplete Assessment Standards:
- Limitation: Current evaluation dimensions, while focusing on accuracy, completeness, and safety, often lack refinement in other critical medical aspects such as hallucination, empathy, and interpretability [126-128]. There is no standardized method for quantifying hallucination types and severity. For LLM agents, specific dimensions like tool usage and clinical reasoning are often overlooked.
- Future Work: Develop a more precise and hierarchical classification system for hallucinations. Integrate agent-specific evaluation dimensions into targeted frameworks, building upon models that assess metacognition and different types of medical reasoning.
Limitations of Evaluation Metrics:
- Limitation: The field still predominantly relies on traditional classification and NLP metrics. While advanced LLMs like GPT-4 are being explored for automated assessment [11,129,130], their stability and robustness for such tasks require further validation. Crucially, automated metrics often cannot fully capture the correctness, applicability, or implications of responses in dynamic medical contexts, making human evaluation indispensable in the short term.
- Future Work: Focus on further developing and validating automated evaluation systems that can balance the safeguarding of rational assessments with alleviating pressure on human resources, facilitating large-scale evaluations.
Gap in Practical Application and Validation:
- Limitation: Most studies are currently at the preclinical validation stage. There's a lack of rigorous clinical trials (RCTs) that compare LLMs against existing healthcare systems, traditional AI tools, or human professionals in real-world settings [131]. Appropriate endpoints to gauge success or failure (e.g., reducing morbidity, improving work efficiency, patient/physician satisfaction) are needed.
- Future Work: Design clinical trials that are more aligned with real-world clinics. Incorporate diversity and personalization in evaluator selection, including physicians, patients, medical students, and other users, based on specific application scenarios. Leverage user feedback (from service logs, user satisfaction) for online assessments and continuous performance monitoring, similar to generalist LLMs [132].
  
  The paper emphasizes that a five-element evaluation framework (models, environments, interfaces, interactive users, and leaderboards) proposed by Abbasian et al. [133] offers valuable references for future research and applications.

7.3. Personal Insights & Critique

This paper provides a highly valuable and timely review for anyone working on or interested in the application of LLMs and LLM agents in healthcare. Its strength lies in its systematic categorization and comprehensive coverage, offering a structured way to understand a rapidly evolving and often chaotic research landscape.

Inspirations and Transferability:

Structured Thinking: The paper's framework (data sources, task scenarios, evaluation methods, dimensions) is an excellent model for analyzing any complex AI application domain. This structured approach can be transferred to other high-stakes fields like finance, autonomous driving, or legal tech, where AI errors can have severe consequences.
Beyond Accuracy: The strong emphasis on moving beyond simple accuracy to evaluate hallucination, bias, safety, interpretability, and empathy is critical. These qualitative and ethical dimensions are paramount in human-centric fields and should be central to AI evaluation across the board.
Agent-Centric Evaluation: The detailed breakdown of agent-specific evaluation dimensions (tool usage, reasoning, workflow management, autonomous assessment) offers a blueprint for assessing increasingly sophisticated AI systems that orchestrate multiple tools and decision steps. This is crucial for understanding the true capabilities and risks of AI beyond basic prompt-response models.
Interdisciplinary Collaboration: The repeated call for collaboration between domain experts (healthcare professionals) and AI researchers is not just a polite suggestion but a practical necessity highlighted by the review. This underscores that robust AI development and deployment require a deep understanding of both technological capabilities and real-world domain constraints.

Potential Issues, Unverified Assumptions, and Areas for Improvement:
Scope vs. Depth Trade-off: As a review, the paper provides broad coverage, but the depth of analysis for any single evaluation method or task scenario is necessarily limited. A beginner might still need to consult the cited papers for the nitty-gritty details of specific metrics or task implementations.
Rapid Obsolescence: The field of LLMs is evolving at an unprecedented pace. A review published in 2025 based on literature up to early 2025 might quickly become partially outdated, especially regarding specific model performances or state-of-the-art evaluation techniques. The challenge for such reviews is to focus on enduring principles rather than fleeting specifics. The paper mitigates this by focusing on frameworks and challenges rather than just listing benchmarks.
Defining "Adequate Assessment": The exclusion criterion for studies lacking "adequate assessment" (formal protocols, statistical validation, sample size > 20) is good, but "adequate" can still be subjective. More explicit guidelines or a scoring rubric for assessment quality might have strengthened the selection process, though this is difficult for a broad review.
Bias in Selected Studies: While the systematic search aimed for comprehensiveness, the selection of "representative studies" for examples (e.g., in task scenarios) could still introduce some implicit bias, potentially highlighting successes more than failures, or certain research groups over others.
Operationalizing Qualitative Metrics: While the paper calls for evaluating empathy, interpretability, and hallucination more rigorously, the operationalization of these qualitative concepts into robust, scalable, and objective metrics remains a significant challenge. The paper highlights the need, but the "how" is still an open research question. For instance, how do we quantitatively score "empathy" consistently across different evaluators and cultures?
Cost-Effectiveness: The paper mentions cost-effectiveness as an often-overlooked dimension. Given the high computational and human resource costs associated with developing, deploying, and evaluating large LLMs, this is a crucial practical aspect that warrants more explicit discussion and potential metrics.

Overall, this paper serves as an excellent foundational text for understanding the current state and future directions of medical LLM evaluation. Its rigorous structure and emphasis on interdisciplinary collaboration make it a highly valuable contribution to the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.