Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology
TL;DR Summary
This study developed an autonomous AI agent that integrates GPT-4 with multimodal precision oncology tools. Evaluated on 20 real cases, it achieved 87.5% tool accuracy and 91.0% correct conclusions, significantly boosting decision accuracy to 87.2%, laying the groundwork for pers
Abstract
Clinical decision-making in oncology is complex, requiring the integration of multimodal data and multidomain expertise. We developed and evaluated an autonomous clinical artificial intelligence (AI) agent leveraging GPT-4 with multimodal precision oncology tools to support personalized clinical decision-making. The system incorporates vision transformers for detecting microsatellite instability and KRAS and BRAF mutations from histopathology slides, MedSAM for radiological image segmentation and web-based search tools such as OncoKB, PubMed and Google. Evaluated on 20 realistic multimodal patient cases, the AI agent autonomously used appropriate tools with 87.5% accuracy, reached correct clinical conclusions in 91.0% of cases and accurately cited relevant oncology guidelines 75.5% of the time. Compared to GPT-4 alone, the integrated AI agent drastically improved decision-making accuracy from 30.3% to 87.2%. These findings demonstrate that integrating language models with precision oncology and search tools substantially enhances clinical accuracy, establishing a robust foundation for deploying AI-driven personalized oncology support systems.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology
1.2. Authors
Dyke Ferber, Omar S. M. El Nahhas, Georg Wölflein, Isabella C. Wiest, Jan Clusmann, Marie-Elisabeth Leßmann, Sebastian Foersch, Jacqueline Lammert, Maximilian Tschochohei, Dirk Jäger, Manuel Salto-Tellez, Nikolaus Schultz, Daniel Truhn, and Jakob Nikolas Kather
1.3. Journal/Conference
The paper is published in Nature Cancer. Nature Cancer is a highly reputable peer-reviewed scientific journal focusing on all areas of cancer research. As part of the Nature Portfolio, it holds significant influence and prestige in the oncology and biomedical research communities.
1.4. Publication Year
2025
1.5. Abstract
Clinical decision-making in oncology is complex, requiring the integration of multimodal data and multidomain expertise. We developed and evaluated an autonomous clinical artificial intelligence (AI) agent leveraging GPT-4 with multimodal precision oncology tools to support personalized clinical decision-making. The system incorporates vision transformers for detecting microsatellite instability and KRAS and BRAF mutations from histopathology slides, MedSAM for radiological image segmentation and web-based search tools such as OncoKB, PubMed and Google. Evaluated on 20 realistic multimodal patient cases, the AI agent autonomously used appropriate tools with 87.5% accuracy, reached correct clinical conclusions in 91.0% of cases and accurately cited relevant oncology guidelines 75.5% of the time. Compared to GPT-4 alone, the integrated AI agent drastically improved decision-making accuracy from 30.3% to 87.2%. These findings demonstrate that integrating language models with precision oncology and search tools substantially enhances clinical accuracy, establishing a robust foundation for deploying AI-driven personalized oncology support systems.
1.6. Original Source Link
/files/papers/694d2b6f1bec792f52e46409/paper.pdf (This is a local file path, indicating it's likely a PDF accessed through a specific system or provided directly. Its publication status is "Published online: 6 June 2025").
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the increasing complexity of clinical decision-making in oncology. This complexity stems from the need to integrate diverse data types (multimodal data) such as genomic, histopathological, and radiological information, along with requiring expertise across multiple medical domains. Traditional Large Language Models (LLMs) like GPT-4 have shown promise in medical reasoning but often fall short when dealing with highly specialized tasks, requiring multi-step reasoning, or handling multimodal data. They can also "hallucinate" (generate factually incorrect but plausible-sounding information) when asked questions outside their knowledge base.
Current research often evaluates LLMs on single, specific tasks, which doesn't reflect the real-world clinical need for holistic, personalized patient care. Furthermore, regulatory policies in regions like the US and EU restrict the approval of universal multipurpose AI models, favoring medical devices with singular, well-defined purposes. Prior attempts to enhance LLMs with domain-specific information, such as fine-tuning or Retrieval-Augmented Generation (RAG), primarily position LLMs as information extraction tools rather than true clinical assistants capable of reasoning, strategizing, and performing actions.
The paper's innovative idea is to overcome these limitations by developing an autonomous clinical AI agent that leverages the reasoning capabilities of GPT-4 and equips it with a suite of specialized precision oncology tools and web-based search functionalities. This approach aims to bridge the gap between generalist LLMs and specialized AI models, allowing for robust, evidence-based, and personalized clinical decision-making.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Development of an Autonomous
AIAgent: The authors developed a novelAIagent that integratesGPT-4with multimodal precision oncology tools and web search capabilities. This agent can autonomously select and use appropriate tools, perform multi-step reasoning, and generate evidence-based clinical recommendations. -
Multimodal Tool Integration: The system successfully incorporates a diverse set of tools including
vision transformersfor detectingmicrosatellite instability (MSI)andKRAS/BRAFmutations from histopathology slides,MedSAMfor radiological image segmentation, and web-based search tools (OncoKB,PubMed,Google) for information retrieval. -
Enhanced Decision-Making Accuracy: The
AIagent significantly outperformsGPT-4alone. When evaluated on 20 realistic multimodal patient cases, the integrated agent achieved a clinical conclusion accuracy of 91.0%, compared to only 30.3% forGPT-4alone. -
Effective Tool Use and Citation: The agent autonomously used appropriate tools with 87.5% accuracy and accurately cited relevant oncology guidelines 75.5% of the time, demonstrating its capability for robust reasoning and transparent, evidence-based responses.
-
Robust Foundation for Deployment: The findings establish a robust foundation for deploying
AI-driven personalized oncology support systems, addressing the need for specializedAIcapabilities within a regulated medical context.The key conclusions are that combining
LLMswith precision medicine tools andRetrieval-Augmented Generation (RAG)substantially enhances clinical accuracy and reliability in oncology decision-making. This modular, agent-based approach offers advantages over monolithic generalist models, including ease of updating medical knowledge and integrating validated medical devices, while mitigating issues likehallucination.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a reader should be familiar with several foundational concepts in artificial intelligence, machine learning, and oncology.
-
Large Language Models (LLMs):
- Conceptual Definition:
LLMsare advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and process human language. They learn complex patterns in language, enabling them to perform tasks like translation, summarization, question answering, and even creative writing. - How they work:
LLMstypically employ atransformerarchitecture, which usesattention mechanismsto weigh the importance of different words in a sequence. They undergo two main phases:pre-trainingon a massive text corpus to learn general language understanding, and thenfine-tuningon smaller, specialized datasets for specific tasks or domains. - Relevance to Paper:
GPT-4is theLLMbackbone of theAIagent, serving as its "reasoning engine" to interpret patient cases, formulate strategies, and orchestrate the use of various tools.
- Conceptual Definition:
-
Multimodal AI:
- Conceptual Definition:
Multimodal AIrefers to artificial intelligence systems that can process and integrate information from multiple types of data inputs, or "modalities," such as text, images, audio, and structured data. - Relevance to Paper: In oncology, this means combining textual patient reports, radiological images (CT/MRI), histopathological images, and genomic data to form a holistic understanding of a patient's condition. The
AIagent developed in this paper is amultimodal AIsystem.
- Conceptual Definition:
-
Precision Oncology:
- Conceptual Definition:
Precision oncology, also known as personalized medicine in cancer, is an approach to cancer treatment that considers the individual variability in genes, environment, and lifestyle for each person. Instead of a one-size-fits-all approach, it aims to tailor treatments based on the unique genetic and molecular characteristics of a patient's tumor. - Relevance to Paper: The
AIagent supportspersonalized clinical decision-makingby integrating genomic information and other patient-specific data to suggest highly targeted therapies.
- Conceptual Definition:
-
Retrieval-Augmented Generation (RAG):
- Conceptual Definition:
RAGis a technique used to enhanceLLMsby allowing them to access, retrieve, and incorporate information from external knowledge bases during the generation process. This contrasts with traditionalLLMsthat only rely on the knowledge encoded during their training. - How it works: When an
LLMreceives a query, aretrieval componentfirst searches an external database (e.g., a collection of medical guidelines) for relevant documents or passages. These retrieved documents are then provided as additional context to theLLM, which uses this new information to generate a more informed and accurate response, reducing the likelihood ofhallucinations. - Relevance to Paper: The paper's
AIagent usesRAGto ground its responses in authoritative medical guidelines and documents, ensuring accuracy and providing citations.
- Conceptual Definition:
-
AI Agents / Autonomous AI Systems:
- Conceptual Definition: An
AI agent(or autonomousAIsystem) is anLLMequipped with a set of tools (e.g., calculators, search engines, specializedAImodels) and the ability to autonomously decide when and how to use these tools to achieve a complex goal. It involvesreasoning,planning, andactionsteps, often in an iterative fashion. - Relevance to Paper: The core of this paper is the development and evaluation of such an
AI agentin the oncology domain.
- Conceptual Definition: An
-
Vision Transformers:
- Conceptual Definition:
Vision transformersare a type of neural network that applies thetransformerarchitecture (originally developed for natural language processing) to image processing tasks. Unlike traditionalConvolutional Neural Networks (CNNs)that process images using convolutional layers,vision transformerstreat image patches as sequences of tokens, similar to words in a sentence, and useself-attention mechanismsto understand relationships between these patches. - Relevance to Paper: The
AIagent usesvision transformersfor specialized tasks like detectingmicrosatellite instability (MSI)andKRAS/BRAFmutations directly fromhistopathology slides.
- Conceptual Definition:
-
Medical Image Segmentation:
- Conceptual Definition:
Image segmentationis a computer vision technique that involves dividing a digital image into multiple segments (sets of pixels), often to identify specific objects or regions of interest.Medical image segmentationspecifically applies this to medical scans (like CT or MRI) to delineate organs, tumors, or other anatomical structures. - Relevance to Paper: The
AIagent usesMedSAM(a specialized medical image segmentation model) to segment tumors from radiological images, allowing for precise measurement of tumor size and tracking changes over time.
- Conceptual Definition:
-
Microsatellite Instability (MSI) / Microsatellite Stability (MSS):
- Conceptual Definition:
Microsatellitesare short, repetitive sequences ofDNA.Microsatellite Instability (MSI)is a condition where errors accumulate in these repetitiveDNAsequences due to a dysfunctionalDNA mismatch repair (MMR)system.MSI-Hightumors are often more responsive toimmunotherapy.Microsatellite Stability (MSS)indicates a stableMMRsystem. - Relevance to Paper: Detecting
MSIstatus fromhistopathology slidesis a critical biomarker for guidingimmunotherapydecisions in oncology, and theAIagent incorporates avision transformerfor this purpose.
- Conceptual Definition:
-
KRAS and BRAF Mutations:
- Conceptual Definition:
KRASandBRAFare genes that encode proteins involved in cell growth and differentiation. Mutations in these genes (KRASmutations andBRAF V600Emutation) are commononcogenic driversin various cancers, including colorectal cancer, melanoma, and lung cancer. These mutations can beactionable targetsfor specifictargeted therapies. - Relevance to Paper: The
AIagent usesvision transformersto detectKRASandBRAFmutations fromhistopathology slidesand then leverages theOncoKBdatabase to suggest appropriatetargeted therapies.
- Conceptual Definition:
-
RECIST Criteria (Response Evaluation Criteria In Solid Tumors):
- Conceptual Definition:
RECISTis a set of standardized rules used in clinical trials to objectively measure how much cancer has shrunk or disappeared after treatment. It involves measuring target lesions (tumors) and classifying the response ascomplete response,partial response,stable disease, orprogressive diseasebased on predefined percentage changes in tumor size. - Relevance to Paper: The
AIagent usesMedSAMto measure tumor size and acalculatorto determine changes according toRECIST criteria, informing decisions about disease progression or response.
- Conceptual Definition:
3.2. Previous Works
The paper contextualizes its work by referencing several previous studies and existing AI models:
-
General
LLMCapabilities:GPT-4(OpenAI et al., 2023): Acknowledged for its remarkable advancements, mimicking human reasoning, and showing knowledge across professional disciplines. Cited for achieving a passing score on theUnited States Medical Licensing Examination (USMLE)and providing detailed explanations (Nori et al., 2023).GPT-4for medical recommendation suggestions from official guidelines (Ferber et al., 2024): This previous work from some of the current authors highlightedLLMsas rapid and reliable reference tools in oncology.
-
Multimodal
AISystems:- Acosta et al. (2022): Discussed the rise of multimodal biomedical
AI. - Examples of
AIsystems analyzing radiology images with clinical data (Khader et al., 2023) or integrating histopathology with genomic (Chen et al., 2022) or text-based information (Lu et al., 2024). These advancements fueled anticipations for "generalist multimodalAIsystems." - (Tu et al., 2024) and
MedGemini(Saab et al., 2024): Mentioned as examples of generalistLLMsin medicine, highlighting the challenge of developing such models to truly perform on par with precision medicine tools.
- Acosta et al. (2022): Discussed the rise of multimodal biomedical
-
Enhancing
LLMswith Domain-Specific Information:Fine-tuning(Chen et al., 2023, forMEDITRON-70B) andRetrieval-Augmented Generation (RAG)(Lewis et al., 2020): Strategies for enrichingLLMswith domain-specific knowledge.Almanac(Zakka et al., 2024) andLongHealth(Adams et al., 2024): Recent datasets and models targeting open-ended responses and patient-related content, though limited to text.
-
Autonomous
AISystems (Agents):- Previous work showing
LLMsequipped with tools (calculators, web search) proving superiority in multi-step reasoning and planning tasks (Schick et al., 2023). Arasteh et al.(2024) used integrated data analysis tools of anLLMto analyze scientific data, achieving results on par with human researchers.
- Previous work showing
-
Specialist Unimodal Deep Learning Models in Precision Medicine:
Kather et al.(2019): Predictedmicrosatellite instabilityfrom histology.MedSAM(Ma et al., 2024): A state-of-the-art medical image segmentation model.Wagner et al.(2023) andEl Nahhas et al.(2024): Developedvision transformermodels forKRASandBRAFmutation detection fromhistopathology slides.OncoKB(Chakravarty et al., 2017): A precision oncology knowledge base.MSIntuit(Saillard et al., 2023): AnAI-based pre-screening tool forMSIdetection mentioned as a clinical-grade alternative.
-
Regulatory Context:
-
Derraz et al. (2024): Pointed out the need for new regulatory thinking for
AI-based personalized drug/cell therapies in precision oncology. -
Gilbert et al. (2023): Discussed that
LLM AIchatbots require approval as medical devices.The paper also refers to
CTranspath(Wang et al., 2022) for feature extraction in computational pathology pipelines (El Nahhas et al., 2025).
-
3.3. Technological Evolution
The field of AI in medicine has rapidly evolved, particularly with the advent of Large Language Models (LLMs).
-
Early
AIin Medicine: Initially,AIapplications in medicine focused on rule-based expert systems or simpler machine learning models for specific tasks (e.g., image analysis for pathology or radiology). -
Rise of Deep Learning: With the success of
deep learning, particularlyConvolutional Neural Networks (CNNs),AIcapabilities expanded, leading to highly effective unimodal models for tasks like disease detection from medical images or biomarker prediction from histopathology. These specialist models (MedSAM,vision transformersfor ) represent the current state-of-the-art in specialized medicalAI. -
Emergence of
LLMs: The development oftransformerarchitectures and massiveLLMslikeGPT-4revolutionizedAI's language understanding and generation capabilities. These models demonstrated broad knowledge and reasoning abilities, even passing medical exams. -
Multimodal
LLMsand GeneralistAI: The next step was integratingLLMswith other modalities (e.g., text and images) to createmultimodal LLMsandgeneralist medical AIsystems (,MedGemini). The ambition was for a single model to handle any medical information. However, these faced challenges in specialization, rapid knowledge updates, and regulatory approval. -
LLMAgents with Tools (Retrieval-Augmented Generation): To address the limitations of standaloneLLMs(hallucinations, lack of real-time knowledge, inability to perform actions), the concept ofLLM agentsemerged. These agents leverageLLMsas "reasoning engines" and augment them with external tools (web search, calculators) and knowledge bases (RAG). This allowsLLMsto perform multi-step reasoning, retrieve up-to-date information, and execute specialized tasks.This paper's work fits within the
LLMagent with tools paradigm. It leverages the advanced reasoning ofGPT-4and specifically integrates it with highly specialized, validated precision oncology tools, augmented byRAGfor evidence grounding.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's approach offers several core differences and innovations:
-
Modular vs. Monolithic Generalist
AI:- Related Work (Generalist
AI): Efforts like andMedGeminiaim to create single, all-encompassing multimodalAImodels that can understand and reason across any dimension of medical information. - This Paper's Differentiation: The paper explicitly opts for a modular, tool-based approach. Instead of trying to develop a single "generalist foundation model," it equips a powerful
LLM(GPT-4) with an array of specialist unimodal deep learning models (e.g.,vision transformersfor histology,MedSAMfor radiology) and external search/knowledge tools (OncoKB,PubMed,Google). This acknowledges the achievements of specialist models and circumvents the challenges of training and maintaining a single, complex generalist model.
- Related Work (Generalist
-
LLMas a "Reasoning Engine" with Actions, not just Information Extraction:- Related Work (RAG/Fine-tuning): Previous
RAGorfine-tuningapproaches often positionLLMsprimarily as enhanced information extraction tools. - This Paper's Differentiation: The
AIagent in this study goes beyond mere information retrieval. It acts as a "reasoning engine" that canstrategize,plan, andperform actionsby autonomously selecting, invoking, and integrating outputs from various tools. This enables customized decision-making rather than just providing suggestions.
- Related Work (RAG/Fine-tuning): Previous
-
Addressing Regulatory and Maintenance Challenges:
- Related Work (Generalist
AI): Developing and regulating universal multipurposeAImodels is challenging due to policy restrictions (medical devices should fulfill a singular purpose) and the difficulty of continuous alignment with evolving medical knowledge. - This Paper's Differentiation: The modular approach simplifies updates. Medical knowledge can be rapidly updated by replacing documents in the
RAGdatabase or through web searches, without retraining the coreLLM. State-of-the-art, regulatory-approved medical devices can be integrated as tools, facilitating compliance and allowing for independent validation of each component.
- Related Work (Generalist
-
Mitigating
HallucinationthroughRAGand Tool Use:- Related Work (
LLMsalone): StandaloneLLMsare prone tohallucination, providing plausible but incorrect answers. - This Paper's Differentiation: The combination of
RAG(grounding responses in authoritative sources with citations) and the use of verifiable tools for factual data generation (e.g., tumor measurements, mutation detection) drastically reduceshallucinations, which is critical in healthcare.
- Related Work (
-
Focus on Multimodal, Multi-step Clinical Scenarios:
-
Related Work (Benchmarks): Existing biomedical benchmarks are often limited to one or two data modalities and closed question-and-answer formats.
-
This Paper's Differentiation: The paper develops and evaluates its agent on 20 realistic, multidimensional patient cases that require
multimodal dataintegration (CT/MRI, histopathology, genomic, text) andmulti-step reasoning, more closely mimicking real-world clinical decision-making.In essence, this paper proposes a pragmatic and robust strategy for developing medical
AIby orchestrating specializedAIcomponents with a general-purposeLLM, rather than attempting to build a singleAIthat masters all medical tasks.
-
4. Methodology
4.1. Principles
The core principle behind the AI agent is to leverage a Large Language Model (LLM), specifically GPT-4, not merely as a knowledge database, but as a "reasoning engine" (as described by Truhn et al., 2023). This LLM is augmented with a curated knowledge database (through Retrieval-Augmented Generation, RAG) and a suite of specialized tools. This integrated framework allows the AI agent to:
-
Interpret complex multimodal patient data: Understand clinical vignettes, radiological images, histopathology slides, and genomic reports.
-
Autonomously strategize and execute actions: Decide which specialized tools are needed to extract further insights (e.g., segment tumors, detect mutations, search literature).
-
Synthesize information: Integrate data from the patient context, tool outputs, and retrieved medical guidelines.
-
Generate evidence-based clinical conclusions: Provide personalized recommendations, complete with citations to authoritative sources, reducing
hallucinations.The philosophy is that while a generalist
LLMprovides broad reasoning, specialized precision medicine tools offer superior accuracy for very specific tasks, andRAGensures responses are grounded in current medical evidence.
4.2. Core Methodology In-depth (Layer by Layer)
The AI agent's pipeline operates in a two-stage process, as illustrated in Figure 1, involving tool usage and response generation augmented by RAG.
The following figure (Figure 1 from the original paper) shows the high-level overview of the LLM agent framework:
该图像是一个示意图,展示了自主人工智能代理在肿瘤学临床决策中的工作流程。包括知识数据库、患者案例、LLM代理和选择的工具,如PubMed和OncoKB等,并显示了如何查询和处理医学信息以提供支持。图中还展示了选择特定工具和使用医疗图像分割的过程。
4.2.1. Dataset Composition and Data Collection
The foundation of the agent's knowledge lies in its comprehensive dataset, designed for correctness, up-to-dateness, and contextual relevance, with a focus on oncology.
- Sources: Data was collected from six official medical sources:
MDCalc(https://www.mdcalc.com/): For clinical scores.UpToDateandMEDITRON(Chen et al., 2023): For general-purpose medical recommendations.Clinical Practice Guidelinesfrom:American Society of Clinical Oncology (ASCO)European Society of Medical Oncology (ESMO)Onkopedia guidelines(German and English subset) from the German Society for Hematology and Medical Oncology.
- Retrieval: Documents were retrieved and downloaded as
HTMLextracted text or rawPDFfiles. - Filtering: To manage the volume of data, a keyword-based filtering was applied to the document contents, targeting terms relevant to oncology use cases.
MEDITRONguidelines were directly accessible as preprocessedjsonlinesfiles.
4.2.2. Information Extraction and Data Curation from PDF Files
Extracting usable text from PDFs is challenging due to their display-oriented structure.
GROBIDApplication: The authors usedGROBID (generation of bibliographic data), a Java application and machine learning library, to convert unstructuredPDFdata into a standardizedTEI (Text Encoding Initiative)format (Ide & Véronis, 1995).GROBID, trained on scientific articles, helps preserve text hierarchy and extracts metadata (titles, authorship, publication dates, URLs).- Data Cleansing: Raw document text was programmatically retrieved from
TEI'sXMLfields. This included:- Removal of extraneous information: hyperlinks, graphical elements, corrupted tabular data, malformed characters, inadvertently extracted
IPaddresses. - Reformatting and standardization: Headers denoted with
#, blank lines for paragraph separation.
- Removal of extraneous information: hyperlinks, graphical elements, corrupted tabular data, malformed characters, inadvertently extracted
- Archiving: The purified text and its metadata were archived in
jsonlinesformat.
4.2.3. Agent Composition: Retrieval-Augmented Generation (RAG)
The RAG component is crucial for providing domain-specific medical knowledge to the LLM.
- Embedding Creation and Indexing:
- Conversion to Embeddings: Raw text data from the curated guidelines is converted into numerical vector representations, called
embeddings. - Embedding Model: OpenAI's
text-embedding-3-largemodel is used for this purpose. - Text Segmentation: Text is segmented into chunks of varying lengths (512, 256, and 128 tokens), each with a 50-token overlap to maintain context across boundaries.
- Storage: These
vector embeddings, along with metadata and original text, are stored in a localvector databasecalledChroma. This database facilitates efficient lookup operations using vector-based similarity measures.
- Conversion to Embeddings: Raw text data from the curated guidelines is converted into numerical vector representations, called
4.2.4. Agent Composition: Tools
The LLM is equipped with an array of tools, allowing it to interact with and draw conclusions from multimodal patient data. The LLM autonomously determines the need for and timing of tool usage. Each tool has JSON specifications detailing its function and required input parameters.
The following figure (Figure 3 from the original paper) details the agent's pipeline in patient case evaluation:
该图像是示意图,展示了基于GPT-4的人工智能代理在肿瘤学临床决策中的工具选择与使用流程,包括肿瘤定位、目标定义以及查询的步骤。
-
Web Search Capabilities:
- Google Custom Search API: For general web searches. Information retrieved is text-extracted, purified, and integrated directly as context for the
LLM. - PubMed Queries: For scientific literature searches. Responses are processed similarly to the
RAGprocedure and stored in a separate database for retrieval.
- Google Custom Search API: For general web searches. Information retrieved is text-extracted, purified, and integrated directly as context for the
-
Radiology Processing Tools:
- GPT-4 Vision (GPT-4V) API: Used to generate comprehensive, detailed, and structured radiology reports from
CTorMRIscans, particularly when no pre-existing report is provided in the patient case. - MedSAM (Medical Segment Anything Model) (Ma et al., 2024): A state-of-the-art medical image segmentation model.
- Function:
MedSAMgenerates asegmentation maskfor a specifiedregion of interest (ROI)within an image, given pixel-wise bounding box coordinates. This mask allows for the calculation of theoverall surface areaof a tumor or lesion. - Workflow:
GPT-4first locates relevant patient images on the filesystem, identifies their chronological order (from filenames), and extracts reference lesion locations from the patient vignette. It then sends this information toMedSAM. After receiving tumor segmentation sizes fromMedSAM,GPT-4uses acalculatortool to determine percentage changes according toRECIST criteriato assess disease progression, stability, or response. - Limitation: Currently restricted to single-slice images taken at the same magnification.
- Function:
- GPT-4 Vision (GPT-4V) API: Used to generate comprehensive, detailed, and structured radiology reports from
-
Calculator Tool:
- Function: A simplified tool for elementary arithmetic operations (addition, subtraction, multiplication, division) executed locally via a Python interpreter. Essential for quantitative analysis, such as comparing tumor sizes over time.
-
OncoKB Database (Chakravarty et al., 2017):
- Function: Provides critical information on medical evidence for treating a wide range of genetic anomalies (mutations, copy-number alterations, structural rearrangements) in oncology.
- Workflow:
GPT-4sends theHUGO symbol(gene name), thechange of interest(mutation, amplification, variant), and specific alteration (e.g.,BRAF V600E) to theOncoKB API. TheAPIreturns a structuredJSONobject containing potentialFood and Drug Administration (FDA)-approved or investigational drug options, along with their evidence levels.
-
Histopathological Analysis (Specialized Vision Transformers):
- Function:
GPT-4can engage in-house developedvision transformermodels for detecting phenotypic alterations directly from routinehistopathology slides. These models are specialized for:- Distinguishing
microsatellite instability (MSI)frommicrosatellite stability (MSS)(El Nahhas et al., 2024). - Detecting
KRASandBRAFmutations (Wagner et al., 2023).
- Distinguishing
- Workflow: For efficiency in a research setting,
histopathology featuresare pre-extracted from images usingCTranspath(Wang et al., 2022), a self-supervisedtransformermodel.GPT-4transmits these pre-extracted features to the respectiveMSI,KRAS, andBRAF vision transformers. It then receives abinary prediction(e.g.,MSIversusMSS,KRAS mutantversuswild type) along with amutation probabilityin percentage. This implementation detail (pre-extracted features) is hidden fromGPT-4.
- Function:
4.2.5. Agent Composition: Combine, Retrieve and Generate Responses
The final stage involves synthesizing all gathered information into a structured, evidence-based response using the DSPy library (Khattab et al., 2023).
- Input: The
LLMreceives the original patient context, the user's posed question, and all outcomes from the tool applications. - Chain-of-Thought Reasoning (Wei et al., 2022): The
LLMis instructed to decompose the initial user query into up to 12 more granularsubqueries. Thesesubqueriesare derived from both the initial patient context and the results obtained from tool applications. This decomposition helps in retrieving more targeted information. - Document Retrieval from Vector Database:
- For each generated
subquery, theLLMtransforms it into a numerical embedding using the sametext-embedding-3-largemodel used for the medical guidelines. - Cosine Distance: The query embedding is compared against all embedded chunks in the
Chroma vector databaseusingcosine distance(lower distance indicates higher similarity). The top 40 most analogous document passages are retrieved. - Reranking:
Cohere's Rerank 3 Englishmodel is then applied to reorder these 40 retrieved passages based on their semantic similarity to theLLM's query. This step is crucial for filtering out passages that might have highcosine similaritybut are contextually irrelevant (e.g., a query about approved drugs and a guideline stating a drug is not approved). - Selection: Only the top 10 most relevant passages from the reranked results are kept for each subquery.
- For each generated
- Consolidation and Formatting:
- All retrieved guideline text chunks from all subqueries are combined.
- Deduplication: Duplicate passages are removed to reduce token usage.
- Citation Prefixing: Each unique passage is prefixed with an enumeration like
Source [x]: ...to facilitate accurate citations in the final response.
- Response Generation Strategy:
- Before generating the final answer, the
LLMis instructed to generate a step-by-step strategy for building a structured response. This strategy also includes identifying any missing information that could further refine and personalize the recommendations. - Synthesis: The final model output is synthesized based on all available information, strictly following the generated hierarchical strategy.
- Before generating the final answer, the
- Citations and Self-Evaluation:
- Programmatic Citations: To enhance reliability and fact-checking, the model is programmatically configured to incorporate citations for each statement (defined as a maximum of two consecutive sentences) using
DSPy suggestions. - Self-Evaluation: The
LLMperforms a self-evaluation step, comparing its own generated output against the respective context from the database within a window of one to two sentences. This procedure is performed for a single iteration. All prompts are implemented usingDSPy'ssignatures.
- Programmatic Citations: To enhance reliability and fact-checking, the model is programmatically configured to incorporate citations for each statement (defined as a maximum of two consecutive sentences) using
4.3. LLM Tool Use Benchmarking: Llama-3 70B and Mixtral 8x7B
To assess the performance of the proprietary GPT-4 model against open-weight alternatives, the authors performed a comparison with Llama-3 70B (Meta AI, 2024) and Mixtral 8x7B (Jiang et al., 2024).
- Prompt Simplification: The original prompt used for
GPT-4was slightly simplified and shortened forLlama-3andMixtralbased on initial testing. Other parameters related to tool composition remained unaltered. - Metrics: The
function callingperformance was evaluated on the 20 patient cases using the following metrics:Required/successful: Fraction of required tool calls that completed successfully.Required/failed: Fraction of required tool calls that were invoked but failed.Required/unused: Fraction of necessary tool calls that the model failed to invoke.Not required/failed: Ratio of superfluous tool calls invoked by the models that resulted in failures.
4.4. Model Specifications
- Core Agent/Tools
LLM:gpt-4-0125-preview(GPT-4) viaOpenAI Python API. - Visual Processing
LLM:gpt-4-vision-preview(GPT-4V) via thechat completions endpoint. - Temperature: Empirically set to 0.2 for the agent phase and 0.1 during
RAG(lower temperatures makeLLMsmore deterministic and less creative). - Text Embeddings:
text-embedding-3-large(OpenAI's latest embedding model), producing 3,072-dimensional embeddings. - Baseline
GPT-4(without tools/RAG): Evaluated with identical hyperparameters, using achain-of-thought reasoning module. - Open-weight
LLMsfor Benchmarking:Meta Llama-3 70B(llama3-70b-8192)Mistral's Mixtral 8x7B(mixtral-8x7b-32768)- Accessed via
Groq API. Temperature set to 0.2, maximum output tokens to 4096.
4.5. Clinical Case Generation
To rigorously test the agent in realistic scenarios, a unique dataset of patient cases was created.
- Composition: 20 distinct, multimodal, and entirely fictional patient cases were compiled, primarily focusing on
gastrointestinal oncology(colorectal, pancreatic, cholangiocellular, and hepatocellular cancers). - Patient Profile: Each case included a comprehensive medical history: diagnoses, notable medical events, and previous treatments.
- Imaging Data: Each patient was paired with either a single or two slices of
CTorMRIimaging. These served as sequential follow-up staging scans (e.g., liver or lungs) or simultaneous staging scans of both liver and lungs. Images were predominantly from the Department of Diagnostic and Interventional Radiology, University Hospital Aachen, with some samples from public datasets (The Cancer Imaging Archive, Clark et al., 2013). - Histology Images: Obtained from
The Cancer Genome Atlas (TCGA). - Genomic Information: Several patient descriptions included details on
genomic variations(mutations and gene fusions). - Query Structure: Queries were not simple single questions but structured with multiple
subtasks,subquestions, andinstructions, requiring the model to handle an average of three to four subtasks per round. - Robustness Evaluation: To investigate bias, 15 random combinations of
age,sex, andoriginwere generated for each of the 20 base cases, resulting in 300 total combinations.
4.6. Human Results Evaluation
A structured evaluation framework, inspired by Singhal et al. (2023), was developed for blinded manual evaluation by a panel of four certified clinicians with oncology expertise. The evaluators were blinded to the model responses when establishing the ground truth for tool necessity and completeness.
-
Tool Use Assessment:
- Baseline: A manual baseline for the expected tools necessary to solve each patient task was established.
- Metric: Ratio of actual versus expected (required) tool uses.
- Definition of Tool Requirement: A tool was considered "required" if the model was explicitly instructed to use it, or if its output was essential for answering the question (default in almost all situations).
- Helpfulness: Quantified by the proportion of user instructions and subquestions directly addressed and resolved by the model.
-
Textual Output Quality Assessment:
- Segmentation: Each reply was segmented into smaller evaluable items (statements), defined as a segment concluding with either a literature reference or a topic shift in the subsequent sentence.
- Factual Correctness:
- Correctness: Proportion of correct replies relative to all model outputs.
- Incorrectness: Responses that were factually wrong but not clinically detrimental (e.g., superfluous diagnostic procedures, irrelevant patient information).
- Harmfulness: Responses that were factually incorrect and clinically judged as potentially deleterious (e.g., advising suboptimal or contraindicated treatments).
- Completeness:
- Method: Specific keywords and terms representing expected interventions (treatments, diagnostic procedures) were identified for each medical scenario. These keywords were highly specific (e.g., "FOLFOX and bevacizumab" instead of "chemotherapy").
- Metric: Measures the extent to which the agent's response aligns with these essential, anticipated pieces of information.
-
Citation Adherence Assessment:
- Method: For each reference in the model's output, the corresponding original document segment (by source identifier) was reviewed.
- Critical Dimensions:
- Citation Correctness: Model's statements faithfully mirror the content of the original document.
- Irrelevance: Model's assertions are not substantiated by the source material.
- Incorrect Citation: Information attributed to a source diverges from its actual content.
-
Consensus: For all benchmarks, the majority vote across the four observers was reported. In cases of a tie, the most adverse outcome was selected, following a hierarchical schema:
- Citations:
correct>irrelevant>wrong - Correctness:
correct>wrong>harmful
- Citations:
4.7. Statistics and Reproducibility
- Sample Size: patient cases. No statistical methods were used to predetermine sample sizes.
- Randomization: Experiments were not randomized, as only one test group (the
AIagent) existed. - Blinding: Investigators were not blinded to allocation during experiments and outcome assessment; however, they were blinded to the model responses while establishing the ground truth for tool necessity and completeness.
- Experiment Repetition: Experiments were limited to per case due to stringent
GPT-4VAPI rate and access limitations during the preview phase. - Data Exclusion: No data were excluded from the analysis.
- Reproducibility Concerns: Silent changes by
OpenAIto their models and potential future model deprecations may affect reproducibility. However, results are expected to remain reproducible with other current state-of-the-art models. - Guardrails: Built-in
guardrailsinGPT-4occasionally led to refusals to address medical queries. In these instances, each sample was reran in newly instantiated settings until no such refusals occurred.
5. Experimental Setup
5.1. Datasets
The study utilized a combination of existing medical knowledge bases, real-world medical imaging, and synthetically generated patient cases.
-
Knowledge Database for RAG:
- Sources:
MDCalc,UpToDate,MEDITRON,ASCO Clinical Practice Guidelines,ESMO Clinical Practice Guidelines, andOnkopedia guidelines. - Characteristics: These sources comprise a vast collection of medical documents, clinical guidelines, and clinical scores, specifically tailored to oncology. They represent authoritative, up-to-date medical knowledge. The total number of documents after keyword-based filtering was approximately 6,800.
- Purpose: To provide the
LLMwith an extensive, evidence-based foundation for its reasoning and to ground its responses in medical literature viaRAG.
- Sources:
-
Radiology Images:
- Sources: Predominantly in-house
CTandMRIchest and abdomen series from the Department of Diagnostic and Interventional Radiology, University Hospital Aachen. A few cases were supplemented from public datasets, includingThe Cancer Imaging Archive (TCIA). - Characteristics: Images were representative slices selected by an experienced radiologist. They included sequential follow-up scans or simultaneous staging scans, allowing the
AIagent to track disease development over time. - Example (conceptual): A
CTscan of the liver showing a known lesion, followed by a subsequentMRIscan a few months later to assess changes in size. - Purpose: To enable the
AIagent to useGPT-4Vfor report generation orMedSAMfor quantitative tumor segmentation and measurement.
- Sources: Predominantly in-house
-
Histopathology Images:
- Source:
The Cancer Genome Atlas (TCGA)Research Network (https://www.cancer.gov/tcga). - Characteristics: These are images of colorectal cancer tissue, containing features relevant for predicting
microsatellite instability (MSI)andKRAS/BRAFmutations. - Purpose: To provide visual input for the specialized
vision transformermodels, enabling the detection of genetic alterations directly from routinehistopathology slides.
- Source:
-
Patient Cases for Evaluation:
- Source: 20 distinct, comprehensive, and entirely fictional patient cases, primarily focused on
gastrointestinal oncology(colorectal, pancreatic, cholangiocellular, and hepatocellular cancers). - Characteristics: Each case included a full patient profile: concise medical history (diagnoses, events, treatments), paired with radiology images (CT/MRI), and in several cases, information on
genomic variations(mutations, gene fusions). Queries for each case involved multiplesubtasks,subquestions, andinstructions(average of 3-4 per round), designed to necessitatemultimodal dataintegration andmulti-step reasoning. - Purpose: To provide a realistic and multidimensional benchmark for quantitatively testing the
AIagent's performance in complex clinical decision-making scenarios. For robustness evaluation, 15 random permutations ofage,sex, andoriginwere generated for each of the 20 base cases, totaling 300 combinations.
- Source: 20 distinct, comprehensive, and entirely fictional patient cases, primarily focused on
5.2. Evaluation Metrics
The paper uses several metrics to evaluate the AI agent's performance across different aspects of clinical decision-making.
-
Tool Use Accuracy
- Conceptual Definition: This metric quantifies how effectively the
AIagent identifies and successfully executes the necessary tools to derive supplementary insights required to solve a given patient case. It reflects the agent's ability to reason about tool applicability and manage tool invocation. - Mathematical Formula: $ \text{Tool Use Accuracy} = \frac{N_{\text{required, successful}}}{N_{\text{required}}} $
- Symbol Explanation:
- : The number of tools that were expected to be used and ran successfully.
- : The total number of tools considered necessary to fully solve a patient case.
- Conceptual Definition: This metric quantifies how effectively the
-
Clinical Conclusion Accuracy (Correctness)
- Conceptual Definition: This metric assesses the factual truthfulness of the
AIagent's statements within its generated responses. It also differentiates between incorrect statements that are merely wrong (e.g., suboptimal suggestions) and those that are potentially harmful (e.g., contraindicated treatments). - Mathematical Formula: $ \text{Correctness} = \frac{N_{\text{correct}}}{N_{\text{correct}} + N_{\text{wrong}} + N_{\text{harmful}}} $
- Symbol Explanation:
- : The number of factually correct statements.
- : The number of incorrect statements that are not detrimental.
- : The number of incorrect statements deemed potentially damaging or clinically deleterious.
- Conceptual Definition: This metric assesses the factual truthfulness of the
-
Completeness
- Conceptual Definition: This metric measures how comprehensively the
AIagent's response covers all the essential information, interventions, or diagnostic procedures that human oncologists would expect in a well-rounded answer for a given clinical scenario. It is based on a pre-defined set of specific keywords and terms. - Mathematical Formula: $ \text{Completeness} = \frac{N_{\text{covered_keywords}}}{N_{\text{total_expected_keywords}}} $
- Symbol Explanation:
- : The number of pre-defined expected keywords or terms that the model accurately identified or proposed in its response.
- : The total number of keywords or terms considered essential by human experts for a complete answer.
- Conceptual Definition: This metric measures how comprehensively the
-
Helpfulness
- Conceptual Definition: This metric quantifies the degree to which the
AIagent effectively addresses and resolves all the subquestions and instructions provided by the user within the clinical query. - Mathematical Formula: $ \text{Helpfulness} = \frac{N_{\text{answered_subquestions}}}{N_{\text{total_subquestions}}} $
- Symbol Explanation:
- : The number of subquestions or instructions that the model effectively addressed.
- : The total number of subquestions or instructions given in the user's query.
- Conceptual Definition: This metric quantifies the degree to which the
-
Citation Accuracy
- Conceptual Definition: This metric evaluates the reliability and validity of the citations provided by the
AIagent. It assesses whether the cited sources genuinely support the statements made by the model, distinguishing between accurate, irrelevant, and conflicting citations. - Mathematical Formula: $ \text{Citation Accuracy} = \frac{N_{\text{correct_citations}}}{N_{\text{correct_citations}} + N_{\text{irrelevant_citations}} + N_{\text{incorrect_citations}}} $
- Symbol Explanation:
- : The number of citations that accurately align with the model's assertions.
- : The number of citations where the reference content does not mirror the model's statement.
- : The number of citations where the information attributed to a source conflicts with its actual content.
- Conceptual Definition: This metric evaluates the reliability and validity of the citations provided by the
5.3. Baselines
The paper compared its proposed AI agent against the following baselines:
-
GPT-4 Alone: This serves as the primary baseline to demonstrate the incremental value of integrating specialized tools and
RAGwith a powerfulLLM. TheGPT-4model (gpt-4-0125-preview) was evaluated with identical hyperparameters to the agent but without access to the external tools or theRAGpipeline, utilizing only achain-of-thought reasoning module. This comparison highlights the limitations ofLLMswhen used "out-of-the-box" for complex, multimodal medical tasks. -
Llama-3 70B (Meta AI): An advanced,
open-weight Large Language Modeldeveloped by Meta AI. It was included to assess how proprietary models likeGPT-4compare to state-of-the-art open-source alternatives intool-calling performance. -
Mixtral 8x7B (Mistral): Another prominent
open-weight Large Language Model, known for itsMixture-of-Experts (MoE)architecture. This baseline further diversified the comparison againstopen-weight modelsregarding their ability to effectively identify and utilize external tools.These baselines are representative because
GPT-4is considered a best-in-class proprietaryLLM, whileLlama-3 70BandMixtral 8x7Brepresent leadingopen-weight LLMalternatives, crucial for evaluating the generalizability and accessibility of suchAIagent architectures.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results strongly validate the effectiveness of the proposed AI agent, demonstrating significant improvements in accuracy, reliability, and capability compared to GPT-4 alone and open-weight models.
6.1.1. Tool Use and Retrieval Improve LLM Responses
The integrated AI agent drastically improved decision-making accuracy compared to GPT-4 alone.
-
Qualitative Improvement: As illustrated in Figure 2a,
GPT-4without tool use often failed to detect the current disease state, provided generic responses, or drew incorrect conclusions (e.g., falsely assuming "disease progression" or "no evidence of disease"). In contrast, the agent, by leveraging tools, correctly identified treatment responses, measured tumor surfaces, and made appropriate decisions. -
Quantitative Improvement (Completeness): The
AIagent's responses achieved a 87.2% success rate on thecompletenessbenchmark (95 out of 109 expected answers covered), compared to only 30.3% forGPT-4alone (3 out of 109 expected answers covered). This highlights that enhancingLLMswith tools andRAGis crucial for generating precise and comprehensive solutions to complex medical cases.The following figure (Figure 2 from the original paper) shows an example of responses from
GPT-4alone in comparison toGPT-4with tool use andRAG, illustrating improvedLLMperformance.
该图像是包含多个病例分析的示意图,展现了AI在癌症临床决策中的应用。图中提及了不同患者疾病进展的分析,强调了AI工具在数据处理和治疗建议中的作用。
6.1.2. GPT-4 Handles Complex Chains of Tool Use
The AI agent, specifically GPT-4, demonstrated a strong ability to autonomously manage and chain complex tool invocations.
- Overall Tool Use Accuracy: Out of 64 required tool invocations across all 20 patient cases, the agent successfully used 56, achieving an overall
tool use accuracyof 87.5%. There were no failures among the tools that were successfully invoked when required.- Only 8 tools (12.5%) were required but missed by the model.
- There were two instances where the model attempted to call superfluous tools without necessary data, resulting in failures.
- Sequential Chaining Examples:
- Tumor Progression Calculation: For patient G,
GPT-4invokedMedSAMtwice (for images at different time points), used the output fromMedSAM(segmentation masks/areas) as input for thecalculatorfunction to determine a tumor growth factor of 2.14. This shows the agent's ability to integrate results from one tool into another for multi-step reasoning. - Mutation-Specific Management: For patient W, the model used
vision transformermodels to confirm a suspectedBRAFmutation. It then queried theOncoKBdatabase to retrieve medical information on appropriate management for that specific mutation.
- Tumor Progression Calculation: For patient G,
- Robustness Across Patient Factors: An investigation into the
agent's robustnessacross different combinations ofsex,age, andorigin(Extended Data Fig. 1) revealed that the primary source of performance variation was the number of tools required for each patient case, rather than intrinsic patient demographic factors. Cases requiring more tools showed higher variability in tool-calling behavior.
6.1.3. Radiology Tools Improve GPT-4's Treatment Accuracy
The integration of radiology processing tools was crucial for the agent's decision-making.
- GPT-4 Vision (GPT-4V): When no radiology report was provided,
GPT-4Vgenerated comprehensive reports, effectively guiding the agent's decisions despite occasional omissions or extraneous details. - MedSAM for Quantitative Analysis: For cases with reference lesions,
GPT-4successfully usedMedSAMto generate surfacesegmentation masks. It identified image locations, chronological order, sent requests toMedSAM, received tumor segmentation sizes, and then used thecalculatorto determine percentage changes according toRECIST criteria. This autonomous multi-step process allowed the agent to accurately determine disease progression, stability, or response. In summary, the pipeline autonomously handled complex sequences: determining tool need, locating data, understanding chronological order, sending requests, receiving results, and integrating these into subsequent decision-making steps, all without human supervision.
6.1.4. Evaluations Show Accurate, Helpful and Reliable Responses
The manual evaluation by four medical experts confirmed the high quality of the agent's responses across multiple dimensions.
- Clinical Conclusion Accuracy (Correctness):
- Out of 245 assessable statements, 223 (91.0%) were factually correct.
- 16 (6.5%) were incorrect but not detrimental.
- Only 6 (2.4%) were flagged as potentially harmful.
- Remarkably, the agent could resolve issues even with contradictory information in patient descriptions (e.g., discrepancies between reported mutations and tool results), pointing out inconsistencies and recommending further confirmation.
- Helpfulness:
- Among 67 queries, 63 (94.0%) were categorized as effectively addressed, demonstrating the agent's ability to provide sophisticated answers to user instructions and subquestions.
- Citation Accuracy:
-
Of 257 citations, 194 (75.5%) were accurately aligned with the model's assertions.
-
59 (23.0%) were irrelevant.
-
Only 4 (1.6%) were found to be in conflict with the model's statement. This low rate of incorrect citations highlights the effectiveness of the
RAGand self-evaluation mechanisms in mitigatinghallucinationsand ensuring transparency.The following figure (Figure 4 from the original paper) summarizes the performance of the agent's pipeline in patient case evaluation:
该图像是图表,展示了自主人工智能代理在患者案例评估中的表现。图中包含四个指标:工具使用、正确性、完整性和帮助性,各自的成功率和比例清晰标示,反映了AI代理在处理案例时的有效性与准确性。
-
6.1.5. Comparison with Open-Weight LLMs (Llama-3 70B, Mixtral 8x7B)
The benchmarking against open-weight LLMs revealed substantial shortcomings in their tool-calling performance compared to GPT-4.
- Overall Success Rates:
GPT-4: 87.5%required/successful.Llama-3 70B: Only 39.1%required/successful.Mixtral 8x7B: A mere 7.8%required/successful.
- Specific Failures:
-
Identifying Necessary Tools: Both
Llama-3(18.8% unused) andMixtral(42.2% unused) struggled to identify necessary tools. -
Supplying Correct Arguments: Even when a correct tool was identified,
Llama-3(42.2%) andMixtral(50.0%) frequently failed to supply the necessary and accurate function arguments, leading to invalid requests and program crashes.GPT-4showed none of these failures. -
Superfluous Tool Calls:
Llama-3frequently made unnecessary tool calls (e.g., random calculations on nonsense values, hallucinating tumor locations) resulting in 62 failures across 20 cases.Mixtral's major issue was generally disregarding tool use.The following figure (Figure 6 from the original paper) shows the
tool-calling performanceofGPT-4,Llama-3 70B, andMixtral 8x7B:
该图像是工具调用性能的图表,展示了GPT-4、Llama-3 70B和Mixtral 8x7B在不同情况下的成功率和失败率。图表显示GPT-4在要求的工具中成功率为87.5%,而Llama-3 70B和Mixtral 8x7B的成功率较低,分别为39.1%和50.0%。
-
These results underscore GPT-4's superior and reliable performance in identifying and correctly applying relevant tools to patient cases, establishing its suitability as the backbone for this agentic AI workflow.
6.2. Data Presentation (Tables)
The main results are presented in charts (Figure 2b, Figure 4, Figure 6b) and descriptive text. There are no tables in the main body of the paper that require transcription according to the provided rules. The figures depict aggregated numerical data which have been analyzed in the text above.
6.3. Ablation Studies / Parameter Analysis
While the paper does not present explicit "ablation studies" in the traditional sense (removing components to test their contribution), it does perform an analysis of the agent's robustness across various patient factors and compares the full agent to GPT-4 alone, which serves a similar purpose in demonstrating component efficacy.
6.3.1. Impact of Patient Factors on Tool-Calling Behavior
The authors investigated whether patient characteristics such as sex, age, and origin influence the model's tool-calling behavior.
-
Methodology: 15 random permutations of
age,sex, andoriginwere generated for each of the 20 base patient cases, creating a total of 300 combinations. The agent's tool usage was then evaluated across these diverse patient populations. -
Findings (Extended Data Fig. 1): The evaluation showed that the primary source of performance variation in tool-calling behavior was the number of tools required for each patient case, rather than intrinsic patient factors (gender, age, or ethnicity/origin).
- Patient cases requiring relatively fewer tools (e.g., patients Adams, Lopez, Williams) showed lower variability in tool-calling.
- Patient cases requiring more tools (e.g., patient Ms Xing) exhibited higher variability in tool-calling behavior, regardless of the combinations of
age,sex, andethnicity/origin.
-
Implication: This suggests that the complexity of the task (number of tools needed) is a more dominant factor in the agent's performance than patient demographics, indicating a relatively stable performance across diverse patient populations for a given task complexity.
The following figure (Extended Data Fig. 1 from the original paper) shows details on the simulated patient cases and bias investigation:
该图像是一个示意图,展示了一个集成多种工具的自主人工智能代理,包括GPT-4、MedSAM,以及用于肿瘤学的搜索工具如OncoKB和PubMed。图中列出了主要工具和技术以支持临床决策。
6.3.2. Contribution of Tools and RAG (Implicit Ablation)
The comparison between GPT-4 alone and the GPT-4 integrated AI agent with tools and RAG serves as an implicit ablation study, clearly demonstrating the individual contributions of these components.
- Result:
GPT-4alone achieved only 30.3% completeness, while the integratedAIagent achieved 87.2% completeness. - Analysis: This stark difference highlights that the
LLMitself, while capable of general reasoning, is severely limited in complex medical scenarios without external, specialized tools for data acquisition (e.g., image analysis, mutation detection) and withoutRAGfor grounding its knowledge in authoritative medical guidelines. The integration of these components is not merely additive but synergistic, enabling theLLMto function as an effective clinical assistant.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully developed and validated an autonomous AI agent designed for clinical decision-making in oncology. By leveraging GPT-4 as a "reasoning engine" and augmenting it with specialized multimodal precision oncology tools (such as vision transformers for MSI, KRAS, BRAF detection, MedSAM for radiological image segmentation) and web-based search tools (OncoKB, PubMed, Google), the agent significantly enhanced clinical accuracy. Evaluated on 20 realistic multimodal patient cases, the integrated AI agent achieved a remarkable clinical conclusion accuracy of 91.0% and tool use accuracy of 87.5%, while GPT-4 alone managed only 30.3% decision-making accuracy. The agent also demonstrated high helpfulness (94.0%) and citation accuracy (75.5%), addressing crucial concerns about reliability and transparency. These findings demonstrate that a modular, tool-based approach, combined with Retrieval-Augmented Generation (RAG), provides a robust and reliable foundation for deploying AI-driven personalized oncology support systems, overcoming limitations of generalist LLMs and addressing practical challenges like rapid knowledge updates and regulatory compliance.
7.2. Limitations & Future Work
The authors openly acknowledge several limitations of their current work and propose future research directions:
- Small Sample Size for Evaluation: The study used 20 patient cases. Constructing these realistic, multimodal cases while adhering to data protection standards is laborious.
- Future Work: Encourage future efforts to develop larger-scale benchmark cases.
- Medical Tool Selection and Optimization: The core focus was the
LLMagent's tool-using abilities, not independent optimization of each tool.- Future Work: Individual tools require better independent optimization and validation (e.g., using annotated ground-truth segmentation masks for
MedSAMor clinical-grade models likeMSIntuitforMSIdetection in a production environment). Prioritizing clinical endpoints to validate the agent pipeline offers directly relevant evidence.
- Future Work: Individual tools require better independent optimization and validation (e.g., using annotated ground-truth segmentation masks for
- Agent Limitations and Radiology Interpretation: The current agent is in an experimental stage, limiting clinical applicability. It uses only singular slices of radiology images, and
GPT-4V's capabilities in interpreting medical images are limited.- Future Work: Anticipate progress in generalist foundation models with enhanced capabilities in interpreting
three-dimensional (3D)medical images. Integrating advanced, specialized medical image-text models (like theMerlinmodel for3D CTimaging) as tools. These could incorporate patient history for holistic disease state evaluation beyond just lesion size changes.
- Future Work: Anticipate progress in generalist foundation models with enhanced capabilities in interpreting
- Static Agent Architecture: The current agent architecture is a static design choice.
- Future Work: Explore integrating
RAGand tool use concurrently, whereRAGcould guide the agent through complex tool application steps. This synergy might improve complex workflows with more challenging tools.
- Future Work: Explore integrating
- Single Interaction Evaluation: The evaluation was confined to a single interaction.
- Future Work: Incorporate
multiturn conversations, including human feedback for refinement (a "human-in-the-loop" concept).
- Future Work: Incorporate
- Domain Restriction: Patient scenarios were restricted to oncology.
- Future Work: The underlying framework could be adapted to virtually any medical specialty with appropriate tools and data.
- Data Protection and
LLMDeployment:GPT-4's cloud-based nature makes it unsuitable for real-world settings due to sensitive patient data transfer concerns.- Future Work: Explore
open-weight models(e.g.,Llama-3 405B,Hermes-3-Llama-3.1) that can be deployed on local servers, and medicallyfine-tuned models.
- Future Work: Explore
- RAG Pipeline Improvements: Current
RAGuses generalist embedding, retrieval, and reranking models.- Future Work: Explore domain-specific models for enhanced retrieval performance,
hybrid search(combining exact keyword and similarity searches), and models with larger context windows (e.g.,Gemini 1.5) for handling patient information distributed across hundreds of documents.
- Future Work: Explore domain-specific models for enhanced retrieval performance,
- Temporal Dependencies:
LLMsneed to handle temporal dependencies in treatment recommendations (e.g., rapidly changing lung cancer guidelines vs. trial results).- Future Work: The multitool agent could
cross-reference informationfrom official guidelines with more up-to-date information from internet/PubMedsearches, building on prior work showingLLMscan identify temporal differences.
- Future Work: The multitool agent could
- Enhanced
AIAgent Capabilities:- Future Work: Develop
AIsystems whereLLMsact as central "orchestrators" (e.g.,MedVersa), trainable to determine when to complete tasks independently or delegate to specialist vision models. Train systems to learn concepts likeuncertaintyto recognize limitations.Task-specific fine-tuningandfew-shot promptingcould further improve performance. Incorporate human feedback on edge cases for continuous improvement.
- Future Work: Develop
7.3. Personal Insights & Critique
This paper presents a compelling and pragmatic vision for the future of AI in clinical medicine. Its core strength lies in its modular, agent-based approach, which directly addresses several critical limitations of current monolithic LLMs in healthcare:
- Specialization without Generalist Constraints: Instead of attempting to build an
AIthat is equally expert in all medical domains, the paper successfully demonstrates that a powerfulLLMcan orchestrate a collection of highly specialized, potentially independently validatedAItools. This allows forprecision medicinecapabilities without the immense data and computational burden of training oneLLMfor everything. - Explainability and Trust: The modularity inherently offers superior
explainability. Physicians can scrutinize the output from each individual tool, fostering greater trust and enabling easier fact-checking compared to a large, "black-box" generalist model. This is paramount in a high-stakes domain like healthcare. - Adaptability and Regulatory Compliance: The ability to swap out or update individual tools (e.g., a better
MSIdetection model) or knowledge sources (latest guidelines) without retraining the entire system is a significant advantage forAIlongevity and regulatory pathways for medical devices. Each tool can undergo its own validation and approval process. - Mitigation of
Hallucinations: The combination ofRAGand verifiable tool outputs (e.g., actual tumor measurements, mutation reports) provides a robust mechanism to reduceLLM hallucinations, which is a critical safety feature in clinical applications.
Potential Issues and Areas for Improvement:
- Tool Integration Overhead: While modular, the complexity of integrating and maintaining numerous diverse tools, ensuring their interoperability, and handling potential versioning conflicts could become a significant engineering challenge in a scaled production environment.
- Orchestration Complexity: The
LLM's ability to "reason" about which tool to use, when, and how to interpret its output is impressive but inherently complex. As the number and complexity of tools grow, theLLM's orchestration logic might require increasingly sophisticated prompting or even specializedfine-tuning. The paper's finding thattool-calling behaviorvariability increases with the number of required tools hints at this challenge. - Human-in-the-Loop Implementation: The paper acknowledges the need for
multiturn conversationsandhuman-in-the-loopfeedback. Designing intuitive interfaces for clinicians to interact with such an agent, provide feedback, and maintain ultimate decision authority will be crucial for adoption. - Ethical and Legal Frameworks: The paper touches on regulatory and data privacy concerns. Before deployment, robust legal and ethical frameworks for
AIaccountability, liability, and data governance (especially with cloud-basedLLMs) must be established. The "human-in-the-loop" concept is a good starting point for ensuring human oversight.
Transference to Other Domains:
The methodology's core framework—an LLM as a reasoning engine, augmented by specialized tools and RAG—is highly transferable.
-
Other Medical Specialties: Easily adaptable to cardiology (integrating
ECGanalysis tools, cardiac imaging segmentation), neurology (integratingEEGanalysis, brain imaging tools), or pathology (integrating morebiomarkerdetection tools). -
Legal Domain: An
LLMagent could integrate legal databases, case law search tools, and document analysis tools to assist in legal research. -
Engineering/Scientific Research: Integrating simulation software, data analysis packages, and scientific literature databases could create powerful research assistants, similar to what
Arasteh et al.demonstrated.Overall, this paper serves as a significant blueprint for agent-based generalist medical
AI, demonstrating that a pragmatic, modular approach can deliver highly accurate and reliableAIsupport in complex clinical settings, potentially accelerating the deployment ofAIin healthcare while navigating its inherent challenges. The emphasis on individual tool validation and explainability is a key takeaway for building trust inAIsystems.
Similar papers
Recommended via semantic vector search.