BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models
TL;DR Summary
BiomedXPro uses LLMs to generate diverse interpretable prompts, improving explainability and performance in biomedical diagnosis, especially under few-shot conditions, with strong semantic alignment to clinical features.
Abstract
The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model's performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models
- Authors: Kaushitha Silva, Mansitha Eashwara, Sanduni Ubayasiri, Ruwan Tennakoon, and Damayanthi Herath. The authors are affiliated with the University of Peradeniya, Sri Lanka, and RMIT University, Australia.
- Journal/Conference: This paper is an arXiv preprint. Preprints are preliminary versions of academic papers that have not yet undergone peer review for publication in a formal conference or journal. They allow for rapid dissemination of research findings.
- Publication Year: The paper was submitted to arXiv in 2025 (as indicated by the source, likely a placeholder year for future submission).
- Abstract: The paper introduces
BiomedXPro, an evolutionary framework designed to automatically generate an ensemble of diverse, human-readable natural language prompts for disease diagnosis from biomedical images. The authors argue that current prompt optimization methods for Vision-Language Models (VLMs) are a barrier to clinical adoption because they produce either uninterpretable "soft prompts" (latent vectors) or a single textual prompt, failing to capture the complex, multi-faceted nature of clinical diagnosis.BiomedXProuses a Large Language Model (LLM) to extract biomedical knowledge and iteratively optimize a set of prompt pairs. Experiments showBiomedXProoutperforms state-of-the-art methods, especially in few-shot (low data) scenarios. A key finding is that the generated prompts align with clinically significant features, making the model's predictions more transparent and trustworthy. - Original Source Link: https://arxiv.org/abs/2510.15866v1
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: AI models used for medical image diagnosis, especially modern Vision-Language Models (VLMs), often function as "black boxes." The methods used to guide them (
prompting) either create uninterpretable mathematical vectors (soft prompts) or rely on a single, simplistic text description. This lack of transparency is a major hurdle for adoption in high-stakes clinical environments where doctors need to understand why a model made a particular diagnosis. - Importance & Gaps: Real-world clinical diagnosis is not based on a single observation. A pathologist or radiologist integrates multiple visual cues (e.g., cell shape, tissue patterns, color variations) to reach a conclusion. Existing AI prompting methods fail to replicate this multi-perspective reasoning, limiting their robustness and trustworthiness. There is a critical need for AI systems that not only are accurate but also provide explanations that align with clinical reasoning.
- Innovation: The paper introduces
BiomedXPro, a novel framework that uses an evolutionary algorithm powered by a Large Language Model (LLM). Instead of finding one "best" prompt, it evolves a diverse collection (ensemble) of human-readable prompt pairs. Each pair describes the presence vs. absence of a specific clinical feature (e.g., "regular cell shape" vs. "irregular and lobulated cell shape"). This approach aims to make the AI's diagnostic process both interpretable and more aligned with how human experts work.
- Core Problem: AI models used for medical image diagnosis, especially modern Vision-Language Models (VLMs), often function as "black boxes." The methods used to guide them (
-
Main Contributions / Findings (What):
- A Novel Evolutionary Framework (
BiomedXPro): The paper proposes a new method that uses an LLM to automatically generate, evaluate, and refine a diverse set of natural language prompts for biomedical image classification. - Superior Performance in Data-Scarce Scenarios:
BiomedXProconsistently outperforms existing state-of-the-art prompt-tuning methods (likeCoOp,BiomedCoOp, andXCoOp) on multiple biomedical benchmarks, with its most significant advantages seen in "few-shot" settings where only a handful of training examples are available. - Enhanced Interpretability and Clinical Alignment: The framework produces a final set of prompts that are not only high-performing but also semantically meaningful. The authors show that these prompts correspond to statistically significant clinical features, grounding the model's predictions in verifiable medical concepts and making them more trustworthy.
- Ensemble-Based Robustness: By using a diverse ensemble of prompts for the final prediction, the model mimics the multi-faceted diagnostic process of clinicians, enhancing its robustness.
- A Novel Evolutionary Framework (
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Vision-Language Models (VLMs): These are AI models trained to understand the relationship between images and text. A prime example is CLIP (Contrastive Language-Image Pre-training), which learns to map images and their corresponding text descriptions to a shared "embedding space." In this space, an image of a cat and the sentence "a photo of a cat" will be located very close to each other. This allows for powerful "zero-shot" classification, where the model can classify images into categories it has never explicitly been trained on, simply by comparing the image to text descriptions of those categories.
- Prompt Optimization: This is the process of finding the best possible text input (a "prompt") to guide a VLM to perform a specific task accurately. There are two main types:
- Soft Prompt Tuning: Instead of using human-written text, this method learns a set of continuous vectors (a "soft prompt") that are fed directly into the model. These are highly effective but are uninterpretable to humans (e.g.,
CoOp,BiomedCoOp). - Hard Prompt Tuning: This involves optimizing actual, human-readable text strings. While interpretable, it has traditionally been a manual, labor-intensive process. Recent methods use LLMs to automate the generation of these text prompts.
BiomedXProis a form of automated hard prompt tuning.
- Soft Prompt Tuning: Instead of using human-written text, this method learns a set of continuous vectors (a "soft prompt") that are fed directly into the model. These are highly effective but are uninterpretable to humans (e.g.,
- Evolutionary Algorithms: These are optimization techniques inspired by natural selection. They start with an initial "population" of potential solutions (in this case, prompts), evaluate their "fitness" (how well they perform), and then use operations like selection, mutation, and crossover to generate a new, hopefully better, population over many "generations."
- Chain-of-Thought (CoT) Prompting: A technique to improve the reasoning ability of LLMs. Instead of just asking for an answer, you instruct the LLM to "think step-by-step" or "formulate a strategy" first. This often leads to more coherent and accurate outputs.
-
Previous Works:
- VLMs in Biomedical Imaging: Standard VLMs like
CLIPstruggle with medical images due to specialized terminology and subtle visual features. To address this, domain-specific models likeBiomedCLIPwere developed.BiomedCLIPwas pre-trained on millions of biomedical image-text pairs, making it a powerful foundation model for medical tasks.BiomedXProbuilds on top of a frozenBiomedCLIPmodel. - Limitations of Soft Prompt Tuning: Methods like
CoOplearn uninterpretable soft prompts. Biomedical adaptations likeBiomedCoOpandXCoOpuse an LLM to generate some initial knowledge but still optimize these uninterpretable vectors, failing to provide true transparency. - Hard Prompt Tuning: Early methods were manual. Automated approaches like
APEandOPROuse LLMs as black-box optimizers. Evolutionary approaches likeEvoPromptandProAPOiteratively refine prompts. However, the paper argues these methods are not directly suited for the biomedical domain, may lack prompt diversity, or don't fully integrate best practices from prompt optimization research.
- VLMs in Biomedical Imaging: Standard VLMs like
-
Differentiation:
BiomedXProvs.BiomedCoOp/XCoOp:BiomedCoOpandXCoOpuse an LLM once to generate knowledge but ultimately learn an uninterpretable soft prompt. In contrast,BiomedXProuses the LLM iteratively within an evolutionary loop to produce an entire set of fully interpretable, human-readable hard prompts.BiomedXProvs. Other Hard Prompt Optimizers: Unlike methods that optimize for a single best prompt,BiomedXProis explicitly designed to generate a diverse ensemble of prompts. This is crucial for capturing the multi-faceted nature of clinical diagnosis. It also incorporates specific strategies likeroulette wheel selectionfor diversity andcrowdingto remove semantic redundancy, which are tailored for this goal.BiomedXProvs.Xplainer:Xplaineruses a structured, descriptor-based prompting approach that offers interpretability but requires significant manual engineering by experts.BiomedXProautomates the discovery of these descriptive prompts, making it more generalizable across different tasks.
4. Methodology (Core Technology & Implementation)
The core of BiomedXPro is an evolutionary algorithm that searches for an optimal set of diverse and interpretable text prompts. The overall workflow is shown in Figure 1.
该图像是BiomedXPro框架的流程示意图,展示了利用大语言模型生成生物医学文本提示对,并结合BiomedCLIP多模态编码器对提示进行适应度评估和筛选,最终获得多样性最优的提示对。图中包含的公式为 。
-
Principles: The central idea is to treat prompt discovery as a multi-objective optimization problem: maximizing classification accuracy while also promoting prompt diversity. The framework uses an LLM not as a static knowledge source but as an active, adaptive optimizer that generates new "mutations" (prompt variations) based on performance feedback.
-
Steps & Procedures:
1. Problem Formulation The goal is to classify a medical image into a binary label . The system uses a pre-trained VLM with an image encoder and a text encoder . The objective is to find a set of human-readable prompt pairs: Here, is a text description of a visual feature indicating the presence of a disease (e.g., "Atypical Giant cells"), and describes its absence (e.g., "No atypical giant cells").
For a given image , the classification decision for a single prompt pair is made by comparing the similarity of the image embedding to the positive and negative prompt embeddings:
- : The vector representation (embedding) of the image .
- and : The vector representations of the positive and negative text prompts.
- : Cosine similarity, which measures how close two vectors are in the embedding space.
- : The indicator function, which is 1 if the condition inside is true (the image is more similar to the positive prompt) and 0 otherwise.
2. Evolutionary Prompt Optimization The framework iteratively refines a population of prompt pairs over several generations.
-
Initialization: The process starts by providing a meta-prompt () to an LLM. This meta-prompt instructs the LLM to generate an initial population of distinct prompt pairs based on a task description (e.g., "identify melanoma in dermoscopy images").
-
Fitness Evaluation: Each prompt pair in the current population is evaluated on the training data. Its "fitness" score, , is calculated using a performance metric (e.g., accuracy, F1-score, or inverse binary cross-entropy).
-
Population Update: High-performing prompt pairs (those with a fitness score above a threshold ) are stored in a memory buffer . This buffer aggregates the best prompts found across all generations.
-
LLM-guided Mutation (The "Evolution" Step):
- Selection: A subset of prompt pairs is selected from the memory buffer using roulette wheel selection. This method probabilistically selects prompts, giving higher-fitness prompts a greater chance of being chosen, but still allowing lower-fitness prompts to be selected occasionally. This balances exploiting good solutions (exploitation) with exploring new ones (exploration).
- Mutation: The selected prompts and their scores are fed into a new meta-prompt, . This prompt instructs the LLM to generate new prompt pairs that are different from the provided examples and are expected to achieve a higher score. The prompt uses several clever techniques:
- It frames the examples as "best performing pairs" to give the LLM a clear optimization goal.
- It sorts the examples by score in ascending order to mitigate the LLM's "recency bias" (tendency to focus on the last examples it sees).
- It includes Chain-of-Thought (CoT) cues ("Formulate a strategy," "Let's think step-by-step") to encourage the LLM to reason about how to create better prompts.
-
Crowding for Diversity: After the evolutionary process runs for generations, the final memory buffer may contain many prompts that are linguistically different but semantically similar. To create a final, diverse set, a "crowding" mechanism is applied. An LLM is prompted to group together prompts that describe the same medical concept. From each group, only the prompt pair with the highest fitness score is kept. This results in the final optimized set .
-
Final Prediction: To classify a new image, the predictions from all prompt pairs in the final set are combined using weighted majority voting. The vote of each prompt pair is weighted by its fitness score , so more reliable prompts have a greater influence on the final decision. where the weight is the fitness score .
The entire process is summarized in Algorithm 1 of the paper.
5. Experimental Setup
-
Datasets:
Derm7pt: A dataset of dermoscopy images for skin lesion classification. The paper uses it for a binary task: melanoma vs. non-melanoma.WBCAtt: A dataset of peripheral blood smear images for white blood cell classification. It's a multiclass problem, which the authors tackle using a one-vs-rest approach (training a binary classifier for each cell type against all others).Camelyon17-WILDS: A histopathology dataset of lymph node tissues for detecting metastatic cancer. This dataset is designed to test domain generalization, as the training and testing data come from different hospitals (domains).
-
Evaluation Metrics:
F1-macroscore: This is the primary metric used. It's well-suited for classification tasks with class imbalance, which is common in medical datasets.- Conceptual Definition: The F1-score for a single class is the harmonic mean of precision and recall.
F1-macrocalculates the F1-score for each class independently and then takes the unweighted average of these scores. This means it treats all classes as equally important, regardless of how many samples they have. - Mathematical Formula:
For a single class, the F1-score is:
where Precision = TP / (TP + FP) and Recall = TP / (TP + FN).
For a problem with classes, the
F1-macrois: - Symbol Explanation:
TP: True Positives (correctly identified positives).FP: False Positives (negatives incorrectly identified as positive).FN: False Negatives (positives incorrectly identified as negative).F1_i: The F1-score for the -th class.- : The total number of classes.
- Conceptual Definition: The F1-score for a single class is the harmonic mean of precision and recall.
-
Baselines:
Zero-shot BiomedCLIP: The performance of theBiomedCLIPmodel without any task-specific tuning, using a simple prompt like "a photo of [CLASS]".CoOp&CoCoOp: Gradient-based soft prompt tuning methods.BiomedCoOp&XCoOp: Biomedical adaptations ofCoOpthat use an LLM to initialize the context for the soft prompts.
6. Results & Analysis
-
Core Results:
The primary results are presented in Table 1, which compares
BiomedXProto baselines in few-shot settings (from 1 to 16 training samples per class).Manual transcription of Table 1 from the paper.
Dataset Method 1-Shot 2-Shot 4-Shot 8-Shot 16-Shot Camelyon17WILDS Zero-shot 41.93 CoOp 78.79 70.76 74.15 84.25 88.47 CoCoOp 76.43 66.13 75.19 85.13 86.88 BiomedCoOp 53.15 61.06 58.69 63.46 56.94 XCoOp 66.90 35.60 45.00 64.20 84.50 BiomedXPro(Ours) 72.06 86.95 90.20 90.87 90.38 Derm7pt Zero-shot 27.86 CoOp 33.91 55.88 58.70 54.92 61.38 CoCoOp 33.89 56.70 50.97 54.36 57.06 BiomedCoOp 52.49 57.22 45.96 51.30 61.46 XCoOp 39.50 58.70 60.10 41.90 54.80 BiomedXPro(Ours) 64.54 61.45 60.87 58.51 64.17 WBCAtt Zero-shot 10.50 CoOp 33.08 41.74 55.10 67.81 75.5 CoCoOp 31.86 41.91 55.96 62.02 72.19 BiomedCoOp 11.24 10.83 10.48 10.57 10.91 XCoOp 26.20 22.10 22.80 25.20 28.9 BiomedXPro(Ours) 41.39 47.31 58.33 69.63 72.18 Analysis:
BiomedXProconsistently and significantly outperforms all baselines across all datasets, especially in the most data-scarce 1, 2, and 4-shot settings. For example, onCamelyon17-WILDS, it achieves an F1-macro score of 90.20% with just 4 shots, far exceedingCoOp(74.15%) and the underperforming biomedical variantsBiomedCoOp(58.69%) andXCoOp(45.00%). The authors hypothesize thatBiomedCoOpandXCoOpunderperform because they rely on a single, static LLM query, which can lock the optimization into a suboptimal space.BiomedXPro's iterative refinement avoids this trap.In the full-data regime (Table 2),
BiomedXPro(71.14%) is competitive with the best baseline,XCoOp(75.60%), while offering the crucial advantage of full interpretability.Manual transcription of Table 2 from the paper.
Method F1-macro BiomedXPro (Ours) 71.14 BiomedCoOp 59.02 CoOp 67.91 XCoOp 75.60 CoCoOp 67.68 -
Clinical Relevance: The analysis shows a strong alignment between the highest-scoring prompts discovered by
BiomedXProand clinically relevant features. For example, in theDerm7ptmelanoma task, a high-fitness prompt (F1: 0.6523) correctly identified the contrast between 'regular, linear arrangement' and 'chaotic, branching pattern' of vascular structures, which is a known indicator. This demonstrates that the evolutionary process is not just finding random high-performing text strings but is successfully articulating statistically significant visual concepts in a way that is meaningful to a clinician. -
Ablation Studies: These studies analyze the impact of different components of the
BiomedXProframework.-
Effect of Prompt Selection Criteria (Figure 2):
该图像是图表,展示了不同提示选择策略对模型收敛速度和性能的影响。图中显示Roulette Wheel策略在探索与利用之间取得最佳平衡,性能提升更稳定且优于Best-N和Random策略。This experiment compares
roulette_wheelselection (the chosen method) with selecting thebestperforming prompts andrandomprompts. Thebeststrategy converges quickly to a suboptimal solution (pure exploitation), whilerandomis inefficient (pure exploration).roulette_wheelstrikes an effective balance, leading to the most stable and highest performance over time. -
Impact of Generation Size (Figure 3):
该图像是图表,展示了每次迭代生成的提示对数量对模型性能的影响。结果表明,选择10对生成提示可在性能和稳定性之间取得最佳平衡,优于5对和50对的设置。The authors tested generating 5, 10, and 50 new prompt pairs per iteration. Generating only 5 led to slow convergence. Generating 50 caused performance to plateau after an initial jump. The chosen size of 10 provided the best trade-off between optimization stability and refinement diversity.
-
Effect of Chain-of-Thought (CoT) Prompting (Figure 4):
该图像是一个折线图,展示了在不同迭代次数下,带有和不带有CoT(Chain of Thought)提示的Top 10平均得分变化趋势。图中显示带CoT的表现显著优于不带CoT,尤其在早期迭代阶段。This chart shows that including CoT instructions ("formulate a strategy," "Let's think step-by-step") in the mutation meta-prompt consistently improves performance compared to a direct instruction baseline. This suggests that prompting the LLM to "reason" about its task helps it generate higher-quality prompts.
-
Effect of Initial Population Size (Figure 5):
该图像是图表,展示了初始种群规模对模型性能的影响。随着迭代次数增加,较大规模(50个)的初始种群在Top 10平均得分上持续优于较小规模,说明多样性更有利于优化。This chart compares initial population sizes of 10, 30, and 50. A larger initial population (50) leads to better and more stable performance. This indicates that starting the evolutionary search from a broader and more diverse semantic space helps prevent premature convergence and leads to a better final solution.
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces
BiomedXPro, an evolutionary framework that addresses the critical need for interpretability and trustworthiness in biomedical AI. By using an LLM as an adaptive optimizer, the framework automatically generates a diverse ensemble of human-readable prompts that align with clinical reasoning.BiomedXProdemonstrates superior performance over existing methods, especially in low-data scenarios, without sacrificing interpretability. The authors conclude that this represents a significant step towards the safe and reliable clinical deployment of VLMs. -
Limitations & Future Work: The authors provide a candid discussion of the framework's limitations:
- Model Dependencies: The quality of the generated prompts is capped by the biomedical knowledge encoded in the LLM, and the overall performance depends on the quality of the VLM's embedding space. The choice of LLM is also a trade-off between capability (e.g., GPT-4) and cost, as the iterative process is computationally expensive.
- Architectural Limitations: The framework is designed for binary classification and is extended to multiclass problems via a one-vs-rest scheme, which can be suboptimal. A native multiclass extension is a complex future direction. Additionally, diversity is only enforced at the end; integrating it into each iteration proved unstable.
- Validation Assumptions: The validation relies on statistical features from the training data as a proxy for clinical importance, which may be biased. True clinical validation requires expert review.
- Need for Deeper Grounding: Future work should visually ground the prompts (e.g., using Grad-CAM) to verify that the model is actually looking at the features described in the text. This, along with evaluation by clinical experts, is essential for building full clinical trust.
-
Personal Insights & Critique:
- Novelty and Significance: The paper's core contribution—using an LLM-driven evolutionary algorithm to generate a diverse ensemble of interpretable prompts—is a strong and timely idea. It directly tackles the "black box" problem in a practical way that resonates with the needs of the medical community. The focus on diversity as a proxy for multi-faceted clinical reasoning is particularly insightful.
- Practicality: The computational cost of running hundreds of generations with repeated LLM calls is a significant practical barrier. While the authors use a smaller open-source model (
Gemma) to manage this, it highlights the trade-off between performance and accessibility. The cost-effectiveness of this approach compared to simply collecting more labeled data or using a more powerful (but expensive) base model remains an open question. - Transferability: The
BiomedXProframework is highly general and could be applied to other specialized domains beyond biomedicine where expert knowledge is crucial and interpretability is paramount (e.g., geological surveys from satellite imagery, fault detection in manufacturing). - Potential for Improvement: The one-vs-rest approach for multiclass problems is a clear weakness. A future version that could natively handle multiple classes by optimizing sets of prompts (e.g., one prompt per class) would be a powerful extension. Additionally, integrating a deterministic semantic similarity metric (like sentence embeddings) directly into the selection step could provide a more stable way to enforce diversity throughout the optimization process rather than just at the end.
Similar papers
Recommended via semantic vector search.