Abstract

The study presents an early diagnostic model for mycosis fungoides and five inflammatory skin conditions, utilizing a convolutional neural network (CNN) that integrates multi-modal data. The motivation behind this research stems from the clinical challenge posed by the accurate and timely diagnosis of these skin diseases. By leveraging advanced data processing techniques, the model aims to enhance diagnostic precision and improve patient outcomes.

1. Bibliographic Information

1.1. Title

Early diagnosis model of mycosis fungoides and five inflammatory skin diseases based on multi-modal data-based convolutional neural network

1.2. Authors

Zhaorui Liu (Department of Dermatology, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Science and Peking Union Medical College, Beijing, China)
Yilan Zhang (Image Processing Center, School of Astronautics, Beihang University, Beijing, China)
Ke Wang (Image Processing Center, School of Astronautics, Beihang University, Beijing, China)
Fengying Xie (Corresponding Author; Image Processing Center, School of Astronautics, Beihang University, Beijing, China)
Jie Liu (Department of Dermatology, Peking Union Medical College Hospital, China)

1.3. Journal/Conference

British Journal of Dermatology (Published by Oxford University Press on behalf of the British Association of Dermatologists).

Comment: The British Journal of Dermatology is one of the world's top dermatology journals, known for high-impact clinical and experimental research.

1.4. Publication Year

The text mentions funding years like 2022 and 2023, and the file path suggests a recent upload. Based on the content, it is a recent study (likely 2023 or 2024).

1.5. Abstract

This study addresses the difficulty in differentiating early-stage Mycosis Fungoides (MF)—a type of cutaneous T-cell lymphoma—from common inflammatory skin diseases like eczema and psoriasis. The authors developed a multi-modal Deep Learning model that integrates three types of data: patient clinical metadata, clinical images (macroscopic), and dermoscopic images (microscopic). Using a Convolutional Neural Network (CNN) architecture named RegNetY-400MF, the model was trained on 1,157 cases. The results demonstrated that the AI model significantly outperformed dermatologists in diagnostic accuracy and that an "AI-assisted" workflow (Doctor + AI) achieved the highest diagnostic performance, improving sensitivity and specificity notably for early MF detection.

1.6. Original Source Link

/files/papers/691c3dde25edee2b759f32d2/paper.pdf Status: Published Academic Paper.

2. Executive Summary

2.1. Background & Motivation

The Clinical Challenge: Mycosis Fungoides (MF) is a rare but serious malignancy (cancer) of the skin's immune cells (T-cells). In its early stages, MF looks deceptively similar to benign inflammatory skin conditions such as eczema, psoriasis, and lichen planus. This similarity leads to frequent misdiagnoses, delayed treatment, and poorer patient outcomes. The Gap: While biopsy is the gold standard, it is invasive. Non-invasive methods like dermoscopy (using a specialized magnifier to see skin structures) help, but interpretation is subjective and difficult. The Innovation: Existing Artificial Intelligence (AI) models in dermatology mostly focus on melanoma (skin cancer) and typically use only one type of image. This paper proposes a Multi-modal approach, combining clinical photos, dermoscopic images, and patient metadata (age, gender) to mimic a dermatologist's comprehensive diagnostic process.

2.2. Main Contributions / Findings

First Multi-modal MF Model: This is the first study to apply multi-modal learning (Clinical + Dermoscopy + Metadata) specifically for the early diagnosis of MF and its differentiation from five other inflammatory diseases.
Algorithm Selection: The study exhaustively compared 13 different CNN architectures and identified RegNetY-400MF as the optimal backbone due to its balance of high accuracy and low computational complexity.
Clinical Validation: The study conducted a rigorous "Human vs. AI" comparison. The AI model achieved higher accuracy than 23 participating dermatologists.
AI-Assisted Improvement: Most importantly, the study showed that when dermatologists used the AI's predictions as a reference, their own diagnostic accuracy improved significantly (from ~71.5% to ~82.9%), proving the model's value as a clinical decision support tool.

3.1. Foundational Concepts

1. Convolutional Neural Network (CNN): A Deep Learning algorithm specifically designed for processing grid-like data, such as images. It uses "filters" or "kernels" to scan the image and automatically learn features (like edges, textures, and shapes) without human instruction.

Analogy: Imagine a detective using a magnifying glass to scan a picture. The detective starts by finding simple clues (lines, colors) and combines them to identify complex objects (lesions, tumors).

2. Multi-modal Learning: A machine learning technique where the model learns from multiple different types of input data (modalities) simultaneously.

In this paper: The modalities are Clinical Images (what the naked eye sees), Dermoscopic Images (magnified skin structures), and Metadata (text data like age/gender). Combining these provides a richer context than any single source alone.

3. Dermoscopy: A non-invasive skin examination technique using a dermoscope—a handheld device with a magnifier and light source. It allows visualization of subsurface skin structures (like blood vessel patterns) that are invisible to the naked eye.

4. Mycosis Fungoides (MF): The most common form of Cutaneous T-Cell Lymphoma (CTCL). It is a cancer of the white blood cells that manifests in the skin. It is "the great imitator" because early patches look just like eczema.

5. RegNet (Regular Network): A family of CNN architectures designed using Neural Architecture Search (NAS). Instead of manually designing every layer, researchers defined a "design space" of possible network structures and found simple, regular rules that generate efficient models. RegNetY includes a "Squeeze-and-Excitation" (SE) block, which helps the network focus on the most important features.

3.2. Previous Works

Melanoma Focus: The authors note that most prior multi-modal skin AI research focused on melanoma (e.g., using the Derm7pt dataset). Models like FusionM4Net and TFormer were designed for tumors but not tested on inflammatory diseases or MF.
MF Diagnosis: Previous work by Lallas et al. and Bilgic et al. established specific dermoscopic criteria for MF (e.g., "sperm-like vessels"). This paper builds on that biological knowledge but automates the detection using AI.

3.3. Differentiation Analysis

Unimodal vs. Multimodal: While previous studies aimed at psoriasis or lupus used single images, this study integrates three data streams.
Disease Scope: This is explicitly the first AI model targeting the specific differential diagnosis of Early-Stage MF vs. five mimics (Eczema, Psoriasis, Lichen Planus, Pityriasis Rosea, Seborrheic Dermatitis).
Clinical Integration: The study goes beyond model metrics to measure the real-world impact on dermatologists' performance when assisted by the AI.

4. Methodology

4.1. Principles

The core principle is Information Fusion. The hypothesis is that different data modalities contain complementary information:

Clinical Images: Capture the overall distribution and location of the rash.
Dermoscopic Images: Capture fine details like vascular patterns (blood vessels).
Metadata: Provides demographic priors (e.g., MF is more common in older adults, while Atopic Dermatitis is common in children). By processing these streams separately and then merging their features, the model makes a holistic decision.

4.2. Core Methodology In-depth (Layer by Layer)

The following figure (Figure 1 from the original paper) illustrates the complete architecture of the diagnostic model, showing how the three data streams are processed and fused.

该图像是示意图，展示了一种基于多模态数据的卷积神经网络模型的架构，用于早期诊断真菌性皮肤病及五种炎症性皮肤疾病。图中包括皮肤镜图像、临床图像以及元数据的处理流程。

Step 1: Data Input and Preprocessing

The model accepts three inputs for a single patient case:

Clinical Image: Cropped to the lesion area.
Dermoscopic Image: Captured with polarized light.
Metadata: Age and Gender.

Preprocessing:

Images are augmented (random contrast, gamma adjustment) to handle lighting variations.
Metadata (Categorical) is converted using One-Hot Encoding.
- Concept: If "Gender" has 2 categories (Male, Female), "Male" might be encoded as [1, 0] and "Female" as [0, 1]. This converts text categories into a numerical format the neural network can process.

Step 2: Feature Extraction (The Backbone)

The system uses RegNetY-400MF as the primary feature extractor for images.

Why RegNetY-400MF? It is a specific variant of the RegNet family with a computational cost of 400 MegaFLOPs (Million Floating-point Operations Per Second). This makes it lightweight and fast compared to massive models like ResNet50, while maintaining high accuracy.

The Extraction Process:

Image Streams: The Clinical Image and Dermoscopic Image are passed through two separate (but structurally identical) RegNetY-400MF networks.
Weight Sharing Integration: The text mentions a "weight sharing integration module." This typically means the networks might share some parameters or have a mechanism to align the features from both visual sources, allowing the model to learn common patterns across modalities.
Output: Each RegNetY outputs a Feature Vector—a list of numbers representing the abstract visual content of the image.

Step 3: Metadata Processing

The One-Hot encoded metadata is passed through a Multi-Layer Perceptron (MLP).

MLP Structure: A series of fully connected layers that transform the simple demographic data into a feature vector compatible with the image features.

Step 4: Fusion and Classification

Concatenation: The feature vectors from the Clinical stream, Dermoscopic stream, and Metadata stream are joined together (concatenated) into one long "Master Vector."
Optimization: The model is trained using the Adam Optimizer.
- Formula: The Adam optimizer updates the network weights ( $\theta$ ) using the following equations based on gradients ( $g_t$ ): $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$ $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$ $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$ $\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$
- Symbol Explanation:
  - $m_t$ : First moment estimate (moving average of gradients).
  - $v_t$ : Second moment estimate (moving average of squared gradients).
  - $\beta_1, \beta_2$ : Exponential decay rates (set to 0.9 and 0.999 in this paper).
  - $\alpha$ : Learning rate (set to 2e-4 with cosine annealing).
  - $\epsilon$ : A small constant to prevent division by zero.
Loss Function: While not explicitly written in the text, multi-class classification tasks universally use Cross-Entropy Loss to penalize the model when its predicted probability for the correct class is low.

Step 5: Interpretability (Grad-CAM)

To understand where the model is looking, the authors use Grad-CAM (Gradient-weighted Class Activation Mapping).

Principle: It looks at the gradients flowing into the final convolutional layer to generate a heatmap. Red areas on the heatmap indicate regions that strongly influenced the model's decision (e.g., a specific blood vessel pattern).

5. Experimental Setup

The following flowchart (Figure 2 from the original paper) details the patient selection and exclusion criteria, showing how the final dataset of 1,157 cases was formed.

该图像是包含患者选择标准和数据处理流程的示意图，展示了早期诊断模型的构建过程。该模型基于1157例经过验证的病例，包括临床图像、皮肤镜图像及相关元数据，分为训练集、验证集和测试集，以提高诊断精度。

5.1. Datasets

Source: Department of Dermatology, Peking Union Medical College Hospital (2016–2020). Scale: 1,157 total cases.

Images: 2,452 Clinical images, 6,550 Dermoscopic images.
Classes:
1. MF: Mycosis Fungoides (114 cases)
2. ECZ: Eczema (347 cases)
3. PSO: Psoriasis (213 cases)
4. LP: Lichen Planus (243 cases)
5. PR: Pityriasis Rosea (131 cases)
6. SD: Seborrheic Dermatitis (109 cases) Data Split: The dataset was divided into Training, Validation, and Test sets. The test set specifically contained 118 cases for the "Human vs. AI" comparison.

5.2. Evaluation Metrics

The paper uses standard classification metrics. Here are their definitions and formulas:

1. Accuracy:

Definition: The percentage of total cases correctly classified.
Formula: $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$
Symbols: TP (True Positive), TN (True Negative), FP (False Positive), FN (False Negative).

2. Precision (Positive Predictive Value):

Definition: Out of all cases the model predicted as "Disease X", how many were actually "Disease X"?
Formula: $Precision = \frac{TP}{TP + FP}$

3. Sensitivity (Recall):

Definition: Out of all actual "Disease X" cases, how many did the model correctly detect? (Crucial for MF to avoid missed diagnoses).
Formula: $Sensitivity = \frac{TP}{TP + FN}$

4. Specificity:

Definition: Out of all cases that were not "Disease X", how many did the model correctly identify as negative?
Formula: $Specificity = \frac{TN}{TN + FP}$

5. F1-Score:

Definition: The harmonic mean of Precision and Sensitivity. It provides a balanced view when datasets are imbalanced.
Formula: $F1 = 2 \cdot \frac{Precision \cdot Sensitivity}{Precision + Sensitivity}$

6. Kappa Coefficient:

Definition: A statistical measure of inter-rater agreement (or agreement between model and ground truth) that accounts for the possibility of the agreement occurring by chance.
Interpretation: $>0.75$ is excellent agreement; 0.40-0.75 is moderate.

5.3. Baselines

Network Baselines: The authors compared RegNetY-400MF against 12 other famous architectures, including ResNet50 (industry standard), EfficientNet (known for efficiency), Inception-v3, and Swin Transformer (modern attention-based model).
Human Baselines: A group of 23 Dermatologists with varying experience levels served as the human benchmark.

6. Results & Analysis

6.1. Core Results Analysis

1. Model Architecture Comparison: RegNetY-400MF was chosen not just for speed, but performance. It ranked first in 4 out of 5 metrics among the 13 tested networks.

Sensitivity: 73.69%
Specificity: 94.26%
F1-score: 72.18%
Accuracy: 73.33%
Note: These numbers represent the baseline performance before full optimization or fusion in the validation phase.

2. Human vs. AI vs. Doctor+AI: This is the most critical result. The table below synthesizes the performance data presented in the text for the test dataset (118 cases).

The following are the results from the comparative experiments described in the "Results" section of the original paper:

Metric	Doctor Only	Doctor + AI Assistance	Improvement
Average Accuracy	71.52%	82.94%	+11.42%
Sensitivity	74.56%	86.16%	+11.61%
Specificity	94.06%	96.45%	+2.38%
F1-Score	(Implicitly lower)	(Implicitly higher)	+12.48%
Kappa	(Moderate)	(Higher)	+14.72%

Analysis: The AI model alone achieved higher accuracy than the doctors. However, the combination (Doctor + AI) was superior to both. The 11.61% increase in sensitivity is vital for MF, as it means significantly fewer cancer cases are missed.

3. Impact by Experience Level:

Junior Clinicians: Saw the massive improvements (Accuracy +14.48%).
Senior Practitioners: Saw smaller gains in accuracy but, paradoxically, a slight decrease in MF accuracy in some metrics, yet an increase in sensitivity. This suggests AI helps experts be more "safe" (catching more cases) even if it slightly increases false alarms.

6.2. ROC Curve Analysis

The following figure (Figure 3 from the original paper) displays the Receiver Operating Characteristic (ROC) curves for the six diseases.

Figure 3 168x85 mm (x DPI) 该图像是一个图表，展示了六种皮肤疾病的敏感度与特异性的关系，包括菌状真菌病（MF）、盘状红斑狼疮（LP）、银屑病（PSO）、湿疹（ECZ）、皮肤病（SD）和脓疱疮（PR）。每种疾病的曲线显示了AI模型与皮肤科医生的诊断效果对比。

Interpretation: The ROC curve plots the True Positive Rate (y-axis) vs. False Positive Rate (x-axis).
Curves: The curves (colored lines) represent the AI model's performance. The closer a curve is to the top-left corner (coordinate 0,1), the better the model.
Dots: The yellow dots represent individual dermatologists.
Observation: Most yellow dots fall below the AI's curves. This visually proves that for any given false positive rate, the AI model achieves a higher true positive detection rate than the average dermatologist. The performance gap is particularly noticeable for Eczema, where the AI significantly outperforms humans.

6.3. Visualization & Interpretability

The following figure (Figure 4 from the original paper) shows the Grad-CAM visualization results.

该图像是插图，展示了不同患者在真诊断和AI预测下的表现，对比了五种不同的皮肤状态（图A至E）。每个图像下方附有真诊断信息及AI预测的准确率，其中MF（mycosis fungoides）的预测准确率均在99%以上，显示了该模型的有效性。

Success Cases (High Accuracy): For classic MF cases (Images A-E), the model achieves >99% probability. The heatmaps (red/yellow overlays) align perfectly with dermoscopic features known to doctors, such as atypical vascular architectures (weird blood vessels) and lymphocyte clusters.
Failure Cases (<90% Accuracy): The authors note that errors occur when lesions have "non-specific inflammation patterns" (e.g., generic redness) or poor contrast. In these cases, the heatmap shows the model focusing on irrelevant features, highlighting the limitation of image-based diagnosis when visual cues are weak.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully established a robust, multi-modal AI framework for the early detection of Mycosis Fungoides. By integrating clinical photos, dermoscopy, and patient data, the model:

Achieved superior diagnostic accuracy compared to standard dermatologists.
Served as an effective "second opinion," significantly boosting the performance of doctors, especially junior ones.
Validated the use of RegNetY-400MF as an efficient backbone for dermatological AI.

7.2. Limitations & Future Work

Limitations:

Retrospective Design: The study used past data. It needs to be tested in a "live" clinical setting (prospective study) to prove it works in real-time.
Interpretability Gap: While Grad-CAM helps, the model only gives a probability score. It cannot explain why it thinks a lesion is MF in natural language (e.g., "I see linear vessels here").
Sample Size: 1,157 cases is a decent size but relatively small for deep learning, limiting the model's exposure to rare variations.

Future Work:

Generative AI: The authors suggest using Large Language Models (LLMs) to generate descriptive diagnostic reports, not just scores.
Deployment: Packaging the model into a software interface for large-scale clinical testing.

7.3. Personal Insights & Critique

Clinical Reality Check: The most impressive part of this paper is the "Doctor + AI" experiment. Many AI papers stop at "Model beats Human," which creates an adversarial narrative. This paper correctly frames AI as an assistive tool. The result that Junior doctors improved the most suggests this tool could be a powerful equalizer in education and training.
The "Black Box" Issue: The reliance on Grad-CAM is standard but limited. Heatmaps often just highlight the "lesion" without telling us what feature in the lesion triggered the response. Future work integrating "concept bottleneck models" (which explicitly predict features like 'scaling' or 'vessels' before the diagnosis) would be more clinically trustworthy.
Data Imbalance: MF had 114 cases vs. 347 for Eczema. While metrics like F1-score help, the class imbalance is a risk. The model might be biased towards calling ambiguous cases "Eczema" simply because it's more common. The high sensitivity achieved suggests they managed this well, likely through the loss function or sampling, though specific details on handling imbalance (e.g., weighted loss) were not deeply detailed in the text provided.