Integrating large language models with explainable fuzzy inference systems for trusty steel defect detection
TL;DR Summary
LE-FIS integrates LLMs with explainable fuzzy inference, using locally trained globally predicted detection and genetic optimization, providing efficient, transparent, and trustworthy steel defect detection.
Abstract
Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec Integrating large language models with explainable fuzzy inference systems for trusty steel defect detection Kening Zhang a , Yung Po Tsang a , Carman K.M. Lee a , C.H. Wu b , ∗ a Department of Industrial and Systems Engineering, Research Institute for Advanced Manufacturing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong b Department of Supply Chain and Information Management, The Hang Seng University of Hong Kong, Siu Lek Yuen, N.T, Hong Kong A R T I C L E I N F O Keywords: Fuzzy interfence system (FIS) Steel defect detection Explainable artificial intelligence (XAI) Black-box model A B S T R A C T In industrial applications, the complexity of machine learning models often makes their decision-making processes difficult to interpret and lack transparency, particularly in the steel manufact
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Integrating large language models with explainable fuzzy inference systems for trusty steel defect detection.
1.2. Authors
The authors of this paper are Kening Zhang, Yung Po Tsang, Carman K.M. Lee, and C.H. Wu. They are affiliated with the Hong Kong Polytechnic University in Hong Kong. Their research backgrounds appear to be in machine learning, industrial applications, and explainable AI.
1.3. Journal/Conference
The provided text does not specify the journal or conference where this paper was published. The file path and lack of a formal publication header suggest it may be a preprint or a manuscript under review.
1.4. Publication Year
The paper references works published as recently as 2024. Therefore, the publication year of this paper is likely 2024 or later.
1.5. Abstract
The paper introduces a novel framework named LE-FIS (Large language models-based Explainable Fuzzy Inference System) designed to interpret "black-box" machine learning models used for detecting defects in steel. The methodology involves a two-stage process. First, a deep learning model is trained using a LTGP (locally trained, globally predicted) approach, where images are broken into smaller segments for training, and the model is then used to detect defects on entire images. Second, the LE-FIS framework is used to explain the decisions of this deep learning model by automatically generating fuzzy rules and membership functions. These components are then optimized using a genetic algorithm (GA). Finally, state-of-the-art large language models (LLMs) are employed to translate the technical outputs of LE-FIS into human-understandable explanations. The experimental results indicate that the LTGP model performs effectively in defect detection, and the LE-FIS framework, supported by LLMs, successfully provides a trustworthy and interpretable solution, enhancing transparency and reliability for industrial applications.
1.6. Original Source Link
The original source is a PDF file located at /files/papers/690485299f2f7e6b6c47c52d/paper.pdf. Its publication status is likely a preprint or submitted manuscript, as it is not linked to an official publisher's website.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper addresses is the "black-box" nature of advanced machine learning models, particularly deep learning models like Convolutional Neural Networks (CNNs), when applied to critical industrial tasks such as steel defect detection. While these models achieve high accuracy, their internal decision-making processes are opaque and difficult for humans to understand.
This lack of transparency is a major barrier to adoption in high-stakes environments for several reasons:
-
Quality Control: Engineers need to understand why a defect was flagged to take appropriate corrective action.
-
Regulatory Compliance: Many industries require transparent and justifiable processes for quality assurance.
-
Stakeholder Trust: Building trust with engineers, managers, and customers is difficult if the technology making crucial decisions cannot be explained.
Prior research has either focused on high-accuracy black-box models or simpler, more interpretable models that lack accuracy. Existing Explainable AI (
XAI) methods likeLIMEorSHAPprovide some insight but do not generate inherent, rule-based explanations that are easy to follow. The paper identifies a gap: the need for a system that is not only accurate but can also logically explain its reasoning and provide actionable guidance. This paper's innovative idea is to bridge this gap by building an explainable surrogate model (LE-FIS) that mimics the black-box detector and then uses the power of Large Language Models (LLMs) to translate its logic into natural language.
The overall workflow is illustrated in Figure 1 from the paper, which contrasts machine learning and non-machine learning methods in terms of their accuracy and explainability trade-off.
该图像是论文中的示意图,展示了钢铁缺陷检测过程的整体流程,包括图像采集、原始图像输入、基于机器学习和非机器学习方法的模糊推理系统的检测方法,以及缺陷分类的最终输出,突出了准确性与可解释性的权衡。
2.2. Main Contributions / Findings
The paper outlines four primary contributions:
- A Novel Detection Method (
LTGP): The paper proposes alocally trained, globally predicteddeep detection method. This approach involves segmenting large steel surface images into smaller patches for training a deep learning model, which is then used to perform detection on the complete, unsegmented images. This is designed to handle large-scale industrial imagery efficiently. - An Explainable Surrogate Model (
LE-FIS): The paper introducesLE-FIS, an explainable fuzzy inference system designed to interpret theLTGPblack-box model.LE-FISautomatically generates fuzzy rules and membership functions to approximate the behavior of the deep learning model. A genetic algorithm (GA) is used to optimize its parameters for a better fit. - LLM-Powered Interpretation: The authors utilize state-of-the-art
LLMsto interpret the fuzzy rules and results fromLE-FIS. This step translates the technical, mathematical output of the fuzzy system into clear, professional, and accessible natural language explanations. The paper also establishes evaluation metrics to compare the explanatory performance of differentLLMs. - Experimental Validation: The paper provides experimental results demonstrating that the
LTGPmodel achieves strong performance in defect detection, and the completeLE-FISframework successfully provides a trustworthy and interpretable model. This enhances transparency for industrial steel defect detection, balancing accuracy with explainability.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, one must be familiar with the following concepts:
-
Black-Box vs. White-Box Models:
- Black-Box Model: An artificial intelligence system whose internal workings are hidden or too complex for humans to understand. Users can see the inputs and outputs but not the decision-making process. Deep neural networks are a prime example.
- White-Box Model: A model whose internal logic is transparent and easily interpretable. Examples include decision trees or simple linear regression, where the rules and weights are clear.
-
Explainable AI (XAI): A field of artificial intelligence focused on developing methods and techniques that make the results and decisions of AI systems understandable to humans. The goal of
XAIis to open up the "black box" and provide insights into how a model arrives at a specific conclusion. -
Fuzzy Inference System (FIS): A system based on fuzzy logic, which is a form of logic that deals with reasoning that is approximate rather than fixed and exact. An
FISconsists of three main stages:- Fuzzification: Crisp (numerical) input values (e.g., pixel intensity, texture score) are converted into fuzzy values by determining their degree of membership in predefined fuzzy sets (e.g., "low brightness," "medium texture," "high contrast"). This is done using membership functions.
- Inference Engine (Rule Evaluation): A set of human-like
IF-THENrules are applied to the fuzzy values to produce a fuzzy output. For example:IF texture is 'rough' AND brightness is 'low', THEN defect is 'likely'. - Defuzzification: The fuzzy output from the rule evaluation is converted back into a single, crisp numerical value (e.g., a defect probability score).
-
Genetic Algorithm (GA): A search and optimization technique inspired by the process of natural selection. A
GAworks by evolving a population of candidate solutions to a problem over generations. The basic steps are:- Initialization: Create an initial population of random solutions (chromosomes).
- Fitness Evaluation: Evaluate how good each solution is using a "fitness function." In this paper, the fitness is how well the
FISoutput matches the black-box model's output. - Selection: Select the fittest individuals to be "parents" for the next generation.
- Crossover: Combine the genetic material (parameters) of pairs of parents to create new offspring.
- Mutation: Introduce small, random changes into the offspring's genes to maintain diversity.
- Repeat: Repeat the process of evaluation, selection, crossover, and mutation for many generations until the solution converges.
-
Convolutional Neural Network (CNN) and ResNet:
- CNN: A type of deep learning model especially effective for image analysis. It uses layers of convolutions (filters) to automatically learn hierarchical features from images, such as edges, textures, and shapes.
- ResNet (Residual Network): A specific architecture of
CNNthat introduces "shortcut connections" or "residual blocks." These connections allow the model to skip one or more layers, which helps to mitigate the "vanishing gradient" problem and enables the training of much deeper and more powerful networks. The paper usesResNet-18.
-
Large Language Models (LLMs): Massive deep learning models trained on vast amounts of text data (e.g.,
GPT-4,Llama). They excel at understanding, generating, and translating human language. In this paper, they are used as an "explainer" to convert the technical rules of theFISinto fluent, understandable prose.
3.2. Previous Works
The authors position their work against existing methods for steel defect detection and explainability.
-
Machine Learning for Defect Detection: The paper acknowledges a rich history of using machine learning for this task.
- Early methods used neural networks like two-layer feedforward networks [13].
- More recent and successful methods leverage deep learning, especially
CNNs[14, 15], which excel at image-based feature extraction. - Unsupervised methods, such as
Generative Adversarial Networks (GANs)andautoencoders[11, 16], have been used to detect defects without needing labeled data. - Reinforcement learning has also been applied to enhance detection [17]. The common thread is that while these models can be highly accurate, they are typically black boxes.
-
Explainable AI (XAI) in Defect Detection: The paper critiques previous
XAIapproaches in this domain.- Some methods claim explainability but do not fundamentally address the black-box problem [18].
- Other works [19-21] apply general-purpose
XAItools likeLIME(Local Interpretable Model-agnostic Explanations) orSHAP(SHapley Additive exPlanations). These tools are post-hoc, meaning they explain a model's prediction by analyzing how inputs affect the output, but they do not reveal the model's internal logic. The authors argue these methods are not "inherently explainable." - Other approaches are limited in scope, for instance, by only addressing classification and not segmentation [22], or by providing interpretations that are still unclear about the exact logic used [23].
3.3. Technological Evolution
The field of automated defect detection has evolved significantly:
-
Traditional Image Processing: Early methods relied on manually crafted filters and algorithms to detect features like edges, blobs, and textures. These were highly interpretable but brittle and lacked accuracy on complex defects.
-
Classical Machine Learning: Models like Support Vector Machines (
SVMs) and Random Forests were used with hand-engineered features, improving accuracy but still requiring significant domain expertise for feature design. -
Deep Learning Revolution:
CNNsand other deep models automated feature extraction, leading to a massive leap in accuracy. However, this came at the cost of interpretability, creating the "black-box" problem. -
The Rise of XAI: In response, the field of
XAIemerged, with post-hoc methods likeLIMEandSHAPbecoming popular. These methods provide valuable insights but are often approximations and do not create a fully transparent model.This paper represents a step forward in the
XAItimeline by attempting to create an inherently interpretable surrogate model (FIS) and then further enhancing its accessibility through natural language generation (LLM).
3.4. Differentiation Analysis
The core innovation of this paper compared to previous work is its hybrid, multi-stage approach to explainability:
-
Beyond Post-Hoc Explanations: Instead of just applying
LIMEorSHAPto a black-box model, this paper constructs a completely new, interpretable "white-box" model (theFIS) that is trained to mimic the black-box model. The explanation is then derived from this interpretable surrogate. -
Generation of Logical Rules: The
FISgenerates explicitIF-THENrules, which are more intuitive to a human operator than feature importance scores or heatmaps. This provides a causal, logical explanation. -
Leveraging LLMs for the "Last Mile" of Explanation: The paper uniquely recognizes that even
IF-THENrules can be technical. It usesLLMsto bridge the final gap, translating these rules into professional, easy-to-read reports with recommendations. This is a novel application ofLLMsin the industrialXAIpipeline. -
Integrated Framework: The
LE-FISis not just a single component but a full framework, from theLTGPtraining strategy to theGA-based optimization and the finalLLMinterpretation layer, as shown in the paper's overall architecture diagram (Figure 2).
该图像是一个用于钢铁缺陷检测的整体方法示意图,展示了数据预处理、FLEX模块(含模糊推理与遗传算法优化)及输出结果解释的流程,体现了LE-FIS模型结构与LLMs解释过程。
4. Methodology
4.1. Principles
The core idea of the methodology is to create a trustworthy and explainable system for steel defect detection by combining the predictive power of a deep learning model with the interpretability of a fuzzy inference system, and then making that interpretation accessible through a large language model. The approach can be broken down into four main stages:
- Image Preprocessing and Segmentation: Prepare the data by segmenting large images into smaller, manageable patches for efficient training.
- Deep Learning for Detection: Train a high-performance "black-box" model (
ResNet-18) on these patches to serve as the primary defect detector. This is theLTGPmethod. - Fuzzy System Fitting: Train an
FISto mimic the predictions of theResNet-18model. ThisFISserves as an interpretable "surrogate" model. Its parameters are optimized using a Genetic Algorithm (GA). - LLM-based Explanation: Use a prompted
LLMto translate the rules and outputs of theFISinto natural language.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Step 1: Preprocessing and Segmentation
To handle large industrial images (e.g., pixels), the authors employ a locally trained, globally predicted (LTGP) strategy. The training phase starts with preprocessing and segmentation.
-
Patch Extraction: The original high-resolution image, denoted as with dimensions , is cropped into smaller patches of size . This is done using a sliding window with vertical stride and horizontal stride . A specific patch is extracted using the formula: $ I _ { c r o p } ^ { i , j } = I _ { r a w } ( i \cdot s _ { h } : i \cdot s _ { h } + h _ { c } , j \cdot s _ { w } : j \cdot s _ { w } + w _ { c } ) $
- Symbol Explanation:
- : The cropped patch at the
(i, j)-th window position. - : The original input image.
i, j: The row and column indices of the sliding window.- : The vertical and horizontal strides, determining how far the window moves in each step.
- : The height and width of the cropped patch. The paper mentions that zero-padding is applied if the image dimensions are not perfectly divisible by the window size and stride, ensuring the entire image is covered.
- : The cropped patch at the
- Symbol Explanation:
-
Normalization: After cropping, each patch is normalized to scale pixel values to a standard range (e.g., [0, 1]), which helps stabilize the training process. The formula for normalization is given as: $ In(l) = I_r(l) $ The text seems to have a typo here, as the formula is incomplete. A standard normalization formula would be . The paper's text mentions , which confirms this intent.
- Symbol Explanation:
In(l): The -th normalized image patch.- : The -th raw image patch.
- : The mean pixel value across the dataset.
- : The standard deviation of pixel values across the dataset.
- Symbol Explanation:
4.2.2. Step 2: Defect Detection using ResNet-18 (The Black-Box Model)
The cropped and normalized patches are used to train a ResNet-18 model for defect classification. ResNet-18 is chosen for its ability to train deep networks effectively.
-
Residual Blocks: The core component of
ResNetis the residual block, which allows gradients to flow more easily through the network. Its operation is defined by: $ y = F ( x , { W _ { i } } ) + x $- Symbol Explanation:
- : The input feature map to the block.
- : The output feature map.
- : The residual mapping (a series of convolutional transformations) that the block learns. The model learns the difference, or residual, between the input and the desired output.
- : The weights of the convolutional layers within the block.
The transformation
F(x)typically consists of two or more convolutional layers. The paper specifies a two-layer block: $ F ( x , { W _ { i } } ) = W _ { 2 } \cdot \sigma ( W _ { 1 } \cdot x ) $
- Symbol Explanation:
- : The weight matrices of the two convolutional layers.
- : The ReLU (Rectified Linear Unit) activation function, which introduces non-linearity.
- Symbol Explanation:
-
Final Classification Layers: After passing through multiple residual blocks, the resulting feature map is processed for final classification.
- Global Average Pooling (GAP): The spatial dimensions of the feature map are reduced to a single vector. This reduces the number of parameters and helps prevent overfitting.
$
f _ { g a p } = \frac { 1 } { H _ { f } \times W _ { f } } \sum _ { i = 1 } ^ { H _ { f } } \sum _ { j = 1 } ^ { W _ { f } } f _ { r n } ( i , j )
$
- Symbol Explanation:
- : The output feature vector after pooling.
- : The value of the feature map at position
(i, j). - : The height and width of the final feature map before
GAP.
- Symbol Explanation:
- Fully Connected (FC) Layer: The pooled feature vector is fed into a standard
FClayer to produce logits (raw scores) for each defect class. $ z = W _ { f c } \cdot f _ { g a p } + b _ { f c } $- Symbol Explanation:
- : The output vector of logits.
- : The weight matrix of the
FClayer. - : The bias term of the
FClayer.
- Symbol Explanation:
- Softmax Function: The logits are converted into class probabilities.
$
\hat { y } _ { i } = \frac { \exp ( z _ { i } ) } { \sum _ { j = 1 } ^ { C } \exp ( z _ { j } ) } , \quad i = 1 , 2 , \ldots , C
$
- Symbol Explanation:
- : The predicted probability for class .
- : The logit for class .
- : The total number of defect classes.
- Symbol Explanation:
- Global Average Pooling (GAP): The spatial dimensions of the feature map are reduced to a single vector. This reduces the number of parameters and helps prevent overfitting.
$
f _ { g a p } = \frac { 1 } { H _ { f } \times W _ { f } } \sum _ { i = 1 } ^ { H _ { f } } \sum _ { j = 1 } ^ { W _ { f } } f _ { r n } ( i , j )
$
4.2.3. Step 3: FIS Module for Black-Box Fitting (The Surrogate Model)
This is the core of the LE-FIS framework. An FIS is trained to approximate the output () of the ResNet-18 model, using the same input features.
-
Fuzzification (Membership Functions): Each input feature (e.g., texture, color from the image) is mapped to a degree of membership in a fuzzy set (e.g., 'low', 'medium', 'high') using a membership function. The paper uses a generalized bell-shaped function: $ \mu _ { A _ { i } } ( x _ { i } ) = 1 / ( 1 + \left( \frac { x _ { i } - c _ { i } } { \sigma _ { i } } \right) ^ { 2 } ) $
- Symbol Explanation:
- : The membership degree of input in the fuzzy set .
- : The center of the bell curve for this function.
- : The width of the bell curve, controlling its steepness. These parameters () are learned during optimization.
- Symbol Explanation:
-
Fuzzy Rules and Inference: The system uses a set of
IF-THENrules. A typical rule is structured as: The "firing strength" or weight () of each rule is calculated by combining the membership degrees of the inputs for that rule, typically using a product (AND operator): $ w _ { i } = \prod _ { j = 1 } ^ { n } \mu _ { A _ { j } } ( x _ { j } ) $- Symbol Explanation:
- : The firing strength of the -th rule.
- : The membership degree of input in the fuzzy set associated with this rule.
- : The number of input features.
- Symbol Explanation:
-
Defuzzification: The final output of the
FISis a crisp value, calculated as the weighted average of the outputs of all rules. $ z = { \frac { \sum _ { i = 1 } ^ { N } w _ { i } f _ { i } ( x _ { 1 } , x _ { 2 } ) } { \sum _ { i = 1 } ^ { N } w _ { i } } } $- Symbol Explanation:
- : The final crisp output of the
FIS. - : The total number of fuzzy rules.
- : The firing strength of the -th rule.
- : The output function of the -th rule (for a Sugeno-type FIS, this is often a linear combination of inputs).
- : The final crisp output of the
- Symbol Explanation:
-
Optimization with Genetic Algorithm (GA): The goal is to make the
FISoutput () as close as possible to the black-box model's prediction (). TheGAis used to tune theFISparameters (the membership function parameters and the rule parameters) by minimizing the Mean-Squared Error (MSE) loss function: $ \mathcal { L } = \frac { 1 } { m } \sum _ { i = 1 } ^ { m } \left( z _ { i } - \hat { y } _ { i } \right) ^ { 2 } $- Symbol Explanation:
- : The loss function to be minimized.
- : The number of training samples.
- : The output of the
FISfor the -th sample. - : The prediction of the
ResNet-18black-box model for the -th sample.
- Symbol Explanation:
4.2.4. Step 4: Prompt Design for LLMs
After the FIS is trained, its rules and parameters represent an interpretable model. However, to make it truly accessible, the authors use an LLM. They employ Prompt Engineering (PE) to guide the LLM to generate a high-quality explanation. The prompt is structured into six sections:
- Task: Clearly state the goal: use an
LLMto explain the results of a fuzzy logic system for steel defect detection. - Context: Provide background information, including the dataset (Severstal), the task (classification based on image features like color and texture), and the LLM's role.
- Example: Give a concrete example of the desired output, such as "Based on the fuzzy logic rules, the color variation and irregularities in texture in the image suggest the presence of a scratch."
- Roles: Define the roles: the
LE-FISis the "detector," and theLLMis the "explainer." - Format: Specify the output structure, which should include the detection result, a detailed explanation, and a possible recommendation.
- Tone: Instruct the
LLMto use a professional yet accessible tone, explaining technical terms clearly for non-experts.
5. Experimental Setup
5.1. Datasets
The study uses the Kaggle Steel Defect Detection dataset. This is a well-known public dataset for benchmarking industrial defect detection models.
- Source and Scale: The dataset contains 12,568 training images, each with a resolution of pixels.
- Characteristics and Domain: The images are of flat steel surfaces. The key feature is that defects are annotated with pixel-wise masks, which allows for training and evaluating segmentation models, not just classification.
- Data Sample: The dataset contains four types of surface defects:
- Type 1 (Scratches): Linear, narrow markings.
- Type 2 (Patches): Irregular, larger defects with rough textures.
- Type 3 (Dents): Depressions or surface distortions.
- Type 4 (Cracks): Thin, elongated fractures. The paper notes a significant challenge with this dataset: class imbalance, where some defect types are much more common than others. This requires careful handling during model training to avoid bias.
- Choice Rationale: This dataset was chosen because it is a standard benchmark in the field and provides the precise pixel-level annotations needed to validate the segmentation performance of the proposed model.
5.2. Evaluation Metrics
The paper uses several standard metrics to evaluate the performance of the defect detection and segmentation.
-
Dice Coefficient: This metric measures the overlap or similarity between the predicted segmentation mask and the ground-truth mask. It is particularly useful for imbalanced datasets where the region of interest (the defect) might be very small compared to the background.
- Conceptual Definition: The Dice coefficient calculates twice the area of the intersection of the two masks, divided by the sum of their areas. A score of 1 means a perfect match, and 0 means no overlap.
- Mathematical Formula: $ { \mathrm { D i c e } } = { \frac { 2 \times | X \cap Y | } { | X | + | Y | } } $
- Symbol Explanation:
- : The set of pixels in the predicted segmentation mask.
- : The set of pixels in the ground-truth mask.
- : The number of elements in a set (i.e., the area of the mask).
- : The intersection operator, representing the pixels common to both masks.
-
Precision: This metric measures the accuracy of the positive predictions.
- Conceptual Definition: Of all the pixels that the model predicted as defects, what proportion were actual defects? High precision means the model makes few false positive errors.
- Mathematical Formula: $ \mathrm { P r e c i s i o n } = { \frac { \mathrm { T P } } { \mathrm { T P } + \mathrm { F P } } } $
- Symbol Explanation:
TP(True Positives): The number of defect pixels correctly identified as defects.FP(False Positives): The number of non-defect pixels incorrectly identified as defects.
-
Recall (Sensitivity): This metric measures the model's ability to find all actual positive instances.
- Conceptual Definition: Of all the actual defect pixels present in the image, what proportion did the model successfully identify? High recall means the model makes few false negative errors.
- Mathematical Formula: $ { \mathrm { R e c a l l } } = { \frac { \mathrm { T P } } { \mathrm { T P } + { \mathrm { F N } } } } $
- Symbol Explanation:
TP(True Positives): The number of defect pixels correctly identified as defects.FN(False Negatives): The number of actual defect pixels that the model failed to identify.
-
F1-score: This metric provides a single score that balances both Precision and Recall.
- Conceptual Definition: The F1-score is the harmonic mean of Precision and Recall. It is a good overall measure of a model's accuracy, especially when the class distribution is uneven.
- Mathematical Formula: $ { \mathrm { F1 - s c o r e } } = { \frac { 2 \times { \mathrm { P r e c i s i o n } } \times { \mathrm { R e c a l l } } } { { \mathrm { P r e c i s i o n } } + { \mathrm { R e c a l l } } } } $
- Symbol Explanation:
Precision: The precision score.Recall: The recall score.
5.3. Baselines
The paper compares its proposed LE-FIS method against a baseline model described as a "Complex deep learning method [26]". The reference [26] is "A deep learning model for steel surface defect detection" by Li et al. (2024). This baseline is representative because it likely represents the state-of-the-art in terms of pure predictive accuracy, serving as the "black-box" model that LE-FIS aims to explain while trading off some performance.
6. Results & Analysis
6.1. Core Results Analysis
The paper presents a comprehensive analysis of the performance of its proposed LTGP detector and the LE-FIS framework.
Visual Detection Results: Figure 3 shows qualitative results of the model's predictions. The images display the original steel surface, the ground-truth mask of the defect, and the mask predicted by the model. These examples visually confirm that the model is capable of accurately localizing different types of defects, such as linear scratches and larger "crazing" defects.

Training Performance: Figure 4 shows the training progress of the underlying deep learning model.
-
Figure 4(a) shows that the validation accuracy increases rapidly and stabilizes at a high level, indicating that the model learns effectively and does not suffer from severe overfitting.
-
Figure 4(b) shows that the Dice coefficient also steadily improves, confirming that the model's segmentation accuracy gets progressively better at matching the ground-truth defect shapes.

Genetic Algorithm Optimization: Figure 5 illustrates the optimization process of the
FISparameters by the Genetic Algorithm. -
Figure 5(a) plots the best and mean fitness values over generations. The sharp drop in the early generations shows that the
GAquickly finds good parameter combinations. The subsequent stabilization suggests the algorithm has converged to an optimal solution. -
Figure 5(b) shows the stall generations, indicating that while the optimization slows down in later stages, it avoids getting stuck in poor local optima. This confirms the effectiveness of the
GAfor tuning theLE-FIS.
Fuzzy System Analysis: Figure 6 provides insights into the learned
FIS. -
Figure 6(a) is a t-SNE visualization, which shows that the features used by the
FIScreate a clear separation between the different defect categories (scratches, patches, dents, cracks). This means the features are discriminative. -
Figure 6(b) displays the learned membership functions. The smooth curves show how the
FIShandles uncertainty by allowing gradual transitions between fuzzy sets (e.g., from 'low' to 'medium' defect likelihood).
该图像是论文中图6,包含两个子图:(a) 展示了真实钢材缺陷类别的t-SNE可视化,显示刮痕、补丁、凹痕和裂纹的分布;(b) 展示了输入变量1的三个隶属函数随图像特征值变化的曲线。
6.2. Data Presentation (Tables)
Performance Comparison: The following are the results from Table 1 of the original paper:
| Class | Complex deep learning method [26] | LE-FIS | ||||
|---|---|---|---|---|---|---|
| Precision (%) | Recall (%) | F1-score (%) | Precision (%) | Recall (%) | F1-score (%) | |
| Scratches | 92.75 | 91.92 | 92.33 | 82.18 | 80.45 | 81.30 |
| Patches | 90.63 | 89.74 | 90.18 | 79.58 | 78.22 | 78.89 |
| Dents | 88.92 | 87.88 | 88.40 | 77.45 | 75.22 | 76.32 |
| Cracks | 93.85 | 92.96 | 93.40 | 83.12 | 81.24 | 82.17 |
| Macro Avg | 91.54 | 90.62 | 91.08 | 80.58 | 78.78 | 79.67 |
| Weighted Avg | 91.97 | 91.75 | 91.86 | 81.03 | 79.18 | 80.09 |
| Support (Scratches) | 100 | 100 | ||||
| Support (Patches) | 30 | 30 | ||||
| Support (Dents) | 15 | 15 | ||||
| Support (Cracks) | 50 | 50 | ||||
| Time cost (Training) (s) | 223.21 | 54.23 | ||||
| Time cost (Interfence) (s) | 4.23 | 1.20 | ||||
Analysis: This table highlights the central trade-off of the paper. The complex deep learning method [26] achieves superior performance across all metrics (e.g., a weighted average F1-score of 91.86%). The proposed LE-FIS method has lower scores (weighted average F1-score of 80.09%). However, LE-FIS offers a massive advantage in efficiency: its training time is over 4x faster (54.23s vs. 223.21s) and its inference time is over 3x faster (1.20s vs. 4.23s). The key argument of the paper is that this performance drop is an acceptable price for gaining full model transparency and reliability, which LE-FIS provides.
LLM Explanation Performance: The following are the results from Table 2 of the original paper:
| LLM model | |||||||||
| Copilot ChatGPT-40 | ChatGPT-4o mini | ChatGPT-o1 mini | ChatGPT-4 | Qwen | Mistral | Llama | |||
| Words | IQ | 700 | 928 | 714 | 1125 | 619 | 754 | 572 | 598 |
| DQ | 1036 | 1924 | 789 | 1357 | 788 | 738 | 1033 | 1253 | |
| Membership function explanation | Yes | Yes | No | No | No | Yes | Yes | No | |
| Rules explanation | Yes | Yes | No | Yes | Yes | Yes | No | Yes | |
| Feature explanation | No | Yes | No | Yes | No | Yes | No | No | |
Analysis: This table compares the ability of different LLMs to explain the LE-FIS output. The authors evaluate them based on the length of the explanation (word count for an initial question, IQ, and a detailed question, DQ) and their ability to explain three key aspects: membership functions, rules, and features.
- Verbosity:
ChatGPT-4andCopilotprovide the most detailed responses. - Capability: There are clear differences in capability.
CopilotandChatGPT-4owere able to explain the membership functions, while several others were not. Most models could explain the fuzzy rules, but onlyChatGPT-4oandQwensuccessfully provided feature-level explanations. Based on this synthesis, the authors selectedGPT-4oas the best-performing model for generating the final explanation, as it demonstrated a strong ability to provide detailed, multi-faceted explanations. The paper then presents the refined output fromGPT-4oas the final explanation.
6.3. Ablation Studies / Parameter Analysis
The paper does not contain a formal ablation study section where individual components of the LE-FIS model (e.g., GA optimization, specific rule structures) are removed to measure their impact. However, the analysis of the GA optimization process in Figure 5 functions as a form of parameter analysis, demonstrating that the GA effectively converges and finds optimal parameters for the FIS, which is crucial for the model's performance.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully presents a novel framework, LE-FIS, for building an explainable and trustworthy steel defect detection system. The authors combined a powerful deep learning detection method (LTGP) with an interpretable fuzzy inference system (FIS) that acts as a surrogate model. The FIS parameters are optimized with a genetic algorithm to closely mimic the black-box detector. The framework's major innovation is the use of Large Language Models (LLMs) to translate the technical fuzzy logic into human-readable explanations.
The experimental results validate the approach. The LTGP model achieves good detection performance (81.03% weighted average precision), and while this is lower than a pure black-box model, it is presented as a satisfactory trade-off for achieving full interpretability and faster computation. The LLM component was shown to effectively generate comprehensive explanations, enhancing the transparency and reliability of the entire system for industrial use.
7.2. Limitations & Future Work
The authors acknowledge several limitations of their work:
-
Data Dependency: The performance of
LE-FISis highly dependent on large volumes of high-quality, labeled training data. Its accuracy may degrade significantly in data-scarce scenarios. -
Domain Specificity: While effective for steel defect detection, the model's performance may not directly transfer to other industries or defect types without recalibration and fine-tuning, as data distributions and defect characteristics can vary widely.
Based on these limitations, they propose the following directions for future work:
-
Extend to other defect types and materials.
-
Further optimize the fuzzy inference system.
-
Integrate a real-time feedback loop for improved adaptability on the production line.
7.3. Personal Insights & Critique
This paper offers a thoughtful and practical approach to the pressing challenge of XAI in industrial settings.
Strengths:
- The core concept of using an
FISas a surrogate model and then anLLMas a "translator" is highly innovative. It creates a multi-layered explanation that is both logically sound (the fuzzy rules) and easily consumable (the natural language output). - The
LTGPtraining strategy is a clever and practical solution for handling the very large images common in industrial inspection. - The paper's focus on trustworthiness and transparency, not just accuracy, is crucial for real-world adoption of AI in critical systems.
Potential Issues and Areas for Improvement:
-
The Fidelity-Interpretability Trade-off: The central premise rests on a significant trade-off: a ~10% drop in F1-score for explainability. In a safety-critical or high-cost manufacturing process, this might be an unacceptable compromise. The paper justifies it, but a deeper analysis of the cost-benefit in an industrial context would strengthen the argument.
-
The Surrogate Model Gap: The explanation provided is for the
LE-FISsurrogate model, not the originalResNet-18black-box. While theMSEloss is minimized to ensure fidelity, there's no guarantee that theFISreasons in the same way as theResNet. The paper could be strengthened by including metrics that specifically measure the "fidelity" or "faithfulness" of the surrogate model to the black-box. -
Subjectivity in LLM Evaluation: The method for comparing
LLMsis somewhat superficial. "Word count" is a poor proxy for explanation quality, and the decision of whether a model "explains" a concept is subjective. A more rigorous evaluation, perhaps involving human operators rating the usefulness and clarity of the explanations, would be more compelling. -
Genericity of the Final Explanation: The "best explanation" example provided in the text feels quite generic. It explains the general principles of the system rather than providing a specific, instance-based explanation for a particular defect detection. A more powerful system would generate a unique report for each detected defect, e.g., "This area was flagged as a 'scratch' because the fuzzy rules for 'high edge response' and 'low texture variance' fired strongly, with a confidence of 92%." This would be more actionable for an operator.
Overall, this paper is a valuable contribution that pushes the boundaries of practical
XAI. Its a-la-carte approach of combining different AI techniques (ResNet,FIS,GA,LLM) into a single pipeline is a powerful paradigm that could be adapted to many other domains where accuracy and transparency are both required.
Similar papers
Recommended via semantic vector search.